# openwebtext

**Repository Path**: bsyonline/openwebtext

## Basic Information

- **Project Name**: openwebtext
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-03-19
- **Last Updated**: 2025-03-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# OpenWebText

This project is a clone of the GPT-2 WebText dataset as outlined in the [OpenAI paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). This project is still heavily WIP.

Huge thanks to [jcpeterson](https://github.com/jcpeterson/openwebtext) for letting me use his download code. His version of OpenWebText is super well written, so please check it out!

## Dependencies

Pipenv, Python 3,

To install python dependencies:
```
pipenv install
```

[Newspaper](https://github.com/codelucas/newspaper#get-it-now) Dependencies:

On Ubuntu:
```
sudo apt-get install libxml2-dev libxslt-dev
```
On OS X:
```
brew install libxml2 libxslt
```
## Usage

1. Get list of URLs from reddit:

```
pipenv run python get_urls.py
```

2. Download data from URLs:

```
pipenv run python download.py
```

Resulting files will be deposited in `data/` with format `{domain}-{sha256 hash of url}.txt`.

Enjoy!