# openwebtext **Repository Path**: bsyonline/openwebtext ## Basic Information - **Project Name**: openwebtext - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-03-19 - **Last Updated**: 2025-03-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # OpenWebText This project is a clone of the GPT-2 WebText dataset as outlined in the [OpenAI paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). This project is still heavily WIP. Huge thanks to [jcpeterson](https://github.com/jcpeterson/openwebtext) for letting me use his download code. His version of OpenWebText is super well written, so please check it out! ## Dependencies Pipenv, Python 3, To install python dependencies: ``` pipenv install ``` [Newspaper](https://github.com/codelucas/newspaper#get-it-now) Dependencies: On Ubuntu: ``` sudo apt-get install libxml2-dev libxslt-dev ``` On OS X: ``` brew install libxml2 libxslt ``` ## Usage 1. Get list of URLs from reddit: ``` pipenv run python get_urls.py ``` 2. Download data from URLs: ``` pipenv run python download.py ``` Resulting files will be deposited in `data/` with format `{domain}-{sha256 hash of url}.txt`. Enjoy!