Github project

Cleaner version of this page coming soon, but for now some fun datasets:

Linux Kernel (6.2MB)
The above is only the kernel. The examples in my blog post were trained on the full Linux code base. That is:
$ git clone
$ cd linux
$ find . -name "*.[c|h]" | shuf | xargs cat > linux.txt
(This gives a 474MB file that I plugged in)

All works of Shakespeare concatenated (4.6MB)
Leo Tolstoy's War and Peace (3.3MB)

Free books: (in general, including War and Peace) can be found in

Wikipedia: 100MB Wikipedia data Hutter Prize

The Stacks Project, which is where the Latex dataset on Algebraic Geometry came from.