What You Get When You Only Train on Public Domain Text
Mr. Chatterbox is a 340M-parameter model trained exclusively on 28,000 Victorian-era texts from the British Library — definitively public domain, zero copyright exposure. Simon Willison's writeup documents both what it proves and what it falls short of: the corpus is large enough to train something coherent, but not large enough to be useful by Chinchilla norms.
Read more →
