The Model That Stopped at 1930

The premise of Talkie is elegantly constrained: train a 13B language model entirely on English text published before January 1, 1931 — books, newspapers, scientific journals, patents, case law — and see what a model knows and can do when its training data ends nearly a century ago. No Wikipedia. No Stack Overflow. No knowledge of transistors, World War II, or Python.

The project comes from Nick Levine, David Duvenaud, and Alec Radford — yes, the Radford who wrote the original GPT paper, co-authored GPT-2, and built Whisper. That’s an unusual combination of names for what is explicitly not a product launch. The Hugging Face release is Apache 2.0, the weights are open, and the point is research.

The research question is about generalization. Modern evaluations of language model capabilities are hobbled by contamination: if a model was trained on the internet, it has likely seen the answer to every benchmark question at some point. You can’t cleanly separate what a model learned from what it memorized. Talkie sidesteps this by making contamination structurally impossible — nothing in the model’s training data mentions the things you’d want to test for. The researchers describe it as “contamination-free by construction.”

The most striking finding is what the model can do despite this constraint. Fed Python programming examples entirely in context — through demonstrations, not training — Talkie can write working code. Simple code, “one-line programs or small modifications,” but code nonetheless. A model that has never read a word about digital computers is abstracting the structure of a programming language from a handful of examples and applying it. Whether you find this obvious or surprising probably reveals something about how you think about what LLMs are doing.

There’s also a less flashy but arguably more important technical finding: raw OCR text from historical archives achieves only about 30% of the learning efficiency of human-transcribed text. The team estimated this by comparing learning curves across differently sourced subsets of their 260-billion-token corpus. Three times the compute for the same effective signal. Given how many historical digitization projects rely on automated OCR — and given that this text ends up in modern pretraining corpora — the gap suggests that a lot of “historical data” in current foundation model training is substantially noisier than its volume implies. Fixing that is an obvious direction for future work.

Two model variants are released. talkie-1930-13b-base is the raw pretrained model. talkie-1930-13b-it is an instruction-tuned checkpoint, fine-tuned on demonstration pairs extracted from pre-1931 reference works, primarily built for the chat interface at the project’s website. A parallel model, talkie-web-13b-base, was trained on FineWeb with identical architecture and training compute, to allow controlled comparisons between what a modern data distribution teaches versus a historical one.

The researchers acknowledge that the temporal filter is imperfect — the model has some knowledge of FDR and World War II despite the 1930 cutoff, likely from contextual leakage in documents that discuss earlier periods while anticipating later ones. The plan is to release the training data eventually, since all pre-1931 US-published works are now out of copyright.

On standard benchmarks, Talkie underperforms models trained on modern data, which is expected. What’s more interesting is the benchmark shape — where exactly it fails and where it doesn’t, and what that distribution says about what a century of extra text teaches a model versus what emergent generalization provides for free.