“Pretraining Language Models for Diachronic Linguistic Change Discovery” by @tom-lippincott.bsky.social, @cmessner.bsky.social, & more shows that efficient pretraining techniques produce useful models over corpora too large for easy manual inspection and too small for “typical” LLM approaches: (3/5)
Posts by Craig Messner
Maybe a bit afield, but I always like telling students about Hugh Kenner's experiments with ngram modeling for literary style in the early 1980s (so roughly contemporaneous with Jelenik's adoption of them for ASR). As found in Byte magazine in 1984: archive.org/details/byte...
this is both predictable and informative
I regret to inform you that while LLMs may well automate numerous rote linguistic tasks they cannot as yet replace your teammates in CS2
New to me as well, will be using this in class this semester!
From my own local extrapolations, it feels like methods tied to classic quant dh/cultural analytics are getting broader play, and that "humanities machine learning" (which extends ML fields like interpretability, data-efficient training, evaluation and etc.) is emerging underneath
The real question is: who is winning?
Any leads on research that examines what disjoint exists between human-recoverable and SAE recoverable features out there? (Perhaps after: arxiv.org/html/2506.15..., which features are "naturally" distinguishable by humans but represented densely by SAEs, which end up in "noisy" dense reps.)
Anyone have on instruction tuning for models trained solely on historical data? Turns out texts from 1750 have very few "reddit-like" constructs.
Interestingly this reads to me a lot like a description of how close reading works in practice, especially post new-historicism -- "why this word here, knowing what we know about its contemporaneous use"
More importantly, its frustrations will hopefully serve as useful critical lessons. I also hereby claim the use of the name "Poetaster" for any further such systems!
I teach a machine learning class for students in traditionally less computational fields (cdh.jhu.edu/teaching/4/). Recent students had questions about RAG, so I used EmbeddingGemma's release as an excuse to put together an example in the form of a poetry criticism game
github.com/messner1/poe...