1) 'baby-like' LMs are just capped at a smaller # of words (e.g. 10mill in #BabyLm)
But starting w/tokenized text solves a huge part of the problem: figuring out where the words begin/end in the first place (over time). Baby input doesn't come pre-chewed. (cf. zero-resource folks, Dupoux etc.) 2/4
#BabyLm
4. #BabyLM challenge description paper, co-authored by Lucas Georges Gabriel Charpentier
babylm.github.io
4. #BabyLM challenge description paper, co-authored by Lucas Georges Gabriel Charpentier
https://babylm.github.io/
Are transformers really all we need? I doubt it. We tested alternative backbones for language models in low-resource scenarios — #Mamba, #xLSTM, and #HGRN2 — and they work surprisingly well!
📄 Paper: aclanthology.org/2024.conll-b...
Thanks for being part of the #BabyLM Challenge! 👶
How can we train LLMs with <100M words? In our #BabyLM paper, we introduce a new language+vision self-synthesis training recipe to tackle this question:
Our model learns over 4 phases -- most crucially self-captioning unseen images to generate synthetic language data
arxiv.org/abs/2411.00828