My lab is hiring a software engineer to support our #NeuroAI research: careers.epfl.ch/job/Lausanne.... Please consider applying if you want to build out the infrastructure enabling models of the human brain & mind (e.g., www.Brain-Score.org). We will start screening applications this week π§ π€
Posts by Martin Schrimpf
DNN models of the brain are getting bigger. Are we replacing one complicated system in vivo with another in silico?
In new work, we seek the *smallest* DNN models of visual cortex, balancing prediction with parsimony.
It turns out these compact models are surprisingly small!
rdcu.be/e5H8G
I believe the results in your and Ebrahim's paper, but I do not understand why this particular configuration is so important. If you agree with point 2, then that is much more general than what we did in 2021 (more models & data) and with a more stringent metric -- and the core claim stands.
2. NWP task performance is correlated with brain alignment in a larger set of models and datasets (going beyond our 2021 set).
I understand your pushback to be that NWP-correlates-brainalignment does _not_ hold when using the *exact* 2021 models and datasets, *but* with a different metric.
(maybe more as a personal summary from this, don't feel obliged to respond.)
I believe we agree on two things:
1. The results from Schrimpf et al. 2021 with the exact same specifications (datasets, metrics, models) are perfectly reproducible from the open-source code.
I personally see Brain-Score as an evolving set of benchmarks that is improved over time (and not as a static goalpost). Indeed our community is updating it with more rigorous alignment tests and better models. I hope you will consider contributing!
In vision, Yamins & Hong et al 2014 first established a correspondence between object classification accuracy and ventral stream alignment on a dataset that is very easy by today's standards; which has now been extended to ImageNet, larger and more diverse neural data etc. See Brain-Score.org/vision
I guess what you mean is whether we should move past the particular methodologies we used in the 2021 paper by testing alignment more stringently and building even better brain models -- I absolutely think so!
The core claim you mean is "Models that perform better at predicting the next word in a sequence also better predict brain measurements" -- and yes, that indeed has been validated and extended by many follow-up studies. As you said yourself, the results can also be perfectly replicated.
π The Re-Align Challenge is now LIVE!
Weβre inviting you to explore what properties of vision models and data lead to convergences and divergences in representational alignment.
π Get started: huggingface.co/spaces/repre...
π§΅π
What's your sense as to why that is? Our intuition from the 2025 EMNLP paper is that more scaled models develop a lot more capabilities beyond formal "core" language processing; I'm curious if you agree
correction: the original implementation was incorrect and @kartikpradeepan.bsky.social updated the model PR thanks to @ebrahimfeghhi.bsky.social linking the open source code. Updates here: bsky.app/profile/msch...
hi Ebrahim, I responded in this thread: bsky.app/profile/msch.... Happy to discuss more
Either way I'm glad the OASM model is now part of the open-source community platform, this will be a great reference point. With the new benchmarks soon on Brain-Score, we can encourage the development of models that generalize much better than what we did 5 years ago
Regarding the original claims: Badr & others reproduced the correlation to NWP performance with the new benchmarks and newer models so I see no reason for the 2021 claims to be invalid. These new benchmarks are enforcing stronger generalization (great!) but that doesn't mean the old ones were wrong.
The AlKhamissi et al 2025 benchmarks are most stringent afaict since they split on stories instead of contiguous k-folds, which prevents temporal autocorrelation within a story. (L)LMs indeed score much higher than OASM here. I'm glad OASM is now integrated in Brain-Score as a useful reference!
Thanks @kartikpradeepan.bsky.social for confirming that this model indeed scores highly on the earlier benchmarks with part-of-sentence-splits! Building on Feghhi & Hadidi et al 2024, AlKhamissi et al 2025 had identified the most stringent benchmarks. We should have merged this PR sooner. 1/
Some re-mapping is necessary even for predicting one brain's activity from another, esp. in higher areas. Linear regression is one of the more restrictive ways to achieve this between two brains so we use the same for models. @neuranna.bsky.social wrote about this here: arxiv.org/abs/2208.10668
I'll continue in this thread where @ebrahimfeghhi.bsky.social has been helpful with linking the code. I would like to remind you that there is a human at the other end of the screen and that no information will be lost by keeping this friendly.
Thanks Ebrahim! Would you be interested in submitting this model directly to Brain-Score? Alternatively I can let cursor attempt it again but as you pointed out, it doesn't necessarily get it right
Nima you're very much welcome to update the PR. You are even more welcome to use the Brain-Score platform as we stated previously. I don't know how we can reach common ground if you don't either use the same benchmark implementation, or release your model code.
I am of course happy to be proven wrong, but I find the framing of this preprint a bit frustrating. We gave similar feedback before, yet the manuscript doesn't seem to engage with the counter-evidence. I would appreciate clarification on the results discrepancy -- please feel free to update the PR!
This is significantly lower than the paper's reported number and far below gpt2-xl (which in the paper is outperformed by oasm). So something does not track here, either in the preprint's re-implementation of the benchmark or my reconstruction of the model.
3. I implemented and submitted the authors' model to Brain-Score (see PR#355 github.com/brain-score/...). The implementation follows the paper as I could not find a code release. It obtains a ceiling-normalized score of 0.34 on the criticized Pereira2018 benchmark.
-- this work includes null models such as randomly-assigned stimuli responses. Brain-Score language includes benchmarks that use this stronger form of generalization, which we flagged about a year ago.
2. Splitting across larger temporal chunks (eg stories) is indeed a stronger form of generalization than smaller chunks (eg sentences). @bkhmsi.bsky.social tackled this in his EMNLP'25 where we identified the most stringent evaluation of brain alignment to be linear predictivity with story splits
The perhaps strongest support for this point is that recent LLMs confirm the original prediction: as their task performance improved, their alignment to the human brain further increased (see e.g. Shen et al. 2025).
Thank you Dan for the ping! As far as I can tell, all of the original claims hold, for the following reasons:
1. The relationship between next-word prediction performance and brain alignment has been replicated in several other studies (eg Caucheteux et al 2022; De Varda et al 2025; Mischler 2024).
Looking forward to presenting at the #AAAI #NeuroAI workshop; including 3 projects that were just accepted to ICLR! arxiv.org/abs/2509.24597, arxiv.org/abs/2510.03684, arxiv.org/abs/2506.13331 π§ͺπ§ π€
π Re-Align is back for its 4th edition at ICLR 2026!
π£ We invite submissions on representational alignment, spanning ML, Neuroscience, CogSci, and related fields.
π Tracks: Short (β€5p), Long (β€10p), Challenge (blog)
β° Deadline: Feb 5, 2026 for papers
π representational-alignment.github.io/2026/