This reads as a bad-faith misreading of our point: earlier in this thread you indicated you understood that our paper argues the NWP correlation in Schrimpf et al. 2021 does not hold, and now you're writing as if we've said the opposite. Thanks @neural-reckoning.org, I'm stepping back from this now.
Posts by Nima Hadidi
We did not say that "the results can also be perfectly replicated", especially not regarding the NWP correlation shown in Schrimpf et al. 2021. Our Figure 4 specifically shows that that result does not replicate under any activation extraction choice on any of the 3 datasets under contiguous splits.
Putting such speculation aside, we’re curious: do you still view the AlKhamissi 2025 results as a reproduction of the Schrimpf 2021 results despite the differences? Do you still reject any results from our study, which directly replicates the Schrimpf 2021 experiments with more rigorous methods? 3/3
Our Ext. Data Fig. 1 shows that adding GloVe to PWR closes much of the gap from PWR to GPT2-XL on Pereira2018, suggesting untrained→trained improvement is largely explainable by coarse semantic features. 2/3
Because trained models beat untrained ones (fully explained by position and word rate (PWR)), we think AlKhamissi et al.’s early NWP–brain correlation mostly reflects the shift from an untrained model to one that is “trained enough”; as for what trained enough means… 1/3
AlKhamissi et al. evaluated the NWP correlation *across the training process within individual models*, not across a zoo of pretrained models as in Schrimpf et al. Their results are consistent with ours (and Caucheteux et al.'s): NWP correlation is less robust at later training stages.
Hi Martin, thanks for the response. To be clear, we were already splitting on "stories". We used the more general term "contiguous" splits since not all of these datasets are stories (e.g. Fedorenko is sentences and Pereira is brief passages, only Blank is "stories"). We stated this in Section 4.9:
Hi Anna, a successful replication of OASM on Pereira 2018 has now been shown in Brain-Score!
Thanks to @kartikpradeepan.bsky.social who has just now replicated our results using OASM with shuffled splits on Pereira 2018: github.com/brain-score/...
Does this change your interpretation of the results in Schrimpf et al. 2021? @mschrimpf.bsky.social
Also, LITCoder (Binhuraib et al. 2025) reports similarly high predictivity on naturalistic listening datasets with an OASM-like baseline when using shuffled splits, so effects like what we see with OASM seem widespread across datasets/pipelines with shuffled splits.
For what it’s worth, our GPT2-XL Pereira results look quite comparable to Brain-Score: Shuffled vs contiguous scores and the layerwise pattern in the shuffled case both closely track what Kauf et al. 2024 show in their appendix figs.
Practically, we don’t think we can implement a Brain-Score submission quickly right now (near-term deadlines), and we’ve already shared a reference implementation. If someone more familiar with Brain-Score wants to implement it there, that’d be ideal; we’re happy to answer questions / help validate.
Thanks, Anna. Agree it would be clean and helpful to have the exact same OASM definition implemented inside Brain-Score. We actually found Brain-Score hard to extend/debug for what we needed (custom splits, variance partitioning, and modern ridge tooling like himalaya), so we’re not familiar with it
We take these claims very seriously. When a high profile researcher claims that the results of our work cannot be replicated using a vibe-coded model that doesn't even attempt to model the correct features, we believe it is appropriate to state this plainly.
Hi Anna, we are happy to have a cordial discussion, and we are trying to contribute positively by ensuring that highly influential results are robust. However, this is now the second time that Martin or his group have claimed that our results do not replicate.
Great, we're excited to hear your response! If you have difficulty replicating our results next time, please reach out. Agreed, lets keep it friendly on both ends.
Martin, we linked our code in the previous thread. We also had a link to our code in Feghhi et al., 2024, which contained the OASM results and which your group cited in AlKhamissi et al., 2024.
@mschrimpf.bsky.social has publicly claimed that he can't replicate our results. Meanwhile what he's actually done is vibe-code a model that has nothing more than its acronym (and not even what it stands for) in common with ours.
Yeah, we'd love that discussion as well!
Thanks for promoting our work!