Nima Hadidi (@nrhadidi) Bsky

This reads as a bad-faith misreading of our point: earlier in this thread you indicated you understood that our paper argues the NWP correlation in Schrimpf et al. 2021 does not hold, and now you're writing as if we've said the opposite. Thanks @neural-reckoning.org, I'm stepping back from this now.

2 months ago 0 0 1 0

We did not say that "the results can also be perfectly replicated", especially not regarding the NWP correlation shown in Schrimpf et al. 2021. Our Figure 4 specifically shows that that result does not replicate under any activation extraction choice on any of the 3 datasets under contiguous splits.

2 months ago 0 0 1 0

Putting such speculation aside, we’re curious: do you still view the AlKhamissi 2025 results as a reproduction of the Schrimpf 2021 results despite the differences? Do you still reject any results from our study, which directly replicates the Schrimpf 2021 experiments with more rigorous methods? 3/3

2 months ago 1 0 1 0

Our Ext. Data Fig. 1 shows that adding GloVe to PWR closes much of the gap from PWR to GPT2-XL on Pereira2018, suggesting untrained→trained improvement is largely explainable by coarse semantic features. 2/3

2 months ago 1 0 1 0

Because trained models beat untrained ones (fully explained by position and word rate (PWR)), we think AlKhamissi et al.’s early NWP–brain correlation mostly reflects the shift from an untrained model to one that is “trained enough”; as for what trained enough means… 1/3

2 months ago 1 0 1 0

AlKhamissi et al. evaluated the NWP correlation *across the training process within individual models*, not across a zoo of pretrained models as in Schrimpf et al. Their results are consistent with ours (and Caucheteux et al.'s): NWP correlation is less robust at later training stages.

2 months ago 1 0 1 0

Hi Martin, thanks for the response. To be clear, we were already splitting on "stories". We used the more general term "contiguous" splits since not all of these datasets are stories (e.g. Fedorenko is sentences and Pereira is brief passages, only Blank is "stories"). We stated this in Section 4.9:

2 months ago 2 0 0 0

Hi Anna, a successful replication of OASM on Pereira 2018 has now been shown in Brain-Score!

2 months ago 2 0 0 0

add OASM model from Hadidi et al. 2025 by mschrimpf · Pull Request #355 · brain-score/language Cursor-aided implementation based on the paper. Preliminary results from local run: 0.34 on Pereira2018-linear

Thanks to @kartikpradeepan.bsky.social who has just now replicated our results using OASM with shuffled splits on Pereira 2018: github.com/brain-score/...

Does this change your interpretation of the results in Schrimpf et al. 2021? @mschrimpf.bsky.social

2 months ago 7 0 1 2

Also, LITCoder (Binhuraib et al. 2025) reports similarly high predictivity on naturalistic listening datasets with an OASM-like baseline when using shuffled splits, so effects like what we see with OASM seem widespread across datasets/pipelines with shuffled splits.

2 months ago 6 0 1 0

For what it’s worth, our GPT2-XL Pereira results look quite comparable to Brain-Score: Shuffled vs contiguous scores and the layerwise pattern in the shuffled case both closely track what Kauf et al. 2024 show in their appendix figs.

2 months ago 4 0 1 0

Practically, we don’t think we can implement a Brain-Score submission quickly right now (near-term deadlines), and we’ve already shared a reference implementation. If someone more familiar with Brain-Score wants to implement it there, that’d be ideal; we’re happy to answer questions / help validate.

2 months ago 5 0 1 0

Thanks, Anna. Agree it would be clean and helpful to have the exact same OASM definition implemented inside Brain-Score. We actually found Brain-Score hard to extend/debug for what we needed (custom splits, variance partitioning, and modern ridge tooling like himalaya), so we’re not familiar with it

2 months ago 4 0 1 0

We take these claims very seriously. When a high profile researcher claims that the results of our work cannot be replicated using a vibe-coded model that doesn't even attempt to model the correct features, we believe it is appropriate to state this plainly.

2 months ago 5 0 1 0

Hi Anna, we are happy to have a cordial discussion, and we are trying to contribute positively by ensuring that highly influential results are robust. However, this is now the second time that Martin or his group have claimed that our results do not replicate.

2 months ago 5 0 1 0

Great, we're excited to hear your response! If you have difficulty replicating our results next time, please reach out. Agreed, lets keep it friendly on both ends.

2 months ago 4 0 0 0

Martin, we linked our code in the previous thread. We also had a link to our code in Feghhi et al., 2024, which contained the OASM results and which your group cited in AlKhamissi et al., 2024.

2 months ago 4 0 1 0

@mschrimpf.bsky.social has publicly claimed that he can't replicate our results. Meanwhile what he's actually done is vibe-code a model that has nothing more than its acronym (and not even what it stands for) in common with ours.

2 months ago 7 0 2 0

Yeah, we'd love that discussion as well!

2 months ago 1 0 1 0

Thanks for promoting our work!

1 year ago 3 0 1 0

Posts by Nima Hadidi