Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this yearβs MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.
Posts by Itay Itzhak @ COLM π
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers π
This week Iβll be in New York giving talks at NYU, Yale, and Cornell Tech.
If youβre around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
Thrilled to be part of this work led by
@adisimhi.bsky.social !
ManagerBench reveals a critical problem:
β
LLMs can recognize harm
β But often choose it anyway to meet goals
π€ Or overcorrect and become ineffective
We need better balance!
A must-read for safety folks!
Traveling to #COLM2025 this week, and here's some work from our group and collaborators:
Cognitive biases, hidden knowledge, CoT faithfulness, model editing, and LM4Science
See the thread for details and reach out if you'd like to discuss more!
At #ACL2025 and not sure what to do next? GEM πΒ² is the place to be for awesome talks on the future of LLM evaluation. Come hear @GabiStanovsky, @EliyaHabba, @LChoshen and others rethink what it means to actually evaluate LLMs beyond accuracy and vibes. Thursday @ Hall C!
In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!
Now hungry for discussing:
β LLMs behavior
β Interpretability
β Biases & Hallucinations
β Why eval is so hard (but so fun)
Come say hi if thatβs your vibe too!
@boknilev.bsky.social @gabistanovsky.bsky.social
Huge thanks to my co-authors
@boknilev @GabiStanovsky!
Preprint: arxiv.org/abs/2507.07186
Webpage: itay1itzhak.github.io/planted-in-...
Weβd love your thoughts, critiques, and ideas π¬
Letβs talk about building more interpretable and trustworthy LLMs!
#NLProc #Bias #CognitiveAI
π§ Takeaway:
Cognitive biases are not introduced during instruction tuning.
Theyβre planted in pretraining and only surfaced by finetuning.
If we want fairer models, we need to look deeper into the pretraining pipeline.
π Step 2: Cross-tuning.
We swap instruction datasets between models with different pretraining.
Result: Biases follow the pretrained model!
PCA clearly shows models group by pretraining base, not by instruction.
The bias βsignatureβ stays intact, no matter the finetuning!
π² Step 1: Training randomness.
We finetune the same model 3Γ with different seeds.
Result: Some variation in bias scores, but behavior patterns stay stable compared to MMLU variance.
β
Aggregating across seeds reveals consistent trends.
π§ͺ We introduce a two-step causal framework to disentangle the effects of:
- Pretraining
- Instruction tuning
- Training randomness
- π Bottom line - pretraining is the origin of bias. Finetuning? Just the messenger
#CausalInference #TrustworthyAI #NLP
π¨New paper alertπ¨
π§
Instruction-tuned LLMs show amplified cognitive biases β but are these new behaviors, or pretraining ghosts resurfacing?
Excited to share our new paper, accepted to CoLM 2025π!
See thread below π
#BiasInAI #LLMs #MachineLearning #NLProc
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) π§΅
Are you recovering from your @colmweb.org abstract submission? GEM has a non-archival track that allows you to submit a two-page abstract in parallel?
Our workshop deadline is soon, please consider submitting your evaluation paper!
You can find our call for papers at gem-benchmark.com/workshop
New paper alert!
Curious how small prompt tweaks impact LLM accuracy but donβt want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.
Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!
1/13 LLM circuits tell us where the computation happens inside the modelβbut the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. π§΅π
Super interesting! Have you tested how LAP handles more diverse paraphrasing? For example, do you think it would also work for code functions with similar roles?
π¨π¨ New preprint π¨π¨
Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?
We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.
arxiv.org/abs/2502.14829
We usually blame hallucinations on uncertainty or missing knowledge. But what if I told you that LLMs hallucinate even when they *know* the correct answer - and they do it with *high certainty* π€―?
Check out our new paper that challenges assumptions on AI trustworthiness! π§΅π
GEM is so back! Our workshop for Generation, Evaluation, and Metrics is coming to an ACL near you.
Evaluation in the world of GenAI is more important than ever, so please consider submitting your amazing work.
CfP can be found at gem-benchmark.com/workshop
Why not try the straightforward approach: label high-quality texts and train an LM to classify them? Of course this should be done separately for different types of texts - a great scientific paper β a great novel.
(Similar to how Llama 3 pretraining used quality scores from Llama 2 and Roberta)