Posts by Leshem (Legend) Choshen @EMNLP
Don't be misled by the survival framing, yes health and climate, but also the AI for fairness freedom and the world
survivalandflourishing.fund/2026/applica...
I am sure there are tons of other philantropies like that I don't know?
Money💶, academic money🧑🏫 to be exact, is almost a mystery🕵️
GPU hours, hidden philanthropies, industry websites, governmental lesser-known funds.
How do you know who wants to fund your research?
Starting a thread for hidden grants, locations to find them, and ideas for a change
alphaxiv.org/abs/2604.13076
Alignment to animals 🐄🐖🐓
Just read a fun paper training\eval models to care about animal welfare.
On one hand, it shows we can plug values into these systems.
On the other…
A weirder question: what values are we absorbing as we happily suckle on AI-generated words? 🤖🥛
My search fail to find all the papers I know worked on it before(prompts not medical) where are the multiprompt papers starting how to evaluate on multiple prompts (e. G. Polo) where are the papers talking about prompt instability (e. G. Dove by habna and sowa by mizrahi)
❤️
With 100K tokens and thinking about thinking about thinking, the Granite model told me happy April Fool's.
Didn't see any jokes this year. Did we become too dry for that?
Would you share something fun in the comments?
Notice the dent?
We did too. Looking into the data, we saw that at this point, models often noticed that they don't have to answer our questions!
Saying things like "why do I bother, it will never work" or "I am a god"
But with more test time compute they got bored and did it.
The method is quite clear, instead of thinking for N steps, we ask the model to think what it would have arrived at after >>N steps.
Then you train the model to do it, and recursively tell it to then think about that. It works too!
Lol
By MIT and IBM
Why inference scale when you can scale scale?
Our thinking about thinking changes scaling completely!
By asking the model to think about what it would have thought for N steps
It breaks benchmark after benchmark
Games, Scientific discovery, proteins...
#AI
And remember, you are at the bottom of the food chain, act accordingly 😈
More tips on rebuttals (and graphics, writing, or LaTeX)
docs.google.com/document/d/1...
Good luck!🍀
What did I miss?
Help busy ACs/reviewers help you
Make your rebuttal skimmable.
State the bottom line early in every paragraph. Repeat the reviewer’s criticism clearly, then your response or change.
Say upfront: Do you agree, politely disagree, or slightly correct?
Clarity saves everyone time.
Do rather than argue.
Evidence is stronger than opinion.
When possible, add an experiment, ablation, citation, or pointer to a section/figure.
If the answer is already in the paper, point to it directly.
Clarify or do — don’t debate.
People respond better when they feel heard.
A useful structure is:
Concern → acknowledgment → clarification → evidence
*Use their own words for more empathy
Example:
“We understand the concern about dataset size. Our dataset is comparable to prior work (X, Y), and we show…”
It’s “peer” review 🤗
Treat reviewers as thoughtful colleagues, not opponents.
Be friendly, take their time seriously, and respond with clarity and respect.
Your paper will improve — and they might actually change their mind when it counts.
Your rebuttal has two audiences:
The reviewer
The Area Chair
Sometimes the reviewer will not change their mind.
Your job is to show that the concern has been addressed clearly enough for the AC to see that. Or (rarely) that the reviewer is unreasonable.
Rebuttal season is here, yey🤞
With many asking me,
I compiled the most common misconceptions
Hope the tips help 🧵
All tips:
docs.google.com/document/d/14Wax8M5w8F_8miDlYJ9-I6wqpelxlXjCEUbkNzNMqqE/edit?tab=t.0#heading=h.rfq27f356vmm
#AI
🤖📈🧠
Finally, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
📝Paper: arxiv.org/abs/2603.16848
🤗Data (900K judgements!): huggingface.co/datasets/ibm...
💻Code: github.com/IBM/Anchor-S...
As demonstrated, a lot of data is being wasted in anchor-based evaluation. Following that, we conduct a power analysis, and compute sufficient benchmark sizes, finding that standard benchmark sizes are insufficient and fail to distinguish between competitive models reliably.
We propose an `informativeness’ measure for anchors, and show that it correlates with the anchors quality.
For example, best-performing and worst-performing models make poor anchors. Because these extreme anchors consistently outscore or lag behind the rest, they offer little insight into how the other models compare.
However, the choice of anchor is critical.
Our experiments show that *a poor anchor can dramatically reduce correlation with human rankings*, making the evaluation unreliable.
To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single *anchor*.
The main drawback of this approach is that the cost of evaluation grows fast.
Specifically, as the number of evaluated models increases, the number of model pairs to compare grows quadratically.
The ``LLM-as-a-judge’’ (LMJ ) paradigm has become a standard method for evaluating open-ended generation.
A primary setting for LMJ is pairwise comparisons, where we ask whether model A's response is better than B's or vice versa
Do you run pairwise evaluation?
Do you test your models on the Arena-Hard and AlpacaEval benchmarks?
You probably want to read this 🧵👇
arxiv.org/abs/2603.16848
By Shachar Don Yehia, me and Omri Abend
🤖📈🧠 #AI