Leshem (Legend) Choshen @EMNLP (@lchoshen) Bsky

AI Safety Grants & Funding Discover AI safety grants, fellowships, and funding opportunities from top funders including Open Philanthropy, NSF, DARPA, and more.

Long list of security ones aisecurityandsafety.org/en/grants/

2 days ago 0 0 0 0

SFF-2026 S-Process Grant Round Application Announcement | Survival and Flourishing Fund SFF-2026 S-Process Grant Round Application Announcement | Survival and Flourishing Fund

Don't be misled by the survival framing, yes health and climate, but also the AI for fairness freedom and the world
survivalandflourishing.fund/2026/applica...
I am sure there are tons of other philantropies like that I don't know?

2 days ago 1 0 1 0

How to apply for and get compute grants (for students) · Idle Words

Compute (GPU) funds
nightingal3.github.io/blog/2026/04...

2 days ago 0 0 1 0

Money💶, academic money🧑‍🏫 to be exact, is almost a mystery🕵️
GPU hours, hidden philanthropies, industry websites, governmental lesser-known funds.

How do you know who wants to fund your research?
Starting a thread for hidden grants, locations to find them, and ideas for a change

2 days ago 2 0 1 0

alphaxiv.org/abs/2604.13076

4 days ago 0 0 0 0

Alignment to animals 🐄🐖🐓

Just read a fun paper training\eval models to care about animal welfare.

On one hand, it shows we can plug values into these systems.

On the other…
A weirder question: what values are we absorbing as we happily suckle on AI-generated words? 🤖🥛

4 days ago 2 0 1 0

Robustness as an Emergent Property of Task Performance Robustness is often regarded as a critical future challenge for real-world applications, where stability is essential. However, as models often learn tasks in a similar order, we hypothesize that easier tasks will be easier regardless of how they are presented to the model. Indeed, in this paper, we show that as models approach high performance on a task, robustness is effectively achieved. Through an empirical analysis of multiple models across diverse datasets and configurations (e.g., paraphrases, different temperatures), we find a strong positive correlation. Moreover, we find that robustness is primarily driven by task-specific competence rather than inherent model-level properties, challenging current approaches that treat robustness as an independent capability. Thus, from a high-level perspective, we may expect that as new tasks saturate, model robustness on these tasks will emerge accordingly. For researchers, this implies that explicit efforts to measure and improve robustness may warrant reduced emphasis, as such robustness is likely to develop alongside performance gains. For practitioners, it acts as a sign that indeed the tasks that the literature deals with are unreliable, but on easier past tasks, the models are reliable and ready for real‑world deployment.

did you check it on easy questions as shown here to matter a lot?
arxiv.org/html/2602.03...

1 week ago 0 0 1 0

My search fail to find all the papers I know worked on it before(prompts not medical) where are the multiprompt papers starting how to evaluate on multiple prompts (e. G. Polo) where are the papers talking about prompt instability (e. G. Dove by habna and sowa by mizrahi)

1 week ago 0 0 1 0

❤️

2 weeks ago 0 0 0 0

With 100K tokens and thinking about thinking about thinking, the Granite model told me happy April Fool's.

Didn't see any jokes this year. Did we become too dry for that?
Would you share something fun in the comments?

2 weeks ago 3 0 0 0

Notice the dent?

We did too. Looking into the data, we saw that at this point, models often noticed that they don't have to answer our questions!
Saying things like "why do I bother, it will never work" or "I am a god"
But with more test time compute they got bored and did it.

2 weeks ago 3 0 1 1

The method is quite clear, instead of thinking for N steps, we ask the model to think what it would have arrived at after >>N steps.
Then you train the model to do it, and recursively tell it to then think about that. It works too!

2 weeks ago 2 0 1 0

Lol

By MIT and IBM
Why inference scale when you can scale scale?

Our thinking about thinking changes scaling completely!

By asking the model to think about what it would have thought for N steps

It breaks benchmark after benchmark
Games, Scientific discovery, proteins...
#AI

2 weeks ago 4 0 1 0

Ever Growing Academic Writing #################### Call for Collaboration ################## This aims to help academic writers If you have any additions or corrections, please add them or comment #################################...

And remember, you are at the bottom of the food chain, act accordingly 😈

More tips on rebuttals (and graphics, writing, or LaTeX)
docs.google.com/document/d/1...

Good luck!🍀
What did I miss?

3 weeks ago 0 0 0 0

Help busy ACs/reviewers help you
Make your rebuttal skimmable.

State the bottom line early in every paragraph. Repeat the reviewer’s criticism clearly, then your response or change.
Say upfront: Do you agree, politely disagree, or slightly correct?
Clarity saves everyone time.

3 weeks ago 0 0 1 0

Do rather than argue.
Evidence is stronger than opinion.
When possible, add an experiment, ablation, citation, or pointer to a section/figure.
If the answer is already in the paper, point to it directly.
Clarify or do — don’t debate.

3 weeks ago 0 0 1 0

People respond better when they feel heard.
A useful structure is:
Concern → acknowledgment → clarification → evidence
*Use their own words for more empathy

Example:
“We understand the concern about dataset size. Our dataset is comparable to prior work (X, Y), and we show…”

3 weeks ago 0 0 1 0

It’s “peer” review 🤗
Treat reviewers as thoughtful colleagues, not opponents.
Be friendly, take their time seriously, and respond with clarity and respect.
Your paper will improve — and they might actually change their mind when it counts.

3 weeks ago 0 0 1 0

Your rebuttal has two audiences:

The reviewer
The Area Chair

Sometimes the reviewer will not change their mind.
Your job is to show that the concern has been addressed clearly enough for the AC to see that. Or (rarely) that the reviewer is unreasonable.

3 weeks ago 0 0 1 0

Rebuttal season is here, yey🤞

With many asking me,
I compiled the most common misconceptions

Hope the tips help 🧵

All tips:
docs.google.com/document/d/14Wax8M5w8F_8miDlYJ9-I6wqpelxlXjCEUbkNzNMqqE/edit?tab=t.0#heading=h.rfq27f356vmm
#AI
🤖📈🧠

3 weeks ago 5 2 1 0

Mediocrity is the key for LLM as a Judge Anchor Selection The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Har...

Finally, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

📝Paper: arxiv.org/abs/2603.16848
🤗Data (900K judgements!): huggingface.co/datasets/ibm...
💻Code: github.com/IBM/Anchor-S...

4 weeks ago 0 0 0 0

As demonstrated, a lot of data is being wasted in anchor-based evaluation. Following that, we conduct a power analysis, and compute sufficient benchmark sizes, finding that standard benchmark sizes are insufficient and fail to distinguish between competitive models reliably.

4 weeks ago 0 0 1 0

We propose an `informativeness’ measure for anchors, and show that it correlates with the anchors quality.

4 weeks ago 0 0 1 0

For example, best-performing and worst-performing models make poor anchors. Because these extreme anchors consistently outscore or lag behind the rest, they offer little insight into how the other models compare.

4 weeks ago 0 0 1 0

However, the choice of anchor is critical.
Our experiments show that *a poor anchor can dramatically reduce correlation with human rankings*, making the evaluation unreliable.

4 weeks ago 0 0 1 0

To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single *anchor*.

4 weeks ago 0 0 1 0

The main drawback of this approach is that the cost of evaluation grows fast.

Specifically, as the number of evaluated models increases, the number of model pairs to compare grows quadratically.

4 weeks ago 0 0 1 0

The ``LLM-as-a-judge’’ (LMJ ) paradigm has become a standard method for evaluating open-ended generation.
A primary setting for LMJ is pairwise comparisons, where we ask whether model A's response is better than B's or vice versa

4 weeks ago 1 0 1 0

Do you run pairwise evaluation?
Do you test your models on the Arena-Hard and AlpacaEval benchmarks?

You probably want to read this 🧵👇

arxiv.org/abs/2603.16848
By Shachar Don Yehia, me and Omri Abend
🤖📈🧠 #AI

4 weeks ago 5 0 1 0

Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. Fi...

Last, they show that prompting or training (SFT) to think on the right entities across the languages improves scores and closes some of the language gap

arxiv.org/abs/2603.17070
@lucasbandarkar.bsky.social Alan Ansell & Trevor Cohn

4 weeks ago 1 0 0 0

Posts by Leshem (Legend) Choshen @EMNLP