Amazing collaboration with Alireza Salemi, @hamedzamani.bsky.social and @841io.bsky.social
arxiv.org/abs/2604.14256
Posts by Danny To Eun Kim
Key Ideas:
🔄Simulation of repeated user–system interactions with evolving preferences
📊Longitudinal metrics: market share, retention, market health, and fairness
📋A research agenda for marketplace simulation, metrics, optimization, and adoption in evaluation campaigns (e.g., TREC)
Our framework reconceptualizes evaluation by placing systems as participants in a competitive AI marketplace.
Instead of accuracy scores, we ask:
Does your system retain users? Does it gain market share? Does it survive competitive pressure? Short and long term dynamics?
🤔 No static benchmark would have predicted either outcome.
🔒(2) An average model entering a highly concentrated market. When it entered, it struggled to achieve the market share you'd expect from its benchmark scores alone. Users had already formed strong preferences that take a long time to shift for an average model.
⚡(1) A strong model entering a lightly concentrated market. When a strong model joined, it captured market share far beyond what its benchmark scores alone would proportionally justify. A competitive market amplifies quality signals in ways static evaluation cannot capture.
One core question our framework lets you ask:
❓will a model you deploy survive and get enough user traffic in a competitive AI market?
To motivate this, we simulated two market-entry scenarios — introducing a new model mid-simulation into an existing marketplace:
👨💻 In the real world, users (and agents too!) don't always query a single fixed system. They choose between competing providers, switch based on experience, and drive market share dynamics over time.
❌ Yet our benchmarks still treat systems as if they operate in isolation.
We argue that the dominant evaluation paradigms — Static Benchmarking and Arena — are missing something fundamental: competition and evolving user preferences.
🎉 Excited to share a new agent evaluation perspective with our newly accepted #SIGIR2026 paper:
"Evaluation of Agents under Simulated AI Marketplace Dynamics."
arxiv.org/abs/2604.14256
@acmsigir.bsky.social
Announcing the release of the NTCIR Tip-of-the-Tongue (ToT) Shared Task participation guidelines: ntcir-tot.github.io/guidelines.
Please reshare and spread the word. #NTCIR2026 #NTCIRToT #NTCIR2026ToT
/1
We tend to conflate "autonomy" with "reliability" in AI agents. But autonomy without trust is catastrophically dangerous.
Our new paper formalizes UQ for LLM agents, proposes a new lens: agent uncertainty as a conditional uncertainty reduction process.
📄 huggingface.co/papers/2602....
🎭 How do LLMs (mis)represent culture?
🧮 How often?
🧠 Misrepresentations = missing knowledge? spoiler: NO!
At #CHI2026 we are bringing ✨TALES✨ a participatory evaluation of cultural (mis)reps & knowledge in multilingual LLM-stories for India
📜 arxiv.org/abs/2511.21322
1/10
#ChatGPT began to put ads in their response.
Check our paper on “how fair ranking can positively impact the LLM response and content/ad exposure”.
dl.acm.org/doi/10.1145/...
#chatGPT began to put ads in their response.
Check out our paper on “Ads detection and integration in the era of LLMs”.
ceur-ws.org/Vol-4038/pap...
as AI increasingly supports shopping and ads, it’s worth remembering that retrieval often shapes who gets exposure in final generated output. in a recent paper, @teknology.bsky.social uses methods from fair ranking to assess and address exposure bias in downstream generation.
841.io/doc/fairrag....
advertisement generation and detection in RAG
Excited to present at #CLEF2025 #Touché Lab (Session 2) shared task "Advertisement in RAG"🇪🇸!
@webis.de
🗓️Sept 9 (Tue)
⏲️5:20PM (CEST) / 11:20AM (EST)
📍Florentino Sanz Room
🧠https://arxiv.org/abs/2507.00509
Join us for insights on #RAG + advertising!
Some exciting news! 🤗 After 3 amazing years at TREC, the Tip-of-the-Tongue (ToT) shared task will be a core task at NTCIR-19 in 2026. The new track will focus on tip-of-the-tongue information needs in English and East Asian languages.
More details coming soon. See you all in Tokyo next year!
Gentle reminder 📢
All run submissions for the Tip-of-the-Tongue (ToT) Track are due next week Wednesday (Aug 27).
More info: trec-tot.github.io/guidelines
#TREC2025 #TRECToT #TREC2025ToT
This year's TREC Tip of the Tongue (ToT) track will be amazing! Based on our rigorous experiments on synthetic ToT query generation presented at #SIGIR2025, we extended the track to open domain ToT queries.
We provide codes for baseline systems, and submissions are due by August 27th!
To Eun Kim just presented the work on "Tip of the Tongue Query Elicitation for Simulated Evaluation" at #SIGIR2025. The approach will be used in the #TREC2025 Tip-of-the-Tongue track, and we had some sweets at the poster :)
The paper is available online: dl.acm.org/doi/10.1145/...
Hello TREC-ToTers!
We have released the test queries for the TREC 2025 Tip-of-the-Tongue (TREC-ToT) Track. Please see the guidelines for more information: trec-tot.github.io/guidelines. Run submission deadline will tentatively be in August. #TREC2025 #TRECToT #TREC2025ToT
Please spread the word!
❓How do LLMs respond to fair ranking in RAG?
🤩 See how fair ranking boosts downstream utility while promoting fairer attribution of cited sources.
Catch our oral presentation at #ICTIR2025!
#SIGIR2025 @841io.bsky.social
Dory from finding nemo with the quote: "I remember it like it was yesterday. Of course, I dont remember yesterday."
Do not forget to participate in the #TREC2025 Tip-of-the-Tongue (ToT) Track :)
The corpus and baselines (with run files) are now available and easily accessible via the ir_datasets API and the HuggingFace Datasets API.
More details are available at: trec-tot.github.io/guidelines
An overview of the work “Research Borderlands: Analysing Writing Across Research Cultures” by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We survey and interview interdisciplinary researchers (§3) to develop a framework of writing norms that vary across research cultures (§4) and operationalise them using computational metrics (§5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (§6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (§7).
🖋️ Curious how writing differs across (research) cultures?
🚩 Tired of “cultural” evals that don't consult people?
We engaged with interdisciplinary researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗
📜 arxiv.org/abs/2506.00784
[1/11]
Hello TREC-ToTers! 👋🏽
Excited to announce the release of TREC 2025 Tip-of-the-Tongue (TREC-ToT) Track guidelines: trec-tot.github.io/guidelines. We will release test queries in July and run submission deadline will be in August. #TREC2025 #TRECToT #TREC2025ToT
Please register to participate:
Related paper here!
bsky.app/profile/841i...
Ever trusted a metric that works great on average, only for it to fail in your specific use case?
In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think.
📄 aclanthology.org/2025.finding...
#NLP #Evaluation
(🧵1/9)
If you're interested in OpenAI including shopping results, you might also be interested in @teknology.bsky.social's paper relating retrieval diversity/fairness and generation by downstream RAG models. This has implications for individuals selling products online.
arxiv.org/abs/2409.11598