Danny To Eun Kim (@teknology) Bsky

Evaluation of Agents under Simulated AI Marketplace Dynamics Modern information access ecosystems consist of mixtures of systems, such as retrieval systems and large language models, and increasingly rely on marketplaces to mediate access to models, tools, and ...

Amazing collaboration with Alireza Salemi, @hamedzamani.bsky.social and @841io.bsky.social
arxiv.org/abs/2604.14256

3 days ago 0 0 0 0

GitHub - kimdanny/marketplace-eval: Official Code Repository for SIGIR 2026 perspective paper: "Evaluation of Agents under Simulated AI Marketplace Dynamics" Official Code Repository for SIGIR 2026 perspective paper: "Evaluation of Agents under Simulated AI Marketplace Dynamics" - kimdanny/marketplace-eval

🛠️ pip install marketplace-eval
github.com/kimdanny/mar...

3 days ago 0 0 1 0

Key Ideas:
🔄Simulation of repeated user–system interactions with evolving preferences
📊Longitudinal metrics: market share, retention, market health, and fairness
📋A research agenda for marketplace simulation, metrics, optimization, and adoption in evaluation campaigns (e.g., TREC)

3 days ago 0 0 1 0

Our framework reconceptualizes evaluation by placing systems as participants in a competitive AI marketplace.
Instead of accuracy scores, we ask:
Does your system retain users? Does it gain market share? Does it survive competitive pressure? Short and long term dynamics?

3 days ago 0 0 1 0

🤔 No static benchmark would have predicted either outcome.

3 days ago 0 0 1 0

🔒(2) An average model entering a highly concentrated market. When it entered, it struggled to achieve the market share you'd expect from its benchmark scores alone. Users had already formed strong preferences that take a long time to shift for an average model.

3 days ago 0 0 1 0

⚡(1) A strong model entering a lightly concentrated market. When a strong model joined, it captured market share far beyond what its benchmark scores alone would proportionally justify. A competitive market amplifies quality signals in ways static evaluation cannot capture.

3 days ago 0 0 1 0

One core question our framework lets you ask:
❓will a model you deploy survive and get enough user traffic in a competitive AI market?

To motivate this, we simulated two market-entry scenarios — introducing a new model mid-simulation into an existing marketplace:

3 days ago 0 0 1 0

👨‍💻 In the real world, users (and agents too!) don't always query a single fixed system. They choose between competing providers, switch based on experience, and drive market share dynamics over time.
❌ Yet our benchmarks still treat systems as if they operate in isolation.

3 days ago 0 0 1 0

We argue that the dominant evaluation paradigms — Static Benchmarking and Arena — are missing something fundamental: competition and evolving user preferences.

3 days ago 1 0 1 0

🎉 Excited to share a new agent evaluation perspective with our newly accepted #SIGIR2026 paper:
"Evaluation of Agents under Simulated AI Marketplace Dynamics."
arxiv.org/abs/2604.14256
@acmsigir.bsky.social

3 days ago 5 2 2 0

NTCIR 2026 Tip-of-the-Tongue (ToT) Shared Task Tip of the tongue: The phenomenon of failing to retrieve something from memory, combined with partial recall and the feeling that retrieval is imminent.

Announcing the release of the NTCIR Tip-of-the-Tongue (ToT) Shared Task participation guidelines: ntcir-tot.github.io/guidelines.

Please reshare and spread the word. #NTCIR2026 #NTCIRToT #NTCIR2026ToT

/1

3 weeks ago 4 6 1 0

We tend to conflate "autonomy" with "reliability" in AI agents. But autonomy without trust is catastrophically dangerous.

Our new paper formalizes UQ for LLM agents, proposes a new lens: agent uncertainty as a conditional uncertainty reduction process.
📄 huggingface.co/papers/2602....

2 months ago 5 1 3 0

🎭 How do LLMs (mis)represent culture?
🧮 How often?
🧠 Misrepresentations = missing knowledge? spoiler: NO!

At #CHI2026 we are bringing ✨TALES✨ a participatory evaluation of cultural (mis)reps & knowledge in multilingual LLM-stories for India

📜 arxiv.org/abs/2511.21322

1/10

2 months ago 46 22 1 2

#ChatGPT began to put ads in their response.
Check our paper on “how fair ranking can positively impact the LLM response and content/ad exposure”.
dl.acm.org/doi/10.1145/...

3 months ago 4 0 0 1

#chatGPT began to put ads in their response.
Check out our paper on “Ads detection and integration in the era of LLMs”.
ceur-ws.org/Vol-4038/pap...

3 months ago 1 0 0 0

as AI increasingly supports shopping and ads, it’s worth remembering that retrieval often shapes who gets exposure in final generated output. in a recent paper, @teknology.bsky.social uses methods from fair ranking to assess and address exposure bias in downstream generation.

841.io/doc/fairrag....

3 months ago 9 3 0 1

advertisement generation and detection in RAG

Excited to present at #CLEF2025 #Touché Lab (Session 2) shared task "Advertisement in RAG"🇪🇸!
@webis.de
🗓️Sept 9 (Tue)
⏲️5:20PM (CEST) / 11:20AM (EST)
📍Florentino Sanz Room
🧠https://arxiv.org/abs/2507.00509
Join us for insights on #RAG + advertising!

7 months ago 1 0 0 1

an aerial view of tokyo at night with lots of lights ALT: an aerial view of tokyo at night with lots of lights

Some exciting news! 🤗 After 3 amazing years at TREC, the Tip-of-the-Tongue (ToT) shared task will be a core task at NTCIR-19 in 2026. The new track will focus on tip-of-the-tongue information needs in English and East Asian languages.

More details coming soon. See you all in Tokyo next year!

7 months ago 5 3 0 0

Gentle reminder 📢
All run submissions for the Tip-of-the-Tongue (ToT) Track are due next week Wednesday (Aug 27).

More info: trec-tot.github.io/guidelines
#TREC2025 #TRECToT #TREC2025ToT

8 months ago 2 2 0 1

This year's TREC Tip of the Tongue (ToT) track will be amazing! Based on our rigorous experiments on synthetic ToT query generation presented at #SIGIR2025, we extended the track to open domain ToT queries.
We provide codes for baseline systems, and submissions are due by August 27th!

8 months ago 1 1 0 0

To Eun Kim just presented the work on "Tip of the Tongue Query Elicitation for Simulated Evaluation" at #SIGIR2025. The approach will be used in the #TREC2025 Tip-of-the-Tongue track, and we had some sweets at the poster :)

The paper is available online: dl.acm.org/doi/10.1145/...

9 months ago 12 3 0 0

Hello TREC-ToTers!

We have released the test queries for the TREC 2025 Tip-of-the-Tongue (TREC-ToT) Track. Please see the guidelines for more information: trec-tot.github.io/guidelines. Run submission deadline will tentatively be in August. #TREC2025 #TRECToT #TREC2025ToT

Please spread the word!

9 months ago 3 3 0 1

❓How do LLMs respond to fair ranking in RAG?
🤩 See how fair ranking boosts downstream utility while promoting fairer attribution of cited sources.
Catch our oral presentation at #ICTIR2025!
#SIGIR2025 @841io.bsky.social

9 months ago 7 0 0 1

Dory from finding nemo with the quote: "I remember it like it was yesterday. Of course, I dont remember yesterday."

Do not forget to participate in the #TREC2025 Tip-of-the-Tongue (ToT) Track :)

The corpus and baselines (with run files) are now available and easily accessible via the ir_datasets API and the HuggingFace Datasets API.

More details are available at: trec-tot.github.io/guidelines

9 months ago 11 7 0 0

An overview of the work “Research Borderlands: Analysing Writing Across Research Cultures” by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We survey and interview interdisciplinary researchers (§3) to develop a framework of writing norms that vary across research cultures (§4) and operationalise them using computational metrics (§5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (§6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (§7).

🖋️ Curious how writing differs across (research) cultures?
🚩 Tired of “cultural” evals that don't consult people?

We engaged with interdisciplinary researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗

📜 arxiv.org/abs/2506.00784

[1/11]

10 months ago 72 30 1 5

TREC 2025 Tip-of-the-Tongue (ToT) Track Tip of the tongue: The phenomenon of failing to retrieve something from memory, combined with partial recall and the feeling that retrieval is imminent.

Hello TREC-ToTers! 👋🏽

Excited to announce the release of TREC 2025 Tip-of-the-Tongue (TREC-ToT) Track guidelines: trec-tot.github.io/guidelines. We will release test queries in July and run submission deadline will be in August. #TREC2025 #TRECToT #TREC2025ToT

Please register to participate:

11 months ago 4 2 0 1

Related paper here!

bsky.app/profile/841i...

11 months ago 0 0 0 0

Ever trusted a metric that works great on average, only for it to fail in your specific use case?

In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think.

📄 aclanthology.org/2025.finding...
#NLP #Evaluation

(🧵1/9)

11 months ago 23 5 1 2

Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation Modern language models frequently include retrieval components to improve their outputs, giving rise to a growing number of retrieval-augmented generation (RAG) systems. Yet, most existing work in RAG...

If you're interested in OpenAI including shopping results, you might also be interested in @teknology.bsky.social's paper relating retrieval diversity/fairness and generation by downstream RAG models. This has implications for individuals selling products online.
arxiv.org/abs/2409.11598

11 months ago 9 2 0 1

Posts by Danny To Eun Kim