Siqi Liu (刘思奇) (@liusiqi) Bsky

📌 Mark Your Calendar: Live Game Arena Event This Monday!

We are releasing two new games, Poker and Werewolf, along with an updated Chess leaderboard next Monday, February 2, running daily from 9:30 AM PT to 11:30 AM PT through February 4

2 months ago 15 5 2 2

Research Engineer, Game Theory & Multi-Agent Systems London, UK

We have got exciting (and unconventional) stuff cooking and we are hiring for a strong research engineer on the GDM Game Theory team in London.

Consider apply if you are interested in the intersection of game theory, multiagent systems and LLMs!
job-boards.greenhouse.io/deepmind/job...

6 months ago 19 7 0 1

Joint work with @drimgemp.bsky.social, @lukemarris.bsky.social, Georgios Piliouras, Nicolas Heess and @sharky6000.bsky.social.

1 year ago 2 0 0 0

Re-evaluating Open-Ended Evaluation of Large Language Models A case study using the livebench.ai leaderboard.

Frontier models are often compared on crowdsourced user prompts - user prompts can be low-quality, biased and redundant, making "performance on average" hard to trust.

Come find us at #ICLR2025 to discuss game-theoretic evaluation (shorturl.at/0QtBj)! See you in Singapore!

1 year ago 8 2 1 1

[🧵1/N] Thrilled to share our work "Re-evaluating Open-Ended Evaluation of Large Language Models"! 🚀 Popular LLM leaderboards (think Elo/Chatbot Arena) are useful, but are they telling the whole story? We find issues w/ redundancy & bias. 🤔
Paper @ ICLR 2025: arxiv.org/abs/2502.20170 #LLM #ICLR2025

1 year ago 15 2 2 1

🥁Introducing Gemini 2.5, our most intelligent model with impressive capabilities in advanced reasoning and coding.

Now integrating thinking capabilities, 2.5 Pro Experimental is our most performant Gemini model yet. It’s #1 on the LM Arena leaderboard. 🥇

1 year ago 215 66 34 11

Deviation Ratings: A General, Clone-Invariant Rating Method Many real-world multi-agent or multi-task evaluation scenarios can be naturally modelled as normal-form games due to inherent strategic (adversarial, cooperative, and mixed motive) interactions. These...

[🧵1/N] Please check out our new paper (arxiv.org/abs/2502.11645) on game-theoretic evaluation. It is the first method that results in clone-invariant ratings in N-player, general-sum interactions. Co-authors: @liusiqi.bsky.social , Ian Gemp, Georgios Piliouras, @sharky6000.bsky.social 🎉

1 year ago 15 2 2 3

Posts by Siqi Liu (刘思奇)