Indeed, unfortunately politics can't be treated like some vacuum that doesn't interact with anything else- especially when the body in power is stifling scientific research.
That being said, it does feel like there isn't much subject-specific academic discussion going on here, unlike old Twitter.
Posts by LMAnalysis
Yeah, my position is that pure scaling has worked up till now at least. Yeah hallucinations still happen, but significant progress is being made nevertheless, for instance for reasoning. I do agree that new ideas need to be implemented, TTC alone isn't enough to address fundamental problems
I kind of disagree, pure scaling has resulted in continued improvements in performance when stagnation could be expected. Just throwing pure compute at Grok resulted in a SOTA LLM, its problems seem to stem from core flaws of the transformer architecture.
Maybe Titans and LLaDA are on the horizon!
While we don't have many other benchmarks for it yet, Livebench.ai 's low-contamination benchmark is putting Gemini in second. It isn't quite o1-level for reasoning and language- but pulls ahead on math and coding.
Gemini might just be the best publicly available general-purpose model right now!
And this story is repeated across the board. In fact, in every single category and language, Gemini is first place or tied for first! I don't know if there has been such a universally strong general-purpose model release since the original GPT-4. (5)
With a +45 point leap over its predecessor, Gemini takes a commanding +29 point lead for hard prompts.
With style control, this lead reduces to +19 points, but the leap over previous Gemini remains at +46 points. Clearly, Google cooked something up with this release! (4)
With style control, the new Gemini's lead expands to +36 over the previous Gemini, though it's now tied with GPT-4o. But sure, this could be Google figuring out how to bypass the style control filters at LMArena. So let's take a look at hard prompts. (3)
On the overall leaderboard of lmarena.ai, Gemini makes a 15-point jump over its two-week-old predecessor to regain first place in human preference.
It looks like an incremental upgrade in the endless optimization war between Google and OpenAI over the last few months- until you start filtering.(2)
LMArena categories overview. Gemini-exp-1206 is first or tied for first accross all categories.
The new Gemini release from Google has mostly flown under the radar- perhaps understandably so.
๐ฎ
Regaining the #1 spot on the lmarena.ai overall leaderboard feels like Google just finetuned their model for human preference again- but taking a closer look reveals truly remarkable performance... ๐งต
Yeah it seems really cool, a munch more powerful finetuning method, but based on the sign up form seems quite selective.
Also it will only be rolled out next year
๐
๐
A sneak peek of some topics I want to write about in the near future!
- Approaching benchmark limits: what comes next?
- Rankings and style control at LMArena: the 'hidden details' of the leaderboard
- o1's (not so) groundbreaking performance...
- Gemini's new is probably cooler than you think!
My main focus for now will be benchmarks of LMSys/LMArena, as it has been widely adopted as the 'gold standard' of a model's performance for everyday use.
But I will also cover many other benchmarks to showcase their strengths and weaknesses at furthering our understanding of an LLM's performance.
Hello BlueSky๐ฆ! This page will be all about benchmarks of large language models.
I've decided to create it for two key reasons:
Firstly, benchmarking LLMs is becoming more difficult.
And secondly, interpreting benchmarks can be difficult.