LMAnalysis (@mlanalysis) Bsky

Indeed, unfortunately politics can't be treated like some vacuum that doesn't interact with anything else- especially when the body in power is stifling scientific research.

That being said, it does feel like there isn't much subject-specific academic discussion going on here, unlike old Twitter.

10 months ago 4 0 2 0

Yeah, my position is that pure scaling has worked up till now at least. Yeah hallucinations still happen, but significant progress is being made nevertheless, for instance for reasoning. I do agree that new ideas need to be implemented, TTC alone isn't enough to address fundamental problems

1 year ago 1 0 0 0

I kind of disagree, pure scaling has resulted in continued improvements in performance when stagnation could be expected. Just throwing pure compute at Grok resulted in a SOTA LLM, its problems seem to stem from core flaws of the transformer architecture.

Maybe Titans and LLaDA are on the horizon!

1 year ago 0 0 1 0

While we don't have many other benchmarks for it yet, Livebench.ai 's low-contamination benchmark is putting Gemini in second. It isn't quite o1-level for reasoning and language- but pulls ahead on math and coding.

Gemini might just be the best publicly available general-purpose model right now!

1 year ago 1 0 0 0

And this story is repeated across the board. In fact, in every single category and language, Gemini is first place or tied for first! I don't know if there has been such a universally strong general-purpose model release since the original GPT-4. (5)

1 year ago 1 0 1 0

With a +45 point leap over its predecessor, Gemini takes a commanding +29 point lead for hard prompts.

With style control, this lead reduces to +19 points, but the leap over previous Gemini remains at +46 points. Clearly, Google cooked something up with this release! (4)

1 year ago 0 0 1 0

With style control, the new Gemini's lead expands to +36 over the previous Gemini, though it's now tied with GPT-4o. But sure, this could be Google figuring out how to bypass the style control filters at LMArena. So let's take a look at hard prompts. (3)

1 year ago 0 0 1 0

On the overall leaderboard of lmarena.ai, Gemini makes a 15-point jump over its two-week-old predecessor to regain first place in human preference.

It looks like an incremental upgrade in the endless optimization war between Google and OpenAI over the last few months- until you start filtering.(2)

1 year ago 0 0 1 0

LMArena categories overview. Gemini-exp-1206 is first or tied for first accross all categories.

The new Gemini release from Google has mostly flown under the radar- perhaps understandably so.
🔮
Regaining the #1 spot on the lmarena.ai overall leaderboard feels like Google just finetuned their model for human preference again- but taking a closer look reveals truly remarkable performance... 🧵

1 year ago 1 0 1 0

Yeah it seems really cool, a munch more powerful finetuning method, but based on the sign up form seems quite selective.
Also it will only be rolled out next year

1 year ago 0 0 0 0

📌

1 year ago 1 0 0 0

📌

1 year ago 0 0 0 0

A sneak peek of some topics I want to write about in the near future!
- Approaching benchmark limits: what comes next?
- Rankings and style control at LMArena: the 'hidden details' of the leaderboard
- o1's (not so) groundbreaking performance...
- Gemini's new is probably cooler than you think!

1 year ago 1 0 0 0

My main focus for now will be benchmarks of LMSys/LMArena, as it has been widely adopted as the 'gold standard' of a model's performance for everyday use.
But I will also cover many other benchmarks to showcase their strengths and weaknesses at furthering our understanding of an LLM's performance.

1 year ago 1 0 1 0

Hello BlueSky🦋! This page will be all about benchmarks of large language models.

I've decided to create it for two key reasons:
Firstly, benchmarking LLMs is becoming more difficult.
And secondly, interpreting benchmarks can be difficult.

1 year ago 2 0 1 0

Posts by LMAnalysis