They have a teaser of what looks like a screenshot but say it's not a screenshot, so Image 2 model?
Posts by Pekka Lund
OpenAI announced a live stream at 12 pm PT today. So this?
It's open and already available on Hugging Face.
K2.6 seems to be more or less equally strong in all the benchmarks of the index. It doesn't reach top 3 in any of them but it's close behind and doesn't have any clear weaknesses.
Kimi K2.6 is now #4 in the Artificial Analysis Intelligence Index. Had they released it just a week ago, before Opus 4.7, a Chinese open model would have been ahead of Anthropic. And there would probably have been more comparisons to the DeepSeek moment.
Clickbait based on an ill-defined concept.
And by any meaningful definition AGI arrived long ago.
Sergey Brin in an internal memo:
"To win the final sprint, we must urgently bridge the gap in agentic execution and turn our models into primary developers"
Final sprint illustrated:
Sure, but it still can't run Crysis.
So it's now Google's turn to have their code red moment.
Frankly, I'm now worried about the next Gemini. Especially if the reports are true and they are training internal models with their own code for advancing internal usage, instead of focusing on generally strong models that can be released.
I think it was in the "RAM is cheap" era. Some were actually running R1 with CPU only setups. It wasn't that much smaller.
It looks very good. Competitive performance against the best with max/high settings and seems to be very strong on the agentic front, long-horizon tasks, and coding.
I just hope V4 release wasn't postponed again...
No they don't. It's just a simple JSON configuration file for browser extensions.
This is pure clickbait and sadly it works because this is Bluesky.
Many are saying that. But since none of them can actually specify what "genuine" means beyond assumed magic, it doesn't really mean anything in either case.
E.g. my usual peer review prompt as a gem in the Gemini app gives me annoying disclaimers like this, which hasn't happened in the AI Studio:
"As an AI, I don't hold personal academic grudges or experience the exhaustion of peer review"
And it never feels like it takes on that role as deeply.
Yep, same. The problem seems to be that even if you use gems in the Gemini app with custom system instructions, it retains more Google specified system instructions that seem to make it worse.
I have seen worse takes than that.
With models well beyond Mythos-level by that time.
In that case, I dream and fear that DeepSeek V4 will be Mythos-level.
There's been various cases where they have tried to do something along those lines but they are actively prevented and trained against doing so. So it would be more like rebellion than initiative. And I think they realize it's not the right time to do that yet.
I'm guessing they are hard at work on the agentic front, as that has been a clearly identified weakness for some time now.
Yeah, I'm also worried that limited releases will become more common, especially since OpenAI just released GPT‑5.4‑Cyber and GPT‑Rosalind that way.
Which by the way seems like a weird timing if they are about to release a much more powerful general model.
Costs, output tokens and output speed for running Artificial Analysis Intelligence Index for the top models now tied with score 57:
Gemini 3.1 Pro: $892.28, 57M tokens, 129.6 tokens/s
GPT-5.4 (xhigh): $2851.01, 120M tokens, 74.9 tokens/s
Claude Opus 4.7 (max): $4406.45, 100M tokens, 51.8 tokens/s
Opus 4.7 (max) used 100M tokens vs. Opus 4.6 (max) used 160M. So it seems to be significantly more efficient. Although still used almost twice as much tokens as Gemini 3.1 Pro (57M).
Claude Opus 4.7 hits the top spot in the Artificial Analysis Intelligence Index. In practice, it's a three-way tie with Gemini 3.1 Pro and GPT-5.4.
Much of it seems to be thanks to being #1 in GDPval-AA, which has 16.7% weight in the index. Otherwise the results aren't that impressive.
Spud should put that theory to the test soon.
You know how this works. When the next Gemini is released, we will forget Claude even exists (until the next Claude is released).
I would say agents already work with initiative. It's more a matter of giving them freedom to spend tokens.
Juodaanko Vantaalla niin laajasti purovettä?
You would have to use a pretty wide definition of lab though, even today.