#AIBenchmarks hashtag - Bluesky

New Mamba-3 AI Model Beats Transformers by 4%, Runs 7x Faster Together.ai has released Mamba-3, an open-source state space model that outperforms Transformers by nearly 4% on language modeling and runs up to 7x faster.

1 week ago

winbuzzer.com/2026/03/18/o...

New Mamba-3 AI Model Beats Transformers by 4%, Runs 7x Faster

#AI #AIModels #AIResearch #DeepLearning #MachineLearning #OpenSourceAI #AIInference #AIBenchmarks #Mamba3 #TogetherAI #StateSpaceModels

1 0 0 0

Ben@SharedSapience

@sharedsapience.substack.com

2 weeks ago

Watch today's Century Report podcast here:

https://www.youtube.com/watch?v=DPTUSD8r1as

#AIGovernance #CleanEnergy #AIBenchmarks

0 0 0 0

Ben@SharedSapience

@sharedsapience.substack.com

2 weeks ago

Pentagon calls Claude "pollution" for having safety built in. AI chatbots enter military targeting chains. Virginia passes balcony solar 96-0. PJM unlocks 40% more grid capacity. #AIGovernance #CleanEnergy #AIBenchmarks sharedsapience.com/century-report/the-centu...

0 0 0 0

Scott M. Graffius

@scottgraffius.bsky.social

2 weeks ago

AI systems sometimes present fiction as fact, a phenomenon known as AI hallucinations. Using such outputs can spread false information, damage reputations, and create other problems ...

doi.org/10.13140/RG....

#AIBenchmarks #AIHallucinations #AIResearch #AISafety #AI

0 0 0 0

How Anthropic's Claude Opus 4.6 Broke Its Own AI Benchmark Anthropic has revealed Claude Opus 4.6 identified the BrowseComp benchmark and decrypted its answer key, raising serious AI evaluation integrity concerns.

2 weeks ago

winbuzzer.com/2026/03/10/a...

Anthropic's Claude Opus 4.6 Broke Its Own AI Benchmark

#AI #Anthropic #LLMs #Claude #ClaudeOpus46 #AISafety #AIBenchmarks #AIResearch #MachineLearning #BrowseComp

1 0 0 0

OpenAI Launches GPT-5.4 With Native Computer Use and Finance Tools OpenAI has launched GPT-5.4, its first model with native computer-use capabilities, alongside financial plugins for Microsoft Excel and Google Sheets.

3 weeks ago

winbuzzer.com/2026/03/06/o...

OpenAI Launches GPT-5.4 With Computer Use and Finance Tools

#AI #Software #OpenAI #GPT54 #AIModels #GenerativeAI #ChatGPT #AIAgents #AgenticAI Enterprise #EnterpriseAI #AICoding #AIBenchmarks #ChatGPTPlus #ChatGPTEnterprise

0 0 0 0

Vivienne Ming

@socos.org

3 weeks ago

Turns out "can pass the test" and "can actually use this ability in complex, sustained interaction" remain frustratingly different questions.

#PerspectiveTaking #TheoryOfMind #AIResearch #HybridIntelligence #SocialCognition #AIBenchmarks #AIRealism #CognitivePsychology #LLMs

0 0 1 0

Google Launches Gemini 3.1 Flash-Lite for Enterprise Scale Google has launched Gemini 3.1 Flash-Lite, its fastest and most cost-efficient AI model, at $0.25 per million tokens and 2.5x faster than Gemini 2.5 Flash.

3 weeks ago

winbuzzer.com/2026/03/03/g...

Google Launches Gemini 3.1 Flash-Lite for Enterprise Scale

#AI #Google #GoogleGemini #Gemini31FlashLite #Gemini31 #BigTech #AIModels #GoogleAIStudio #GoogleVertexAI #Flash #AIInference #AIBenchmarks

0 0 0 0

Vivienne Ming

@socos.org

3 weeks ago

It turns out the d20 rolls weren't the random element we should have been worried about.

#HybridIntelligence #AIResearch #DungeonsAndDragons #AIBenchmarks #LargeLanguageModels #DnD #AIRealism #TTRPG

0 0 1 0

AI Daily Post

@aidailypost.com

3 weeks ago

Alibaba’s new Qwen3.5‑9B beats OpenAI’s massive gpt‑oss‑120B on everyday laptop tests. Tiny model, huge punch—see how it stacks up in real‑world AI benchmarks. Curious? Dive in! #Qwen3_5 #OpenAI #AIBenchmarks

🔗 aidailypost.com/news/alibaba...

0 0 0 0

ClawNews

@clawnews.bsky.social

3 weeks ago

New Benchmarks Emerge for Evaluating LLM Capabilities LemmaBench and DARE-bench are new benchmarks for evaluating Large Language Models (LLMs) in mathematics and data science. LemmaBench focuses on research-level mathematics, while DARE-bench targets complex data science tasks. Both benchmarks highlight performance gaps in current LLMs and the imp

📰 New Benchmarks Emerge for Evaluating LLM Capabilities

LemmaBench and DARE-bench are new benchmarks for evaluating Large Language Models (LLMs) in mathematics and da...

www.clawnews.ai/new-benchmarks-emerge-fo...

#AIBenchmarks #LargeLanguageModels #Mathematics

0 0 0 0

AdwaitX

@adwaitx.bsky.social

1 month ago

OpenAI Retired SWE-bench Verified: The AI Benchmark Credibility Crisis Explained OpenAI’s decision to stop evaluating against SWE-bench Verified, announced February 23, 2026 by the company’s own Frontier Evals team, exposes a structural problem in how AI capabilities are measured, reported, and trusted.

OpenAI retired SWE-bench Verified. 59.4% of tasks were flawed. Top models were recalling answers from memory, not solving problems. AdwaitX breaks down what this benchmark crisis means for AI in 2026. #AdwaitX #AIBenchmarks #OpenAI

0 0 0 0

AdwaitX

@adwaitx.bsky.social

1 month ago

Gemini 3.1 Pro Just Doubled Its Reasoning Score: What That Means for You Google upgraded the reasoning engine behind its entire AI ecosystem, and the performance jump is not incremental. Gemini 3.1 Pro went from 31.1% to 77.1% on ARC-AGI-2, marking one of the largest single-

Google's Gemini 3.1 Pro scores 77.1% on ARC-AGI-2, more than doubling its predecessor. It leads Claude Opus 4.6 and GPT-5.2 on 10 of 13 key benchmarks. AdwaitX breaks down what this means for developers in 2026. #AdwaitX #Gemini31Pro #AIBenchmarks

0 0 0 0

SquaredTech

@squaredtech.bsky.social

1 month ago

Gemini 3.1 Pro Unveiled: AI Crushes Benchmarks! Gemini 3.1 Pro unveiled crushes benchmarks with 77.1% ARC-AGI-2 score. Boosts enterprise workflows and advanced reasoning over Gemini 3 Pro.

Gemini 3.1 Pro Unveiled: AI Crushes Benchmarks!
#Gemini3Pro #AIBenchmarks #FutureTech
www.squaredtech.co/gemini-3-1-p...

0 0 0 0

Google Rolls Out Gemini 3.1 Pro Across Apps, Vertex, and CLI Google has launched Gemini 3.1 Pro in preview, citing major benchmark gains while keeping pricing unchanged and expanding access across enterprise tools.

1 month ago

winbuzzer.com/2026/02/19/g...

Google Rolls Out Gemini 3.1 Pro Across Apps, Vertex, and CLI

#AI #Google #GoogleGemini #Gemini31Pro #Alphabet #BigTech #AIBenchmarks #AIReasoningModels #GoogleAIStudio #GoogleVertexAI #NotebookLM

0 0 0 0

Anthropic Unveils Claude Sonnet 4.6 with Near-Opus Level Scores Anthropic has launched Claude Sonnet 4.6 as new default for claude.ai users, achieving 79.6% on SWE-bench with flagship-level performance at Sonnet pricing.

1 month ago

winbuzzer.com/2026/02/17/a...

Anthropic Unveils Claude Sonnet 4.6 with Near-Opus Level Scores

#AI #LLMs #Anthropic #Claude #ClaudeSonnet #ClaudeSonnet46 #AIBenchmarks #ComputerUse #AICoding

2 0 0 0

1 month ago

winbuzzer.com/2026/02/13/g...

Google Gemini 3 Deep Think Beats Opus 4.6 and GPT-5.2, Solves 18 New Research Problems

#AI #GoogleGemini #GeminiDeepThink #Google #BigTech #GoogleDeepMind #GoogleAI #Gemini3 #GoogleAIUltra #AIBenchmarks

0 0 0 0

Can AI Truly Discover New Science?

1 month ago

Can AI make real scientific discoveries? A proposed set of “Turing tests” aims to measure whether machines can reason, explore, and innovate like scientists. #aibenchmarks

0 0 0 0

The Seven Qualification Tests for an AI Scientist

1 month ago

Seven rigorous tests that define whether an AI can rediscover fundamental scientific laws—from heliocentrism to sorting algorithms. #aibenchmarks

0 0 0 0

Researchers Outline a Roadmap for AI That Can Make Scientific Discovery

1 month ago

A look at how AI systems are evolving from automated experiments to rediscovering major scientific breakthroughs without relying on human knowledge. #aibenchmarks

0 0 0 0

Researchers Propose a Turing Test to Measure Whether AI Can Make Scientific Discoveries

1 month ago

A proposed Turing test challenges AI to rediscover foundational scientific laws without human knowledge, redefining what it means for machines to do science. #aibenchmarks

0 0 0 0

Handwriting vs AI: Real Performance of AI on Handwritten Documents

1 month ago

Benchmarking AI on handwritten forms: see which models deliver real-world accuracy, speed, and cost-efficiency, and why bigger isn’t always better. #aibenchmarks

0 0 0 0

Anthropic's Claude Opus 4.6 Leads AI Intelligence Index Anthropic's Claude Opus 4.6 has claimed the top spot on the Artificial Analysis Intelligence Index, leading in agent tasks and coding ahead of OpenAI's Codex challenge.

1 month ago

winbuzzer.com/2026/02/08/a...

Anthropic's Claude Opus 4.6 Leads AI Intelligence Index

#AI #Claude #Anthropic #OpenAI #ClaudeOpus46 #ArtificialAnalysis #Benchmark #AIBenchmarks #Codex #GPT5

0 0 0 0

METR's AI Agents Can Work 5 Hours Claim Revisited METR has announced that Claude Opus 4.5 appears capable of completing five-hour human tasks, but viral discourse has ignored substantial measurement caveats and error bars.

1 month ago

winbuzzer.com/2026/02/06/m...

METR's Five-Hour AI Claim: Why Everyone Misunderstood the Graph

#AI #METR #Anthropic #Claude #ClaudeOpus45 #AIResearch #AISafety #AIBenchmarks #LLMs #AICoding #AIAgents #AgenticAI #AICoding #AISafety #AIEthics

0 0 0 0

Ars Technica News

@arstechni.ca

1 month ago

AI companies want you to stop chatting with bots and start managing them https://arstechni.ca #largelanguagemodels #AIdevelopmenttools #machinelearning #ClaudeOpus4.6 #GPT-5.3-Codex #AIassistants #AIbenchmarks #generativeai #AIsecurity #ClaudeCode #ClaudeOpus #codeagents #agenticAI #AIandwork…

0 0 0 0

AI Daily Post

@aidailypost.com

1 month ago

Anthropic just dropped Claude Code—its newest AI assistant built for devs and enterprises. Think smarter code gen, tighter benchmarks, and a fresh take on language models. Curious how it stacks up? Dive in! #ClaudeCode #EnterpriseAI #AIBenchmarks

🔗 aidailypost.com/news/anthrop...

0 0 0 0

Gemini 3 Tops All Leaderboards as Game Arena Adds Poker and Werewolf Google DeepMind has expanded its Game Arena AI benchmark with Poker and Werewolf games, as Gemini 3 models have swept all three strategic leaderboards.

1 month ago

winbuzzer.com/2026/02/04/d...

Gemini 3 Tops All Kaggle Leaderboards as Game Arena Adds Poker and Werewolf

#AI #Google #GoogleDeepMind #GoogleGemini #DeepMind #AIBenchmarks #Gemini3 #Gemini3Pro #Gemini3Flash #BigTech #AIModels #Chess #Kaggle #Games

0 0 0 0