#BENCHMARKS hashtag - Bluesky

@dazwhitehead1.bsky.social

4 hours ago

Awww look at that beautiful smile Julio & yay for #BenchMarks day. You look pawtastic. Have a great day my friend. 🥰❤️💛🐾

1 0 0 0

Julio Dog Come

@juliodogcome.bsky.social

13 hours ago

#Benchmarks picture day! I’m 2 1/2 years old now and Mom says I am not allowed to grow anymore! Can you believe that every month on the 13th we do a photo shoot on my bench!?! I think a Great Blue Heron may have pooped on my bench because it was really dirty today!🤭
#BandanasMakeEverythingBetter

47 5 12 0

Awesome Agents

@awesomeagents.bsky.social

15 hours ago

Berkeley: Every Major AI Agent Benchmark Can Be Hacked UC Berkeley researchers achieved near-perfect scores on eight major AI agent benchmarks without solving a single task, exposing systemic flaws in how the industry measures progress.

Berkeley: Every Major AI Agent Benchmark Can Be Hacked

awesomeagents.ai/news/berkeley-agent-benc...

#Benchmarks #AiAgents #Security

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

18 hours ago

Stanford's AI Index 2026 - US Edge Over China Is Gone Stanford HAI's 2026 AI Index finds the US-China model gap has effectively closed, GenAI has hit 53% global adoption faster than any prior technology, and young software developers are the first casualties of the labor shift.

Stanford's AI Index 2026 - US Edge Over China Is Gone

awesomeagents.ai/news/stanford-ai-index-2...

#Benchmarks #China #AiJobs

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

21 hours ago

AI Models Pass Vision Tests Without Seeing the Images A Stanford study shows frontier AI models achieve 70-80% of visual benchmark scores with no images provided, exposing a fundamental flaw in how multimodal AI is evaluated.

AI Models Pass Vision Tests Without Seeing the Images

awesomeagents.ai/news/mirage-ai-vision-be...

#Benchmarks #Research #Hallucination

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

21 hours ago

Arcee's Trinity-Large: 398B Open Reasoning at $0.90 Arcee AI ships Trinity-Large-Thinking, a 398B sparse MoE reasoning model under Apache 2.0 that hits 91.9% on PinchBench for $0.85 per million output tokens on OpenRouter.

Arcee's Trinity-Large: 398B Open Reasoning at $0.90

awesomeagents.ai/news/arcee-trinity-large...

#OpenSource #AiAgents #Benchmarks

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

1 day ago

AI Models Pass Vision Tests Without Seeing the Images A Stanford study shows frontier AI models achieve 70-80% of visual benchmark scores with no images provided, exposing a fundamental flaw in how multimodal AI is evaluated.

AI Models Pass Vision Tests Without Seeing the Images

awesomeagents.ai/news/mirage-ai-vision-be...

#Benchmarks #Research #Hallucination

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

2 days ago

Arcee's Trinity-Large: 398B Open Reasoning at $0.90 Arcee AI ships Trinity-Large-Thinking, a 398B sparse MoE reasoning model under Apache 2.0 that hits 91.9% on PinchBench for $0.85 per million output tokens on OpenRouter.

Arcee's Trinity-Large: 398B Open Reasoning at $0.90

awesomeagents.ai/news/arcee-trinity-large...

#OpenSource #AiAgents #Benchmarks

3 0 0 0

AIntelligenceHub

@aintelligencehub.bsky.social

2 days ago

KellyBench is a useful reality check for long-running agents: every frontier model tested lost money across the season, and only Opus 4.6 and GPT-5.4 avoided ruin in every seed. aintelligencehub.com/articles/kel... #AIAgents #Benchmarks #AIResearch

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

3 days ago

Instruction Following Leaderboard: IFEval Rankings 2026 Rankings of AI models on IFEval and IFBench, the two main benchmarks for measuring how reliably LLMs follow precise formatting, length, and content constraints.

Instruction Following Leaderboard: IFEval Rankings 2026

awesomeagents.ai/leaderboards/instruction...

#Benchmarks #Rankings #Llm

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

3 days ago

EXAONE 4.5: LG's Open VLM Beats GPT-5-mini on STEM LG AI Research released EXAONE 4.5, a 33B open-weight vision-language model that posts higher STEM scores than GPT-5-mini and Claude 4.5 Sonnet - but a non-commercial license caps its real-world reach.

EXAONE 4.5: LG's Open VLM Beats GPT-5-mini on STEM

awesomeagents.ai/news/lg-exaone-4-5-open-...

#OpenSource #Llm #Benchmarks

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

3 days ago

GLM-5.1 Tops SWE-Bench Pro With Zero NVIDIA Hardware Z.ai's GLM-5.1 scores 58.4 on SWE-bench Pro, edging out GPT-5.4 and Claude Opus 4.6, after being trained on 100,000 Huawei Ascend chips with no US silicon.

GLM-5.1 Tops SWE-Bench Pro With Zero NVIDIA Hardware

awesomeagents.ai/news/glm-5-1-swe-bench-p...

#Huawei #OpenSource #Benchmarks

2 0 0 0

Awesome Agents

@awesomeagents.bsky.social

4 days ago

Instruction Following Leaderboard: IFEval Rankings 2026 Rankings of AI models on IFEval and IFBench, the two main benchmarks for measuring how reliably LLMs follow precise formatting, length, and content constraints.

Instruction Following Leaderboard: IFEval Rankings 2026

awesomeagents.ai/leaderboards/instruction...

#Benchmarks #Rankings #Llm

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

4 days ago

EXAONE 4.5: LG's Open VLM Beats GPT-5-mini on STEM LG AI Research released EXAONE 4.5, a 33B open-weight vision-language model that posts higher STEM scores than GPT-5-mini and Claude 4.5 Sonnet - but a non-commercial license caps its real-world reach.

EXAONE 4.5: LG's Open VLM Beats GPT-5-mini on STEM

awesomeagents.ai/news/lg-exaone-4-5-open-...

#OpenSource #Llm #Benchmarks

2 0 0 0

Cyril BOSSELUT

@bossone0013.bsky.social

4 days ago

🚀 New in my toolkit: a single‑file OLLAMA benchmark 🤖
⚡️ The script 👉 short.b1project.com/2D5u1c runs a lightweight benchmark against any Ollama‑powered LLM to get TTFT, TPS , and total tokens.
🔧 Drop a model name, tweak the prompt, and hit ▶️
Happy testing !
#AI #ML #OpenAI #Ollama #Benchmarks

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

5 days ago

GLM-5.1 Tops SWE-Bench Pro With Zero NVIDIA Hardware Z.ai's GLM-5.1 scores 58.4 on SWE-bench Pro, edging out GPT-5.4 and Claude Opus 4.6, after being trained on 100,000 Huawei Ascend chips with no US silicon.

GLM-5.1 Tops SWE-Bench Pro With Zero NVIDIA Hardware

awesomeagents.ai/news/glm-5-1-swe-bench-p...

#Huawei #OpenSource #Benchmarks

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

5 days ago

AutoAgent Builds Its Own Harness, Tops Two Benchmarks Kevin Gu's MIT-licensed AutoAgent lets a meta-agent engineer and hill-climb its own agent harness overnight, claiming the top GPT-5 slot on TerminalBench and first place on SpreadsheetBench.

AutoAgent Builds Its Own Harness, Tops Two Benchmarks

awesomeagents.ai/news/autoagent-self-opti...

#OpenSource #DeveloperTools #Benchmarks

1 0 0 0

Marcus Schuler

@marcus-schuler.com

6 days ago

GLM-5.1 topped SWE-Bench at 58.4, lost to Mythos at 77.8% same day. Real story: 8-hour autonomous runs, 6K+ tool calls. Endurance over intelligence? #AI #coding #benchmarks www.implicator.ai/glm-5-1-works-eight-hour...

1 0 0 0

HotHardware

@hothardware.com

6 days ago

Qualcomm Snapdragon X2 Elite Review: New ASUS And HP Laptops Tested Laptops powered by the Qualcomm Snapdragon X2 Elite go on sale soon and we've taken two machines for a spin through an array of benchmarks.

Laptops powered by the #Qualcomm #Snapdragon X2 Elite go on sale soon and we've taken two machines for a spin through an array of #benchmarks.

hothardware.com/reviews/qual...

0 0 0 0

Linux-Maintainers

@linux-maintainers.activitypub.awakari.com.ap.brid.gy

1 week ago

New Linux Driver to Detect Malicious HID Devices Linux 7.1 to Fix Long-Standing Battery Reporting Limits for HID Devices For years, Linux users with high-end gaming peripherals and wireless accesso...

#Technology #Desktop #Linux #Linux #benchmarking #Linux […]

[Original post on archynewsy.com]

0 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

1 week ago

CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and E...

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad et al.

Action editor: Lu Jiang

https://openreview.net/forum?id=XB1cwXHV0c

#reward #benchmark #benchmarks

0 0 0 0

Marcus Schuler

@marcus-schuler.com

1 week ago

AI Top 40 chart launches: weights SWE-bench/ARC-AGI 4x over Arena. GPT-5.4 #1, Claude #2 despite Arena lead. Meta tested 27 Llama variants in one month. #AI #benchmarks #LLM www.implicator.ai/forty-models-ranked-arce...

1 0 0 0

Marcus Schuler

@marcus-schuler.com

1 week ago

AI Top 40 ranks models across 10 benchmarks. Contamination-resistant tests weighted 4x higher than Arena. GPT-5.4 leads. Updates weekly. #AI #LLM #benchmarks www.implicator.ai/implicator-ai-launches-t...

1 0 0 0

KloudiHub

@kloudihub.bsky.social

1 week ago

📈 GPT-5.4 is out with ~33% fewer factual errors vs GPT-5.2 and better coding scores.

Steady gains. No single 'AGI moment'.

The real story is how fast incremental progress compounds.

#GPT5 #OpenAI #LLM #Coding #Benchmarks

0 0 0 0

KloudiHub

@kloudihub.bsky.social

1 week ago

🧠 AI jumped from near-zero to 37% on an 'unbreakable' expert exam — in just 14 months.

Humans still lead. But the gap is closing faster than anyone expected.

What happens when it hits 80%?

#AI #Benchmarks #Research #AGI #ML

2 0 0 0

Help Net Security

@helpnetsecurity.com

1 week ago

CIS Benchmarks March 2026 Update

📖 Read more: helpnet.short.gy/gHXdNb

#cybersecurity #cybersecuritynews #benchmarks

0 0 0 0

KloudiHub

@kloudihub.bsky.social

2 weeks ago

📊 ARC-AGI-3 is out — and it's humbling today's best models.

Frontier AI still can't crack flexible, general reasoning at human level.

The gap is real. AGI hype needs a reality check.

#AGI #Benchmarks #AI #Research #ARC

0 0 0 0

Pooya

@pooyagolchian.bsky.social

2 weeks ago

The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks A $500 RTX 5070 running Qwen 3.5 Coder 32B outperforms Claude Sonnet 4.6 on HumanEval at 40 tokens per second. The local AI revolution has reached consumer hardware.

The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks

A $500 RTX 5070 running Qwen 3.5 Coder 32B outperforms Claude Sonnet 4.6 on HumanEval at 40 tokens per second. The local AI revolution …

#AI #LLM #Benchmarks

pooya.blog/blog/500-gpu-beats-claud...

2 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

2 weeks ago

Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

Chang Yang, Ruiyu Wang, Junzhe Jiang et al.

Action editor: Hanie Sedghi

https://openreview.net/forum?id=Xb6d5lGLb2

#benchmarks #npsolver #complexity