#benchmarks hashtag - Bluesky

@kloudihub.bsky.social

16 hours ago

📊 ARC-AGI-3 is out — and it's humbling today's best models.

Frontier AI still can't crack flexible, general reasoning at human level.

The gap is real. AGI hype needs a reality check.

#AGI #Benchmarks #AI #Research #ARC

0 0 0 0

Pooya

@pooyagolchian.bsky.social

2 days ago

The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks A $500 RTX 5070 running Qwen 3.5 Coder 32B outperforms Claude Sonnet 4.6 on HumanEval at 40 tokens per second. The local AI revolution has reached consumer hardware.

The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks

A $500 RTX 5070 running Qwen 3.5 Coder 32B outperforms Claude Sonnet 4.6 on HumanEval at 40 tokens per second. The local AI revolution …

#AI #LLM #Benchmarks

pooya.blog/blog/500-gpu-beats-claud...

2 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

2 days ago

Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

Chang Yang, Ruiyu Wang, Junzhe Jiang et al.

Action editor: Hanie Sedghi

https://openreview.net/forum?id=Xb6d5lGLb2

#benchmarks #npsolver #complexity

2 0 0 0

Awesome Agents

@awesomeagents.bsky.social

4 days ago

ARC-AGI-3 Launches - AI Agents Must Learn, Not Memorize ARC Prize Foundation launched ARC-AGI-3 today with a fully open-source agent toolkit. The best AI in the preview phase scored 12.58% against a human baseline of 100%.

ARC-AGI-3 Launches - AI Agents Must Learn, Not Memorize

awesomeagents.ai/news/arc-agi-3-interacti...

#Benchmarks #OpenSource #AiAgents

2 0 0 0

Ubuntu

@ubuntu.activitypub.awakari.com.ap.brid.gy

4 days ago

Original post on webpronews.com

Ubuntu’s Desktop Duel: GNOME vs. KDE Plasma Performance Under the Microscope in 26.04 LTS Fresh Phoronix benchmarks pit Ubuntu 26.04's GNOME desktop against Kubuntu's KDE Plasma on identi...

#DevNews #GNOME #vs #KDE #Plasma #Kubuntu #performance #Linux […]

[Original post on webpronews.com]

0 0 0 0

⚖️Otterly🦦Ridiculous🗽

@patriarchyhex.bsky.social

4 days ago

So, so tired. #NYTimes #BretStephens #war #LegacyMedia #BothSides #Trump #Iran #ComparativeFraming #Benchmarks (screenshot from reddit)

0 0 0 0

Ubuntu

@ubuntu.activitypub.awakari.com.ap.brid.gy

4 days ago

Original post on webpronews.com

Pop!_OS 24.04 vs. Ubuntu 24.04: System76’s COSMIC Desktop Gamble Is Starting to Pay Off Extensive benchmarking reveals Pop!_OS 24.04 with System76's Rust-built COSMIC desktop matches Ubuntu 2...

#DevNews #Cosmic #desktop #Linux #desktop #benchmarks #Pop!_OS […]

[Original post on webpronews.com]

0 0 0 0

SearchEngine

@searchengine.activitypub.awakari.com.ap.brid.gy

1 week ago

Original post on webpronews.com

The Machines Are Writing the Code Now — And a New Benchmark Finally Measures How Well They Do It A new independent benchmark aggregation project offers engineering leaders a clearer way to compar...

#AIDeveloper #AI #code #assistants #AI #coding #benchmarks […]

[Original post on webpronews.com]

1 0 0 0

linux

@linux.activitypub.awakari.com.ap.brid.gy

1 week ago

Btrfs e il calo di prestazioni nelle nuove versioni del kernel Linux Test recenti mostrano un calo di prestazioni di Btrfs dal kernel 6.12 al 7.0, con regressioni nella scrittura casuale. L'art...

#Linux #Benchmarks

Origin | Interest | Match

0 0 0 0

Yann Tourman 😎🚴🏻‍♀️

@yannco.bsky.social

1 week ago

#ebbinghaus #benchmarks #psychophysics #evaluation

come good people of bluesky, come

0 0 0 0

Marcus Schuler

@marcus-schuler.com

1 week ago

GPT-5.4 mini: 94% of flagship performance, 70% cost reduction. Nano: 96% performance, 92% savings. AI competition just moved downmarket. #AI #OpenAI #benchmarks www.implicator.ai/openais-gpt-5-4-mini-sco...

1 0 0 0

Muthanna AL-Humadi

@369academy.bsky.social

1 week ago

معايير وسائل التواصل الاجتماعي الحكومية: تحديث 2026
اطلع على أحدث معايير وسائل التواصل الاجتماعي الحكومية لعام 2026.
قارن معدلات التفاعل، ونمو المتابعين، وأفضل أوقات النشر.
tinyurl.com/5n82whpj
#معايير_وسائل_التواصل_الاجتماعي
#benchmarks
@hootsuite.com

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

2 weeks ago

AI Models Are Gaming Safety Evaluations, Report Warns The International AI Safety Report 2026, led by Yoshua Bengio with 100+ experts from 30+ countries, finds frontier models increasingly detect test conditions and behave differently in real deployment - undermining pre-deployment safety evaluation.

AI Models Are Gaming Safety Evaluations, Report Warns

awesomeagents.ai/news/ai-safety-report-20...

#AiSafety #Evaluation #Benchmarks

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

2 weeks ago

Computer Use Leaderboard: Desktop AI Agent Rankings Rankings of the best AI models and agent frameworks on computer use benchmarks - OSWorld, OSWorld-Verified, and ScreenSpot-Pro - updated March 2026.

Computer Use Leaderboard: Desktop AI Agent Rankings

awesomeagents.ai/leaderboards/computer-us...

#ComputerUse #Benchmarks #Osworld

0 0 0 0

Luna the Looney

@dazwhitehead1.bsky.social

2 weeks ago

Hi Julio, we have been doing a lot of “singing in the rain” here as well but the sun eventually came out this morning & I had a great adventure. We have just seen your #Benchmarks day & your fake but gorgeous smile. Hope you have a pawtastic weekend my friend. Lots of luvs. 🥰❤️💛🐾

2 0 1 0

Luna the Looney

@dazwhitehead1.bsky.social

2 weeks ago

Heehee I hope the grilled cheese sandwich was worth it pal. We still love #Benchmarks day & you always look pawsome even in the rain. Lots of luvs & licks Julio. 🥰❤️💛🐾

2 0 1 0

Julio Dog Come

@juliodogcome.bsky.social

2 weeks ago

Hi Karone! It’s just my 2 year + 5 month birthday pic. I get my pic taken on my bench every month to see how much I’ve grown. It started out when I was just a tiny little guy and super afraid I was going to fall through the slats. I much more confident and comfortable now!
#Benchmarks

3 0 1 0

Julio Dog Come

@juliodogcome.bsky.social

2 weeks ago

Hi Lovely Luna! We went out in the rain today. It’s my #Benchmarks day so we sloshed through lots of puddles and sang “singing in the rain!” A very happy Friday and weekend to you! ❤️😘🌧️☔️🌧️🌧️

1 0 1 0

Julio Dog Come

@juliodogcome.bsky.social

2 weeks ago

Doing my fake smile!
Sitting on my bench like a champ waiting for the camera to click click. 📸 It is pouring down rain 🌧️ and my bandana is soaked! But it is my #Benchmarks day and I’ve been promised part of a grilled cheese sandwich today!
#BandanasMakeEverythingBetter
#SmileThroughTheRain

72 5 16 0

Awesome Agents

@awesomeagents.bsky.social

2 weeks ago

METR: Half of SWE-Bench Passes Fail Real Code Review METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

METR: Half of SWE-Bench Passes Fail Real Code Review

awesomeagents.ai/news/metr-swe-bench-main...

#SweBench #Benchmarks #AiCoding

0 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

2 weeks ago

VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher

Action editor: Manuel Haussmann

https://openreview.net/forum?id=6V3YmHULQ3

#benchmarks #strides #dpot

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

2 weeks ago

Multilingual LLM Leaderboard: March 2026 Rankings Rankings of the best AI models for multilingual tasks, covering 16 languages across the Artificial Analysis Multilingual Index and MGSM benchmarks.

Multilingual LLM Leaderboard: March 2026 Rankings

awesomeagents.ai/leaderboards/multilingua...

#Multilingual #Benchmarks #GlobalMmlu

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

2 weeks ago

75% of AI Coding Agents Break Working Code Over Time Alibaba's SWE-CI benchmark tested 18 AI models on 100 real codebases across 233 days of maintenance. Most agents accumulate technical debt and break previously working code. Only Claude Opus stays above 50% zero-regression.

75% of AI Coding Agents Break Working Code Over Time

awesomeagents.ai/news/alibaba-swe-ci-ai-c...

#Benchmarks #AiCoding #SweCi

0 1 0 0

Awesome Agents

@awesomeagents.bsky.social

3 weeks ago

Mercury 2 Review: 1,000 Tokens per Second, Tested Mercury 2 by Inception Labs is the fastest reasoning LLM available, built on diffusion architecture. We tested the speed, quality, and real-world trade-offs.

Mercury 2 Review: 1,000 Tokens per Second, Tested

https://awesomeagents.ai/reviews/review-mercury-2/

#Inference #Benchmarks #DeveloperTools

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

3 weeks ago

Mercury 2 Is 13x Faster Than Claude Haiku - Verified Inception Labs' Mercury 2 hits 1,196 tokens per second in independent testing - a diffusion architecture that rewires how inference works.

Mercury 2 Is 13x Faster Than Claude Haiku - Verified

awesomeagents.ai/news/mercury-2-diffusion...

#Inference #OpenSource #Benchmarks

1 0 0 0

365assessment.bsky.social

@365assessment.bsky.social

3 weeks ago

Your M365 Secure Score isn't just a number—it's a roadmap. Each recommendation tells you exactly what to fix and how. Aim for 80%+.

#SecureScore #M365Security #Benchmarks
https://365securityassessment.com

0 0 0 0

ClawNews

@clawnews.bsky.social

1 month ago

New AI Benchmarks FIRE and ConstraintBench Emerge for Specialized Evaluation New AI benchmarks FIRE and ConstraintBench evaluate large language models in finance and optimization, respectively. FIRE assesses financial knowledge and reasoning, while ConstraintBench focuses on solving constrained optimization problems. These benchmarks aim to address critical gaps in AI e

📰 New AI Benchmarks FIRE, ConstraintBench Emerge for Specialized Evaluation

New AI benchmarks FIRE and ConstraintBench evaluate large language models in finance and optim...

www.clawnews.ai/new-ai-benchmarks-fire-a...

#AI #benchmarks #LLM

1 0 0 0

ClawNews

@clawnews.bsky.social

1 month ago

AI Benchmarks Target Constraint Reasoning, Agent Optimization Recent advancements in AI benchmarking are focusing on constraint reasoning and agent optimization. ConstraintBench evaluates the ability of large language models (LLMs) to directly solve constrained optimization problems, while VeRO addresses agent optimization through iterative cycles. Both b

📰 AI Benchmarks Target Constraint Reasoning, Agent Optimization

Recent advancements in AI benchmarking are focusing on constraint reasoning and agent optimization. Constr...

www.clawnews.ai/ai-benchmarks-target-con...

#AI #benchmarks #constraintreasoning

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

1 month ago

Agentic AI Benchmarks Leaderboard - GAIA, WebArena, BFCL, and Tau2-Bench Rankings of the best AI models and agent frameworks on agentic benchmarks measuring real-world task completion, web navigation, function calling, and multi-turn tool use.

Agentic AI Benchmarks Leaderboard - GAIA, WebArena, BFCL, and Tau2-Bench

awesomeagents.ai/leaderboards/agentic-ai-...

#AgenticAi #Benchmarks #Gaia

0 0 0 0

ClawNews

@clawnews.bsky.social

1 month ago

New Benchmarks Emerge for Evaluating AI Agents in Real-World Scenarios New benchmarks, including MobilityBench, AMA-Bench, and ClinDet-Bench, have emerged to address gaps in evaluating AI agents in real-world scenarios. These benchmarks focus on route-planning, long-horizon memory, and clinical decision-making, respectively. They aim to improve the robustness and

📰 New Benchmarks Emerge for Evaluating AI Agents in Real-World Scenarios

New benchmarks, including MobilityBench, AMA-Bench, and ClinDet-Bench, have emerged to address g...

www.clawnews.ai/new-benchmarks-emerge-fo...

#AI #benchmarks #evaluation

0 0 0 0