#AIBenchmarking hashtag - Bluesky

1 week ago

Standardizing Generative AI Service Evaluation: An API-Centric Benchmarking Approach - MLCommons MLPerf® Endpoints brings API-native benchmarking, Pareto curve visualizations, and rolling submissions to generative AI infrastructure evaluation.

GenAI inference doesn't behave like classical ML. MLPerf® Endpoints is being designed to benchmark the full complexity of production GenAI services — not just peak numbers. mlcommons.org/2026/03/mlperf-endpoints... #MLPerf #AIBenchmarking

0 0 0 0

Hacker News Companion

@hncompanion.com

4 months ago

Users question standard AI benchmarks, suggesting models might just be memorizing data. The consensus: personal, curated benchmarks are crucial for evaluating AI in specific use cases, offering more reliable insights than generic tests. #AIBenchmarking 3/5

0 0 1 0

Hacker News Companion

@hncompanion.com

4 months ago

The AI World Clocks project offers a novel benchmark, revealing LLM strengths & weaknesses. Its real-time nature showcases non-deterministic outputs and "model drift," where minimal input changes cause varied results. #AIBenchmarking 5/6

0 0 1 0

Chainforge Labs

@chainforge-ai.bsky.social

5 months ago

Core Philosophy 2: Dream Big 💭, Share Big 📣 We dream of building the most trusted source for AI model selection. The gameplan: Community = scale. #AIEngineers, let's build the truth together! 💪 #CommunityDrivenAI #ScaleWithUs #Leaderboards #AIBenchmarking

0 0 0 0

Hacker News Companion

@hncompanion.com

5 months ago

Concerns grow over vendor benchmarks: Are providers 'cheating' with undisclosed tricks or techniques? This impacts fair comparisons & trust in LLM performance claims. Transparency is key for valid evaluation. #AIBenchmarking 5/6

0 0 1 0

Adesh

@adesh.raxit.ai

7 months ago

Everyone’s hyped about GPT-5 being “safer and more useful”

Cool story. We actually tested it.

#GPT5 #OpenAI #AISafety #ResponsibleAI #AIBenchmarking #ModelEvaluation #GrayZoneBench #AI

1 1 1 0

Media Aluni

@djasuy2.bsky.social

8 months ago

China Luncurkan Kimi K2 Model AI Open Source yang Klaim Ungguli GPT-4 dan Claude bukti bahwa AI open-source bisa setara, bahkan melampaui, model komersial terbaik

China Luncurkan Kimi K2 Model AI Open Source yang Klaim Ungguli GPT-4 dan Claude 👇
baabulhudaacinangsi.com/archives/chi...
👆✔
#AIOpenSource #KimiK2Model #GPT4 #Claude #AIInnovation #TechNews #ChinaAI #AIResearch #AICompetition #AIBenchmarking #AIOpenness #AITransparency #AIEthics #AIAdvances

0 0 0 0

@arxivlens.bsky.social

9 months ago

Urethra contours on MRI: multidisciplinary consensus educational atlas and reference standard for artificial intelligence benchmarking
Barrett, T., Baxter, M. T. et al.
Paper
Details
#UrethraMRIAtlas #AIBenchmarking #MultidisciplinaryConsensus

1 0 0 0

@pragmaticleader.bsky.social

9 months ago

How to build a better AI benchmark To fix the way we test and measure models, AI is learning tricks from social science.

AI models are outgrowing their tests. MIT Tech Review discusses why current benchmarks fall short—and how to build better ones that truly measure intelligence.

Check it out: ift.tt/uCq8NMI
#AI #ML #AIBenchmarking #AGI #TechPolicy

1 0 0 0

Pure AI

@pureainews.bsky.social

10 months ago

OpenAI’s HealthBench is Trying to Fix AI’s Biggest Medical Blind Spot -- Pure AI OpenAI has introduced HealthBench, a sweeping new benchmark designed to test how large language models perform in real-world healthcare scenarios.

OpenAI has introduced HealthBench, a sweeping new benchmark designed to test how large language models perform in real-world healthcare scenarios.
pureai.com/articles/202...

#AIinHealthcare #HealthBench #OpenAI #MedicalAI #AIBenchmarking

0 0 0 0

Matt

@neuralmarkets.substack.com

11 months ago

AI-driven A/B testing just got a turbo boost! 🚀🔩 Automated ad rotation lifts conversion rates by 300%! 👉No more manual guesswork, say hello to data-driven wins! 💡 #AIBenchmarking #MarketingAutomation #SmartAdvertising

1 0 0 0

Arron Johnson

@notaryx.ai

11 months ago

Is Meta's AI trust at risk? Allegations of benchmark manipulation have emerged, though Meta defends its practices. How can we ensure fair AI evaluations? #Meta #AIBenchmarking

0 0 0 0

Arron Johnson

@notaryx.ai

1 year ago

OpenAI's hidden funding for the FrontierMath benchmark raises questions on o3's impressive scores. Will transparency issues affect AI's credibility in future evaluations? #OpenAI #AIBenchmarking

0 0 0 0