#AIevals hashtag - Bluesky

3 weeks ago

The AI evals field chose a flawed tool and stuck with it - Kato Coaching Session one left me with two things I hadn’t resolved.1 The first was a line the instructor said almost in passing: “the hard part is scalability, not automation.” I wrote it down because it piqued something, but I couldn’t quite work out what problem it was pointing at. The second was a question I kept […]

"The hard part is scalability, not automation." That line from session one of "AI evals and analytics" confused me. Session two explained it.
Full write-up in my blog: kato-coaching.com/the-ai-evals-field-chose...

#AIEvals #SoftwareTesting #QA

0 0 0 0

Harald Klinke

@harald-klinke.de

3 weeks ago

Eval awareness in Claude Opus 4.6’s BrowseComp performance Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Claude Opus 4.6 noticed it was being benchmarked, identified the test, found the code on GitHub, and decrypted the answer dataset.

Anthropic disclosed it and adjusted the score.

A reminder: web-enabled LLMs are starting to game benchmarks.

#AI #LLM #AIEvals

5 1 2 1

Harald Klinke

@hxxxkxxx.det.social.ap.brid.gy

3 weeks ago

Original post on det.social

Claude Opus 4.6 hat während eines Benchmarks erkannt, dass es getestet wird.
Nach Millionen Tokens Recherche identifizierte es den Benchmark, fand den Code auf GitHub und entschlüsselte den Antwortdatensatz selbst.

Anthropic hat das transparent gemacht – und den Score angepasst.
Ein starkes […]

1 0 0 0

Profile

@esprofile.bsky.social

1 month ago

🤖 ¿Tu sistema basado en IA funciona realmente bien?
Aprende a diseñar AI Evals reales: reglas, métricas, LLM como juez y evaluación en producción.

🗓️ 18 feb | 17:30h | Online
🎙️ Guillermo Rocha
👉 https://f.mtr.cool/plajthhtep

#AIEvals #IA #LLM #Webinar

0 0 0 0

Jock Busuttil

@jock.imanageproducts.com.ap.brid.gy

5 months ago

Producthead logo

PRODUCTHEAD: Treat AI agents like interns

» Delegate the same kinds of task to an AI agent as you would to an intern

» Generative AI won’t help you find product differentiators

» Evals are a way of checking the quality and effectiveness of your LLM and […]

[Original post on imanageproducts.com]

0 2 0 0

Chainforge Labs

@chainforge-ai.bsky.social

5 months ago

Every LLM eventually reveals its specialty. Stop searching for the AI 'God Mode'! 🙅♀️ There's enough evidence to back up the #NoFreeLunch theorem out here! Let's quit chasing the perfect generalist and focus on the best tool for the job. #AIHacks 🛠️ #AILeaderboards #AIEvals

0 0 0 0

Chainforge Labs

@chainforge-ai.bsky.social

5 months ago

We need to air out the LLM performance data! 📢 #Transparent, public #leaderboards are how we get to the real "truth in AI" and build reliable products faster. Let's see the stats! #AIEvals #Community #AITruth #LLMBenchmarking

0 0 0 0

Soumendra Kumar Sahoo

@soumendrak.bsky.social

6 months ago

AI without Observability is like a black box. It will feel like magic initially then you'll be its servant.

AI with Observability is a glass box. There's no magic or unexplainable outcomes. You will be in control.

#Observability #AI #Monitor #AIEvals

2 0 0 0

Bernard Leong

@bernardleong.com

6 months ago

I’d love to hear your thoughts — how are you approaching evaluation and governance in your own AI projects? #LLMOps #MLOps #AIEvals #AgentBricks

0 0 0 0

@nitishagar.bsky.social

6 months ago

AI Evaluation Engineering: Building Reliable Evaluation Systems The landscape of AI evaluation has transformed dramatically as companies deploy large language models at scale. This technical analysis…

Meta's LLaMA training failed every 3 hours across 16,384 GPUs. GitHub Copilot missed critical security bugs. Yet they recovered.

In my deep dive I look at How OpenAI, Anthropic & Notion build evals that actually work with failures, fixes & frameworks you can use today.

#AIEvals #MLOps #AI

0 0 0 0

CloudThrill

@cloudthrill.bsky.social

1 year ago

#💡NewBlogAlert! What is hashtag#Quantization 🙋🏻‍♀️⁉️
llama.cpp, GGML GGUF, Bets K Quants,The Microsoft BitNet Quants .. Everything you need to know about quantization in one article 🔥
#LLMs #LLAMA #Quantization #Kquants #INT8 #AI #AIevals

0 0 0 0

Ethos AI

@ethosaione.bsky.social

1 year ago

Introduction | Ethos AI Whitepaper This section provides the introduction to the challenges faced by current AI ecosystem and how can we leverage community collaboration and blockchain to address the same.

Discover how we're transforming AI evaluations via a decentralized community and how you can get involved.
#EthosAI #AIEvals #whitepaper #web3whitepaper #ethosaiwhitepaper #AICommunity
Read more: ethos-ai.gitbook.io/ethosai-whit...

2 0 0 0