#evals hashtag - Bluesky

1 month ago

The best approach to compare LLM outputs See how you can create a repeatable approach to measure LLM output quality with a detailed guide on set up, metrics to consider, the evals loop and observab...

#ai #evals #observability

Origin | Interest | Match

0 0 0 0

3 months ago

How to Do Evals on a Bloated RAG Pipeline Comparing metrics across datasets and models The post How to Do Evals on a Bloated RAG Pipeline appeared first on Towards Data Science .

#Large #Language #Models #Editors #Pick #Evals #Llm #Llm #Evaluation #Rag

Origin | Interest | Match

0 0 0 0

4 months ago

Evals – гарантия качества и окупаемости ИИ OpenAI опубликовали фреймворк, на который мало кто обратил внимание. ...

#AI #evals #OpenAI #метрики #KPI #ROI #LLM #prompt #engineering #AI #evaluation

Origin | Interest | Match

0 0 0 0

4 months ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer...

#generative-ai #llms #anthropic #claude #evals #llm-pricing […]

[Original post on simonwillison.net]

1 0 0 0

4 months ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer...

#prompt-injection #generative-ai #llms #anthropic #claude #evals #llm-pricing […]

1 0 0 0

4 months ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer...

#prompt-injection #generative-ai #llms #anthropic #claude #evals #llm-pricing […]

1 0 0 0

4 months ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer...

#generative-ai #llms #anthropic #claude #evals #llm-pricing #pelican-riding-a-bicycle #llm-release […]

1 0 0 0

4 months ago

[Перевод] LLM Evals: движущая сила новой эры ИИ в бизнесе На днях OpenAI опубликовали в своем блоге небольшую статью...

#ии #искусственный #интеллект #LLM #openai #evals #benchmarks #бенчмарки #llm #evals #оценки

Origin | Interest | Match

0 0 0 0

Nicole Hennig

@nic221.bsky.social

4 months ago

Why it takes months to tell if new AI models are good www.seangoedecke.com/are-new-models… #AI #evals #benchmarks #vibes

5 0 0 0

4 months ago

Agent design is still hard Agent design is still hard Armin Ronacher presents a cornucopia of lessons learned from building agents over the past few months. There are several agent abstraction libr...

#armin-ronacher #definitions #ai #prompt-engineering #generative-ai #llms #evals #ai-agents […]

0 0 0 0

Emerging Patterns in Building GenAI Products Patterns from our colleagues' work building with Generative AI

4 months ago

Building more with GPT-5.1-Codex-Max Building more with GPT-5.1-Codex-Max Hot on the heels of yesterday's Gemini 3 Pro release comes a new model from OpenAI called GPT-5.1-Codex-Max. (Remember ...

#ai #openai #generative-ai #llms #evals […]

[Original post on simonwillison.net]

0 0 0 0

Fazzaro

@jonfazzaro.bsky.social

4 months ago

"As with testing, we run evals as part of the build pipeline for a Gen-AI system. Unlike tests, they aren't simple binary pass/fail results, instead we have to set thresholds, togeth..."

buff.ly/9rpFGh6 #testing #ai #evals #softwareengineering #developerexperience #gra…

1 0 0 0

Gerrit Eicker

@eicker.bsky.social

5 months ago

#OpenAI launched #AgentKit, a #toolkit for building and deploying #AIagents, at its Dev Day event. AgentKit includes #AgentBuilder for designing agent logic, #ChatKit for embedding chat interfaces, #Evals for measuring #agentperformance, and access to OpenAI’s connector registry.…

0 0 0 0

GitHub - wolfeidau/go-mcp-evals: A Go library and CLI for evaluating Model Context Protocol (MCP) servers using Claude. A Go library and CLI for evaluating Model Context Protocol (MCP) servers using Claude. - wolfeidau/go-mcp-evals

5 months ago

AgentKit от OpenAI: как закончилась эпоха хаоса в мире ИИ-агентов До сегодняшнего дня сборка и запуск AI-агентов нап...

#openai #AgentKit #Agent #Builder #ChatKit #Connector #Registry #Evals #ChatGPT #api #ChatGPT

Origin | Interest | Match

1 0 0 0

Mark Wolfe

@mark.wolfe.id.au

5 months ago

Been working on my own evals tool, for learning, as well as to have some control over how things work. LLMs really are a "unique" system to work with. Still new so lots to improve github.com/wolfeidau/go... #golang #evals #mcp

0 0 0 0

SearchEngine

@searchengine.activitypub.awakari.com.ap.brid.gy

6 months ago

CompileBench: Can AI Compile 22-year-old Code? CompileBench: Can AI Compile 22-year-old Code? Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models han...

#go #ai #prompt-engineering #generative-ai #llms […]

[Original post on simonwillison.net]

0 0 0 0

6 months ago

What is the ARC AGI Benchmark and its significance in evaluating LLM capabilities in 2025 A Comprehensive Guide to Understanding Abstract Reasoning Assessment in Large Language Models

CompileBench: Can AI Compile 22-year-old Code? CompileBench: Can AI Compile 22-year-old Code? Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models han...

#go #ai #prompt-engineering #generative-ai #llms #ai-assisted-programming #evals #coding-agents […]

0 0 0 0

Murali Krishnan

@muralirk.bsky.social

6 months ago

benchmarks are powerful to help us learn more.

ARC-AGI is one such a thing.

more details at labs.adaline.ai/p/what-is-th...

#AI #AGI #Benchmark #testing #evals #LLM

1 0 0 0

Nicole Hennig

@nic221.bsky.social

7 months ago

Text Shot: Researchers from Inclusion AI, which is affiliated with Alibaba’s Ant Group, proposed a new model leaderboard and benchmark that focuses more on a model’s performance in real-life scenarios. They argue that LLMs need a leaderboard that takes into account how people use them and how much people prefer their answers compared to the static knowledge capabilities models have.

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production venturebeat.com/ai/stop-benchmarking-in-... #AI #benchmarks #evals

2 1 0 0

7 months ago

Awakari App

Quoting Artificial Analysis gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...] We’re se...

#evals #openai #deepseek #ai #qwen #llms #gpt-oss #generative-ai

Origin | Interest | Match

0 0 0 0

7 months ago

Awakari App

Quoting Artificial Analysis gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...] We’re se...

#evals #openai #deepseek #ai #qwen #llms #gpt-oss #generative-ai

Origin | Interest | Match

0 0 0 0

7 months ago

Quoting Artificial Analysis gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...] We’re se...

#evals #openai #deepseek #ai #qwen #llms #gpt-oss #generative-ai #artificial-analysis

Origin | […]

0 0 0 0

7 months ago

Claude Opus 4.1 Claude Opus 4.1 Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4". My favorite thing about this model is t...

#ai #generative-ai #llms #llm #anthropic #claude #evals […]

[Original post on simonwillison.net]

0 1 0 0

7 months ago