5. Run experiments to validate fixes
All inside Gemini CLI + Arize AX Skills.
No guessing.
📍 Friday, 9 -10:30am at the Gemini CLI “Terminal Velocity” booth
arize.com/google-next...
Posts by Arize AI
We get asked this a lot:
“How do you actually improve an agent?”
At Google Next, we’ll walk you through the full loop:
1. Instrument your agent (capture traces + tool calls)
2. Query traces (find where things break)
3. Identify failure patterns
4. Build eval datasets
Observe 2026 is for the engineers building the evals, the harnesses, and the feedback loops that make that possible.
The AI Agent Evals Conference | June 4 | Shack15, San Francisco
Register → arize.com/observe/
#Observe2026 #AgentEvals #AIEngineering
Most agents fail silently.
A tool call starts failing. Outputs get less reliable. Workflows degrade over time.
Without visibility, you don’t catch it until users do.
At Google Next, we’ll show how to handle this in production with a simple workflow:
Trace → Evaluate → Improve
We let an agent optimize a RAG system for 8 hours.
No human in the loop on this one. Just eval → improve → repeat.
Recall@5: 39% → 75%
Turns out, you can ralph loop your way to success.
arize.com/blog/how-ar...
We’re heading to Google Next. (April 22-24)
Come talk to us at booth #3722!
If you’re building agents and want to debug things without guessing, come find us.
arize.com/google-next...
Last week at the AI Builders meetup in Seattle, @jimbobbennett.dev spoke about boosting Claude Code performance using Prompt Learning.
If you missed it, don't worry - Jim recorded a video of the session.
youtu.be/ES43SEXArvk
Arize AX and Phoenix skills are now available in the awesome copilot repo, allowing GitHub Copilot users to install these with one command!
All our individual skills are there, as well as plugins that bundle all the skills into a single install.
awesome-copilot.github.com/plugins/?q=...
Congrats to Mistral on the Voxtral TTS release! mistral.ai/news/voxtra...
We’ve been experimenting with evals + tracing for voice agents using OpenInference + Phoenix. Excited to see what builders create with it! (Try it out with Pipecat!)
Guess how many 🌟 Phoenix has on GitHub?
Yes, this meme has been all over our internal slack today.
github.com/Arize-ai/ph...
Most agent failures don't crash. They produce confident, wrong output that feeds the next agent in the pipeline. Now scale that across 100 agents per employee.
Who's watching?
arize.com/blog/100-ai...
Plus: annotation queue improvements, CLI commands for spaces, Bedrock bearer token auth, and more → docs.arize.com/ax/release-...
Saved Views on Tracing — save filters, columns, sort, and time range and reuse them without reapplying the same setup each session.
Python & JS SDKs now support Evaluators, Prompts, and AI Integrations with full CRUD and versioning. If you can do it in the UI, you can script it.
Alyx now supports multi-span analysis, richer dataset page context, and auto-opens with context when navigating between pages.
Evaluator setup streamlined — updated preview, fewer steps from setup to validation, and configurable CPU/memory for online task evaluators.
New from Arize AX this week 🧵
Dashboard exports, smarter Alyx, SDK upgrades, and more.
Full release notes → docs.arize.com/ax/release-...
Trying out evals can be hard if you work in a regulated industry and you can't send your traces to an external SaaS platform without paperwork and approvals.
Phoenix is free, open source, and can run locally in a docker container! Your traces never leave your machine.
arize.com/docs/phoeni...
The hardest part of AI platforms in banking?
Serving 50 teams at different maturity levels without forcing one workflow.
Phoenix plus AX solves this.
arize.com/blog/why-ba...
deploying AI agents is easy. knowing if they're actually working? harder.
join @ArizeAI + @M12VC at @GitHub HQ in SF on 3/31 with Amanda Foster and Nancy Chauhan for real talk on building and evaluating agents in prod 👇
luma.com/mxlgbdvw
Here’s what worked for us:
• Middle truncation (keep the start + end, drop the middle)
• Memory with retrieval instead of stuffing everything into context
• Deduplicating messages and pruning tool outputs
• Sub-agents to isolate high-volume tasks
Worth a read if you’re building agents.
Part 2 of our deep dive into how we built Alyx: context windows
arize.com/blog/how-to...
Once an agent starts running, context becomes the bottleneck fast.
If you're building with LLMs and want a clear path from first prompt to production, this tutorial covers the full workflow in Arize AX.
Get started below ⬇️
arize.com/docs/ax/pro...
💻 Create: System and user message templates, variables, save to Prompt Hub with versioning
🧪 Test: Run on a dataset, add LLM-as-a-Judge evaluators, see how it performs
📈 Optimize: Improve from evaluation feedback, compare versions, validate before production
We just released a new Prompt Tutorial for Arize AX: create, test, and optimize prompts with real data and evaluation.
It's easy to tweak a prompt until it "feels" better without knowing if it actually improved.
This tutorial walks you through a repeatable create → test → optimize workflow:
Agents are great at the iteration. Humans often still have to decide what the objective should be.
The whole story here: arize.com/blog/how-we...
What was surprising was how little human input shaped the outcome. Across the whole process the guidance was basically: “run the evals,” “that shortcut makes the output worse,” and “measure tweet coverage instead of link counts.”
Three decisions ended up shaping several rounds of autonomous work.
The agent handled the loop extremely well: run the evals, diagnose failures, fix the code, repeat. It quickly cleaned up issues like hallucinated links and structural problems.