An OpenTelemetry span tree illustrates ATIF steps of a finance assistant calling a financial search tool and processing results.
Get visibility into agent benchmarking execution using ATIF
An OpenTelemetry span tree illustrates ATIF steps of a finance assistant calling a financial search tool and processing results.
Get visibility into agent benchmarking execution using ATIF
Four feature descriptions showcase a user interface for managing experiments, including tracking progress, completion, stopping, and resuming tasks.
Experiments kicked off from the UI have all the bells and whistles - live progress, a durable runtime, cancel, and resumption.
Three model experiments display settings, user prompts, and evaluation results for AI support agents' performance.
A dashboard displays multiple ExperimentJobs queued and running on various workers, showing progress percentages and completed tasks.
Phoenix 14 introduces Experiment Jobs. Hit Run and Phoenix queues an ExperimentJob per instance. A background daemon fans them out to workers and streams progress back.
This propagates to the clients as well.
phoenix CLI and REST API now support pulling spans by OTEL attribute values so you can start debugging targeted parts of your agent topology.
When debugging agents, you need to get to the problematic issues QUICKLY! Phoenix now comes with quick list-details navigations of conversations with agents as well as resizable drawers so you can quickly navigate between conversations as well. Also sessions now supports VIM bindings for pagination!
Also: attribute builders for raw OTel spans, OITracer for redacting sensitive data before export, and full openinference-semantic-conventions re-exported so you never need a second dependency.
No config ceremony. Just wrap and ship.
traceTool, traceAgent, traceChain, withSpan for functions. @observe for class methods. Context setters propagate session IDs, user IDs, and metadata to every child span.
@arizeai/phoenix-otel 1.0 is out.
One import to trace TypeScript agents. Wrap a function, get a span — inputs, outputs, errors, span kind, all recorded automatically.
Some quality of life improvements
Arize Phoenix now supports python 3.14 across all SDKs and has new REST endpoints including API / Secret rotation APIs.
arize.com/docs/phoeni...
Guess how many 🌟 Phoenix has on GitHub?
Yes, this meme has been all over our internal slack today.
github.com/Arize-ai/ph...
We just added streamdown rendering for our LLM outputs now look and perform so well. Particularly loving the bash and code rendering 🤟
The compromised packages have been pulled from PyPI and the LiteLLM team has rotated maintainer credentials. The situation is still developing, and further lateral movement has been reported. We're monitoring and will update if anything changes for Phoenix users.
LiteLLM is NOT a core Phoenix dep. It's an optional extra for phoenix-evals. We've already pinned it below compromised versions and shipped a fix.
PR: github.com/Arize-ai/ph...
The malicious code runs on Python interpreter startup (no import needed). Docker image users of LiteLLM Proxy unaffected.
Phoenix users:
→ Check your installed version: pip show litellm
→ If you're on 1.82.7 or 1.82.8, uninstall and reinstall at litellm<1.82.7
→ Rotate any credentials in your environment. The payload targeted env vars and secrets
→ Hold off on upgrading litellm until the maintainers confirm all-clear
Post 1
PSA: LiteLLM versions 1.82.7 and 1.82.8 on PyPI were compromised with a credential-stealing payload.
Are you using Phoenix's optional LiteLLM extra for phoenix-evals directly or through DSPy, Smolagents, or CrewAI?
Did you install or upgrade to litellm 1.82.7 or 1.82.8 from PyPI?
See: 🧵
Track exactly what changed in your prompts with diffs!
Arize Phoenix announces new AI providers, highlighting features like prompt running, model comparison, and output evaluation.
The interface displays various outputs responding to the question about the meaning of life, offering philosophical perspectives and insights.
Arize Phoenix 13.11.0 — Adds @perplexity_ai and @togethercompute as built-in providers for benchmarking and evaluation.
Arize AI Phoenix v13.10 now supports Cerebras, Fireworks AI, Groq, and Moonshot (Kimi), as well as OpenAI's GPT 5.4 models, allowing you to compare hundreds of more models side by side for benchmarking, task evaluation, or LLM judge building.
#AI #LLM #OpenSource #Observability #Evals
Anthropic's Claude Agent SDK lets you build AI agents that autonomously read files, run commands, search the web, and edit code — the same tools and agent loop that power Claude Code, now programmable in Python and TypeScript.
Docs: arize.com/docs/phoeni...
Once you separate those two, debugging agent behavior becomes dramatically easier. Full demo/blog from Elizabeth Hutton on how to evaluate tool calling agents: arize.com/blog/how-to...
You need to measure two different behaviors: did the agent choose the correct tool, and did it call the tool correctly?
From the outside everything looks fine because the right tool is triggered. But a single incorrect argument can make the entire action wrong.
This is why evaluating tool-using agents can’t be reduced to a single score.
At first glance that looks contradictory. It isn’t. The agent was consistently choosing the correct tool. The failures were happening after the decision, in how the tool was called. Common examples looked like wrong dates, missing parameters, incorrect values, or schema mismatches.
This is the chart that should make every AI engineer pause. In the demo agent we evaluated:
⚪Tool selection: 100%
⚪Matches expected tool calls: 36%