Observability Beyond Accuracy: Tracing the Full Evaluation Lifecycle
Most teams measure accuracy. The best teams measure the entire evaluation lifecycle — from input telemetry to judge reasoning traces. Here is what you are missing without trace-level observability.
Priya Sharma
Walk into any team operating LLMs at scale and ask them what their evaluation accuracy is. You will get a number. Ask them why it changed last sprint, and the conversation gets much harder. Accuracy is a result metric — it tells you what happened, not what caused it. Observability is the practice of building infrastructure to answer the "why".
What Accuracy Alone Misses
In their landmark survey of foundation model risks, Bommasani et al. at Stanford's Center for Research on Foundation Models documented a pattern they termed the "evaluation bottleneck": most teams had extensive metrics for model capability but almost no instrumentation for understanding the conditions under which capable models failed.[1] Three years on, in enterprise production deployments, this pattern persists.
Without trace-level observability you cannot distinguish between a model that scores 0.85 accuracy on simple tasks and 0.40 on complex ones (versus one that scores 0.63 uniformly), or between a latency spike caused by tokenisation overhead versus upstream API congestion, or between a degradation caused by a retrieval quality regression versus a model update. Each distinction matters enormously for the remediation path.
The Three Observability Layers
Layer 1 — Input Telemetry
Every evaluation run produces structured inputs: scenario IDs, prompt templates, configuration parameters, context injections. Capturing these as structured telemetry events is the foundation. Without input telemetry, you cannot reproduce a failure, cannot correlate score changes to configuration changes, and cannot build a meaningful regression baseline.
The OpenTelemetry Semantic Conventions for Generative AI (v0.3.0, 2024) provide a standardised schema for this layer, including span attributes for gen_ai.request.model, gen_ai.usage.input_tokens, and the full prompt content under gen_ai.prompt.[2] Using these conventions means your traces are compatible with any OTLP-compatible backend — Grafana, Honeycomb, AWS X-Ray — without vendor lock-in.
Layer 2 — Execution Telemetry
The execution layer captures what happens during inference: latency at each stage, token counts, retry events, timeouts, and error codes. This is where you detect cost anomalies, latency regressions, and reliability issues before they affect evaluation validity.
A common mistake is treating execution metrics as secondary to evaluation scores. A latency spike that causes evaluation timeouts will manifest as a sudden accuracy drop — and without execution telemetry, you will spend hours investigating a phantom model regression that is actually an infrastructure issue.
Layer 3 — Judge Telemetry
The highest-value observability layer, and the most frequently absent. Judge telemetry captures not just the final score but the reasoning process: what criteria did the judge apply, what evidence did it cite, and where did it flag ambiguity? This transforms a score from an opaque number into an auditable claim with supporting evidence — essential for compliance workflows and for calibrating the judge itself over time.
SLO-Based Alerting for Evaluation Infrastructure
Threshold-based alerting — "alert when accuracy drops below 0.80" — generates false positives and fosters alert fatigue. The SRE approach, described in Google's Site Reliability Engineering book, uses Service Level Objectives defined in terms of error budgets: you are permitted a certain rate of failures over a rolling window, and alerting fires when you are consuming that budget faster than sustainable.[3]
You cannot improve what you cannot observe. But you also cannot act on what you cannot understand. The goal is not more metrics — it is causal clarity.
Applied to LLM evaluation infrastructure, an SLO might read: "95% of evaluation runs should complete within 120 seconds and produce a final score within a 28-day rolling window." An alert fires not when a single run fails but when the failure rate burns through the error budget at an unsustainable rate — eliminating transient false positives while reliably catching meaningful degradation.
References
- [1]Bommasani, R. et al. "On the Opportunities and Risks of Foundation Models." Stanford Center for Research on Foundation Models (CRFM), 2022.
- [2]OpenTelemetry. "Semantic Conventions for Generative AI, v0.3.0." CNCF OpenTelemetry Project, 2024.
- [3]Beyer, B. et al. "Site Reliability Engineering: How Google Runs Production Systems." O'Reilly Media, 2016.
- [4]Arize AI. "Phoenix: Open-Source AI Observability." Arize AI, 2024.