StoAI
Blog/LLM Architecture

LLM Observability: Monitoring What Your AI Is Actually Doing

The complete guide to LLM observability in production. Covers the 12 metrics you need, distributed tracing for AI, anomaly detection, cost dashboards, and the alerting rules that catch problems before users do.

·14 min read·Updated Mar 11, 2026

Why Traditional APM Fails for LLM Applications

Your existing APM tools (Datadog, New Relic, Grafana) tell you that an HTTP request took 2.3 seconds, returned 200 OK, and consumed 45MB of memory. For an LLM-powered endpoint, this tells you almost nothing useful.

The request succeeded — but did the LLM hallucinate an answer? Did it follow the system prompt? Did it cost $0.002 or $0.20? Did it use the right model? Was the response actually helpful to the user?

LLM observability requires a fundamentally different approach. You need to track not just infrastructure metrics, but content quality, cost efficiency, and model behavior. Here's how.

The 12 Metrics Every LLM System Needs

Reliability Metrics

1. Latency (p50, p95, p99) per endpoint

Track time to first token (TTFT) and total response time separately. TTFT matters for streaming UX. Total time matters for billing and throughput.

Healthy targets: TTFT < 500ms (p95), total < 3s (p95) for chat features. Alert if p95 exceeds 5s.

2. Error rate by error type

Segment errors into: rate limits (429), timeouts, context length exceeded, content filtered, server errors (500), and parsing failures. Each type has different root causes and remediation.

Healthy target: < 1% total error rate. Alert if any single error type exceeds 0.5%.

3. Timeout rate

Track separately from errors because timeouts are a reliability signal, not a correctness signal. A spike in timeouts usually means the provider is degraded.

Healthy target: < 0.5%. Alert if it exceeds 2% in a 5-minute window.

4. Fallback activation rate

How often are you hitting your secondary model or cached responses? Some fallback usage is expected (0.5-2%). A spike indicates primary provider issues.

Cost Metrics

5. Cost per request

Calculate in real-time: (input_tokens × input_price) + (output_tokens × output_price). Track per endpoint, per model, per user.

6. Cost per user per day

Aggregate cost at the user level. This feeds into per-user budget enforcement and pricing decisions. Track the distribution — most users will be cheap, but your top 5% will drive 50%+ of cost.

7. Token usage (input vs output)

Track input and output tokens separately. Input tokens are your biggest optimization lever (context management). Output tokens indicate response verbosity (which you can control via prompts and max_tokens).

8. Cache hit rate

Target 20-40% for conversational features, 50%+ for search/FAQ. Track per cache layer (exact match, semantic, prefix). If overall hit rate is below 10%, your caching strategy needs work.

Quality Metrics

9. User feedback score

Thumbs up/down on AI responses. Track the ratio over time. A sustained drop (e.g., from 85% positive to 70% positive) indicates quality degradation — possibly from model updates, prompt drift, or data issues.

10. Hallucination rate (sampled)

Run automated hallucination detection on a sample (5-10%) of responses. Use an LLM-as-judge approach: pass the response and the source context to a judge model, ask if the response is faithful to the sources. Track weekly trends.

11. Format compliance rate

If your prompts request structured output (JSON, specific format), track how often the response actually matches. Non-compliant responses cause downstream parsing errors. Target 99%+.

12. Regeneration rate

How often do users click "regenerate" or "try again"? This is a direct signal that the first response was unsatisfactory. Track per feature. Rising regeneration rate = falling quality.

Distributed Tracing for AI

Standard distributed tracing tracks request → service → database. AI tracing needs additional spans:

[User Request]
  └── [AI Gateway]
       ├── [Cache Check] (hit/miss, latency)
       ├── [Model Router] (selected model, reason)
       ├── [LLM Call] (model, tokens_in, tokens_out, latency, cost)
       │    └── [Tool Calls] (tool_name, parameters, result, latency)
       ├── [Response Processing] (parsing, validation, filtering)
       └── [Cache Write] (key, TTL)

Each span should include AI-specific attributes:

  • llm.model: which model was used
  • llm.tokens.input: input token count
  • llm.tokens.output: output token count
  • llm.cost: calculated cost in USD
  • llm.cache_hit: boolean
  • llm.fallback: boolean (was a fallback model used?)

This trace structure lets you answer questions like: "Why did this request cost $0.15?" (Answer: cache miss, routed to Opus, 3 tool calls, 8K output tokens).

Cost Dashboards

Build three cost views:

Real-time view: Cost per minute/hour for the last 24 hours. This is your smoke alarm. A spike means a runaway prompt, a cache failure, or a traffic burst.

Daily rollup view: Cost per day, broken down by model, feature, and top users. This feeds into budgeting and optimization planning.

Monthly projection view: Current month spend with linear projection to month-end. Compare against budget. Alert at 80% of monthly budget.

Quality Monitoring: Detecting Output Degradation

Model providers silently update models. What worked last Tuesday might not work this Tuesday. You need automated quality monitoring to catch degradation.

The Quality Pipeline

  • Sample 5-10% of production requests
  • Evaluate using LLM-as-judge (a separate model scores the response on a 1-5 scale for relevance, accuracy, and format compliance)
  • Aggregate scores into a daily quality score per feature
  • Alert if the 7-day rolling average drops by more than 10%
  • Investigate by comparing recent responses to historical baselines

Cost of this pipeline: approximately 5-10% of your primary LLM spend. Worth it.

Anomaly Detection

Beyond quality scores, watch for statistical anomalies:

  • Response length anomalies: If average response length changes by more than 20%, the model behavior has shifted.
  • Token usage anomalies: Sudden increase in input or output tokens per request.
  • Refusal rate anomalies: The model refusing to answer queries it previously handled.
  • Latency anomalies: p95 latency increasing without traffic increase.

Tool Comparison

LangSmith (LangChain)

Best for teams already using LangChain. Strong tracing and evaluation features. Less useful if you're not in the LangChain ecosystem.

Helicone

Best for cost monitoring and gateway-level observability. Easy to integrate (proxy-based). Good dashboards. Limited custom evaluation.

Custom (recommended for production)

Build on your existing observability stack (Datadog, Grafana, Prometheus). Add AI-specific metrics as custom metrics. Most flexible, most work. Recommended for teams with strong observability culture.

Our Recommendation

Start with Helicone for immediate visibility (15-minute setup). Build custom metrics in your existing observability stack for long-term. Use LangSmith only if you're using LangChain.

Building Your LLM Observability Stack

Week 1: Instrument the 4 reliability metrics (latency, errors, timeouts, fallback). Ship a basic cost dashboard.

Week 2: Add quality metrics (user feedback, format compliance). Set up sampling for input/output logging.

Week 3: Build the quality evaluation pipeline. Implement anomaly detection on key metrics.

Week 4: Create alerting rules. Document runbooks for common alert scenarios.

Total effort: 2-3 engineering weeks. This is not optional for production AI — it's the difference between knowing your AI is working and hoping it is.

Sobre o autor

Escrito por Rafael Danieli, fundador da StoAI. Engenheiro de sistemas especializado em IA de produção para empresas SaaS. Background em sistemas distribuídos, engenharia de confiabilidade e arquitetura de integração.