Why You Need a Checklist (Even If You're Experienced)
We've shipped AI features into 15 SaaS products. Every time, the checklist catches something. Not because we're careless — because AI systems have failure modes that traditional software doesn't. The LLM can be up, returning 200s, and still producing garbage. Your metrics can look green while users are having a terrible experience.
This checklist covers the 23 items you must verify before shipping an AI feature to production. We use this internally for every engagement.
Reliability (Items 1-5)
1. Fallback behavior is defined and tested
What happens when the primary LLM provider is down, slow, or returning errors? You need a concrete answer, not "we'll figure it out." Options: fallback to a secondary model, return a cached response, disable the AI feature gracefully, or show a human-readable error.
Test it: Kill the LLM connection and verify the fallback activates within your timeout window.
2. Timeouts are configured appropriately
Real-time features: 5 seconds max. Background processing: 30 seconds max. Streaming: first token within 2 seconds.
Common mistake: Using the HTTP client default timeout (often 30-60 seconds). A user staring at a spinner for 30 seconds is a user who doesn't come back.
3. Retry logic handles LLM-specific errors
LLM APIs return specific errors that require different retry strategies. Rate limit errors (429) should use exponential backoff. Server errors (500/503) should retry once after 1 second. Context length errors (400) should never retry — truncate the input and retry with a shorter prompt.
4. Circuit breaker prevents cascade failures
If the LLM provider starts timing out, you don't want every request in your system waiting 5 seconds to fail. Implement a circuit breaker that trips after 5 consecutive failures and returns the fallback immediately for 30 seconds before trying again.
5. Graceful degradation is user-visible
When AI is degraded, tell the user. "AI features are temporarily limited" is better than silently returning bad results. If you're falling back to a simpler model, the output quality may change — users should know.
Observability (Items 6-10)
6. Latency is tracked per-endpoint (p50, p95, p99)
Average latency lies. Track percentiles. Your p50 might be 800ms (fine), but your p99 might be 12 seconds (not fine). Track latency for each AI-powered endpoint separately.
7. Token usage is tracked per-request
You need to know input tokens and output tokens for every request. This feeds into cost tracking, budget enforcement, and optimization. Log both the token count and the model used.
8. Error rates are segmented by error type
"2% error rate" tells you nothing. You need: rate limit errors, timeout errors, content filter errors, context length errors, model errors, and parsing errors — each tracked separately. Different errors require different fixes.
9. Cost per request is calculated in real-time
cost = (input_tokens × input_price) + (output_tokens × output_price)
Track this per request, aggregate per user, per tenant, and per feature. Alert if daily cost exceeds 2x the 7-day average.
10. Input/output sampling is enabled for debugging
Log a random sample (5-10%) of full inputs and outputs for debugging and quality analysis. Redact PII before logging. Store with a 30-day retention. This is how you debug "the AI gave a weird answer" reports.
Cost Controls (Items 11-14)
11. Per-user token budgets are enforced
Set a daily or monthly token budget per user. When exceeded, either degrade to a cheaper model, show a "limit reached" message, or throttle requests. Without this, a single power user can blow your monthly AI budget.
12. Per-tenant cost alerts are configured
If you're B2B, track AI cost per tenant. Alert when a tenant's AI usage spikes above 3x their average. This catches both abuse and legitimate usage that needs a pricing conversation.
13. Model routing optimizes cost vs quality
Not every request needs the most expensive model. Route simple tasks (classification, extraction) to cheaper/faster models. Reserve expensive models (Claude Opus, GPT-4o) for complex reasoning. A good routing strategy saves 20-40% on API costs.
14. Response caching is implemented
Cache identical or near-identical requests. For FAQ-style queries, hit rate can reach 30-50%. Even a simple exact-match cache with a 1-hour TTL provides significant savings on high-traffic features.
Security and Privacy (Items 15-18)
15. Input sanitization prevents prompt injection
Users will try to override your system prompt. Implement basic guards: separate system and user messages, validate that the model's output matches expected formats, and never execute raw model output as code or commands.
16. PII handling follows your data policy
Know exactly what user data is sent to the LLM provider. If you're sending names, emails, or account data, ensure it's covered by your privacy policy and the provider's DPA. Consider anonymizing identifiers before sending.
17. Output filtering catches harmful content
Even with good prompts, models occasionally generate inappropriate content. Implement output filtering for your specific context — a B2B SaaS product should flag profanity, a healthcare product should flag medical advice.
18. Access controls are applied to AI features
AI features should respect your existing permission model. If a user doesn't have access to certain data, the AI shouldn't be able to surface that data in responses. This is especially critical in multi-tenant systems.
User Experience (Items 19-21)
19. Loading states are implemented for AI responses
Never leave the user staring at a blank screen. Show a typing indicator, a progress message, or stream tokens as they arrive. For operations longer than 2 seconds, show estimated wait time.
20. User feedback mechanism is available
Add thumbs up/down, or a simple "Was this helpful?" for every AI-generated response. This is your cheapest quality signal. Track feedback rate and sentiment over time. Alert if negative feedback spikes.
21. AI attribution is clear
Users should know when content is AI-generated. This isn't just ethical — it sets expectations. Users evaluate AI-generated content differently from human-written content, and that's appropriate.
Deployment (Items 22-23)
22. Feature flag controls AI rollout
Never ship AI to 100% of users on day one. Use a feature flag for percentage-based rollout. Start at 5%, monitor for 24 hours, then ramp to 25%, 50%, 100%. Have a kill switch that disables the feature instantly.
23. Rollback procedure is documented and tested
If the AI feature causes issues in production, can you roll it back in under 5 minutes? Document the procedure. Test it in staging. Make sure the on-call team knows how to execute it.
Using This Checklist
Print this. Go through it for every AI feature before launch. If you can't check an item, either fix it or make a conscious decision to accept the risk. Most items take less than a day to implement. The cost of skipping them is measured in incidents, not hours.
The items most commonly skipped (and most commonly regretted): #4 (circuit breaker), #11 (per-user budgets), #13 (model routing), and #22 (feature flags). Don't skip these.