StoAI
Blog/AI Integration

AI Integration Checklist: 23 Things to Verify Before Going to Production

The production readiness checklist for AI features. Covers reliability, observability, cost controls, security, and user experience — everything your team forgets until something breaks.

·10 min read·Updated Mar 11, 2026

Why You Need a Checklist (Even If You're Experienced)

We've shipped AI features into 15 SaaS products. Every time, the checklist catches something. Not because we're careless — because AI systems have failure modes that traditional software doesn't. The LLM can be up, returning 200s, and still producing garbage. Your metrics can look green while users are having a terrible experience.

This checklist covers the 23 items you must verify before shipping an AI feature to production. We use this internally for every engagement.

Reliability (Items 1-5)

1. Fallback behavior is defined and tested

What happens when the primary LLM provider is down, slow, or returning errors? You need a concrete answer, not "we'll figure it out." Options: fallback to a secondary model, return a cached response, disable the AI feature gracefully, or show a human-readable error.

Test it: Kill the LLM connection and verify the fallback activates within your timeout window.

2. Timeouts are configured appropriately

Real-time features: 5 seconds max. Background processing: 30 seconds max. Streaming: first token within 2 seconds.

Common mistake: Using the HTTP client default timeout (often 30-60 seconds). A user staring at a spinner for 30 seconds is a user who doesn't come back.

3. Retry logic handles LLM-specific errors

LLM APIs return specific errors that require different retry strategies. Rate limit errors (429) should use exponential backoff. Server errors (500/503) should retry once after 1 second. Context length errors (400) should never retry — truncate the input and retry with a shorter prompt.

4. Circuit breaker prevents cascade failures

If the LLM provider starts timing out, you don't want every request in your system waiting 5 seconds to fail. Implement a circuit breaker that trips after 5 consecutive failures and returns the fallback immediately for 30 seconds before trying again.

5. Graceful degradation is user-visible

When AI is degraded, tell the user. "AI features are temporarily limited" is better than silently returning bad results. If you're falling back to a simpler model, the output quality may change — users should know.

Observability (Items 6-10)

6. Latency is tracked per-endpoint (p50, p95, p99)

Average latency lies. Track percentiles. Your p50 might be 800ms (fine), but your p99 might be 12 seconds (not fine). Track latency for each AI-powered endpoint separately.

7. Token usage is tracked per-request

You need to know input tokens and output tokens for every request. This feeds into cost tracking, budget enforcement, and optimization. Log both the token count and the model used.

8. Error rates are segmented by error type

"2% error rate" tells you nothing. You need: rate limit errors, timeout errors, content filter errors, context length errors, model errors, and parsing errors — each tracked separately. Different errors require different fixes.

9. Cost per request is calculated in real-time

cost = (input_tokens × input_price) + (output_tokens × output_price)

Track this per request, aggregate per user, per tenant, and per feature. Alert if daily cost exceeds 2x the 7-day average.

10. Input/output sampling is enabled for debugging

Log a random sample (5-10%) of full inputs and outputs for debugging and quality analysis. Redact PII before logging. Store with a 30-day retention. This is how you debug "the AI gave a weird answer" reports.

Cost Controls (Items 11-14)

11. Per-user token budgets are enforced

Set a daily or monthly token budget per user. When exceeded, either degrade to a cheaper model, show a "limit reached" message, or throttle requests. Without this, a single power user can blow your monthly AI budget.

12. Per-tenant cost alerts are configured

If you're B2B, track AI cost per tenant. Alert when a tenant's AI usage spikes above 3x their average. This catches both abuse and legitimate usage that needs a pricing conversation.

13. Model routing optimizes cost vs quality

Not every request needs the most expensive model. Route simple tasks (classification, extraction) to cheaper/faster models. Reserve expensive models (Claude Opus, GPT-4o) for complex reasoning. A good routing strategy saves 20-40% on API costs.

14. Response caching is implemented

Cache identical or near-identical requests. For FAQ-style queries, hit rate can reach 30-50%. Even a simple exact-match cache with a 1-hour TTL provides significant savings on high-traffic features.

Security and Privacy (Items 15-18)

15. Input sanitization prevents prompt injection

Users will try to override your system prompt. Implement basic guards: separate system and user messages, validate that the model's output matches expected formats, and never execute raw model output as code or commands.

16. PII handling follows your data policy

Know exactly what user data is sent to the LLM provider. If you're sending names, emails, or account data, ensure it's covered by your privacy policy and the provider's DPA. Consider anonymizing identifiers before sending.

17. Output filtering catches harmful content

Even with good prompts, models occasionally generate inappropriate content. Implement output filtering for your specific context — a B2B SaaS product should flag profanity, a healthcare product should flag medical advice.

18. Access controls are applied to AI features

AI features should respect your existing permission model. If a user doesn't have access to certain data, the AI shouldn't be able to surface that data in responses. This is especially critical in multi-tenant systems.

User Experience (Items 19-21)

19. Loading states are implemented for AI responses

Never leave the user staring at a blank screen. Show a typing indicator, a progress message, or stream tokens as they arrive. For operations longer than 2 seconds, show estimated wait time.

20. User feedback mechanism is available

Add thumbs up/down, or a simple "Was this helpful?" for every AI-generated response. This is your cheapest quality signal. Track feedback rate and sentiment over time. Alert if negative feedback spikes.

21. AI attribution is clear

Users should know when content is AI-generated. This isn't just ethical — it sets expectations. Users evaluate AI-generated content differently from human-written content, and that's appropriate.

Deployment (Items 22-23)

22. Feature flag controls AI rollout

Never ship AI to 100% of users on day one. Use a feature flag for percentage-based rollout. Start at 5%, monitor for 24 hours, then ramp to 25%, 50%, 100%. Have a kill switch that disables the feature instantly.

23. Rollback procedure is documented and tested

If the AI feature causes issues in production, can you roll it back in under 5 minutes? Document the procedure. Test it in staging. Make sure the on-call team knows how to execute it.

Using This Checklist

Print this. Go through it for every AI feature before launch. If you can't check an item, either fix it or make a conscious decision to accept the risk. Most items take less than a day to implement. The cost of skipping them is measured in incidents, not hours.

The items most commonly skipped (and most commonly regretted): #4 (circuit breaker), #11 (per-user budgets), #13 (model routing), and #22 (feature flags). Don't skip these.

About the author

Written by Rafael Danieli, founder of StoAI. Systems engineer specializing in production AI for SaaS companies. Background in distributed systems, reliability engineering, and integration architecture.