← Back to article
LLM Cost Control Checklist
Free checklist from our LLM cost control guide
A step-by-step checklist for implementing rate limiting, token budgets, caching, and cost visibility for your LLM API usage.
Check your inbox! We sent you a link to the PDF version.
Instructions
Complete this checklist when implementing cost controls for LLM API usage. Work through each section in order. Every item should be addressed before considering your cost control framework production-ready. If any section has unfinished items, document the gap and assign an owner.
1. API Key Management
- API keys issued per tenant (not shared across users or teams)
- Keys stored as cryptographic hashes (SHA-256 or stronger), never in plaintext
- Key format uses a distinctive prefix for identification in logs (e.g., lbp_)
- Key revocation process defined and tested
- No spoofable headers (X-User-Id) used for identity — keys are the identity
- Admin key separated from tenant keys with distinct permissions
2. Rate Limiting
- RPM (requests per minute) limits configured per key
- TPM (tokens per minute) limits configured per key
- Rate limit algorithm chosen (sliding window recommended for LLM traffic)
- 429 responses include Retry-After header
- X-RateLimit-* headers returned on every response for client visibility
- Rate limit overrides available for specific keys or patterns
- Provider-side rate limits documented and understood
3. Budget Configuration
- Daily budgets set per key or team
- Monthly budgets set per key or team
- Soft limit thresholds defined (e.g., warn at 80%)
- Hard limit action defined (block at 100% or configurable ceiling)
- Budget reset schedule documented (daily at midnight UTC, monthly on first)
- Budget status visible via API and dashboard
4. Cost Estimation
- Token counting library integrated (tiktoken for OpenAI)
- Pricing manifest versioned and stored as a config file (not hardcoded)
- Pre-request cost estimation uses worst-case output ceiling (max_tokens × output price)
- Cost estimation documented as estimated, not exact
- Actual cost recorded from response token counts after completion
- Cost recorded per request in usage database
5. Caching
- Exact-match cache implemented (request body hash → cached response)
- Request body hashing uses deterministic key ordering
- TTL configured per use case (shorter for dynamic, longer for stable prompts)
- Cache eviction policy defined (TTL expiry + max entries)
- Cache bypass mechanism available (e.g., header or query parameter)
- Cached requests tracked in usage data with zero cost
- X-Cache header returned (HIT or MISS)
6. Streaming Support
- SSE (Server-Sent Events) chunks forwarded correctly to client
- stream_options.include_usage injected to get actual token counts
- End-of-stream token counts captured from final SSE chunk
- Partial failure accounting implemented (tokens consumed before error)
- Client cancellation aborts upstream request via AbortController
- Streaming requests recorded in usage data after completion
7. Model Downgrade
- Model downgrade disabled by default (opt-in only)
- Downgrade rules config-driven with explicit from/to mapping
- Downgrade triggered only at configurable budget threshold
- X-Model-Downgraded header returned when downgrade occurs
- X-Original-Model header shows what model was requested
- Downgraded requests tracked in usage data
- Capability differences between models documented for users
8. Dashboard and Visibility
- Cost by key or team visible in dashboard
- Cost over time trend visible (hourly and daily granularity)
- Budget remaining visible per key
- Recent requests queryable with model, tokens, cost, and flags
- Dashboard access protected by admin authentication
- Usage data exportable via API
9. Alerting
- Webhook URL configured for budget notifications
- Budget warning fires at configurable threshold (e.g., 80%)
- Budget exceeded fires at 100%
- Alert debounce in place (same event + key fires at most once per hour)
- Webhook payload includes key name, team, budget details, and timestamp
- Webhook failures logged but do not block proxy requests
10. Deployment and Operations
- Single-instance SQLite limitation documented
- Database persistence configured (volume mount for Docker)
- Health check endpoint available (/health)
- Graceful shutdown drains in-flight requests
- Config validation runs at startup (fail fast with clear errors)
- Upgrade path to Redis/Postgres documented for multi-instance
- Pricing manifest update process documented
Review Decision
Decision:
Gaps remaining:
Reviewer:
Date:
Next review:
Found this useful? Read the full article:
Read: LLM API Rate Limiting and Cost Control →