LLM Cost Control Checklist

Free checklist from our LLM cost control guide

A step-by-step checklist for implementing rate limiting, token budgets, caching, and cost visibility for your LLM API usage.

From LLM API Rate Limiting and Cost Control

Instructions

Complete this checklist when implementing cost controls for LLM API usage. Work through each section in order. Every item should be addressed before considering your cost control framework production-ready. If any section has unfinished items, document the gap and assign an owner.

1. API Key Management

API keys issued per tenant (not shared across users or teams)
Keys stored as cryptographic hashes (SHA-256 or stronger), never in plaintext
Key format uses a distinctive prefix for identification in logs (e.g., lbp_)
Key revocation process defined and tested
No spoofable headers (X-User-Id) used for identity — keys are the identity
Admin key separated from tenant keys with distinct permissions

2. Rate Limiting

RPM (requests per minute) limits configured per key
TPM (tokens per minute) limits configured per key
Rate limit algorithm chosen (sliding window recommended for LLM traffic)
429 responses include Retry-After header
X-RateLimit-* headers returned on every response for client visibility
Rate limit overrides available for specific keys or patterns
Provider-side rate limits documented and understood

3. Budget Configuration

Daily budgets set per key or team
Monthly budgets set per key or team
Soft limit thresholds defined (e.g., warn at 80%)
Hard limit action defined (block at 100% or configurable ceiling)
Budget reset schedule documented (daily at midnight UTC, monthly on first)
Budget status visible via API and dashboard

4. Cost Estimation

Token counting library integrated (tiktoken for OpenAI)
Pricing manifest versioned and stored as a config file (not hardcoded)
Pre-request cost estimation uses worst-case output ceiling (max_tokens × output price)
Cost estimation documented as estimated, not exact
Actual cost recorded from response token counts after completion
Cost recorded per request in usage database

5. Caching

Exact-match cache implemented (request body hash → cached response)
Request body hashing uses deterministic key ordering
TTL configured per use case (shorter for dynamic, longer for stable prompts)
Cache eviction policy defined (TTL expiry + max entries)
Cache bypass mechanism available (e.g., header or query parameter)
Cached requests tracked in usage data with zero cost
X-Cache header returned (HIT or MISS)

6. Streaming Support

SSE (Server-Sent Events) chunks forwarded correctly to client
stream_options.include_usage injected to get actual token counts
End-of-stream token counts captured from final SSE chunk
Partial failure accounting implemented (tokens consumed before error)
Client cancellation aborts upstream request via AbortController
Streaming requests recorded in usage data after completion

7. Model Downgrade

Model downgrade disabled by default (opt-in only)
Downgrade rules config-driven with explicit from/to mapping
Downgrade triggered only at configurable budget threshold
X-Model-Downgraded header returned when downgrade occurs
X-Original-Model header shows what model was requested
Downgraded requests tracked in usage data
Capability differences between models documented for users

8. Dashboard and Visibility

Cost by key or team visible in dashboard
Cost over time trend visible (hourly and daily granularity)
Budget remaining visible per key
Recent requests queryable with model, tokens, cost, and flags
Dashboard access protected by admin authentication
Usage data exportable via API

9. Alerting

Webhook URL configured for budget notifications
Budget warning fires at configurable threshold (e.g., 80%)
Budget exceeded fires at 100%
Alert debounce in place (same event + key fires at most once per hour)
Webhook payload includes key name, team, budget details, and timestamp
Webhook failures logged but do not block proxy requests

10. Deployment and Operations

Single-instance SQLite limitation documented
Database persistence configured (volume mount for Docker)
Health check endpoint available (/health)
Graceful shutdown drains in-flight requests
Config validation runs at startup (fail fast with clear errors)
Upgrade path to Redis/Postgres documented for multi-instance
Pricing manifest update process documented

Review Decision

Decision:

Gaps remaining:

Reviewer:

Date:

Next review:

Found this useful? Read the full article:

Read: LLM API Rate Limiting and Cost Control →