LLM API Rate Limiting & Cost Control: Token Budgets & Throttling

LLM API costs behave differently from almost every other API cost your team has managed before. Traditional API calls have roughly predictable per-request costs. A database query, a storage operation, or a webhook delivery costs more or less the same regardless of what the user sends. LLM APIs break that assumption completely because cost scales with both input and output token count, and those counts can vary by orders of magnitude between requests.

A single agentic workflow can burn through fifty dollars or more in tokens in minutes. When an agent loops, retries on ambiguous results, or fans out across multiple tool calls, the token consumption compounds in ways that are hard to predict and harder to stop after the fact. A developer testing a new prompt against a long context window can unknowingly spend more in one session than their team budgeted for the entire day. These are not edge cases. They are normal operating conditions for teams building with large language models.

The principle that matters most here is simple: you cannot control what you do not measure. If your team is calling OpenAI, Anthropic, or any model provider without per-key tracking, per-request cost estimation, and budget enforcement, you are flying blind. The bill arrives at the end of the month and the conversation shifts from engineering to accounting.

Rate limiting is also not just a cost control. It is a security control. A compromised API key without rate limits gives an attacker unlimited access to an expensive resource. Per-key throttling limits the blast radius of a leaked credential. Per-key budgets cap the financial damage. These are the same defense-in-depth principles that apply to API security for AI apps and SaaS integrations, applied specifically to LLM consumption.

Anatomy of LLM API costs

Understanding LLM pricing requires understanding how token-based billing works, because it is fundamentally different from request-based billing.

Every LLM API call has three cost components: input tokens, output tokens, and in some cases cached input tokens. Input tokens are what you send to the model, including the system prompt, conversation history, and any retrieved context. Output tokens are what the model generates in response. Cached input tokens apply when the provider recognizes repeated prefixes and charges a discounted rate for them.

The pricing differences between models are dramatic. Using approximate figures from a recent pricing snapshot: gpt-4o costs about $2.50 per million input tokens and $10.00 per million output tokens. gpt-4o-mini drops to $0.15 per million input tokens and $0.60 per million output tokens. gpt-3.5-turbo sits at $0.50 and $1.50 respectively. The reasoning models like o1 are significantly more expensive at $15.00 per million input and $60.00 per million output.

These prices change. That is why a production cost control system should treat pricing as a versioned manifest rather than hardcoded constants. When OpenAI adjusts pricing, you update a configuration file, not application code.

The math is instructive. At gpt-4o pricing, a budget of $1,000 per month buys roughly 400 million input tokens or 100 million output tokens. That sounds like a lot until you consider that a single agentic workflow with a 128k context window can consume 100k+ tokens per turn. At gpt-4o-mini pricing, that same $1,000 buys roughly 6.7 billion input tokens, which is why model selection is the single largest lever for cost optimization.

The hidden costs are often more dangerous than the visible ones. Retries on transient errors double or triple the token spend for a single logical request. Long context windows mean every message in a conversation carries the full history forward. Agent loops that retry until they get a satisfactory answer can run up unbounded costs. Embedding calls for RAG pipelines add a separate, quieter cost stream that accumulates over time.

Three pillars of LLM cost control

Effective LLM cost management rests on three pillars: rate limiting, budget enforcement, and cost visibility. Each solves a different part of the problem, and you need all three.

Pillar 1: Rate limiting

Rate limiting for LLM APIs needs to work on two dimensions simultaneously. Requests per minute (RPM) prevents burst abuse and protects against runaway automation that fires too many calls too quickly. Tokens per minute (TPM) prevents expensive queries from monopolizing capacity even at low request rates. A single request with a 100k-token context window and a 4k-token response can cost more than a thousand small requests.

The choice of rate limiting algorithm matters. Fixed window rate limiting is simple but suffers from the burst-at-boundary problem: a client can send double the intended rate by timing requests at the end of one window and the start of the next. Sliding window rate limiting avoids this by smoothing the count across time. Token bucket algorithms allow controlled bursts while maintaining a long-term average. For LLM traffic, sliding window is generally the best choice because it provides predictable, smooth throttling without the boundary exploitation that fixed windows allow.

Provider-side rate limits and application-side rate limits serve different purposes. OpenAI imposes its own rate limits based on your account tier, but those limits apply to your entire organization. Application-side rate limiting lets you subdivide that capacity across teams, projects, and individual API keys. You need both layers.

Pillar 2: Budget enforcement

Budget enforcement translates rate limits into financial controls. Per-key daily and monthly token budgets give each consumer a clear allocation and a hard boundary.

The most useful pattern is tiered enforcement rather than a single hard cutoff. Warn at 80% of budget consumed so the consumer can adjust their usage. At 95%, optionally downgrade the model from a more expensive tier to a cheaper one. At 100%, block further requests entirely. This gives consumers time to react before they hit a wall.

Model downgrade deserves careful handling. Automatically switching a request from gpt-4o to gpt-4o-mini can save significant money, but the capability differences between models are real. A task that requires strong reasoning or precise instruction following may fail or produce lower quality results on a smaller model. That is why model downgrade should never be a default behavior. It must be explicitly opted into per key, configured in a policy file, and clearly signaled to the client via response headers so the application can handle the capability change appropriately.

Pillar 3: Cost visibility

You cannot manage costs without visibility into where the money is going. Real-time usage tracking by key, team, and model is the foundation.

Cost estimation before a request is sent is valuable but inherently imprecise. You can estimate input cost by counting tokens with a tokenizer like tiktoken, and you can calculate a worst-case output ceiling based on the max_tokens parameter. But actual output cost depends on how many tokens the model generates at runtime, which varies per request. The right approach is to use the estimate for pre-flight budget checks and record the actual cost after the response completes.

Anomaly detection adds a layer of protection that static budgets cannot provide. If a key that normally spends $5 per day suddenly spends $50 in an hour, that pattern should trigger an alert regardless of whether the budget has been exceeded. The baseline can be as simple as a rolling average with a multiplier threshold.

For dashboards, three visualizations cover most needs: cost by key as a bar chart for identifying top consumers, cost over time as a line chart for spotting trends and anomalies, and budget remaining as a doughnut chart for at-a-glance status. These are operational dashboards, not vanity metrics. They should be the first thing a platform team checks when something feels off.

Architecture: the reverse proxy pattern

The most effective architecture for LLM cost control is a reverse proxy that sits between your application and the model provider. Every request flows through the proxy, which applies controls before forwarding to the upstream API.

The reverse proxy pattern has several advantages over client-side SDK integration. Central control means one enforcement point instead of updating every client. It works with any client that can make HTTP requests, whether that is a Python script, a TypeScript application, or a curl command. It provides a single point of visibility for all LLM traffic. And it cannot be bypassed by a developer who forgets to use the SDK or decides to call the provider directly.

The middleware chain in a well-designed proxy follows a clear sequence: authenticate the request using API keys, check rate limits, verify budget availability, check the cache for an exact match, proxy the request to the upstream provider, and record the actual token usage and cost after the response completes.

API key management deserves attention. Keys should use a recognizable prefix like lbp_ so they are easy to identify in logs and configuration files. They should be stored as SHA-256 hashes, not plaintext, and shown to the user exactly once at creation time. This is the same pattern used by GitHub, Stripe, and other platforms with mature key management. Real API keys that are validated server-side are far more secure than spoofable headers or client-asserted identity.

For a single-instance deployment, SQLite is a pragmatic choice for the backing store. It is fast, requires no separate server, and handles the concurrency needs of a single-node proxy well. The upgrade path to Redis or Postgres exists for teams that need multi-instance deployments, but starting with SQLite keeps the operational footprint minimal and the deployment simple.

Exact-match caching

Caching LLM responses can dramatically reduce costs for workloads with repetitive queries. The simplest effective approach is exact-match caching: hash the full request body with sorted keys for deterministic hashing, and return the cached response if a match exists.

This is deliberately not semantic caching. Semantic caching, where you find “similar enough” queries and return a previous response, requires an embedding model, a vector search index, similarity thresholds, and cache invalidation rules that are themselves a significant engineering challenge. It is a separate system with its own failure modes. Exact-match caching is simple, predictable, and correct by construction. If the request is identical, the cached response is valid.

TTL strategies should vary by use case. Conversational queries where the answer might change with new information deserve shorter TTLs, perhaps 5 to 15 minutes. Embedding generation calls that produce deterministic outputs for the same input can use much longer TTLs, potentially hours or even days. System prompts that are identical across requests are excellent cache candidates.

The real savings show up in workloads with natural repetition: embedding generation for document ingestion where the same chunks appear across runs, repeated system prompts in multi-turn conversations, batch classification tasks where many inputs share the same structure, and development workflows where engineers iterate on prompts against the same test inputs. For these patterns, cache hit rates of 30-60% are common, which translates directly into cost savings.

Streaming and accounting

Most production LLM integrations use streaming responses via Server-Sent Events (SSE). Streaming improves perceived latency because the client sees tokens as they are generated rather than waiting for the complete response. But streaming complicates cost accounting significantly.

The key technique is injecting stream_options.include_usage = true into the request body before forwarding to OpenAI. This tells the API to include actual token counts in the final SSE chunk. Without this flag, streaming responses do not report usage, and the proxy would have to estimate token counts by parsing the streamed content, which is error-prone.

// Inject usage reporting into streaming requests
if (!body.stream_options || typeof body.stream_options !== "object") {
  body.stream_options = {};
}
(body.stream_options as Record<string, unknown>).include_usage = true;

Partial failure handling is where streaming gets genuinely difficult. If the upstream connection drops mid-stream, the proxy needs to account for the tokens that were consumed up to the failure point. The final usage chunk never arrives in this case, so the proxy must fall back to counting the tokens in the chunks it did receive. This is an estimate, but it is better than recording zero usage for a request that actually consumed tokens and cost money.

Client cancellation is the mirror problem. If the client disconnects before the response is complete, the proxy should abort the upstream request to stop further token generation and account for partial usage. Simply ignoring the cancellation wastes tokens on a response nobody will read.

This is harder than it sounds because errors can arrive after a 200 status code on streaming responses. The HTTP connection is established and the initial status is sent before the model starts generating tokens. A failure mid-generation looks like a successful connection that stops producing data, not a clean error response. The proxy needs to handle both the happy path and these messy partial-failure scenarios.

The llm-budget-proxy reference implementation

The llm-budget-proxy is an open-source reference implementation that puts these concepts into a deployable system. It is a single-container reverse proxy built with Fastify and SQLite that provides rate limiting, budget enforcement, exact-match caching, and cost tracking for OpenAI-compatible APIs.

The architecture follows the middleware chain described above. Configuration is driven by two YAML files: a main config file that defines rate limits, budget thresholds, model downgrade rules, cache settings, and alert webhooks, and a pricing manifest that maps model names to per-token costs with a version date so you know when prices were last updated.

The proxy is OpenAI-compatible only in its current form. Anthropic’s Messages API uses a different request and response schema, including different field names for token counts, different streaming formats, and different error structures. Supporting Anthropic would require either separate route handlers or a normalization layer that translates between schemas. That is a documented future extension, not something the MVP attempts to do. Keeping the scope narrow keeps the implementation understandable and the codebase small enough that a single engineer can read the entire thing in an afternoon.

Deployment takes about five minutes with Docker:

// Clone and deploy
// git clone https://github.com/InkByteStudio/llm-budget-proxy
// cp .env.example .env  (add your OpenAI key and admin key)
// docker compose up -d

The admin API provides endpoints for creating and managing API keys, viewing usage statistics, and checking budget status. Each key gets its own rate limits and budget allocations, which can override the defaults defined in the config file.

Build vs buy: honest comparison

The LLM proxy space is not empty. LiteLLM has roughly 39,000 GitHub stars and supports over 100 provider integrations. It offers virtual keys, per-key budgets, a spend management dashboard, and is backed by Postgres and Redis. It is a mature, proven platform that many production teams rely on. Helicone and Portkey offer managed solutions with additional analytics, prompt management, and compliance features.

The differentiator for llm-budget-proxy is operational simplicity. It runs as a single container with SQLite. There is no Postgres to provision, no Redis to manage, no admin UI to deploy separately. Configuration is a YAML file. Deployment is docker compose up. The entire codebase is small enough to audit in a single sitting. That matters when you need to understand exactly what the proxy does with your API keys and request data.

When to use llm-budget-proxy: you are a small team, you are running in dev or staging environments, you use a single provider, and you want to understand the internals of LLM cost control rather than treat it as a black box. It is also a good fit for teams that want a reference implementation to learn from before evaluating larger platforms.

When to use LiteLLM, Helicone, or Portkey: you need multi-provider support across OpenAI, Anthropic, Google, and others. You need enterprise scale with multiple proxy instances. You need vendor support, SLAs, or compliance certifications. You need features like prompt management, A/B testing, or advanced analytics that go beyond cost control.

A hybrid approach works well for many organizations: use llm-budget-proxy for development and staging environments where simplicity and cost matter most, and a vendor solution for production where scale, support, and multi-provider routing justify the operational overhead.

What is next

If you want to implement these controls hands-on, follow the companion tutorial: Implement LLM Rate Limiting and Cost Controls. It walks through deploying the proxy, creating API keys, configuring budgets and rate limits, and validating each control with real OpenAI API calls.

For more on the topics covered in this guide:

API Security for AI Apps and Modern SaaS Integrations covers rate limiting as a security control and API key management best practices.
Kubernetes Networking for AI Workloads addresses infrastructure-level cost management for AI serving platforms.
How to Secure Agentic AI Applications: The 2026 Playbook covers runaway agent protection and guardrails that prevent unbounded tool use.
Harden Your CI/CD Pipeline with Sigstore, SLSA, and SBOMs provides complementary CI/CD hardening techniques.

Get the llm-budget-proxy →

Get the free LLM Cost Control Checklist →

Frequently asked questions

How does LLM rate limiting differ from traditional API rate limiting?

LLM rate limiting tracks both requests per minute (RPM) and tokens per minute (TPM), because cost scales with input and output token count rather than request count alone. A single LLM request can consume thousands of tokens and cost several dollars, making token-based limits essential alongside request-based limits.

Can you calculate exact LLM API cost before sending a request?

Not exactly. You can estimate input cost by counting tokens with tiktoken and calculate a worst-case output ceiling using the max_tokens parameter, but actual output cost depends on how many tokens the model generates at runtime. Accurate cost is recorded after the response completes.

When should you build your own LLM proxy vs using a managed solution?

Build your own when you need a lightweight, single-provider proxy for dev/staging, want to understand the internals, or need full control over a simple deployment. Use a managed solution like LiteLLM, Helicone, or Portkey when you need multi-provider support, enterprise scale, vendor support, or compliance certifications.

LLM API Rate Limiting and Cost Control: Manage Token Budgets, Per-Key Throttling, and Cost Dashboards