Skip to content
Back to Tutorials

Implement LLM API Rate Limiting and Cost Controls: Token Budgets, Per-Key Throttling, and Usage Dashboards

Intermediate · 55 min · 20 min read · Byte Smith ·

Before you begin

  • Node.js and TypeScript basics
  • Docker basics (building images, running containers)
  • An OpenAI API key

What you'll learn

  • Set up a Fastify reverse proxy that forwards requests to OpenAI
  • Count tokens with tiktoken and estimate request cost before sending
  • Implement per-key API authentication with hashed keys in SQLite
  • Build sliding-window rate limiting for RPM and TPM
  • Enforce per-key daily and monthly token budgets with graceful degradation
  • Add exact-match request caching to reduce redundant API calls
  • Handle streaming SSE responses with end-of-stream accounting
  • Build a usage dashboard with Chart.js and deploy with Docker
1
2
3
4
5
6
7
8
9
On this page

If your team uses LLM APIs directly, you have already experienced the problem: one runaway integration, one misconfigured retry loop, or one enthusiastic developer can burn through hundreds of dollars in minutes. Provider-side rate limits protect the provider, not your budget. You need controls on your side.

This tutorial walks through building an LLM API proxy that sits between your applications and OpenAI (or any compatible provider) and enforces per-key rate limits, token budgets, caching, and cost tracking. It is the hands-on companion to LLM API Rate Limiting and Cost Control. The full source is on GitHub at llm-budget-proxy.

Before you start, clone the repo and install dependencies:

git clone https://github.com/InkByteStudio/llm-budget-proxy.git
cd llm-budget-proxy
npm install

Step 1: Set up the proxy server (8 min)

The proxy is a Fastify server that accepts requests on POST /v1/chat/completions, runs them through a middleware chain (auth, rate limit, budget check), and forwards them to OpenAI. Start with the server entry point and the config loader.

Configuration with YAML and environment variables

The config lives in config/config.yml and supports ${ENV_VAR} substitution so you never put secrets in the file itself:

server:
  port: 3000
  host: "0.0.0.0"
  adminKey: "${ADMIN_API_KEY}"

provider:
  name: openai
  baseUrl: "https://api.openai.com"
  apiKey: "${OPENAI_API_KEY}"

rateLimits:
  default:
    rpm: 60
    tpm: 100000
  overrides: []

budgets:
  defaultDaily: 10.00
  defaultMonthly: 100.00
  alertThresholds:
    - percent: 80
      action: warn
    - percent: 95
      action: downgrade
    - percent: 100
      action: block

cache:
  enabled: true
  defaultTtlSeconds: 3600
  maxEntries: 10000

database:
  path: "./data/llm-budget-proxy.db"

The loader reads this file, substitutes environment variables, and validates the result with Zod:

File: src/config/loader.ts

import { readFileSync, existsSync } from "node:fs";
import { resolve } from "node:path";
import { parse as parseYaml } from "yaml";
import { configSchema, type Config } from "./schema.js";

function substituteEnvVars(obj: unknown): unknown {
  if (typeof obj === "string") {
    return obj.replace(/\$\{(\w+)\}/g, (_, varName) => {
      return process.env[varName] ?? "";
    });
  }
  if (Array.isArray(obj)) {
    return obj.map(substituteEnvVars);
  }
  if (obj !== null && typeof obj === "object") {
    const result: Record<string, unknown> = {};
    for (const [key, value] of Object.entries(obj)) {
      result[key] = substituteEnvVars(value);
    }
    return result;
  }
  return obj;
}

export function loadConfig(configPath?: string): Config {
  const resolvedPath = configPath ?? resolve("config", "config.yml");

  if (!existsSync(resolvedPath)) {
    throw new Error(`Config file not found: ${resolvedPath}`);
  }

  const raw = readFileSync(resolvedPath, "utf-8");
  const parsed = parseYaml(raw);
  const substituted = substituteEnvVars(parsed);

  const result = configSchema.safeParse(substituted);
  if (!result.success) {
    const errors = result.error.issues
      .map((i) => `  - ${i.path.join(".")}: ${i.message}`)
      .join("\n");
    throw new Error(`Invalid configuration:\n${errors}`);
  }

  return result.data;
}

The server entry point

File: src/server.ts

import Fastify from "fastify";
import { loadConfig } from "./config/loader.js";
import { loadPricingManifest } from "./pricing/manifest.js";
import { getDb, closeDb } from "./storage/db.js";
import { createAuthMiddleware } from "./middleware/auth.js";
import { createRateLimiter } from "./middleware/rate-limiter.js";
import { createBudgetChecker } from "./middleware/budget-checker.js";
import { createProxyHandler } from "./proxy/handler.js";
import { registerDashboardRoutes } from "./dashboard/api.js";

async function main(): Promise<void> {
  const config = loadConfig();
  loadPricingManifest();
  const db = getDb(config.database.path);

  const app = Fastify({ logger: true, bodyLimit: 10 * 1024 * 1024 });

  app.get("/health", async () => ({
    status: "ok",
    uptime: process.uptime(),
    version: "1.0.0",
  }));

  registerDashboardRoutes(app, db, config);

  const authMiddleware = createAuthMiddleware(db);
  const rateLimiter = createRateLimiter(config);
  const budgetChecker = createBudgetChecker(db, config);
  const proxyHandler = createProxyHandler(db, config);

  app.post("/v1/chat/completions", {
    preHandler: [authMiddleware, rateLimiter, budgetChecker],
  }, proxyHandler);

  const shutdown = async (): Promise<void> => {
    await app.close();
    closeDb();
    process.exit(0);
  };

  process.on("SIGTERM", shutdown);
  process.on("SIGINT", shutdown);

  await app.listen({ port: config.server.port, host: config.server.host });
}

main().catch((err) => {
  console.error("Failed to start:", err);
  process.exit(1);
});

The middleware chain runs in order: authenticate the key, check rate limits, check budget, then forward the request. Each middleware can short-circuit the request with an error response. The proxy handler itself also validates the request body before forwarding: if the model field is missing or not a string, it returns a 400 immediately. Upstream JSON responses are parsed inside a try/catch so that malformed responses result in a clean 502 instead of crashing the process.

Verify

Start the server and confirm it responds:

export OPENAI_API_KEY=sk-your-key-here
export ADMIN_API_KEY=admin-dev-key
npm run dev
curl http://localhost:3000/health
# {"status":"ok","uptime":1.23,"version":"1.0.0"}

The proxy is running but will return 401 on /v1/chat/completions until you create an API key (Step 3).

Step 2: Add token counting and cost estimation (7 min)

Before forwarding a request, the proxy counts input tokens and estimates the worst-case cost. This estimate drives rate limiting and budget enforcement.

Token counting with tiktoken

File: src/proxy/token-counter.ts

import { encoding_for_model, type TiktokenModel } from "tiktoken";

const MODEL_ENCODING_MAP: Record<string, TiktokenModel> = {
  "gpt-4o": "gpt-4o",
  "gpt-4o-mini": "gpt-4o-mini",
  "gpt-4-turbo": "gpt-4-turbo",
  "gpt-4": "gpt-4",
  "gpt-3.5-turbo": "gpt-3.5-turbo",
};

const DEFAULT_ENCODING: TiktokenModel = "gpt-4o";

export function countTokens(messages: ChatMessage[], model: string): number {
  const tiktokenModel = MODEL_ENCODING_MAP[model] ?? DEFAULT_ENCODING;

  let enc;
  try {
    enc = encoding_for_model(tiktokenModel);
  } catch {
    enc = encoding_for_model(DEFAULT_ENCODING);
  }

  try {
    let tokenCount = 0;

    for (const message of messages) {
      tokenCount += 4; // message overhead
      if (message.role) {
        tokenCount += enc.encode(message.role).length;
      }
      if (typeof message.content === "string") {
        tokenCount += enc.encode(message.content).length;
      } else if (Array.isArray(message.content)) {
        for (const part of message.content) {
          if (part.type === "text" && part.text) {
            tokenCount += enc.encode(part.text).length;
          }
        }
      }
    }

    tokenCount += 2; // reply priming
    return tokenCount;
  } finally {
    enc.free();
  }
}

export interface ChatMessage {
  role: string;
  content: string | ContentPart[];
  name?: string;
}

interface ContentPart {
  type: string;
  text?: string;
}

The function adds 4 tokens per message for the chat-ML overhead and 2 tokens at the end for reply priming, matching OpenAI’s documented token counting rules.

Cost estimation

File: src/proxy/cost-estimator.ts

import { getModelPricing, type PricingModel } from "../pricing/manifest.js";

export interface CostEstimate {
  inputTokens: number;
  maxOutputTokens: number;
  inputCost: number;
  worstCaseOutputCost: number;
  worstCaseTotalCost: number;
  model: string;
  pricing: PricingModel;
}

export function estimateCost(
  model: string,
  inputTokens: number,
  maxTokens?: number,
): CostEstimate | null {
  const pricing = getModelPricing(model);
  if (!pricing) return null;

  const maxOutputTokens = maxTokens ?? pricing.maxOutputTokens;
  const inputCost = (inputTokens / 1000) * pricing.inputPer1k;
  const worstCaseOutputCost = (maxOutputTokens / 1000) * pricing.outputPer1k;

  return {
    inputTokens,
    maxOutputTokens,
    inputCost,
    worstCaseOutputCost,
    worstCaseTotalCost: inputCost + worstCaseOutputCost,
    model,
    pricing,
  };
}

export function calculateActualCost(
  model: string,
  inputTokens: number,
  outputTokens: number,
): number {
  const pricing = getModelPricing(model);
  if (!pricing) return 0;

  return (inputTokens / 1000) * pricing.inputPer1k
       + (outputTokens / 1000) * pricing.outputPer1k;
}

The estimate uses worst-case math: (input tokens * input price) + (max_tokens * output price). This is intentionally conservative. After the response arrives, calculateActualCost computes the real cost from the usage object OpenAI returns.

The pricing manifest

Model prices live in config/pricing.yml so you can update them without changing code:

version: "2026-03-14"
provider: openai
models:
  gpt-4o:
    inputPer1k: 0.0025
    outputPer1k: 0.01
    maxOutputTokens: 16384
  gpt-4o-mini:
    inputPer1k: 0.00015
    outputPer1k: 0.0006
    maxOutputTokens: 16384
  gpt-4-turbo:
    inputPer1k: 0.01
    outputPer1k: 0.03
    maxOutputTokens: 4096
Tip

Keep pricing.yml in version control and update it when provider pricing changes. The proxy will refuse requests for models not listed in this manifest, which prevents surprise costs from unknown models.

Verify

You can verify token counting with a quick unit test or by checking the X-Input-Tokens and X-Estimated-Cost headers on any proxied response (visible after Step 3).

Step 3: Implement API key authentication (7 min)

Every request must include a proxy-issued API key. This is not the OpenAI key (which the proxy holds server-side). Proxy keys are hashed with SHA-256 before storage, so the database never contains plaintext keys.

SQLite schema

The database schema is created automatically by the migrations module. The relevant table:

CREATE TABLE IF NOT EXISTS api_keys (
  id            INTEGER PRIMARY KEY AUTOINCREMENT,
  name          TEXT NOT NULL,
  key_hash      TEXT NOT NULL UNIQUE,
  key_prefix    TEXT NOT NULL,
  team          TEXT,
  budget_id     INTEGER REFERENCES budgets(id),
  created_at    TEXT NOT NULL DEFAULT (datetime('now')),
  revoked_at    TEXT,
  metadata      TEXT
);

Key generation and lookup

File: src/storage/keys.ts

import { createHash, randomBytes } from "node:crypto";
import type Database from "better-sqlite3";

export function hashKey(plaintext: string): string {
  return createHash("sha256").update(plaintext).digest("hex");
}

export function generateKey(): string {
  return `lbp_${randomBytes(16).toString("hex")}`;
}

export function createKey(
  db: Database.Database,
  name: string,
  team: string | null,
  budgetId: number | null,
): CreateKeyResult {
  const key = generateKey();
  const keyHash = hashKey(key);
  const keyPrefix = key.slice(0, 12);

  const stmt = db.prepare(`
    INSERT INTO api_keys (name, key_hash, key_prefix, team, budget_id)
    VALUES (?, ?, ?, ?, ?)
  `);
  const result = stmt.run(name, keyHash, keyPrefix, team, budgetId);

  return { id: result.lastInsertRowid as number, name, key, keyPrefix };
}

export function lookupKey(
  db: Database.Database,
  plaintext: string,
): KeyRecord | null {
  const keyHash = hashKey(plaintext);
  const row = db.prepare(`
    SELECT id, name, key_prefix, team, budget_id, created_at, revoked_at
    FROM api_keys WHERE key_hash = ?
  `).get(keyHash) as KeyRecord | undefined;

  return row ?? null;
}

Keys use the format lbp_ followed by 32 hex characters. On each request, the middleware hashes the provided token with SHA-256 and looks it up:

File: src/middleware/auth.ts

import type { FastifyRequest, FastifyReply } from "fastify";
import type Database from "better-sqlite3";
import { timingSafeEqual } from "node:crypto";
import { lookupKey } from "../storage/keys.js";

export function createAuthMiddleware(db: Database.Database) {
  return async function authMiddleware(
    request: FastifyRequest,
    reply: FastifyReply,
  ): Promise<void> {
    const authHeader = request.headers.authorization;
    if (!authHeader) {
      reply.code(401).send({
        error: "missing_api_key",
        message: "Authorization header required",
      });
      return;
    }

    const match = authHeader.match(/^Bearer\s+(.+)$/i);
    if (!match) {
      reply.code(401).send({
        error: "invalid_auth_format",
        message: "Expected: Authorization: Bearer <key>",
      });
      return;
    }

    const token = match[1];
    if (!token.startsWith("lbp_")) {
      reply.code(401).send({
        error: "invalid_key_format",
        message: "API key must start with lbp_",
      });
      return;
    }

    const keyRecord = lookupKey(db, token);
    if (!keyRecord) {
      reply.code(401).send({
        error: "invalid_api_key",
        message: "API key not found",
      });
      return;
    }

    if (keyRecord.revoked_at) {
      reply.code(401).send({
        error: "key_revoked",
        message: "API key has been revoked",
      });
      return;
    }

    request.keyRecord = keyRecord;
  };
}

Create your first key

Use the admin API to create a key:

curl -X POST http://localhost:3000/api/keys \
  -H "Authorization: Bearer admin-dev-key" \
  -H "Content-Type: application/json" \
  -d '{"name": "dev-test", "team": "engineering"}'

The response includes the plaintext key exactly once. Save it. The proxy only stores the hash.

Verify

# Without a key — 401
curl -s -o /dev/null -w "%{http_code}" \
  -X POST http://localhost:3000/v1/chat/completions

# With a valid key — 200 (forwarded to OpenAI)
curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer lbp_your-key-here" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Say hello"}]}'

Step 4: Add per-key rate limiting (7 min)

The rate limiter enforces two sliding-window limits per key: requests per minute (RPM) and tokens per minute (TPM). Both are configurable per key pattern.

File: src/middleware/rate-limiter.ts

import type { FastifyRequest, FastifyReply } from "fastify";
import type { Config } from "../config/schema.js";

interface WindowEntry {
  timestamp: number;
  tokens: number;
}

const windows = new Map<string, WindowEntry[]>();
const WINDOW_MS = 60_000;

export function createRateLimiter(config: Config) {
  return async function rateLimiterMiddleware(
    request: FastifyRequest,
    reply: FastifyReply,
  ): Promise<void> {
    const keyRecord = request.keyRecord;
    if (!keyRecord) return;

    const limits = resolveRateLimits(config, keyRecord.name);
    const windowKey = `key:${keyRecord.id}`;
    const now = Date.now();

    const entries = windows.get(windowKey) ?? [];
    const validEntries = entries.filter((e) => now - e.timestamp < WINDOW_MS);
    windows.set(windowKey, validEntries);

    const currentRpm = validEntries.length;
    const currentTpm = validEntries.reduce((sum, e) => sum + e.tokens, 0);

    reply.header("X-RateLimit-Limit-RPM", limits.rpm);
    reply.header("X-RateLimit-Remaining-RPM", Math.max(0, limits.rpm - currentRpm));
    reply.header("X-RateLimit-Limit-TPM", limits.tpm);
    reply.header("X-RateLimit-Remaining-TPM", Math.max(0, limits.tpm - currentTpm));

    if (currentRpm >= limits.rpm) {
      const oldestValid = validEntries[0];
      const retryAfter = oldestValid
        ? Math.ceil((oldestValid.timestamp + WINDOW_MS - now) / 1000)
        : 60;

      reply.code(429).header("Retry-After", retryAfter).send({
        error: "rate_limit_exceeded",
        message: `RPM limit exceeded (${limits.rpm}/min)`,
        retryAfter,
      });
      return;
    }

    if (currentTpm >= limits.tpm) {
      const oldestValid = validEntries[0];
      const retryAfter = oldestValid
        ? Math.ceil((oldestValid.timestamp + WINDOW_MS - now) / 1000)
        : 60;

      reply.code(429).header("Retry-After", retryAfter).send({
        error: "token_rate_limit_exceeded",
        message: `TPM limit exceeded (${limits.tpm}/min)`,
        retryAfter,
      });
      return;
    }
  };
}

The implementation uses an in-memory Map of WindowEntry arrays keyed by API key ID. On each request, entries older than 60 seconds are pruned. After the request completes, a recordRequest function pushes a new entry with the token count.

Every response includes four rate-limit headers so callers can implement client-side backoff: X-RateLimit-Limit-RPM, X-RateLimit-Remaining-RPM, X-RateLimit-Limit-TPM, and X-RateLimit-Remaining-TPM.

Note

The in-memory sliding window works well for single-instance deployments. If you run multiple proxy instances behind a load balancer, move the window state to Redis or a shared SQLite WAL database.

Verify

Set a low RPM limit in config.yml (e.g., rpm: 3) and burst requests:

for i in $(seq 1 5); do
  curl -s -o /dev/null -w "Request $i: %{http_code}\n" \
    -X POST http://localhost:3000/v1/chat/completions \
    -H "Authorization: Bearer lbp_your-key-here" \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}'
done

The first three requests return 200. The fourth and fifth return 429 with a Retry-After header.

Step 5: Enforce budget limits (8 min)

Rate limits control velocity. Budgets control total spend. The budget checker runs after the rate limiter and compares the key’s accumulated cost against configurable thresholds.

Budget schema and status query

CREATE TABLE IF NOT EXISTS budgets (
  id            INTEGER PRIMARY KEY AUTOINCREMENT,
  name          TEXT NOT NULL,
  period        TEXT NOT NULL CHECK (period IN ('daily', 'monthly')),
  limit_dollars REAL NOT NULL,
  reset_at      TEXT NOT NULL,
  created_at    TEXT NOT NULL DEFAULT (datetime('now'))
);

The getBudgetStatus function sums cost_dollars from usage_records for the current period and computes consumed, remaining, and percent used:

export function getBudgetStatus(
  db: Database.Database,
  budgetId: number,
): BudgetStatus | null {
  const budget = getBudget(db, budgetId);
  if (!budget) return null;

  maybeResetBudget(db, budget);

  const periodStart = computePeriodStart(budget.period);
  const row = db.prepare(`
    SELECT COALESCE(SUM(cost_dollars), 0) as consumed
    FROM usage_records
    WHERE key_id IN (SELECT id FROM api_keys WHERE budget_id = ?)
      AND created_at >= ?
  `).get(budgetId, periodStart) as { consumed: number };

  const consumed = row.consumed;
  const remaining = Math.max(0, budget.limit_dollars - consumed);
  const percentUsed = budget.limit_dollars > 0
    ? (consumed / budget.limit_dollars) * 100
    : 0;

  return { budget, consumed_dollars: consumed, remaining_dollars: remaining, percent_used: percentUsed };
}

The budget checker middleware

File: src/middleware/budget-checker.ts

The middleware reads the sorted alertThresholds from config and applies the first matching threshold. There are three possible actions:

  • warn: Sets an X-Budget-Warning: approaching_limit header. The request proceeds.
  • downgrade: Rewrites the model field to a cheaper model (e.g., gpt-4o to gpt-4o-mini). Disabled by default in config.
  • block: Returns 402 Payment Required with budget details.
for (const threshold of thresholds) {
  if (status.percent_used >= threshold.percent) {
    if (threshold.action === "block") {
      reply.code(402).send({
        error: "budget_exceeded",
        message: `${status.budget.period} budget exhausted`,
        budget: {
          period: status.budget.period,
          limit: status.budget.limit_dollars,
          consumed: status.consumed_dollars,
          remaining: status.remaining_dollars,
        },
      });
      return;
    }

    if (threshold.action === "downgrade" && config.modelDowngrade.enabled) {
      const currentModel = body?.model as string | undefined;
      if (currentModel) {
        const rule = config.modelDowngrade.rules.find((r) => r.from === currentModel);
        if (rule) {
          (body as Record<string, unknown>).model = rule.to;
          reply.header("X-Model-Downgraded", "true");
          reply.header("X-Original-Model", currentModel);
        }
      }
    }

    if (threshold.action === "warn") {
      reply.header("X-Budget-Warning", "approaching_limit");
    }

    break;
  }
}
Warning

Model downgrade is disabled by default (modelDowngrade.enabled: false in config). Enable it only when your callers can tolerate receiving responses from a cheaper model. Some applications break when the model changes unexpectedly.

Every response includes X-Budget-Limit, X-Budget-Remaining, and X-Budget-Period headers so callers always know their budget status.

Verify

Create a key with a tiny budget and send requests until the budget is exhausted:

# Create a key with a $0.01 daily budget
curl -X POST http://localhost:3000/api/keys \
  -H "Authorization: Bearer admin-dev-key" \
  -H "Content-Type: application/json" \
  -d '{"name":"budget-test","budgetPeriod":"daily","budgetLimit":0.01}'

# Send requests until you get a 402
curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer lbp_the-new-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Write a paragraph about budgets."}]}'

After one or two gpt-4o requests, you will see a 402 response with the budget breakdown.

Step 6: Add caching and streaming support (8 min)

Caching eliminates redundant API calls. Streaming requires special handling because usage data only arrives in the final SSE chunk.

Exact-match caching

The cache hashes the entire request body (with keys sorted for determinism) and stores the response in SQLite with a TTL:

File: src/storage/cache.ts

import { createHash } from "node:crypto";
import type Database from "better-sqlite3";

export function computeRequestHash(body: Record<string, unknown>): string {
  const normalized = JSON.stringify(sortKeys(body));
  return createHash("sha256").update(normalized).digest("hex");
}

function sortKeys(obj: unknown): unknown {
  if (obj === null || typeof obj !== "object") return obj;
  if (Array.isArray(obj)) return obj.map(sortKeys);
  const sorted: Record<string, unknown> = {};
  for (const key of Object.keys(obj as Record<string, unknown>).sort()) {
    sorted[key] = sortKeys((obj as Record<string, unknown>)[key]);
  }
  return sorted;
}

export function getCachedResponse(
  db: Database.Database,
  requestHash: string,
): CacheEntry | null {
  const row = db.prepare(`
    SELECT * FROM request_cache
    WHERE request_hash = ? AND expires_at > datetime('now')
  `).get(requestHash) as CacheEntry | undefined;

  return row ?? null;
}

export function setCachedResponse(
  db: Database.Database,
  requestHash: string,
  model: string,
  responseBody: string,
  inputTokens: number,
  outputTokens: number,
  ttlSeconds: number,
): void {
  db.prepare(`
    INSERT OR REPLACE INTO request_cache
      (request_hash, model, response_body, input_tokens, output_tokens, expires_at)
    VALUES (?, ?, ?, ?, ?, datetime('now', '+' || ? || ' seconds'))
  `).run(requestHash, model, responseBody, inputTokens, outputTokens, ttlSeconds);
}

Cache hits return the stored response with X-Cache: HIT and record zero cost in the usage table. Cache hits also include X-Budget-Limit, X-Budget-Remaining, and X-Budget-Period headers so callers always have up-to-date budget status, even on cached responses. A background timer evicts expired entries and enforces the maxEntries cap every five minutes; the eviction interval is cleaned up on shutdown to avoid resource leaks.

Streaming SSE support

When the client sends "stream": true, the proxy injects stream_options.include_usage = true so OpenAI returns a final chunk with token counts. The proxy forwards each SSE chunk to the client in real time and extracts usage data at the end:

// Inject stream_options so OpenAI returns usage in the final chunk
if (!body.stream_options || typeof body.stream_options !== "object") {
  body.stream_options = {};
}
(body.stream_options as Record<string, unknown>).include_usage = true;

// ... forward chunks to client ...

// Parse each SSE line looking for the usage object
const lines = chunk.split("\n");
for (const line of lines) {
  if (!line.startsWith("data: ") || line === "data: [DONE]") continue;
  try {
    const data = JSON.parse(line.slice(6));
    if (data.usage) {
      finalUsage = data.usage;
    }
  } catch {
    // skip unparseable chunks
  }
}

The proxy also handles client disconnects by wiring an AbortController to the raw request’s close event. If the client disconnects mid-stream, the proxy cancels the upstream request and records partial usage.

Tip

Streaming responses are not cached. The cache only stores non-streaming responses. If you need caching for streaming workloads, consider a semantic cache at a higher layer.

Verify

Send the same non-streaming request twice and check the cache header:

# First request — X-Cache: MISS
curl -s -D - -X POST http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer lbp_your-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is 2+2?"}]}' \
  | grep X-Cache

# Second identical request — X-Cache: HIT, zero cost
curl -s -D - -X POST http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer lbp_your-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is 2+2?"}]}' \
  | grep X-Cache

The second request returns instantly with X-Cache: HIT and X-Request-Cost: 0.000000.

Step 7: Build the dashboard and alerting (5 min)

The proxy serves a single-page dashboard at /dashboard and exposes JSON API endpoints for usage data. Both are protected by the ADMIN_API_KEY.

Dashboard API endpoints

File: src/dashboard/api.ts

The admin API provides four endpoints:

  • GET /api/usage — raw usage records with optional filters (key_id, start, end, limit)
  • GET /api/usage/summary — aggregated cost, tokens, and request counts per key for the current period
  • GET /api/usage/timeseries — time-bucketed cost and request data for charting (hourly or daily granularity)
  • GET /api/budgets — current budget status for all keys

All endpoints require Authorization: Bearer <ADMIN_API_KEY> or ?admin_key=<key> as a query parameter. The admin key comparison uses crypto.timingSafeEqual to prevent timing attacks that could leak the key value through response-time analysis.

The dashboard UI

The dashboard is a single index.html file served at /dashboard. It loads Chart.js from a CDN and renders:

  • Summary cards showing total cost, request count, cache hit rate, and active keys
  • A line chart of cost over time
  • A table of per-key usage with cost, token counts, and cache stats

The dashboard fetches data from the admin API endpoints using the admin key entered in a form field. No build step is required.

Webhook alerting

When a budget threshold is crossed, the proxy fires a POST request to the configured webhookUrl with a JSON payload:

export interface WebhookPayload {
  event: "budgetWarning" | "budgetExceeded" | "anomaly";
  timestamp: string;
  keyName: string;
  team: string | null;
  details: {
    budgetName: string;
    period: "daily" | "monthly";
    limitDollars: number;
    consumedDollars: number;
    percentUsed: number;
  };
}

Alerts are debounced per event type and key name with a one-hour cooldown. This prevents flooding your Slack channel or PagerDuty when a key is hovering near a threshold.

Verify

Open http://localhost:3000/dashboard in your browser, enter your admin key, and confirm the charts render with data from your test requests. If you configured a WEBHOOK_URL, check that budget warnings appear when a key crosses the 80% threshold.

Step 8: Containerize and deploy (5 min)

The project includes a multi-stage Dockerfile and a docker-compose.yml for production deployment.

Dockerfile

# Build stage
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json* ./
RUN npm ci --ignore-scripts
COPY tsconfig.json ./
COPY src/ ./src/
RUN npm run build

# Production stage
FROM node:20-alpine
RUN addgroup -g 1001 -S proxyuser && \
    adduser -S proxyuser -u 1001
WORKDIR /app
COPY package.json package-lock.json* ./
RUN npm ci --omit=dev --ignore-scripts && npm cache clean --force
COPY --from=builder /app/dist ./dist
COPY config/ ./config/
COPY src/dashboard/index.html ./dist/dashboard/index.html
RUN mkdir -p /app/data && chown proxyuser:proxyuser /app/data
USER proxyuser
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]

The build stage compiles TypeScript. The production stage runs as a non-root proxyuser and includes a health check that hits the /health endpoint.

docker-compose.yml

services:
  proxy:
    build: .
    ports:
      - "3000:3000"
    volumes:
      - proxy-data:/app/data
      - ./config:/app/config:ro
    environment:
      OPENAI_API_KEY: "${OPENAI_API_KEY}"
      ADMIN_API_KEY: "${ADMIN_API_KEY:-admin-dev-key}"
      WEBHOOK_URL: "${WEBHOOK_URL:-}"
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

volumes:
  proxy-data:

The proxy-data volume persists the SQLite database across container restarts. The config directory is mounted read-only.

End-to-end verification

Create a .env file from the example and bring the stack up:

cp .env.example .env
# Edit .env with your real OPENAI_API_KEY

docker compose up --build -d

Wait for the health check to pass, then create a key and send a request:

# Create a key
curl -X POST http://localhost:3000/api/keys \
  -H "Authorization: Bearer admin-dev-key" \
  -H "Content-Type: application/json" \
  -d '{"name":"docker-test","team":"ops","budgetPeriod":"daily","budgetLimit":5.00}'

# Send a request through the proxy
curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer lbp_the-returned-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello from Docker"}]}'

# Check the dashboard
open http://localhost:3000/dashboard

You should see the request in the dashboard with cost, token counts, and latency.

Wrap-Up

You now have a working LLM API proxy that enforces per-key authentication, sliding-window rate limits, configurable budget thresholds with warn/downgrade/block actions, exact-match caching, streaming support with end-of-stream accounting, a usage dashboard, and webhook alerting. The whole stack runs in a single Docker container with SQLite for persistence.

The most common next steps are adding support for additional providers beyond OpenAI, implementing semantic caching for higher hit rates, adding per-team budget rollups, and integrating the webhook alerts with your existing incident response tooling.

For the full architecture rationale and cost modeling approach, see LLM API Rate Limiting and Cost Control. To download a printable checklist for deploying LLM cost controls, see the LLM Cost Control Checklist.