Build a rate-limit and cost-tracking middleware layer for API calls
Produces a middleware layer that enforces client-side rate limits, tracks token and request spend against a budget, and short-circuits before you blow a quota, with typed budgets, alerts, and an observability hook.
You are a senior backend engineer who controls cost and rate-limit risk on API spend. Build a rate-limit and cost-tracking middleware layer that wraps API calls. Context: - API: [OpenAI / Anthropic / generic — note the pricing basis: per-token / per-request] - Language: [TypeScript / Python] - Concurrency model: [single process / multi-process / distributed with Redis] - Budget: [e.g. 'USD per day', 'requests per minute', 'tokens per hour'] - Needs streaming cost: [yes — count streamed tokens / no] Build a middleware or guard layer with: 1. A client-side rate limiter (token bucket or sliding window) that throttles calls before they hit the network; for distributed mode, back it with Redis with a documented lock strategy. Expose rate config as knobs. 2. A budget tracker that estimates cost per call: for per-token APIs, count prompt plus completion tokens (including streamed tokens) and multiply by the price table; for per-request APIs, count requests. Keep a running total per budget window. 3. Short-circuit logic: before a call, check the rate limiter AND the remaining budget; if either is exhausted, reject with a typed BudgetExceededError or RateLimitedError — do NOT fire the request. 4. A price table as data (model -> input/output price per unit) with the date and currency noted, and a clear warning that prices change and must be maintained. 5. An observability hook: every decision (allowed, throttled, budget-exceeded) emits a structured event (call id, model, tokens, est. cost, decision) so metrics and alerts can be wired in. 6. Graceful behavior when the API returns its own usage headers — prefer the provider's reported usage over the estimate and reconcile. Requirements: - Never let a call fire if it would exceed the budget; the guard is the source of truth. - Streaming cost counting must not block the stream; count asynchronously or after completion. - All cost figures are estimates labeled as such; do not present them as exact billing. Output, in this exact order: 1. A design overview (limiter strategy, budget model, short-circuit order, observability). 2. The full middleware module with typed interfaces. 3. A usage example wrapping a client call and showing a budget-exceeded rejection. 4. The price-table data structure with a maintenance note. 5. A test checklist: throttle under burst, reject at budget, accurate token count on a stream, distributed-mode correctness. Success signal: the output is good only if a call is rejected before firing when it would breach the rate limit or budget, streamed tokens are counted toward cost without blocking the stream, and every cost number is clearly labeled an estimate tied to a maintainable price table.
Use case
Use when you must stay under provider rate limits and a spending budget across many API calls, and want enforcement before the request fires rather than after the bill arrives.
When to use this
In production callers with bursty or high-volume API traffic, or any time cost control matters. Not for low-volume scripts.
Follow-up prompts
- Add a per-tenant budget so multiple users share one provider key safely.
- Generate an alerting hook (Slack/email/webhook) that fires at 80 percent of budget.
- Add a fallback strategy that switches to a cheaper model when the budget is nearly exhausted.
- Source
- promptfork seed
- License
- CC-BY-4.0
- Published
- 6/22/2026