Production AI Guide

AI Cost Modeling: Tokens, Model Selection, and Budget Control

Token costs are predictable and controllable — but only if you model them before you ship. This guide covers how to estimate costs for common AI patterns, choose the right model for each task, use prompt caching effectively, and set budget controls that prevent surprises in production.

Last reviewed: May 26 2026


How Token Pricing Works

Most AI APIs charge separately for input tokens (what you send) and output tokens (what the model returns). Output tokens are typically more expensive — on Claude, roughly 5× the input rate. This matters for how you design prompts.

A rough token rule of thumb: 1 token ≈ 4 characters of English text. A 500-word system prompt is around 700 tokens. A typical code file with 200 lines is around 1,000–1,500 tokens depending on line length and language.

Always-On Cost Drivers

Every request pays for: (1) your system prompt, (2) any conversation history you include, (3) the user's input. Output cost varies by how much the model writes. The system prompt and history are usually the largest cost drivers in a multi-turn product feature.

Check current pricing at anthropic.com/pricing. Prices change; the patterns in this guide do not.


Choosing the Right Model

Claude's model family (as of mid-2026) has three tiers. Using the wrong tier is the most common source of avoidable AI cost.

Haiku — High volume, structured tasks

Use Haiku for tasks where the output is short, structured, or classification-based:

Sonnet — The production default

Sonnet is the right choice for most production AI features. Use it for:

Opus — Complex reasoning, research tasks

Opus costs significantly more than Sonnet. Justify it only when:

Test Before Committing

Before assuming you need Opus, run 20 representative examples through Sonnet and evaluate the output quality. Most product teams find Sonnet adequate for their use case. The cost difference makes the test worth doing.


Prompt Caching

Claude's prompt caching lets you cache the beginning of your prompt across requests. If your system prompt is 2,000 tokens and you make 1,000 requests per hour, you pay full input price for that 2,000 tokens once every 5 minutes — and a deeply discounted cache-read rate for the rest.

When Caching Makes Sense

How to Enable It

// Mark the stable portion of your system prompt for caching
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: yourLongSystemPrompt,
      cache_control: { type: 'ephemeral' }   // cache this block
    }
  ],
  messages: [{ role: 'user', content: userMessage }]
});

The ephemeral cache lasts 5 minutes and resets with each write. For longer-lived caches, check the current API documentation — cache TTL options have expanded over time.

Cache Write Cost

Caching has a write cost (higher than a normal input token read) and a read cost (much lower). The break-even point is roughly when you make the same cached call 2+ times within the TTL window. For high-traffic features, the savings are substantial.


Controlling Output Length

Output tokens cost more than input. The simplest control is max_tokens — it caps how much the model can write in a single response.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 512,   // hard cap — model stops here even mid-sentence
  messages: [...]
});

Set max_tokens to a value that fits your use case, not the maximum allowed. For a feature that generates a one-paragraph summary, 256 tokens is plenty. Leaving it at 8192 means accidental long outputs cost 16× more than they need to.

Prompt-Level Length Control

You can also constrain output length in the prompt itself:

// In your system prompt:
"Respond in 2-3 sentences. Do not include explanations or caveats."

// Or for structured output:
"Return JSON only. No prose before or after the JSON object."

Combine both: use max_tokens as the hard ceiling, and use prompt instructions to guide the model toward appropriate length within that ceiling.


Estimating Costs Before You Ship

Before a feature goes to production, estimate the cost per call and the daily cost at expected volume.

// Rough cost estimation formula (verify current rates at anthropic.com/pricing)
const inputTokens = systemPromptTokens + avgHistoryTokens + avgUserInputTokens;
const outputTokens = avgResponseTokens;

// Example rates (Sonnet, mid-2026 — always verify)
const inputCostPer1M = 3.00;    // USD per million input tokens
const outputCostPer1M = 15.00;  // USD per million output tokens

const costPerCall =
  (inputTokens / 1_000_000) * inputCostPer1M +
  (outputTokens / 1_000_000) * outputCostPer1M;

const dailyCostAt1000Calls = costPerCall * 1000;

console.log(`Cost per call: $${costPerCall.toFixed(5)}`);
console.log(`Daily cost at 1,000 calls: $${dailyCostAt1000Calls.toFixed(2)}`);

Run this estimate against three scenarios: p50 (typical), p95 (busy), and p99 (spike). AI cost surprises almost always come from unexpected input length growth — a user uploads a 10,000-word document when you designed for 500-word inputs.


Monitoring and Budget Alerts

Track Usage Per Feature

Log token counts from the usage field of every API response. Aggregate by feature, user tier, or endpoint. Without this, you can't diagnose cost regressions — a prompt change that adds 200 tokens to every request compounds invisibly.

const response = await anthropic.messages.create({ ... });

// Log usage for every call
logger.info('ai_usage', {
  feature: 'document_summary',
  model: response.model,
  input_tokens: response.usage.input_tokens,
  cache_read_input_tokens: response.usage.cache_read_input_tokens ?? 0,
  output_tokens: response.usage.output_tokens,
  user_id: ctx.userId,
});

Set Spending Limits

Anthropic's API console lets you set monthly spending limits and receive email alerts. Set a hard limit at 2× your expected monthly spend and an alert at 80% of expected. This catches runaway loops, prompt injection attacks that inflate context, or simply faster-than-expected growth.

Per-User Rate Limits

If your product gives users access to AI features, implement per-user or per-tier rate limits at the application layer. Without them, a single power user (or attacker) can exhaust your budget.

// Simple per-user rate limiter using Redis
async function checkAiRateLimit(userId: string): Promise<boolean> {
  const key = `ai_calls:${userId}:${hourSlot()}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, 3600);
  return count <= FREE_TIER_HOURLY_LIMIT;
}

Cost Model Checklist

Related Guides

Claude API for Developers

System prompt design, streaming, tool use, prompt caching deep-dive, and production patterns.

AI Evals in Production

Quality gates, regression testing, observability metrics, and incident playbooks for AI features.

Back to Home