AI Cost Modeling: Tokens, Model Selection, and Budget Control

How Token Pricing Works

Most AI APIs charge separately for input tokens (what you send) and output tokens (what the model returns). Output tokens are typically more expensive — on Claude, roughly 5× the input rate. This matters for how you design prompts.

A rough token rule of thumb: 1 token ≈ 4 characters of English text. A 500-word system prompt is around 700 tokens. A typical code file with 200 lines is around 1,000–1,500 tokens depending on line length and language.

Always-On Cost Drivers

Every request pays for: (1) your system prompt, (2) any conversation history you include, (3) the user's input. Output cost varies by how much the model writes. The system prompt and history are usually the largest cost drivers in a multi-turn product feature.

Check current pricing at anthropic.com/pricing. Prices change; the patterns in this guide do not.

Choosing the Right Model

For routine cost planning, Claude's model family (as of Jun 2026) has three everyday tiers: Haiku, Sonnet, and Opus. Fable 5 is a higher-capability option when available, but its availability has changed since launch, so verify current status before routing production traffic to it. Using the wrong tier is the most common source of avoidable AI cost.

Haiku — High volume, structured tasks

Use Haiku for tasks where the output is short, structured, or classification-based:

Classifying user intent into a fixed set of categories
Extracting structured fields from short text (entity extraction)
Generating short labels, tags, or summaries
Moderation checks — is this content acceptable?
Autocomplete suggestions where speed matters more than depth

Sonnet — The production default

Sonnet is the right choice for most production AI features. Use it for:

Drafting and editing prose, emails, or documentation
Code generation where quality matters
Multi-step reasoning over moderate-length inputs
Question answering with context retrieval (RAG)
Most user-facing features where quality directly affects product experience

Opus — Complex reasoning, research tasks

Opus costs significantly more than Sonnet. Justify it only when:

The task requires deep multi-step reasoning over long, complex inputs
Output quality directly translates to high-value outcomes (e.g., contract analysis, security audits)
You've confirmed Sonnet produces noticeably worse results for this specific task

Test Before Committing

Before assuming you need Opus, run 20 representative examples through Sonnet and evaluate the output quality. Most product teams find Sonnet adequate for their use case. The cost difference makes the test worth doing.

Prompt Caching

Claude's prompt caching lets you cache the beginning of your prompt across requests. If your system prompt is 2,000 tokens and you make 1,000 requests per hour, you pay full input price for that 2,000 tokens once every 5 minutes — and a deeply discounted cache-read rate for the rest.

When Caching Makes Sense

Long system prompts that repeat across every request (instructions, persona, context)
Documents or codebases injected into every call in a RAG-like pattern
Multi-turn conversations where earlier turns are stable

How to Enable It

// Mark the stable portion of your system prompt for caching
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: yourLongSystemPrompt,
      cache_control: { type: 'ephemeral' }   // cache this block
    }
  ],
  messages: [{ role: 'user', content: userMessage }]
});

The ephemeral cache lasts 5 minutes and resets with each write. For longer-lived caches, check the current API documentation — cache TTL options have expanded over time.

Cache Write Cost

Caching has a write cost (higher than a normal input token read) and a read cost (much lower). The break-even point is roughly when you make the same cached call 2+ times within the TTL window. For high-traffic features, the savings are substantial.

Controlling Output Length

Output tokens cost more than input. The simplest control is max_tokens — it caps how much the model can write in a single response.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 512,   // hard cap — model stops here even mid-sentence
  messages: [...]
});

Set max_tokens to a value that fits your use case, not the maximum allowed. For a feature that generates a one-paragraph summary, 256 tokens is plenty. Leaving it at 8192 means accidental long outputs cost 16× more than they need to.

Prompt-Level Length Control

You can also constrain output length in the prompt itself:

// In your system prompt:
"Respond in 2-3 sentences. Do not include explanations or caveats."

// Or for structured output:
"Return JSON only. No prose before or after the JSON object."

Combine both: use max_tokens as the hard ceiling, and use prompt instructions to guide the model toward appropriate length within that ceiling.

Estimating Costs Before You Ship

Before a feature goes to production, estimate the cost per call and the daily cost at expected volume.

// Rough cost estimation formula (verify current rates at anthropic.com/pricing)
const inputTokens = systemPromptTokens + avgHistoryTokens + avgUserInputTokens;
const outputTokens = avgResponseTokens;

// Example rates (Sonnet, mid-2026 — always verify)
const inputCostPer1M = 3.00;    // USD per million input tokens
const outputCostPer1M = 15.00;  // USD per million output tokens

const costPerCall =
  (inputTokens / 1_000_000) * inputCostPer1M +
  (outputTokens / 1_000_000) * outputCostPer1M;

const dailyCostAt1000Calls = costPerCall * 1000;

console.log(`Cost per call: $${costPerCall.toFixed(5)}`);
console.log(`Daily cost at 1,000 calls: $${dailyCostAt1000Calls.toFixed(2)}`);

Run this estimate against three scenarios: p50 (typical), p95 (busy), and p99 (spike). AI cost surprises almost always come from unexpected input length growth — a user uploads a 10,000-word document when you designed for 500-word inputs.

Monitoring and Budget Alerts

Track Usage Per Feature

Log token counts from the usage field of every API response. Aggregate by feature, user tier, or endpoint. Without this, you can't diagnose cost regressions — a prompt change that adds 200 tokens to every request compounds invisibly.

const response = await anthropic.messages.create({ ... });

// Log usage for every call
logger.info('ai_usage', {
  feature: 'document_summary',
  model: response.model,
  input_tokens: response.usage.input_tokens,
  cache_read_input_tokens: response.usage.cache_read_input_tokens ?? 0,
  output_tokens: response.usage.output_tokens,
  user_id: ctx.userId,
});

Set Spending Limits

Anthropic's API console lets you set monthly spending limits and receive email alerts. Set a hard limit at 2× your expected monthly spend and an alert at 80% of expected. This catches runaway loops, prompt injection attacks that inflate context, or simply faster-than-expected growth.

Per-User Rate Limits

If your product gives users access to AI features, implement per-user or per-tier rate limits at the application layer. Without them, a single power user (or attacker) can exhaust your budget.

// Simple per-user rate limiter using Redis
async function checkAiRateLimit(userId: string): Promise<boolean> {
  const key = `ai_calls:${userId}:${hourSlot()}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, 3600);
  return count <= FREE_TIER_HOURLY_LIMIT;
}

Cost Model Checklist

Model selection — Haiku for classification and short structured output; Sonnet for most production features; Opus only when you've confirmed Sonnet falls short.
Prompt caching — cache your system prompt if it's long and repeated across requests.
max_tokens — set it to the task ceiling, not the API maximum.
Pre-ship estimate — cost per call × expected daily volume, at p50, p95, and p99 input sizes.
Log usage — input and output tokens per call, per feature. Aggregate to detect regressions.
Spending alert — set one in the API console before launch, not after your first surprise bill.

Related Guides

Claude API for Developers

System prompt design, streaming, tool use, prompt caching deep-dive, and production patterns.

AI Evals in Production

Quality gates, regression testing, observability metrics, and incident playbooks for AI features.

Back to Home

→