How Token Pricing Works
Most AI APIs charge separately for input tokens (what you send) and output tokens (what the model returns). Output tokens are typically more expensive — on Claude, roughly 5× the input rate. This matters for how you design prompts.
A rough token rule of thumb: 1 token ≈ 4 characters of English text. A 500-word system prompt is around 700 tokens. A typical code file with 200 lines is around 1,000–1,500 tokens depending on line length and language.
Every request pays for: (1) your system prompt, (2) any conversation history you include, (3) the user's input. Output cost varies by how much the model writes. The system prompt and history are usually the largest cost drivers in a multi-turn product feature.
Check current pricing at anthropic.com/pricing. Prices change; the patterns in this guide do not.
Choosing the Right Model
Claude's model family (as of mid-2026) has three tiers. Using the wrong tier is the most common source of avoidable AI cost.
Haiku — High volume, structured tasks
Use Haiku for tasks where the output is short, structured, or classification-based:
- Classifying user intent into a fixed set of categories
- Extracting structured fields from short text (entity extraction)
- Generating short labels, tags, or summaries
- Moderation checks — is this content acceptable?
- Autocomplete suggestions where speed matters more than depth
Sonnet — The production default
Sonnet is the right choice for most production AI features. Use it for:
- Drafting and editing prose, emails, or documentation
- Code generation where quality matters
- Multi-step reasoning over moderate-length inputs
- Question answering with context retrieval (RAG)
- Most user-facing features where quality directly affects product experience
Opus — Complex reasoning, research tasks
Opus costs significantly more than Sonnet. Justify it only when:
- The task requires deep multi-step reasoning over long, complex inputs
- Output quality directly translates to high-value outcomes (e.g., contract analysis, security audits)
- You've confirmed Sonnet produces noticeably worse results for this specific task
Before assuming you need Opus, run 20 representative examples through Sonnet and evaluate the output quality. Most product teams find Sonnet adequate for their use case. The cost difference makes the test worth doing.
Prompt Caching
Claude's prompt caching lets you cache the beginning of your prompt across requests. If your system prompt is 2,000 tokens and you make 1,000 requests per hour, you pay full input price for that 2,000 tokens once every 5 minutes — and a deeply discounted cache-read rate for the rest.
When Caching Makes Sense
- Long system prompts that repeat across every request (instructions, persona, context)
- Documents or codebases injected into every call in a RAG-like pattern
- Multi-turn conversations where earlier turns are stable
How to Enable It
// Mark the stable portion of your system prompt for caching
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: [
{
type: 'text',
text: yourLongSystemPrompt,
cache_control: { type: 'ephemeral' } // cache this block
}
],
messages: [{ role: 'user', content: userMessage }]
});
The ephemeral cache lasts 5 minutes and resets with each write. For longer-lived caches, check the current API documentation — cache TTL options have expanded over time.
Caching has a write cost (higher than a normal input token read) and a read cost (much lower). The break-even point is roughly when you make the same cached call 2+ times within the TTL window. For high-traffic features, the savings are substantial.
Controlling Output Length
Output tokens cost more than input. The simplest control is max_tokens — it caps how much the model can write in a single response.
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 512, // hard cap — model stops here even mid-sentence
messages: [...]
});
Set max_tokens to a value that fits your use case, not the maximum allowed. For a feature that generates a one-paragraph summary, 256 tokens is plenty. Leaving it at 8192 means accidental long outputs cost 16× more than they need to.
Prompt-Level Length Control
You can also constrain output length in the prompt itself:
// In your system prompt:
"Respond in 2-3 sentences. Do not include explanations or caveats."
// Or for structured output:
"Return JSON only. No prose before or after the JSON object."
Combine both: use max_tokens as the hard ceiling, and use prompt instructions to guide the model toward appropriate length within that ceiling.
Estimating Costs Before You Ship
Before a feature goes to production, estimate the cost per call and the daily cost at expected volume.
// Rough cost estimation formula (verify current rates at anthropic.com/pricing)
const inputTokens = systemPromptTokens + avgHistoryTokens + avgUserInputTokens;
const outputTokens = avgResponseTokens;
// Example rates (Sonnet, mid-2026 — always verify)
const inputCostPer1M = 3.00; // USD per million input tokens
const outputCostPer1M = 15.00; // USD per million output tokens
const costPerCall =
(inputTokens / 1_000_000) * inputCostPer1M +
(outputTokens / 1_000_000) * outputCostPer1M;
const dailyCostAt1000Calls = costPerCall * 1000;
console.log(`Cost per call: $${costPerCall.toFixed(5)}`);
console.log(`Daily cost at 1,000 calls: $${dailyCostAt1000Calls.toFixed(2)}`);
Run this estimate against three scenarios: p50 (typical), p95 (busy), and p99 (spike). AI cost surprises almost always come from unexpected input length growth — a user uploads a 10,000-word document when you designed for 500-word inputs.
Monitoring and Budget Alerts
Track Usage Per Feature
Log token counts from the usage field of every API response. Aggregate by feature, user tier, or endpoint. Without this, you can't diagnose cost regressions — a prompt change that adds 200 tokens to every request compounds invisibly.
const response = await anthropic.messages.create({ ... });
// Log usage for every call
logger.info('ai_usage', {
feature: 'document_summary',
model: response.model,
input_tokens: response.usage.input_tokens,
cache_read_input_tokens: response.usage.cache_read_input_tokens ?? 0,
output_tokens: response.usage.output_tokens,
user_id: ctx.userId,
});
Set Spending Limits
Anthropic's API console lets you set monthly spending limits and receive email alerts. Set a hard limit at 2× your expected monthly spend and an alert at 80% of expected. This catches runaway loops, prompt injection attacks that inflate context, or simply faster-than-expected growth.
Per-User Rate Limits
If your product gives users access to AI features, implement per-user or per-tier rate limits at the application layer. Without them, a single power user (or attacker) can exhaust your budget.
// Simple per-user rate limiter using Redis
async function checkAiRateLimit(userId: string): Promise<boolean> {
const key = `ai_calls:${userId}:${hourSlot()}`;
const count = await redis.incr(key);
if (count === 1) await redis.expire(key, 3600);
return count <= FREE_TIER_HOURLY_LIMIT;
}
Cost Model Checklist
- Model selection — Haiku for classification and short structured output; Sonnet for most production features; Opus only when you've confirmed Sonnet falls short.
- Prompt caching — cache your system prompt if it's long and repeated across requests.
- max_tokens — set it to the task ceiling, not the API maximum.
- Pre-ship estimate — cost per call × expected daily volume, at p50, p95, and p99 input sizes.
- Log usage — input and output tokens per call, per feature. Aggregate to detect regressions.
- Spending alert — set one in the API console before launch, not after your first surprise bill.
Related Guides
Claude API for Developers
System prompt design, streaming, tool use, prompt caching deep-dive, and production patterns.
AI Evals in Production
Quality gates, regression testing, observability metrics, and incident playbooks for AI features.