What's wrong with per-token cost modeling for LLM features?

Per-token pricing is a supplier-side unit. A user-facing feature consumes a variable, often unbounded number of tokens per completed task, especially once reasoning traces, tool calls, and retries are included. Forecasting in tokens gives you a denominator that moves faster than your feature's behavior changes. The better unit is a completed task: a resolved ticket, a generated email, a shipped PR review.

Why Your LLM Cost Model Is Wrong — and How to Fix It Before Your Next Pricing Review

Q: How do reasoning models break cost forecasts?

Reasoning models emit internal thinking tokens that are billed like output tokens and are effectively unbounded per call. A customer paying $20/month can generate $18–$25 of inference cost during heavy reasoning workloads. Forecasts based on pre-reasoning ratios of input-to-output tokens underprice these workloads by 3–10x.

Q: How should you price user-facing LLM features?

Price against a task whose economic value is legible — one completed draft, one approved triage, one graded assignment — and model a loaded cost-per-task including retries, tool calls, failure-mode regenerations, and prompt-caching benefits at your observed hit rate. Then put a gross-margin target on the task and let that set the pricing floor, not the per-token cost.

Q: Does prompt caching actually save money?

Yes, substantially, but only for prompts with stable leading context reused across calls. Savings of 50–90% on cached segments are realistic when context is stable. For bespoke per-user prompts with no shared prefix, prompt caching offers little — and teams overestimate their cache hit rate when they forget that small dynamic fields at the front of the prompt invalidate the cache.

TL;DR

LLM API prices fell roughly 80% from early 2025 to early 2026, and the cost of equivalent GPT-4 performance has dropped a thousand-fold in three years. But token consumption per completed task has climbed faster than price has fallen. Most teams still forecast cost in tokens. They should forecast in completed tasks, price against task value, and model reasoning, tool-use, and cache behavior explicitly.

Key takeaways

The right unit is a completed task, not a token.
Reasoning models produce unbounded thinking tokens billed as output; old forecasts underprice them 3–10x.
Prompt caching saves 50–90% on stable prefixes; most teams overestimate their hit rate.
Batch endpoints and off-peak scheduling routinely cut costs 50% for asynchronous workloads.
A $20/month subscription can consume $18–$25 of inference on heavy-reasoning workloads; unit pricing must account for that distribution, not the mean.

The cost curve you can no longer trust

Three numbers explain why every LLM cost model built before 2025 is obsolete.

First, prices collapsed. Andreessen Horowitz's LLMflation tracking shows equivalent GPT-4 performance that cost roughly $60 per million tokens in late 2021 now costs as little as $0.06–$0.40. Per Price Per Token, major-model input pricing fell about 80% in the twelve months ending early 2026.

Second, consumption exploded. Reasoning models, tool use, and retrieval-heavy prompts multiplied tokens-per-task by 10–100x for the workloads that benefit most from them. The cost of the model dropped; the appetite of the application did not.

Third, the denominator moved. A forecast in tokens per user fails when one user's session contains a 50-step agentic loop and another's contains a one-shot completion. The token is a supplier-side unit. Your P&L is not a supplier-side document.

The task-level unit and why it survives

The unit you need is a completed task whose value you can name. Examples of task units that survive contact with production:

One reviewed pull request.
One drafted customer email accepted by a human.
One triaged support ticket auto-resolved.
One scored compliance check.
One generated lesson plan.

Each task has an observable cost distribution: a p50, a p95, and a long tail. Each task has a business value that can be stated in dollars, even crudely. And each task maps onto a customer-facing unit — a seat, a request, a subscription tier — that a CFO can reconcile against revenue.

The modeling discipline is to instrument your product such that every inference call is tagged with the task it contributed to, then aggregate cost per task and attach a value hypothesis. This is the LLM-era equivalent of the query-level ledger for data platforms — same pattern, different substrate.

Developer screen showing code and terminal output used for cost instrumentation — Instrument every inference call with task ID, caller, and model before you forecast anything.

Reasoning models: where forecasts fall apart

Reasoning models emit internal thinking tokens that bill like output tokens and can run for thousands of tokens on hard problems. Two behaviors break old forecasts.

One: unbounded output per call. A traditional chat call had a bounded output (user asks, model answers, user reads). A reasoning call can think for as long as the problem is hard. There is no stable input-to-output ratio, which means every forecast that multiplied an input-token estimate by a fixed ratio is wrong.

Two: adversarial distribution. The hardest queries are the ones that get routed to the reasoning tier, and the hardest queries have the highest output-token distribution. Costs concentrate at the p95 and p99. A blended average hides the tail. Introl's cost-per-million-tokens analysis shows per-task costs can vary 10x between easy and hard prompts in the same feature.

Model the distribution, not the mean. A pricing tier that breaks even on the mean loses money on the top decile, which is exactly where the most engaged customers live.

Prompt caching: real savings, imagined hit rates

Prompt caching is the largest single optimization available to most teams. When a prompt has a stable leading context reused across many calls — a system prompt, a document, a tool schema — providers can cache the key/value state and charge a small fraction for cached segments on subsequent calls. Savings of 50–90% on cached tokens are realistic.

The failure mode is subtle. A single dynamic field at the front of the prompt — a timestamp, a request ID, a user-specific greeting — invalidates the cache for that call. Teams that insert such fields early often believe they have a cache when they do not. The audit is mechanical: log the cache-hit field the provider returns and plot its distribution. You will find the gap between believed and actual hit rate is large.

Cache design should be an explicit architectural concern: order the prompt with the most-stable context first, push all dynamic fields to the end, and measure.

Batch endpoints and the 90% discount most teams ignore

For any non-user-facing workload — overnight summarization, bulk scoring, training-data synthesis, backfills — batch endpoints and off-peak scheduling offer 50–90% savings over synchronous inference. The engineering cost of using them is small. The reason they are underused is organizational: the team that owns the workload does not own the bill, and the team that owns the bill has no authority to force the scheduling change.

FinOps earns its keep here by naming batch-eligible workloads during design review, not during quarterly optimization. A workload designed synchronous at inception is three times harder to move to batch than one designed batch-first.

The pricing review

With a task unit, a distributional cost model, and a caching/batch baseline, a pricing review has concrete inputs:

Define each customer-facing feature in terms of tasks completed per subscription tier.
Model p50 and p95 cost-per-task with caching and batch applied where architecturally feasible.
Attach a gross-margin target per tier.
Stress-test by shifting the task-difficulty distribution toward the power user; confirm the tier still meets margin.
Price the tier accordingly; if the math does not work, redesign the feature — do not hope cost falls further.

Cost will keep falling. It has not stopped yet. But pricing a product on expected future cost reductions has a name, and the name is "hoping." A task-level model lets you make defensible pricing decisions on today's cost curve and reprice later when the distribution actually moves.

Frequently asked questions

What's wrong with per-token cost modeling?

It uses a supplier-side unit. Your product's behavior and your customer's subscription are not priced in tokens; they are priced in outcomes. Forecast in outcomes.

How do reasoning models break cost forecasts?

They emit unbounded billed thinking tokens per call, and the hardest user queries route to them. Costs concentrate at the p95/p99; a mean-based forecast underprices heavy users by 3–10x.

How should you price user-facing LLM features?

Price against a task, model p50 and p95 cost, apply caching and batch where architecturally real, and set the price to hit a gross-margin target at the heavy-user distribution — not the mean.

Does prompt caching actually save money?

Yes, 50–90% on cached tokens when the prompt has a stable prefix. The common failure mode is a dynamic field at the front of the prompt that silently invalidates the cache. Audit the provider's cache-hit field directly.

Sources

Published on hybridfinops.com — a RunAI Pilot publication.

Why your LLM cost model is wrong — and how to fix it before your next pricing review.