The scary thing about LLM costs is not the size of the bill. It is the shape of it. A normal web app has a cost curve you can reason about: more users, more database rows, a bigger Postgres plan, fine. An LLM feature has a cost curve that bends on input you don't control, because a single user pasting a 40,000-token document into your prompt costs more than a thousand users sending one-line questions. I learned this on adatepe.dev the cheap way, by watching a logging line, not a billing alert, but the lesson stuck: with AI features, the unit of cost is the token, and you have to think about tokens the way you used to think about megabytes.
This is the playbook I actually run as a solo developer paying for these calls out of my own pocket. It is not about squeezing the last cent. It is about making the cost curve predictable so a feature can't quietly bankrupt itself.
Measure first, because intuition lies about tokens
Before optimising anything I added per-call cost logging, because I was guessing wrong about where the money went. I assumed the output tokens dominated. They didn't, my prompts were stuffed with context I'd pasted in during development and never trimmed, and the input side was the expensive half. You cannot find that by reasoning. You find it by logging.
Every model call on my stack emits one structured log line, the same one-line JSON pattern I use for everything else, with the token counts and a computed cost.
log({
event: "llm.call",
model,
inputTokens: usage.promptTokenCount,
outputTokens: usage.candidatesTokenCount,
estCostUsd: estimateCost(model, usage),
});
With that in place I could filter logs by event:llm.* and actually see the distribution: which features were expensive, which inputs were outliers, where a retry loop was silently doubling spend. The single highest-leverage thing an indie developer can do for LLM cost optimization is make every call's cost visible. Everything after this section is only useful because I could measure whether it worked.
Before you optimise anything, guess where your money actually goes:
Route to the cheapest model that passes the eval
The biggest structural lever is not squeezing one model, it is not sending every request to your most expensive model in the first place. The price gap between a small fast model and a frontier model is large, often roughly an order of magnitude per token, and a huge share of real requests do not need the expensive one.
So I route. The rule is deliberately dumb, because dumb rules are debuggable:
- Short input, simple classification, structured extraction → cheap fast model.
- Long input, reasoning, anything user-facing and nuanced → stronger model.
- The cheap model returns low confidence → escalate to the stronger one and pay once for the retry.
That last point is the one that makes routing safe. You are not betting the whole request on the cheap model. You try cheap, and you only pay for the expensive model on the fraction of requests that genuinely need it. The classification work on my site runs almost entirely on the cheap tier; only the Proof Oracle's longer generations reach for the bigger model. The routing function lives in the same single server module that owns the provider client, so the whole policy is one file I can reason about.
function pickModel(input: string): string {
if (input.length < 2000) return "gemini-2.5-flash";
return "gemini-2.5-pro";
}
I'd hedge on exact numbers, providers reprice constantly and your mix is your own, but the principle is durable: default cheap, escalate on demand.
The structural difference is easy to see side by side:
// every request → the expensive frontier model
const out = await call("gemini-2.5-pro", input);
// a one-line classification and a 40k-token
// reasoning task pay the SAME per-token rate.
// the cheap 90% subsidises nothing, you just
// overpay on every short request, forever.The gap between tiers is often ~10x per token. Routing means the boring 90% of requests stop paying frontier prices.
That gap is abstract until you put your own numbers on it. Drag your monthly request volume and watch the difference between sending everything to the frontier model and routing the boring 90% to a cheap one:
The exact per-token rates shift constantly, so treat the figures as illustrative, but the shape of that gap is real and durable. One dumb routing rule is usually the single biggest line-item win available to a solo developer.
People ask me which lever to pull first, and the honest answer depends on the shape of their workload, so here is the decision I actually walk through before touching code:
Which cost lever should you pull first?
Pick the one that fits your feature's shape, not the one that sounds clever.
Are you logging token counts per call yet?
Whichever branch you land on, the others still matter eventually; the tree just tells you where the first dollar of savings is hiding for your specific feature.
Cache the things that don't change
A surprising amount of LLM spend is paying repeatedly for the same answer. Two kinds of caching matter, and they are not the same thing.
Response caching is just memoisation. If the input is identical and the task is deterministic enough, store the result keyed by a hash of the input and skip the model entirely on a repeat. For anything that gets the same input twice, a shared link, a re-render, a retry after a network blip, this turns a paid call into a free key lookup. It will not help a chatbot where every message is unique, but it shreds the cost of features with repeated inputs.
Prompt caching is the provider-side feature where a large, stable chunk of your prompt, a long system instruction, a fixed context document, is cached on their end and billed at a steep discount on subsequent calls. If you have a 3,000-token system prompt that prefixes every request, you are paying full price for those same 3,000 tokens every single call unless you use prompt caching. Restructuring the prompt so the stable part comes first and the variable part comes last is the small change that unlocks it. Order matters more than people expect.
The combination is the win: cache identical responses outright, and for everything else make sure the unchanging prefix is billed at the cached rate.
The two are not the same thing, and you want both. Switch between them:
Just memoisation. If the input is identical and the task is deterministic enough, store the result keyed by a hash of the input and skip the model entirely on a repeat.
For anything that gets the same input twice, a shared link, a re-render, a retry after a network blip, this turns a paid call into a free key lookup. It will not help a chatbot where every message is unique, but it shreds the cost of features with repeated inputs.
Budget tokens like a resource, not an afterthought
The input you don't control is the one that hurts, so I cap it. Every feature has a hard token budget on the input, enforced in code before the call goes out.
- Truncate or reject oversized input. A user pasting a novel into a one-line tool gets truncated with a clear message, not a silent $2 request. The cap is a product decision, made on purpose, not an accident discovered on the invoice.
- Bound the output too.
maxOutputTokensis a cost control as much as a UX one. An unbounded generation is an unbounded bill, and most features have a length past which more output is worse, not better. - Kill retry storms. A naive retry-on-failure loop around a model call is a way to multiply your bill during an outage. Cap retries, add backoff, and remember that some providers return a "successful" response with an empty body, retrying that forever is paying to fail.
None of these are clever. They are the boring guardrails that keep the cost curve from bending the wrong way when an input you never imagined shows up.
The retry-storm one is the gotcha that actually bit me, and it is worth the extra paragraph because it does not look like a cost bug at all. One of my background enrichment jobs wrapped each model call in a retry-on-empty loop, the reasonable-sounding kind: if the response body is empty, try again. The provider, under load one evening, started returning a 200 with an empty candidates array instead of an error. My loop read that as "not done yet" and retried, and retried, with no cap, because I had only guarded against thrown exceptions and a "successful" empty response throws nothing. A single record burned roughly 40 calls before the job moved on, and across the backfill that was a few dollars for output that was, in the end, empty. The fix was three lines: a hard cap of 3 attempts, exponential backoff, and treating an empty body as a terminal failure rather than a transient one. The deeper lesson is that an empty success is the worst case for cost, because it satisfies no error handler while still costing full input tokens every time, and a 40,000-token prompt retried blindly is a 40,000-token charge per attempt. I now budget retries explicitly the same way I budget input tokens, with a number I chose on purpose. This is the same boundary discipline I lean on for keeping AI output trustworthy in TypeScript guardrails for AI-generated code: the model's response is untrusted input, and an empty or malformed one has to be a deliberate, bounded branch, not an accident you find on the invoice.
If you want the whole set of tactics as something you can run down before shipping, here it is:
Tick those off before a feature ships and the cost curve stops being able to surprise you in ways you didn't sign off on.
Watch it after you ship, because the inputs evolve
Cost control is not a one-time pass. The reason is that your inputs drift, users find new ways to use the feature, an input that was rare becomes common, a model reprices. The logging from the first section is what makes this ongoing rather than a single heroic optimisation. I keep a rough daily eye on aggregate spend per feature, and the thing I'm watching for is not the average, it's the tail. The p99 request is where the surprises live, because the average request is cheap and boring and the expensive outlier is the one that scales into a problem.
When a feature's tail cost starts creeping, I have the per-call logs to find the specific inputs driving it, and usually the fix is one of the levers above: a tighter budget, a routing tweak, a cache I hadn't added. The point is that I can act on data instead of panicking at a bill.
Between shipping BMW-scale systems and the cost-modelling habits I picked up during my M.Sc. CS at LMU, I have learned to treat the log line as the contract: if a call is not logged the way below, I cannot defend its spend. Here is the exact shape that earns every other lever its keep.
const res = await model.generate(prompt);
log({
event: "llm.call",
model: model.id,
inputTokens: res.usage.input,
outputTokens: res.usage.output,
});One line of JSON per request. You cannot cut a bill you cannot see, and per-call token counts are what turn a scary total into a fixable line item.
That structured line is also what powers the watching I do daily: it is the same data, just read at two different time scales. The logs only earn their keep when something reads them, which is the next problem worth solving.
Separate the work the user is watching from the work they aren't
The lever I reach for once the obvious ones are in place is timing. Not every model call is sitting between a user and a spinner, and the calls that aren't have completely different cost economics that most people leave on the table. If a generation is interactive, the user is staring at the screen waiting, you pay for low latency whether you want to or not. But a surprising amount of LLM work in a real product is not interactive: enriching a record after the fact, summarizing something for a digest, classifying a backlog, regenerating a cache overnight. That work has no human waiting on it, and treating it the same as an interactive request is overpaying for speed nobody benefits from.
So I split the two explicitly. Interactive calls take the fast, more expensive path because the UX demands it. Everything else gets deferred, queued and processed when it's convenient, in bigger chunks, on whatever cheaper tier the provider offers for non-urgent work. The mental model is the same one I use for side effects in a web request: if the user isn't watching the result, it has no business blocking, and it has no business paying the premium that "blocking" implies. A digest that goes out tomorrow morning does not need its summaries generated at interactive latency tonight.
The second-order win is that batching this deferred work also makes it easier to cap and reason about. When the non-urgent calls all flow through one queue, that queue is a single place to enforce a rate, apply a budget, and absorb a provider hiccup with a retry that doesn't touch anything a user can see. The interactive path stays lean and fast; the background path stays cheap and patient. Conflating them is how you end up paying interactive prices for work that could have waited, which is the same mistake as the unrouted model call, just hiding in the time dimension instead of the model-choice one.
The dashboard I actually look at
Logging the data is one thing. Looking at it is another, and for a long time I wasn't really doing the second part. The logs were there, I just had to remember to go grepping for them, which meant I only looked after something already felt wrong. So I built a small admin view into adatepe.dev, nothing fancy, just a single page behind auth that reads the same event:llm.* lines and renders two things: cost per feature over time, and a list of the worst recent requests with their token counts and computed cost. That's it. No funnels, no cohorts, no session replay.
I deliberately resisted bolting a real product-analytics tool onto this. Those are built to answer questions about people, and the question I have here is about money, specifically about one ugly request, not about an aggregate of thousands. Wiring up a heavy suite would have been a weekend I didn't have, plus another vendor and another bill, to get a fuzzier version of a number I can compute exactly from logs I'm already emitting.
The one number I check is not the average. The average is boringly cheap and stays that way, which is exactly why it lies to you. What I watch is the tail: the p99 cost per feature. The expensive outlier is the thing that scales into a problem. One feature can have a perfectly comfortable mean while a few requests, a pasted-in wall of context, a retry loop, a prompt that ballooned, cost ten or twenty times the typical call. Those are invisible in the average and obvious in the p99.
Seeing the tail early is the whole point. When the p99 on some feature starts creeping up, I can go find the offending requests in the worst-recent list and fix the cause before it compounds into a bill that surprises me at month end. The average tells me what already happened. The tail tells me what's about to.
In practice the whole investigation is a few grep commands against the same one-line JSON I emit, and it takes under a minute to go from a suspicious total to the exact request driving it:
That p99 of 0.18 against a mean nearer a tenth of a cent is the tail I keep talking about, and it is the single 42k-token paste, not the 18,000 cheap calls, that I go fix.
Run the same checklist every time, because surprises are repeats
The reason I keep a fixed checklist instead of trusting myself to remember is that the bills that scared me were never exotic. They were the same handful of mistakes wearing new clothes. So I run the same five items before every AI feature ships: log the tokens, cache the repeated prompts, route the easy requests cheap, set a hard budget alert, and trim the context to the lines that matter. It takes a few minutes and it has caught something almost every time.
If I had to name the one item that catches the most runaway spend, it is caching the repeated prompts. Most of the genuinely scary bills I have seen, mine and other people's, trace back to a single un-cached prompt fired inside a loop. A big stable system message that gets billed at full price on every iteration, multiplied by a backfill job or a retry storm, is how a feature quietly costs ten times what you budgeted. Cache that prefix once and the rest of the list becomes tuning rather than damage control. The checklist exists so I never have to discover that on an invoice again.
Caching is the biggest lever, and the easiest to skip
I keep coming back to caching because it is the lever with the best ratio of saved money to effort, and yet it is the one I see skipped most often, including by me early on. The reason it gets skipped is that it feels like premature optimisation: the feature works, the bill is small, why add complexity. Then the feature gets used the way features get used, and the same prompt starts firing again and again, and the small bill is not small anymore.
There are two flavours worth keeping straight. Response caching is plain memoisation: hash the input, store the output, and on a repeat you do a key lookup instead of a paid call. It is perfect for deterministic-enough tasks with inputs that recur, a shared link reopened, a re-render, a retry after a flaky network. Semantic caching goes further, matching inputs that are close in meaning rather than byte-identical, so two differently worded but equivalent questions hit the same cached answer. The tradeoff is real: you trade a cheap exact-match guarantee for an embedding lookup and a similarity threshold you have to tune, and set it too loose and you serve a confidently wrong cached answer to a question that was not actually the same. I reach for exact-match first and only add semantic matching where the input space is genuinely fuzzy and the cost of a near-miss is low.
The classic surprise bill is almost always the same shape: an un-cached prompt sitting inside a loop. A big stable system message billed at full price on every iteration, multiplied by a backfill or a retry storm, is how a feature quietly costs ten times what you budgeted. If you are wiring this into a server layer, I walk through where it sits in building AI features in the App Router.
Caching is not always safe, though. Personalised output, where the answer depends on who is asking, must not leak across users through a shared key, so the user identity has to be part of the cache key or the cache has to be off. Time-sensitive output, anything that depends on the current state of the world, goes stale the moment you store it. For those, a short time-to-live or no cache at all is the honest choice. Caching is the biggest lever precisely because it is free money on repeated work, but only on the work that genuinely repeats.
The short version
Log every call's tokens and cost before you optimise anything. Route requests to the cheapest model that passes, and escalate only when needed. Cache identical responses and use provider prompt caching for stable prefixes. Cap input and output tokens as a deliberate product decision. Then keep watching the tail, because the inputs you don't control will keep evolving. Run as a solo developer, this is the difference between a feature that scales and one that becomes a liability. The features I've built this way are on /#projects, and I document the cost lessons as I hit them on /blog, usually right after something surprises me.
Put together, these levers are not separate tricks, they are stages on the path a single request actually walks before it ever reaches the model. Here is that cost-control path end to end:
A request lands carrying input I do not control, which is exactly the part that can blow the budget. Before anything else I cap and trim it, because a 40,000-token paste into a one-line tool is a product decision I want to make on purpose, not discover on the invoice.
Click through the stages and you will notice each one is a guardrail against an input you never imagined showing up.
What's the next lever you haven't pulled?
Pick one, I'll point you at the matching write-up.
Every lever here came from watching my own invoice as a student, so if you want to see the budgeting hold up in production, my work is right below.