When we launched dayBrain Volt internally, the first version worked. The quotes it generated were accurate, the tone was right, and the output was clean enough to send to clients without editing. We were pleased with it for about 48 hours.

Then we ran the numbers.

At roughly $1.00 per quote generated, the economics were not viable for a product we intended to scale. That is not a rounding error or a rogue API call — it was the actual cost of running Claude Sonnet 4 against our original prompt architecture. At even modest volume, that figure compounds into a serious problem fast.

This post is a detailed account of how we brought that cost down to $0.14 per quote — an 86% reduction — without sacrificing output quality. We will cover the model selection decision, the prompt re-engineering process, what we measured, what we changed, and the framework we now use when making these decisions on any LLM-powered product. If you are building with large language models and cost is a variable you need to control, this should be useful to you.

Why LLM Cost Compounds Faster Than You Expect

Most developers who have built a prototype with a capable model like GPT-4 or Claude Sonnet have experienced the same thing: it works brilliantly in testing, costs are negligible at low volume, and then somewhere between proof-of-concept and production, someone runs a spreadsheet and the room goes quiet.

LLM costs compound along three dimensions simultaneously. Token count per request determines your baseline. Request volume determines frequency. And model tier determines the multiplier applied to both. When all three are left unoptimised — which is the default state of a first-pass implementation — you are paying the maximum possible price for every inference call you make.

For dayBrain Volt, the original architecture had all three problems. Our prompts were verbose, carrying context that the model did not need on every call. We were using Sonnet 4 because it was the model we had been testing with during development and it produced excellent output. And we had not yet stress-tested what volume looked like in practice.

The $1.00 figure was not surprising in retrospect. It was the predictable result of building for quality first and ignoring cost entirely. That is the right order to do things — you need to prove the product works before you optimise it — but you need to actually run the optimisation pass before you ship at scale.

The Difference Between Prototype Economics and Production Economics

This distinction matters more than most technical teams acknowledge. A prototype that costs $1.00 per operation can still be the right call: it proves the concept, it generates the demos, it wins internal buy-in. But prototype economics and production economics are different problems.

In production, you are not running 50 test calls a day. You are running thousands. The cost profile of your LLM architecture is not a footnote in a technical spec — it is a core element of your unit economics, and it needs to be treated as such from the moment you start planning for scale.

We see this pattern repeatedly when working on AI products through Daybrain Digital. Teams build something excellent, then discover that excellence at scale costs ten times what they budgeted. The fix is almost always available — it just requires a structured approach rather than a panic switch to the cheapest available model.

Understanding the Model Selection Trade-Off

The first instinct when facing high LLM costs is to switch to a cheaper model. That instinct is not wrong, but it is incomplete. Model selection is a trade-off decision, not a binary switch. The question is never simply 'which model is cheapest?' It is 'which model produces acceptable output at the lowest cost for this specific task?'

That distinction matters because different tasks have radically different capability requirements. A task that demands nuanced reasoning, complex multi-step logic, or highly calibrated tone might genuinely require a frontier model. A task that requires structured extraction, templated generation, or pattern-matching probably does not. Applying a frontier model to a structured generation task is like hiring a senior architect to measure rooms — technically capable, but wildly over-specified for the work.

Sonnet 4 vs Haiku 4.5: What We Were Actually Comparing

Claude Sonnet 4 and Claude Haiku 4.5 sit at different points on Anthropic's capability and cost spectrum. Sonnet 4 is a strong mid-tier model — capable of nuanced reasoning, extended context handling, and high-quality long-form generation. Haiku 4.5 is Anthropic's fast, cost-efficient model — significantly cheaper per token, lower latency, and optimised for tasks where speed and cost matter more than deep reasoning.

The pricing differential between these two models is substantial. Without quoting exact figures that will date this post, Haiku sits at roughly one-fifth to one-quarter of Sonnet's per-token cost at the time of writing. That gap alone explains most of the cost reduction we achieved. But switching models without changing anything else would have degraded output quality — which is why the prompt re-engineering work was equally important.

The Accuracy Question

The legitimate concern when downgrading model tier is accuracy. For dayBrain Volt, accuracy means several things: the factual correctness of the quote content, the appropriateness of tone and register, the structural completeness of the output, and the consistency of formatting across generations.

Our testing showed that Haiku 4.5 with a well-engineered prompt consistently matched or exceeded Sonnet 4 with our original loose prompt on all four dimensions. That finding is not universally true — for genuinely complex reasoning tasks, Sonnet would likely have maintained an advantage. But for structured generation with clear constraints and examples, Haiku with better prompting outperformed Sonnet with worse prompting every time.

This is a finding worth sitting with. The quality of your prompt architecture often matters more than the model tier you are running against, for tasks in the structured generation category. Investing in prompt engineering before investing in model upgrades is almost always the right sequence.

The Prompt Re-Engineering Process

Our original dayBrain Volt prompt was written the way most first-pass LLM prompts are written: iteratively, during development, with additions made every time the output was not quite right. The result was a prompt that worked but carried significant bloat — redundant instructions, overlapping constraints, context that was included 'just in case', and formatting guidance that restated the same rule in three different ways.

Prompt bloat is expensive. Every token in your prompt is a token you pay for on every single call. A prompt that is 2,000 tokens longer than it needs to be, run 10,000 times per month, is 20 million unnecessary tokens per month. At any model's pricing, that is real money disappearing into instructions the model does not need.

Step One: Token Audit

The first step in our re-engineering process was a full token audit of the original prompt. We broke it into sections and asked a single question about each section: is this instruction necessary for correct output, or is it defensive padding?

Defensive padding is a real phenomenon. When a prompt produces bad output, the reflex is to add more instructions. Over time, prompts accumulate layers of corrective instruction that address problems the current version no longer has — or problems that were caused by earlier prompt versions, not the model. These additions are rarely removed because removing them feels risky. The result is a prompt that is twice as long as it needs to be.

In dayBrain Volt's original prompt, approximately 35% of the token count was defensive padding. Instructions that referenced output problems we had fixed weeks earlier. Formatting constraints written in four different places. Role-setting preamble that was longer than the actual task description.

Step Two: Structural Compression

Once we had identified what could be removed, we turned to structural compression — rewriting remaining instructions to say the same thing in fewer tokens. This is not about dumbing down your instructions. It is about writing precisely.

Natural language is naturally redundant. We use qualifiers, hedges, repetition, and elaboration in everyday communication because it aids human comprehension. LLMs do not need that scaffolding in the same way. A clear, direct instruction in 15 tokens is processed just as reliably as the same instruction padded to 45 tokens with 'please ensure that' and 'it is important that you' and 'make absolutely certain'.

We rewrote every instruction section with a target of 50% token reduction. In practice, we achieved around 42% compression on average across the prompt — meaningful but not extreme.

Step Three: Example Optimisation

Few-shot examples are one of the most effective tools in prompt engineering, but they are also one of the most expensive. A well-chosen example can replace hundreds of tokens of abstract instruction by showing the model exactly what good output looks like. A poorly chosen example, or too many examples, just adds cost without adding signal.

In our original prompt, we had three full examples. After analysis, we found that one example was covering 80% of the cases we cared about, one was covering an edge case that appeared in maybe 5% of requests, and one was largely redundant with the first. We reduced to two examples — one primary, one covering the most important structural variant — and rewrote both to be more concise without losing the demonstrative value.

Step Four: Dynamic Context Injection

The highest-impact single change we made was moving from a static prompt to a dynamic context injection model. In the original architecture, every call included the full context block for all possible quote types, all product categories, and all tone variants. Most of that context was irrelevant to any individual request.

By restructuring the prompt to inject only the context relevant to the specific request, we reduced average prompt token counts by a further 28%. The system prompt stayed lean and stable. The user-turn prompt carried only what that specific call needed. This is a straightforward architectural change, but it requires thinking about your prompt as code — with functions and parameters — rather than as a document.

The Architecture After Re-Engineering

After the full re-engineering process, our dayBrain Volt prompt architecture had four layers: a concise system prompt establishing role and output format constraints; a dynamic context block injected per-request based on quote type and product category; a structured input block carrying the specific request parameters; and a single primary example in the user turn, with a second example called conditionally for less common request types.

Total token count per average request dropped from approximately 3,800 tokens to approximately 1,100 tokens on the input side. Combined with the switch from Sonnet 4 to Haiku 4.5, this produced the cost reduction from $1.00 to $0.14 per quote.

Output quality was validated through a blind evaluation process: we generated 200 quotes using the old architecture and 200 using the new architecture, then had evaluators rate them across the four accuracy dimensions without knowing which architecture produced which output. The new architecture scored marginally higher on formatting consistency and equivalently on all other dimensions.

What the Numbers Actually Mean

A $0.86 per-quote saving sounds modest. Scale it and it is not. At 10,000 quotes per month, that is $8,600 in monthly savings — over $100,000 annually. At 100,000 quotes per month, it is $86,000 per month. The re-engineering work took approximately three weeks of focused effort. The payback period at meaningful volume is measured in days.

This is the correct way to think about LLM optimisation work: not as a technical nicety, but as an investment with a quantifiable return. If you know your per-call cost and your projected volume, you can calculate exactly what a given percentage reduction is worth in annual terms. That number almost always justifies the work.

A Framework for LLM Cost Decisions

Based on the dayBrain Volt process and the broader pattern of AI product work we do at Daybrain Digital, here is the framework we now use when evaluating LLM cost architecture. It is applicable to any product that makes LLM calls at scale.

The Task Classification Test

Before choosing a model, classify your task. This is a five-point scale:

Level 1 — Structured extraction: Pulling specific fields from defined input. Pattern-matching, classification, templated output. Haiku-class models handle this well.
Level 2 — Constrained generation: Generating content within tight structural and tonal constraints, with examples. dayBrain Volt sits here. Haiku-class with strong prompting handles this well.
Level 3 — Guided reasoning: Multi-step analysis with defined outputs. Some ambiguity in the task. Mid-tier models like Sonnet are the right call here.
Level 4 — Open reasoning: Complex analysis, nuanced judgement, tasks where the path to the answer is not pre-specified. Frontier models earn their cost at this level.
Level 5 — Novel problem-solving: Tasks requiring genuine synthesis, creative reasoning across domains, or handling of truly novel inputs. Use your best available model.

Most production SaaS features that use LLMs sit at Level 1 or Level 2. Most teams are running them on Level 3 or Level 4 models because that is what they tested with. The mismatch is where cost goes wrong.

The LLM Cost Optimisation Checklist

Before declaring your LLM architecture production-ready, run through this checklist:

Prompt audit
☐ Have you removed all defensive padding added during development?
☐ Have you checked for duplicate or redundant instructions?
☐ Have you compressed verbose instructions to their minimum effective form?
☐ Have you reviewed your few-shot examples for necessity and conciseness?

Architecture audit
☐ Are you injecting only the context relevant to each specific request?
☐ Are you caching system prompts where your provider supports it?
☐ Are you batching requests where latency allows?
☐ Are you logging and monitoring token counts per request in production?

Model selection audit
☐ Have you classified your task using a framework like the one above?
☐ Have you tested the task on a lower-tier model with an optimised prompt?
☐ Have you defined measurable quality criteria before running comparison tests?
☐ Have you calculated the annual cost difference between model options at projected volume?

Monitoring and iteration
☐ Do you have alerting on cost-per-request in production?
☐ Do you have a process for periodic prompt review as your use case evolves?
☐ Do you have a clear quality regression test you can run after prompt changes?

This checklist is not exhaustive, but running through it before you ship will catch the most common and most expensive architectural errors.

What We Did Not Compromise On

It is worth being direct about the decisions we did not make, because they matter as much as the ones we did.

We did not implement aggressive output truncation. Some teams reduce output token counts by instructing the model to produce shorter responses than the task warrants. For a task like dayBrain Volt, where the output is the product, shortening the output to save tokens would have degraded quality in a way that was immediately visible to users. We did not do that.

We did not cache outputs and serve stale responses. Caching is a legitimate cost reduction tool for some use cases, but not for a product where every quote needs to be generated fresh from the specific input parameters. We evaluated this and ruled it out cleanly.

We did not switch to a non-Anthropic model purely for cost reasons. We compared Haiku 4.5 against comparable offerings from other providers and concluded that for our quality requirements, Haiku was the right choice. Cost optimisation should always be constrained by quality requirements, not the other way around.

These boundaries matter. Optimisation without constraints is just degradation with extra steps. The goal is the lowest cost at acceptable quality — not the lowest cost, full stop.

The Broader Pattern: Friction in AI Architecture

The dayBrain Volt optimisation is a specific instance of a more general pattern. When AI features are expensive to run, the root cause is almost always architectural debt accumulated during development — the same kind of friction that accumulates in any system built quickly and iterated on without structural review.

We wrote about this dynamic in the context of business systems more broadly in 'Identifying Business Friction Before It Costs You Growth' — the principle applies directly to AI architecture. Silent, cumulative inefficiency that compounds with scale is exactly what an unreviewd LLM prompt stack becomes over time.

The same argument applies to the broader question of when to re-engineer versus when to leave a working system alone. A prompt that costs too much is not a broken system — it is a working system with a fixable inefficiency. The decision framework in 'Legacy Systems: When to Modernise and When to Leave Well Alone' is relevant here: the question is always whether the cost of the change is justified by the return, not whether the current state is technically imperfect.

In the dayBrain Volt case, the return was unambiguous. At scale, a three-week re-engineering effort saves six figures annually. That is not a close call.

Applying This to Your Architecture

If you are building an LLM-powered product and you have not yet run a structured cost audit, the first thing to do is establish a baseline. Log your actual per-request token counts — input and output separately. Calculate your actual cost per operation at current volume. Then project that cost at the volume you are planning for at six months and twelve months.

If the number is comfortable, you may not need to act immediately. If the number is uncomfortable, or if it becomes uncomfortable at modest scale, you have a problem worth solving now rather than later. Retrofitting cost optimisation into a live production system under volume pressure is significantly harder than doing it before you scale.

The three levers are always the same: prompt efficiency, model selection, and architecture (how and when you make calls). In our experience, prompt efficiency delivers the fastest return with the lowest risk. Model selection delivers the largest savings but requires quality validation. Architecture changes — batching, caching, dynamic injection — deliver meaningful gains but require more engineering work.

Start with prompts. Always start with prompts.

It is also worth noting that this class of problem — where the right technical answer is available but requires structured analysis to find — is exactly what the technology audit process is designed to surface. An AI cost problem buried in your architecture looks different from the outside than it does once you have mapped it systematically. External perspective helps here, particularly when your team is close to the implementation.

What This Means for AI Product Economics

There is a broader point worth making about where AI product economics are heading. Model providers are competing on price as well as capability, and the cost of inference has fallen substantially over the past two years. Haiku 4.5 today is cheaper and more capable than models that cost twice as much eighteen months ago.

This trend will continue. But the teams that build cost discipline into their AI architecture from the start will be positioned better than the teams that rely on model pricing falling to rescue them from bloated prompt stacks. Market conditions change; architectural debt does not self-correct.

Building with cost as a first-class constraint — alongside quality and latency — is the right engineering posture for any team that expects to run LLM calls at scale. It does not mean being cheap. It means being deliberate.

dayBrain Volt went from a product with broken unit economics to a product with strong ones. The output is better. The cost is 86% lower. The architecture is cleaner and easier to maintain. None of those outcomes required a fundamental rethink of what dayBrain Volt is or what it does. They required a structured, honest look at how it was built and a willingness to do the re-engineering work properly.

That work is almost always available to you. The only question is whether you go looking for it.


If you are building an AI-powered product and want a structured review of your LLM architecture and cost profile, the team at Daybrain Digital can help. We bring the same approach we applied to dayBrain Volt to your codebase — finding the inefficiencies, mapping the trade-offs, and implementing the changes. Book a conversation using our calendar link at co.daybra.in.