AiCal takes a forwarded email and turns it into a calendar event. That is the whole product. You forward "Dinner with Sarah next Thursday at 8 at Trattoria Mio," and it produces a structured event with a title, a start time, a location, and nothing else. The first version I built worked beautifully on the three emails I tested it with and then fell apart the moment I fed it a real inbox, where a "confirmation" email might be a flight, a dentist appointment, a Calendly link, or a newsletter pretending to be an invitation. That gap, between the demo and the inbox, is the entire subject of this post.

Prompt engineering for demos is writing a clever sentence. Prompt engineering for products is designing an interface that a probabilistic system has to honour every time, including on the inputs you never imagined. The trick people miss is that the prompt is not the feature. The prompt plus the schema plus the eval suite plus the fallback is the feature.

pollWhat does "prompt engineering" mean in your projects right now?

The prompt is an interface, so give it a type

The single best decision I made on AiCal was to stop asking the model for "the event details" and start asking it for an object that matches a schema I control. Gemini's structured-output mode lets you hand it a response schema and get JSON back that conforms to it. That moves an enormous amount of work out of the prompt and into a type definition, which is a much better place for it because a type is checkable and a paragraph of instructions is not.

const EventSchema = z.object({
  isEvent: z.boolean(),
  title: z.string(),
  start: z.string().datetime(),
  end: z.string().datetime().nullable(),
  location: z.string().nullable(),
  confidence: z.enum(["high", "medium", "low"]),
});

Two fields there are doing quiet, heavy lifting. isEvent is the model's escape hatch, it says "this email is not actually an invitation," which is the answer for most of a real inbox. confidence is the model rating its own certainty, which I use downstream to decide whether to create the event silently or ask the user to confirm. Without those two fields, the model has no way to express doubt, so it does the worst possible thing: it confidently invents an event for a newsletter.

When the prompt is an interface, the prompt text shrinks. Mine is mostly negative space now, what not to do, how to handle the absence of information, because the shape of the answer is already pinned down by the schema. I stopped writing "return valid JSON with the following fields" entirely; the structured-output mode guarantees that. The prose is free to focus on judgement.

When I teach this to other engineers, I find it lands fastest if I label the prompt the way I would annotate a piece of code in a review. Here is the skeleton I keep coming back to, with the parts that actually carry weight called out.

annotatedAnatomy of a prompt that survives production

[ROLE]    You are a strict JSON validator.
[INPUT]   {{userText}}
[RULES]   Return only valid JSON. No prose.
[SCHEMA]  { "ok": boolean, "errors": string[] }
[FALLBACK] If unsure, set ok=false.

One line that tells the model what it is doing. A narrow role (strict validator) constrains the output more than a vague one (helpful assistant) ever could.

That skeleton is deliberately generic, but the same four moves carry over to AiCal directly. The two approaches diverge hardest on the email that isn't an event, a newsletter pretending to be an invitation:

const prompt = `Extract the event from this email.
Return the title, start time and location.`;

const text = await model.generate(prompt + email);
// newsletter comes in →
// model has no way to say "not an event",
// so it INVENTS one. confidently.
// you parse prose by hand. it drifts. it breaks.

isEvent lets the model abstain; confidence routes doubt to a human. The schema does the work a paragraph of instructions never could.

Before the schema, here is the trap. Predict what the naive version does:

your guessA newsletter pretending to be an invitation lands in the inbox. The prompt just says: extract the event. What does the model return?

You cannot improve what you cannot replay

The thing that separates a real prompt from a lucky one is an eval harness, and almost nobody building a side project has one because it feels like overhead until the first time a "small" prompt tweak silently breaks a case that used to work. That happened to me. I improved the handling of timezones in a prompt edit and quietly destroyed the handling of all-day events, and I only noticed three days later when an all-day "Holiday" became a midnight-to-midnight nightmare on my own calendar.

After that I built a regression suite. Nothing fancy, a folder of real (anonymised) emails paired with the event I expect each to produce, and a script that runs them all and diffs.

for (const { email, expected } of fixtures) {
  const result = await classify(email);
  expect(result.isEvent).toBe(expected.isEvent);
  if (expected.isEvent) {
    expect(result.start).toBe(expected.start);
  }
}

The important detail is that I do not assert on the model's exact wording. I assert on the decisions: did it correctly decide this is an event, did it get the start time right, did it refuse the newsletter. Prompt evals that demand exact string matches are brittle and useless because the model's phrasing legitimately varies. Eval the structured fields, not the prose. When a fixture I care about regresses, the suite fails before I deploy, and I have a reproducible case to debug instead of a vague feeling that "it got worse."

The eval set is a living asset

The mistake I made early on was treating the regression suite as a one-time chore: write a dozen fixtures, watch them go green, move on. But an eval set assembled in a single afternoon only covers the inputs I could imagine at my desk, and the inputs I can imagine are exactly the ones that never break anything. The interesting failures come from reality, so the suite has to keep growing from the place where real inputs arrive: production.

The loop is simple, and I run it every time AiCal does something wrong. A real forwarded email produces a wrong decision, maybe it reads a flight confirmation as an all-day event, maybe it invents a meeting from a marketing blast. I take that exact email, strip anything sensitive out of it, write down the decision I actually wanted, and drop the pair straight into the fixtures folder. Now the thing that broke once can never break silently again. The next prompt edit that would have reintroduced that failure trips the suite before it reaches my calendar.

Fixtures harvested this way beat invented ones for a reason that took me a while to appreciate. When I write a test case from scratch, I am encoding my mental model of what a tricky email looks like, and that model is wrong in the same blind spots as my prompt. A real failure has no such bias: it is concrete proof that the world produces shapes I did not account for. Every one I capture makes the suite a little more representative of the emails people actually forward, instead of the tidy distribution I pictured.

Over months, the eval set turns into something more valuable than a test file. It becomes the institutional memory of every weird input the feature has ever choked on: the German date format, the empty subject line, the calendar invite quoted inside a reply. I cannot hold all of those in my head, and I do not have to. The suite remembers them for me, long after I have forgotten the incident that added each one.

Design for the inputs you didn't imagine

Every prompt I have shipped to production has been broken by an input I didn't anticipate, and that is not a failure of imagination, it is the nature of letting real people send real text into a system. The job is not to imagine every input. The job is to make the unimagined input fail safely.

A few patterns that survived contact with real users:

Always provide an "I don't know" path. The isEvent: false field on AiCal exists so the model never has to fabricate. A model with no way to abstain will hallucinate to fill the gap.
Treat low confidence as a UX branch, not an error. High confidence creates the event. Low confidence shows the user a draft and asks. The model is allowed to be unsure; the product just routes uncertainty to a human.
Pin the date context explicitly. "Next Thursday" is meaningless without today's date in the prompt. I inject the current timestamp and the user's timezone every call. Forgetting this is the single most common bug in any scheduling-adjacent LLM feature.
Re-validate the output even though the schema "guarantees" it. Structured-output modes are very good, not perfect. A safeParse at the boundary costs nothing and catches the rare malformed response before it reaches my database.

Treat the prompt like code, because it is code

The mistake that kept biting me early was treating the prompt as a string I could tweak freely, separate from the rest of the system. It is not. The prompt, the schema it targets, and the eval fixtures that guard it are one unit, and they drift apart the instant you let them. So I version them together, in the repo, in the same commit. When I change the prompt, the schema change and the new or updated fixture go in alongside it, reviewed as one change. A prompt edit with no corresponding eval update is, to me, the same smell as a behavior change with no test: a thing that might be fine and might be a silent regression, and I cannot tell which.

Keeping the prompt in the codebase rather than in some external dashboard is part of this. I have watched people manage prompts in a separate tool, editing them live in production, and it always ends the same way: nobody can reproduce why the output changed last Tuesday because the prompt that produced it no longer exists. A prompt that lives in version control has a history. I can git blame a line of instruction and find the commit, the eval that justified it, and the reasoning in the message. That traceability is worth far more than the convenience of editing a prompt without a deploy.

The other half is observability once it ships. I log enough about each call, the model used, the structured decision it returned, the confidence, to reconstruct what happened without storing anything sensitive. When a user reports that the feature did something strange, I do not want to guess. I want to look at the actual decision the model made and compare it to what the schema allowed, because the bug is almost always in the gap between those two. A prompt you cannot replay and a decision you cannot inspect are how a probabilistic feature becomes unmaintainable. Version the prompt, guard it with evals, log what it decides, and the model stops being a black box you pray to and becomes a component you can actually debug.

Four patterns survived contact with real users. The job is not to imagine every input, it is to make the unimagined one fail safely. Switch between them:

compareFour patterns for the inputs you did not imagine

The isEvent: false field on AiCal exists so the model never has to fabricate.

A model with no way to abstain will hallucinate to fill the gap. Give it an explicit way to say this is not an event, and most of a real inbox stops producing invented meetings.

Most few-shot examples are paying rent for nothing

Few-shot examples feel free because they live in the prompt, but you pay for every one of them on every single call. On AiCal I started with eight hand-picked examples of tricky emails, and for months I assumed all eight were earning their place. They were not. When I pulled them out one at a time and re-ran the eval suite, the first two or three did the real work of anchoring the format and the abstain behaviour, and the rest changed almost nothing except the bill. The model had already learned the pattern by example three. The way to trim is mechanical and worth the afternoon: remove one example, run the evals, and keep it out if the decisions hold. Most few-shot prompts I have seen carry ten examples and get the value of three. Cut the ones that do not move a single fixture, and your prompt gets cheaper and faster on every call without losing a thing.

The arithmetic is brutal once you multiply it across every request, so here is the cost of carrying examples you never measured.

try itFew-shot examples are not free

Examples you paste into every prompt: 4 examples

Extra tokens added per call (about 80 per example)320

Most few-shot prompts pay for ten examples and get the value of three. Trim to the ones that move the output. I tune prompts for real products.

See the work

Trim to the examples that earn their place, and the rest of the savings come from the deterministic fallbacks below.

Cheap fallbacks beat clever prompts

There is a temptation to solve every edge case with a longer, cleverer prompt. I have learned to resist it. A prompt that has grown three paragraphs to handle one weird email is a prompt that is now slower, more expensive, and more likely to regress on the common case. Past a point, prompt complexity is technical debt with extra steps.

The better move is usually a deterministic fallback outside the model. AiCal does a cheap pre-check: if an email contains an .ics attachment, I parse it directly and never call the model at all, because a calendar attachment is already structured data and asking a language model to re-derive it is paying tokens to make something worse. Some inputs do not need intelligence. They need a parser. Knowing which is which is most of the engineering.

The numbers convinced me this was not premature optimization. When I added the .ics fast path, it absorbed close to a fifth of all incoming emails on its own, because most calendar invites that get forwarded already carry the attachment that automated systems generate. That is a fifth of my volume that now returns in single-digit milliseconds with zero token cost and, more importantly, zero chance of a hallucinated time, since the parser reads the exact DTSTART the sender's calendar wrote. The model path, by contrast, was averaging somewhere north of a second per call once you counted the round trip and the retry budget. Cutting a fifth of traffic off that path is a latency and bill win you simply cannot prompt your way to, and I go through the arithmetic of trimming model spend like this in the LLM cost control playbook. The gotcha I hit, and the reason I now safeParse even the parser's output, is that "structured" does not mean "correct": I saw .ics files in the wild with a DTSTART but no DTEND, and one provider that emitted timezones as a raw offset string my library refused to read. So the fast path is not a blind trust of the attachment. It is a cheap, validated first attempt that falls through to the model only when the deterministic read fails its own schema check. Cheap, then clever, in that order.

The same logic applies to refusals. When the model returns low confidence on a genuinely ambiguous email, I do not retry with a more elaborate prompt and hope. I surface the ambiguity to the user, because the user knows whether "lunch?" from their friend is a real plan, and the model never will. A product that admits uncertainty gracefully feels more trustworthy than one that guesses confidently and is sometimes wrong.

Deciding whether a given edge case belongs in the prompt or in deterministic code is the call I make most often, so here is the way I actually route it in my head.

find your answer

Prompt, parser, or human?

Where should this edge case actually be handled.

Is the input already structured (an .ics file, a JSON payload, a known template)?

Why I version prompts like code, in PRs

I spend my days as a working student at BMW and my evenings on an M.Sc. in Computer Science at LMU Munich, and the habit that carried over from both into AiCal is mundane: prompts belong in the repository, reviewed in pull requests, never in a dashboard. The first time I managed a prompt in an external tool I learned the lesson the hard way. Someone edits the live prompt, the output shifts, and a week later nobody can say what the text used to be or why it changed. There is no commit, no diff, no author, no reasoning. The prompt that produced last week's behaviour has simply been overwritten and is gone. A prompt in the repo has all of that for free, and it costs nothing I was not already paying for the rest of the code.

Once the prompt lives in version control, the eval harness becomes the thing that makes editing it safe. I keep a tiny regression suite next to the prompt, a folder of real anonymised emails paired with the decision I expect, and a script that replays them and diffs the structured output. It is not sophisticated and it does not need to be. Its whole job is to fail loudly when a prompt tweak silently breaks a case that used to work, which is the exact failure mode that makes people afraid to touch a prompt at all. With the harness in place, I can edit an instruction, run the fixtures, and see in seconds whether I improved one case while quietly wrecking three others. If you want to see where these prompts actually sit in the stack, I wrote up building AI features in the App Router separately.

The payoff lands in review. When I change a prompt, the diff shows up in the pull request alongside the schema change and the new or updated fixture, and a reviewer can read all three as one unit. A prompt edit with no eval change reads, to me, exactly like a behaviour change with no test: maybe fine, maybe a silent regression, and impossible to tell which from the diff alone. Reviewing prompt diffs the same way I review any other code is what turns a probabilistic feature from something I nervously poke at into something a second person can reason about and approve.

To make that concrete, here is the actual lifecycle a single prompt change goes through on AiCal, from the failing email to the merged commit.

timelineThe life of one prompt change

A forwarded message produces a wrong decision in production. Maybe a flight confirmation reads as an all-day event, maybe a marketing blast invents a meeting. I notice it because the structured decision I logged does not match what the schema should have allowed.

What "production-grade" actually means here

It means the prompt has a schema, the schema has a Zod guard, the behaviour has an eval suite, the uncertainty has a UX branch, and the easy cases have a non-AI fast path. None of those are about writing better sentences. They are about building a system around the model so that the model's inevitable mistakes are contained instead of catastrophic.

Here is the actual path a forwarded email travels through AiCal, from raw intent to a safe outcome. Click any stage to see what happens there and where it can go wrong.

flowThe production prompt pipeline, end to end

A forwarded email arrives carrying a fuzzy human goal: turn this into a calendar event. The input is free-form text I do not control, so I treat it as hostile by default. The tip here is to assume the worst case is the normal case; most of a real inbox is not actually an invitation.

So here's the literal definition, as something you can tick off. A "clever prompt" has none of these; a product has all of them:

checklistIs this a prompt, or a product?0/6

AiCal stores nothing, it reads the email, makes the event, and forgets, which forced a kind of discipline I am grateful for, because every classification has to be right in one pass with no history to lean on. That constraint is what taught me most of this. If you want to see the structured-output approach running in a real product, AiCal and the rest are on /#projects, and I write up the failures as they happen on /blog. The demos are easy. The inbox is the test.

your move

Building this into a real product, what's next?

Pick the piece you want to get right.

Pick whichever read pulls at you, then come see how these prompts behave in real products on adatepe.dev. I'm always happy to trade hard-won lessons.

built by alperenI ship LLM features that survive a real inbox, not demos.Full-stack engineer, M.Sc. CS at LMU Munich. See the products, or get in touch about building yours.Explore my work