The first AI feature I shipped on adatepe.dev leaked my Gemini API key into the client bundle. Not in a dramatic way, no exploit, no bill spike, but I opened the network tab one evening, saw the model call firing straight from the browser, and realised the key was sitting right there in a NEXT_PUBLIC_ variable because that was the fastest way to get a demo working. That was the moment App Router stopped being a routing convenience for me and became a security boundary I had to think about on every feature.

This post is about the boundaries. Next.js App Router gives you Server Components, Server Actions, Route Handlers, and Client Components, and the entire difficulty of building AI features cleanly comes down to putting each piece of the work on the correct side of the network. Get that wrong and you leak keys, blow your latency budget, or render half-parsed JSON to a user. Get it right and the rest is just plumbing.

pollWhere does your model call run today?

Before I get to the rules, a confession from my own first attempt is worth pausing on, because the mistake is so easy to make that you should see exactly how it happens before you trust yourself not to repeat it:

your guessMy very first AI feature called the model from a NEXT_PUBLIC_ variable, the fastest way to get a demo working. What was sitting in the browser bundle?

The server is the only place the model exists

The rule I follow now is blunt: the model provider SDK is imported in exactly one kind of file, and that file never ships to the browser. In App Router that means a Server Component, a Server Action, or a Route Handler. The client knows the feature exists. It does not know which model, which key, or which prompt produced the answer.

For adatepe.dev I keep a single server-only module that owns the provider client. The server-only package is the cheap insurance here, if anything ever imports it from a Client Component, the build fails instead of silently bundling secrets.

import "server-only";
import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

export function getModel(name = "gemini-2.5-flash") {
  return genAI.getGenerativeModel({ model: name });
}

Everything AI-shaped flows through that module. The benefit is not aesthetic. It means there is exactly one place to add rate limiting, one place to swap models, one place to attach logging, and zero chance the key ends up in a .js chunk a crawler can read. When I later added model routing, cheap model for short inputs, a stronger one for long ones, I changed one function and every feature inherited it.

Server Actions versus Route Handlers: pick by shape, not habit

People ask which one to use for AI, as if it were a style preference. It isn't. They have different shapes and the shape should follow the request.

Server Actions are great for the request-then-render case: a form submits, the model produces something, the page re-renders with the result. No streaming, no fancy progress, just a mutation. My contact-intent classifier on the site is a Server Action, it takes a message, runs a quick classification, writes a row, returns. The user never sees tokens dribble in and would not benefit if they did.

Route Handlers are what you reach for the moment you want streaming, because they expose the raw Response object and you can hand back a ReadableStream. The "Proof Oracle" feature on adatepe.dev, which streams a structured, sectioned answer about my work, lives in a Route Handler precisely because I want the user watching the response build rather than staring at a spinner for eight seconds. There is real research that a streaming response feels faster than a blocking one even when total time is identical, and with LLMs the total time is genuinely long, so you get both the perception win and a real one.

The first time I wired one of these up at university, the shape surprised me by how ordinary it was, so let me pull a minimal streaming handler apart line by line. There is less magic here than the framework marketing implies.

annotatedA streaming route handler, decoded

export async function POST(req: Request) {
  const { prompt } = await req.json();
  const result = streamText({ model, prompt });
  return result.toTextStreamResponse();
}

No special runtime, no magic. The App Router route handler reads the request body the same way any API route does. The only difference is what you return.

That whole handler is four lines and none of them are exotic. The interesting decisions are not in the plumbing, they are in choosing the right tool for each shape of work.

A rough decision rule that has held up for me:

One-shot result that updates the page → Server Action.
Streaming, token-by-token, or a long-running structured generation → Route Handler with a stream.
Need it callable by something that isn't your own UI → Route Handler, because it is just an HTTP endpoint.

The choice is not a style preference. Flip between them to see which shape fits the request you have:

compareServer Action or Route Handler?

For the request-then-render case: a form submits, the model produces something, the page re-renders with the result. No streaming, just a mutation.

My contact-intent classifier is a Server Action. It takes a message, runs a quick classification, writes a row, returns. The user never sees tokens dribble in and would not benefit if they did.

If you are still unsure which primitive your specific feature wants, walk the actual decision the way I do when I sketch a new route on a whiteboard:

find your answer

Server Action, Route Handler, or just cache it?

Route your next AI feature to the App Router primitive that fits its shape.

Does the generation take long enough that a frozen screen would read as a hang?

With the routing settled, the next problem is what you stream once you have chosen a handler.

Streaming structured output is where it actually gets hard

The naive streaming demo streams plain prose, appends each chunk to a string, dumps it in a <div>. Fine for a chatbot toy. The Proof Oracle is not prose, it emits structured sections, and I want those rendered as they arrive, not after the whole thing finishes. That is the genuinely hard part of AI in App Router, and nobody's "build a chatbot in 5 minutes" tutorial covers it.

The failure mode is obvious once you hit it: a JSON object that is 60% streamed is not valid JSON. You cannot JSON.parse a half-finished object. So you either wait for the full payload, losing the entire point of streaming, or you parse partial structures. I went with partial parsing on the client and a strict re-validation on the server once the stream closes.

export async function POST(req: Request) {
  const { subject } = await req.json();
  const result = await getModel().generateContentStream(buildPrompt(subject));

  const stream = new ReadableStream({
    async start(controller) {
      const encoder = new TextEncoder();
      for await (const chunk of result.stream) {
        controller.enqueue(encoder.encode(chunk.text()));
      }
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

On the client I accumulate the buffer and run a tolerant parser that extracts whatever complete fields exist so far. When the stream ends, the server has already validated the full object against a schema before it was ever trusted, the client's partial parse is for display only, never for anything that writes to a database. This split matters: optimistic rendering on the client, authoritative validation on the server. Treat anything the client parsed mid-stream as a guess.

It helps to see the whole journey laid out, because every box in it is a place a key, a latency budget, or a half-parsed payload can leak. Here is the path a single Oracle request actually takes from the user's keypress to rendered text:

flowThe lifecycle of one AI request

The Client Component captures the subject and POSTs it. It knows the feature exists but never which model, key, or prompt is behind it. Keep the client dumb on purpose; the less it knows, the less there is to leak into a bundle a crawler can read.

Each stage there is a decision I made deliberately, and most of the rest of this post is just zooming into one box at a time.

Before you decide streaming is a nice-to-have you can bolt on later, picture the version where you skip it entirely:

your guessYou add an LLM call to a Next.js route handler and await the full response before returning. The model takes 9 seconds. What does the user see?

That blank screen is exactly why I treat streaming as the default for any generation that takes real time, so it is worth being precise about what the App Router actually does for you here.

Streaming versus blocking, and why the App Router defaults to motion

The difference between blocking and streaming is the difference between a feature that feels broken and one that feels alive. When you await a full completion in a Route Handler before returning, the response does not exist until the last token lands, so the user gets nothing, no bytes, no paint, for the entire generation. At nine seconds that reads as a hang, and people reload or leave. I have watched my own session replays do exactly that on an early version of the Oracle before I streamed it.

Streaming flips the perceived speed even when the total time is identical. The moment the first token arrives the screen starts moving, and motion reads as progress in a way a spinner never does. The practical pattern in the App Router is the one I showed above: a Route Handler that returns a ReadableStream, enqueuing each chunk as the provider yields it instead of buffering the whole thing. On the rendering side you can lean on Suspense and streamed Server Components for the surrounding shell, so the page frame paints instantly and only the generated region fills in progressively.

Put the two route handlers side by side and the gap stops being abstract:

export async function POST(req: Request) {
  const { prompt } = await req.json();
  const completion = await model.generate(prompt); // 9s wall
  return Response.json({ text: completion.text });
}

Same model, same latency. The right one just refuses to make the user stare at nothing.

What strikes me looking at that pair is how little code separates the two outcomes.

What makes this comfortable in App Router specifically is that streaming is not a bolted-on trick. The framework was built around progressive rendering, so handing back a stream from a Route Handler or suspending a Server Component are first-class paths, not workarounds. You opt out of streaming by awaiting; you opt in by yielding. Once that clicked for me, blocking a long generation started to feel like the unusual choice, and streaming became the obvious one for anything the user waits on.

The two-line difference that buys the cheapest UX win

Look closely at the compare above and the whole change is two lines. The blocking handler does await model.generate(prompt) and returns JSON, so nothing reaches the browser until the final token lands. The streaming handler asks for model.stream(prompt) and hands the readable stream straight back to the client. Same model, same key, same total seconds of compute. The user experience could not be further apart.

That is why I reach for streaming first on adatepe.dev. It costs me almost nothing to write, I do not have to tune the model, batch differently, or pay for faster inference. I just stop buffering the answer and let the bytes flow as they are produced. The perceived speed goes from a frozen tab to a screen that is visibly working, and I changed two lines to get it. Most UX wins on an AI feature ask for real effort. This one asks for restraint, you simply refuse to await the thing you could be streaming. When I weigh effort against payoff across everything I have shipped, nothing else comes close.

Latency is a budget, and the model spends most of it

When I instrument an AI route, the model call dominates everything else by an order of magnitude. The database query is single-digit milliseconds. The model is seconds. That changes how you think about the request.

A few things that earned their place:

Run independent work in parallel. If you need a DB lookup and a model call and they don't depend on each other, Promise.all them. Sequencing them is just adding the model latency to something that was free.
Don't await what the user doesn't see. Logging the request, writing analytics, firing a notification, none of that should block the response. App Router lets you stream the answer back and let the side effects finish after.
Set timeouts and own the failure. Provider calls hang. A request that never resolves is worse than one that fails fast with a fallback. I wrap model calls with an abort timeout and a degraded path.

The degraded path is not optional. The Resend incident in our own runbook taught me that some SDKs do not throw on failure, they hand you back an error field you have to inspect. Model providers are similar: rate limits and content-policy blocks can come back as a "successful" response with an empty or refused body. If your code assumes a non-empty completion, that assumption is a future 500.

Cancellation, because users leave

The case nobody writes a tutorial about is the one where the user navigates away mid-generation, and on a streaming AI feature that case is not rare, it is constant. Model calls are slow, and a slow response is exactly the kind a user abandons. If your route handler keeps the generation running after the client is gone, you are paying a provider for tokens nobody will ever read, and on a serverless platform you may be holding an invocation open for no reason. Both cost money, and the cost scales with exactly the impatience that long generations produce.

So I wire the request's abort signal through to the model call. App Router gives you the request, the request carries a signal, and that signal fires when the client disconnects. Passing it down means a user closing the tab actually stops the work, instead of leaving an orphaned generation burning tokens against a dead connection. It is a small piece of plumbing that most demos skip because demos never have a user who leaves, and production has nothing but users who leave.

The same discipline applies to the timeout I mentioned earlier, and the two compose. A generation should end when the user gives up, when the provider hangs, or when it finishes, whichever comes first, and every one of those three needs an owner in your code. Treating "it completed normally" as the only exit is how you discover, on a billing statement, that a meaningful slice of your traffic was abandoned mid-stream and you paid for all of it. The interactive nature of these features is not just a UX concern. It is a cost-control concern, because the user's attention and your token budget run out at the same unpredictable moment.

Caching the answer, when the answer is stable

The previous two sections were about generations that run too long or get abandoned. The cheapest generation is the one you never make. Some AI outputs are worth caching, and on App Router that is easy to forget, because the model call feels like the whole feature. But if the same input deterministically produces the same answer, storing it keyed by a hash of that input and skipping the model on a repeat is both a latency win and a cost win. The second caller gets an instant response, and you do not pay a provider for a completion you already have.

The question is when this is safe. It is safe when the output is deterministic, fully derived from the input, and not personalized: summarizing a fixed document, classifying a snippet, generating a description for a product that does not change. For those, I hash the input, look it up, and only call the model on a miss. The hard rule is that the cache key must include everything that changes the output. The prompt text, the model name, the temperature, the system instructions, the schema version: if any of those would change the answer, they belong in the key. Leave one out and you serve a stale answer the moment you tweak a prompt, which is a bug that hides for weeks because it only shows on cache hits.

Caching is dangerous the moment the output depends on something the key does not capture. Anything user-specific is off the table unless the user is part of the key, and even then I am cautious, because caching personalized AI output is one leaky key away from showing one person another person's answer. Time-sensitive output is the other trap: if the right answer depends on "now," a cache hands you yesterday's. When in doubt I treat the response as uncacheable. A stale answer that looks confident is worse than a slow one.

Validate at the boundary, every time

The model output is untrusted input. I cannot say this strongly enough. It does not matter that you wrote a careful prompt, the boundary between "model said" and "my app does" is exactly as dangerous as the boundary between "user typed" and "my app does," and it deserves the same Zod schema treatment.

import { z } from "zod";

const ProofSection = z.object({
  heading: z.string().min(1),
  claim: z.string().min(1),
  evidence: z.array(z.string()).min(1),
});

const parsed = ProofSection.array().safeParse(modelJson);
if (!parsed.success) {
  return fallbackResponse();
}

If the parse fails, you do not render garbage and you do not throw an unhandled error into the user's face. You fall back. The schema is also documentation: it is the contract the prompt is trying to satisfy, and when I change the prompt I change the schema in the same commit. The two drift apart the instant you let them.

Before you ship an AI feature in App Router, walk it through this. Each unticked box is a leaked key, a blown latency budget, or half-parsed JSON in a user's face waiting to happen:

checklistIs your AI feature actually production-shaped?0/6

Handling failure: timeouts, retries, and graceful degradation

Model calls fail in ways ordinary HTTP calls rarely do. They time out, they stall halfway through a stream, they come back rate-limited, or they hang so long the user has given up before the first token lands. On adatepe.dev, around my BMW work and my M.Sc. studies at LMU Munich, the failures I lost the most time to were not the loud ones. They were the silent stalls, where the request just sat there resolving nothing while the user stared at a spinner.

The first defense is a timeout on every model call. A provider that hangs forever is worse than one that errors fast, because at least an error gives you something to act on. I wrap each call in an abort signal with a hard ceiling, so a stuck generation gets cut loose instead of holding an invocation open and a user hostage.

Retries help with transient failures, but only if you are disciplined about them. I retry on the errors that are actually worth retrying, the timeouts and the rate limits, not on a content-policy refusal that will fail identically every time. I back off between attempts so I am not hammering a provider that is already struggling, and I cap the retries hard. Two attempts, maybe three. An unbounded retry loop on a slow model call is just a slow stall wearing a different hat, and it multiplies the latency budget I wrote about earlier. If you care about what those retries cost you, controlling LLM cost is the other half of this conversation, because every retry is a completion you pay for.

When the retries are spent, the user gets a real fallback, not a spinner that never stops. A cached previous answer, a simpler non-AI path, a plain "this could not generate right now, here is what I can show you instead." Anything that respects the fact that a person is waiting.

Counting matters more than it sounds. On adatepe.dev every model call goes through the one-line JSON logger our runbook standardizes on, so a single grep over the Vercel logs tells me the real failure mix instead of my guess at it. The numbers surprised me. When I first instrumented the Oracle I assumed rate limits were my main pain, but over a representative week the breakdown was roughly seventy percent clean completions, eighteen percent abort signals from users who navigated away mid-stream, nine percent timeouts I had set too aggressively at four seconds, and barely three percent actual provider rate limits. That changed my priorities entirely. I had been about to add an expensive retry-with-backoff layer to fight rate limits that were almost a rounding error, when the real money was leaking through my own four-second ceiling cutting off generations that would have finished at six. I raised the timeout to ten seconds, wired the abort signal through properly so the eighteen percent stopped billing me, and the retry work dropped down the list. The lesson is the same one the cost playbook keeps hammering: you cannot tune what you have not measured, and AI failures lie about their own shape until you log them. Here is the exact command I run when I want that breakdown:

Triaging AI route failures from the Vercel logs

Read that last line carefully: when every timeout lands on your ceiling to the millisecond, the provider was not hanging, your ceiling was too low. That is the rule under all of this: never swallow the error silently. Log it, surface it, count it. A failure you hide is a failure you cannot fix, and with model providers the quiet ones are the expensive ones.

What I would tell myself a year ago

Keep the provider on the server behind one module. Choose Server Actions for mutations and Route Handlers for streams, by shape not by habit. Assume the model output is hostile and validate it with a schema at the boundary. Budget latency around the model, not around your database. And put a real fallback on every model call, because providers fail in ways that don't look like failures.

None of this is exotic. It is the same discipline you would apply to any untrusted external dependency, which is exactly what a language model is. If you want to see the patterns running rather than described, the AI features are live across /#projects, and I keep adding write-ups as I learn more on /blog. The architecture has held up under real traffic, which is the only review that counts.

your move

What's the next AI problem on your plate?

Pick one, I've shipped each of these and written it up.

Whichever path you pick next, I've been through it shipping these AI features on adatepe.dev around my BMW work and LMU studies, and I'm glad to talk shop.

built by alperenThese AI features run in production, under real traffic.Full-stack engineer, M.Sc. CS at LMU Munich. See the live features, or get in touch about building yours.Explore my work