The first time I really trusted Claude Code, it was 1 a.m. and I was watching it walk through a failing test in my portfolio repo, adatepe.dev, a Next.js 16 / React 19 app with a Bun test suite that has grown to roughly 1,280 tests. It hadn't just patched the symptom. It read the assertion, traced back into the i18n config, noticed that one of my 14 locale files was missing a key the typed contract required, added it, re-ran the uniformity check, and only then told me it was done. I didn't write a clever prompt to get that. I'd spent a few weeks building the scaffolding around it so that "done" had a precise, mechanical meaning.

That is the whole game. People talk about Claude Code prompting as if the magic is in the wording. In my experience the wording barely matters past a certain baseline. What matters is the environment you drop the model into: what it can read, what it must run, and what counts as finished. This is a writeup of how I actually use it day to day, on a real repo, not a list of prompt tricks.

pollWhat do you think makes an AI coding agent reliable?

The whole approach only works because "done" is mechanical, not a feeling. Here is what that gate guards on this site:

0tests that must pass

0locale files kept uniform

0ordered checks in the gate

0the model gets to skip

What a "skill" actually buys you

A skill, in the Claude Code sense, is a piece of reusable context, a markdown file with instructions and sometimes scripts, that the model loads when it's relevant. The naive read is that it's a fancy prompt snippet. The useful read is that it's a way to stop re-explaining your repo every single session.

That framing only makes sense if you map it to how often the task actually repeats, so play with the numbers before you commit:

try itWhy a skill is worth writing once

Times you hit the same task per month: 12 times

Minutes saved per month (about 3 each once it is a skill)36
One-time minutes to write the skill0

36saved at this volume

A skill pays for itself the third time you would have retyped the same prompt. The repeated tasks are exactly the ones worth encoding. I build these workflows.

See how

The slider makes the obvious point concrete, but it leaves the harder question untouched, which is when writing a skill is a mistake at all.

I have a CLAUDE.md at the root of adatepe.dev that is brutally specific. The single most valuable line in it is not advice, it's a command:

bun run verify

That runs, in order: prettier --write → tsc --noEmit → eslint → prettier --check. The file then says, in plain language, that all four must exit clean and that the model must fix every issue the pipeline reports, not just the ones related to its change. No // eslint-disable, no @ts-ignore, no --no-verify. That paragraph has saved me more cleanup time than any prompt I've ever written, because it converts a fuzzy social expectation ("write good code") into a binary the model can check itself against.

The reason this works is boring: the model is far better at satisfying a checkable contract than at guessing your taste. Give it taste to guess and you get plausible-looking output. Give it tsc --noEmit and you get code that compiles.

A quick gut check on trusting an agent:

your guessThe model reports: I have implemented the feature and everything looks good. Meanwhile three files do not typecheck. Was it lying to you?

The validation gate is the product

If you take one thing from this, take this: your validation gate is not a chore you bolt on afterward. It is your Claude Code workflow. Everything else is decoration.

Here's why it matters more with an agent than with a human. A human who breaks the build feels the friction immediately, red squiggles, a failed CI run, a teammate's Slack message. The model feels nothing. It will happily report "I've implemented the feature and everything looks good" while three files don't typecheck, because "looks good" is a narrative it generated, not a fact it measured. The fix is to make completion depend on something it cannot talk its way around.

My gate has a deliberate order, and the order is not cosmetic:

Formatter first. Prettier rewriting files before typecheck means I never waste a tsc run on a diff that's about to change anyway.
Typecheck second. This catches the largest category of agent mistakes, a renamed prop, a wrong import, a function called with the old signature.
Lint third. Once it compiles, lint catches the stylistic and correctness-adjacent stuff.
Format check last. A final prettier --check proves nothing drifted.

For anything that touches behavior, the test suite is part of "done" too. With ~1,280 tests, a green run is a genuinely strong signal, and the model knows it. When I tell it "behavior changed but no test changed means the work isn't finished," it stops shipping confidence and starts shipping coverage.

The honest version of "trust the AI" is "trust the gate the AI has to pass." I don't trust the model. I trust bun run verify.

The order of the gate is not cosmetic. Step through what each stage is for:

compareThe gate, in order

Prettier rewrites files first. Running the formatter before the typecheck means I never waste a tsc run on a diff that is about to change anyway.

It also keeps formatting noise out of the real diff, so the change I am reviewing is the change, not a reflow.

Reading about the gate is one thing; watching it run is another. Here is what a clean pass actually looks like in my terminal when the model declares a task done and I make it prove it:

What bun run verify prints on a clean pass

Note the silence in the middle: tsc --noEmit and eslint print nothing on success, and that absence of output is the whole signal. If any stage spoke up, the task would not be done, and the model knows it cannot narrate its way past an exit code.

How I decompose work so the model doesn't drift

The failure mode I see most often isn't bad code, it's too much code. Ask for something vague and Claude Code will enthusiastically refactor four files you didn't mention. So I scope tasks the same way every time, and I keep them small.

A task I hand it looks roughly like this:

Outcome: one user-visible result, one sentence.
Files: the exact paths I expect to change.
Validation: the specific checks that must pass.
Constraints: what must not change.

A real one from adatepe.dev:

Add the /blog index route with SEO metadata. Touch app/blog/page.tsx, app/lib/blog.ts, app/sitemap.ts. Done means: the route renders, metadata includes a canonical URL, the sitemap test passes, and bun run verify is clean. Do not touch the nav components.

The "do not touch" line does real work. Without it the model will wander into the header to "improve" the navigation while it's nearby. With it, the diff stays reviewable, and a reviewable diff is the only kind I'll actually merge. You can see the result of this discipline across the projects on /#projects, the commit histories are boring in the good way.

Why small tasks beat clever prompts

Small tasks give the gate something tight to bite on. When a 40-line change fails typecheck, the error points almost directly at the cause. When a 600-line change fails, the model has to guess which of its many decisions broke things, and its guesses compound. I'd rather run the loop ten times on ten small tasks than once on a monster.

Skill files rot, so I treat them like code

The thing nobody tells you about skill files is that they go stale, and a stale skill file is worse than no skill file because it lies to you with the full authority of something written down. I had a skill doc that described my i18n setup, and at some point the setup changed and the doc did not. For a while the model kept following the old instructions confidently, producing changes that matched a reality that no longer existed. The doc was not neutral when it was wrong. It actively steered the work in the wrong direction, and it did so more persuasively than a vague prompt would have, precisely because it was specific.

Before I can keep a skill file honest, it helps to be precise about what one actually is. As a working student at BMW and an M.Sc. CS student at LMU Munich, I have read enough half-baked skill docs to know the structure carries most of the weight: a skill is just frontmatter plus a body, and each part has one job.

annotatedA skill file, decoded

---
name: changelog-writer
description: Use when the user asks
  to draft release notes from commits.
---

Read the git log since the last tag,
group by type, write user-facing notes.

Short, kebab-case, unique. This is what the assistant matches against when it decides a skill applies, so it should read like the task it does.

With that anatomy in view, the failure mode is obvious: the frontmatter promises one thing and the body slowly describes another. That gap is exactly the rot I keep running into. So I stopped thinking of skill files as personal notes and started treating them as code. They live in the repo, not in my head or a private scratchpad. They get reviewed when they change, in a pull request, the same as any source file. And when reality drifts from what a doc says, fixing the doc is part of the work that caused the drift, not a someday chore. A skill file that nobody keeps current is technical debt with a friendly face, and the interest it charges is paid in confidently wrong changes that pass review because the reviewer trusted the doc too.

The deeper point is that a skill file is a contract between me and the model about how my repo works, and a contract that silently diverges from reality is how trust quietly breaks. The whole value of writing the context down is that I stop re-explaining it every session. That value inverts the moment the written version is wrong, because now I am not saving explanation, I am injecting a confident falsehood into every session that loads it. Keep the docs honest by reviewing them like code, or do not keep them at all. There is no safe middle where a doc is mostly right and nobody is responsible for the part that is not.

Browser QA, because green tests lie about UI

Unit tests are green and the page looks broken. This happens constantly with anything animated, themed, or layout-sensitive. My CSS theme tokens (--bg, --text, --border-card) can be perfectly valid TypeScript and still render an unreadable dark-mode card.

So for UI tasks I add a second gate: the model drives a real browser, loads the changed page, checks that the elements it touched are visible and interactive, flips dark/light, and reads the console for errors. It's slower, and I only do it for visual work, but it catches the exact class of bug the test suite is structurally blind to. If a feature has route transitions or responsive behavior, I treat browser QA as non-optional.

Recovery: what I do when a branch goes bad

It will go bad. The difference between a productive session and a frustrating one is how fast you bail out of a bad branch. My instinct used to be "ask for a full rewrite." That's almost always wrong, a rewrite throws away the parts that worked and reintroduces variance everywhere.

What I do instead:

Stop. No more edits on top of broken state.
Paste the exact failing command output back in.
Re-scope to the single failing area.
Ask for a root-cause explanation before any patch, if it can't explain why it broke, it can't reliably fix it.
Apply a minimal fix.
Re-run the full gate.

The "explain before patching" step is the one people skip and the one that pays off most. A model that's forced to articulate the cause stops shotgunning random changes at the symptom.

The number that convinced me to be ruthless about bailing early: on one bad branch I let the model keep patching on top of a broken state for what turned out to be eleven follow-up edits before I stopped it. When I finally read the diff, three of those edits were fixes for problems the previous edits had introduced, not the original bug at all. The original failure was a single wrong import path. Eleven edits to fix one wrong import, because each patch was reasoning against a tree that was already poisoned by the last one. Once I reset and pasted only the first failing command output back in, it found the import in one pass. The lesson stuck: edits compound their own errors, so the cost of a bad branch is not linear, it grows with every turn you stay on it.

There is a specific gotcha here that bit me more than once. When you paste failing output back in, paste the first error, not the last. A long failure trace often ends with a cascade of secondary errors that are just fallout from the first one, and if you hand the model the tail of that trace it will earnestly fix the symptom at the bottom while the real cause sits untouched at the top. This is the same discipline I lean on when the failure is a type error rather than a runtime one, which is why I keep the type boundaries strict enough to fail loud and early; more on that in my writeup on TypeScript guardrails for AI-generated code. A loud, early, single error is the cheapest thing you can hand a recovering agent.

Things I've stopped doing

A few habits I've dropped after enough pain:

Letting it format the whole repo every task. The churn buries your actual change in the diff. Scope formatting to touched files.
Trusting "I ran the tests." If I didn't see the output, it didn't happen. I make it paste the run.
Writing one giant universal prompt. Task-scoped skill files keep context local and outputs consistent. A CLAUDE.md plus a couple of targeted skill docs beats a 2,000-word system prompt every time.
Treating skill docs as personal notes. They live in the repo, get reviewed like code, and get updated by PR when they drift from reality. A skill nobody follows is worse than none, because it manufactures false confidence at review time.

Run your own Claude Code setup against this before you blame the model for sloppy output. Most "the AI writes bad code" complaints are really "my environment never defined done":

checklistIs your environment set up to get good work out of an agent?0/6

When a skill is not worth writing

The economics of a skill are simple. You pay a fixed cost once to write it down, and you collect a small saving every time the model loads it instead of you re-explaining the same thing. So the tasks worth encoding are the boring, repeated ones: the verify pipeline, the i18n contract, the way my repo handles theme tokens. I hit those constantly, so the cost amortizes fast and the doc earns its keep.

The trap is writing a skill for work you will do once. If I am exploring an unfamiliar API or wiring up a one-off migration script, the time spent generalizing the steps into a reusable doc is time I will never get back, because there is no second run to collect the saving on. Worse, a skill written for a one-off tends to be vague, since I did not yet understand the problem well enough to encode it cleanly. So my rule is plain. If I have explained the same thing to the model three times, I write it down. Until then, I just explain it.

Skills vs CLAUDE.md vs subagents: what goes where

People conflate these three, and the confusion costs them. They are not interchangeable. They sit at different points on one axis: how often the context applies, and whether the model should reach for it automatically or on demand. Once I started placing each piece of context by that question, my setup stopped fighting itself.

CLAUDE.md is for always-on project rules. It loads every session, unconditionally, so it should hold only the things that are true for every task in the repo. On adatepe.dev that is the bun run verify gate, the bans on @ts-ignore and --no-verify, the i18n contract that all 14 locale files stay uniform, and the operational note that database migrations are never applied automatically by deploys. These are not procedures the model chooses to run. They are the physics of the repo. If I put a niche, rarely-relevant procedure here, I am paying context budget on every single task for something that fires twice a month.

Skills are for on-demand procedures the model invokes when relevant. A skill stays dormant until its description trigger matches the task at hand, which is exactly why the trigger should read "Use when the user asks to draft release notes," not a vague summary. A changelog writer, a runbook for adding a new locale, a recipe for wiring up a new blog route with SEO metadata: each is a procedure I want available but not loaded into every conversation. The cost of a skill is near zero until it activates, so I can keep many of them without bloating the baseline.

Subagents are for parallel or isolated work. When I have two independent tasks with no shared state, or a noisy investigation whose intermediate output I do not want polluting my main thread, I hand it to a subagent with its own context window. That isolation is the same instinct behind my nightshift autonomous loop, where isolated runs clear a backlog without stepping on each other. The rule of thumb: always-on rule goes in CLAUDE.md, repeated procedure becomes a skill, independent or context-heavy work goes to a subagent.

If you are still unsure where a given piece of context belongs, walk it through the same questions I ask myself before I write anything down:

find your answer

Where does this context belong?

Route a rule, a procedure, or a task to the right home before you encode it.

Is it true for every task in the repo, like the verify gate or the type bans?

Where this leaves me

Strong Claude Code usage is mostly unglamorous infrastructure. Write down your contracts. Make completion mechanical. Keep tasks small enough that the gate's feedback is sharp. Add browser QA where static checks go blind. Bail out of bad branches fast and make the model explain itself before it patches.

I built most of this scaffolding while working on my own projects and on internal tooling as a working student at BMW, and the same pattern held in both places: the teams that get value from agents are the ones who invested in making "done" unambiguous. The model is the cheap part. The gate is the expensive, valuable part.

If you want to see what this discipline produces over time, the rest of my writing lives at /blog, and the work itself is on /cv.

your move

Ready to push the discipline further?

Pick where you want to go next.

Whichever path you pick, skills only matter once they survive the gate, and that gate is exactly how I ship at BMW and on adatepe.dev.

built by alperenI ship real products with this workflow, not prompt tricks.Working student at BMW, M.Sc. CS at LMU Munich. See the work, or get in touch.Explore my work