I'm a working student, not a YouTuber, so I don't get paid to be excited about tools. I get paid, and graded, since I'm doing Computer Science at LMU Munich on a BMW Fastlane scholarship while shipping internal tooling at BMW Group, to make things that work. That's the lens here. This isn't a roundup of every agentic coding tool with a star rating. It's what I actually reach for as one person trying to ship real software, and where each thing earns its keep or doesn't.
A quick disclaimer before anyone treats this as gospel: the space moves fast enough that any specific version comparison is stale within months. What doesn't go stale is the way you should think about choosing. So I'll spend more time on the criteria than on the leaderboard, because the criteria are what survive.
I do not grade a tool on its demo. I grade it on whether it can satisfy the gate my own work has to pass:
The only question that matters: what does "done" mean?
Every agentic coding tool is good at producing code that looks right. That's table stakes now and it's also a trap. The thing that separates a tool I keep from a tool I uninstall is whether it respects a definition of "done" that I control, or whether it substitutes its own narrative confidence for that definition.
Concretely: my repos have a fixed gate. On adatepe.dev it's prettier → tsc --noEmit → eslint → prettier --check, plus a Bun suite of roughly 1,280 tests, plus a check that all 14 i18n locale files match a typed config. A tool that lets the agent declare victory before that gate is green is worse than useless to me, because it manufactures false progress I then have to clean up. So my number-one evaluation criterion isn't "how smart is the model," it's "how easily can I make this tool refuse to finish until my checks pass."
If you remember nothing else: pick tools by how well they integrate with your gate, not by how impressive their demos are.
What actually makes a tool agentic
People throw the word "agentic" around loosely, so it helps to pin down what it means in practice. A chat assistant and an agentic tool can run the exact same model underneath. The difference is not intelligence, it is the loop. A chat assistant talks about your code. An agentic tool reads your files, runs your test, sees the failure, edits, and re-runs until the check passes. That closed loop is the part that saves me time, because it removes me as the manual courier moving text between a model and a terminal.
When I was the glue, every cycle cost me a context switch: paste code, read the reply, copy it back, run it, paste the error, repeat. Each hop is a chance for me to lose focus or transcribe something wrong. The moment a tool can run the command itself and react to real output, the work stops being a conversation and starts being progress. That is also why the loop matters more than raw model quality for everyday work. A slightly weaker model that closes the loop against my gate beats a smarter one that just hands me confident text I still have to verify by hand.
The difference is concrete once you see the two side by side:
// You paste code in, read the reply,
// copy it back, run it, paste the error,
// repeat. The model never touches your repo.Same model underneath. The difference is whether it can act on your codebase or just talk about it. That loop is the whole reason agentic tools feel different.
That loop is the lens I use to sort the tools into categories below.
Between my work at BMW and the systems courses in my M.Sc. CS at LMU, I keep coming back to the same mental model: the loop is just a control system, and the model is one component inside it. Stripped down to code, here is what that control flow actually looks like.
while (!done) {
const action = model.next(context);
const result = run(action); // edit, test, run
context = context.concat(result);
done = result.testsPass;
}Not a wall of code, a single next step given the current context. Read a file, run a command, edit a line. Small steps are what make it correctable.
Once you see it as a loop, the categories below stop being marketing buckets and start being structural choices. Each one decides how much of that while block the tool is allowed to run on its own.
The categories, honestly
I find it more useful to think in categories than brands, because tools shuffle between categories constantly.
Inline assistants
The autocomplete-plus tools. They're genuinely good for the local loop, filling in a function body, suggesting the next line, writing the obvious test. Where they fall down is anything cross-file. They have a narrow keyhole view, so they confidently produce a call that doesn't match a signature two files away. I keep one around for typing speed and expect nothing architectural from it.
Terminal agents
This is where I live. A CLI agent that can read the whole repo, run commands, and iterate against output is the format that actually fits how I work, because it can close the loop itself. It writes a change, runs my gate, reads the failure, and fixes it, no copy-pasting errors back and forth. I use Claude Code daily in exactly this mode. The reason terminal agents beat inline ones for real work is simple: they can run things. An agent that can execute tsc --noEmit and read the output is operating on facts. An agent that can only suggest text is guessing.
Autonomous loops
The "go do the backlog" category. I'm biased here because I built one, Nightshift, which decomposes a task list, runs each item against a hard gate, does browser QA, and commits atomically overnight. The honest take is that this category is incredible for a specific shape of work (many small, well-specified, low-risk tasks) and dangerous for everything else. Pointed at vague tickets or risky surfaces it produces a lot of confident wrong. The tools in this category are only as good as the validation gate you wrap them in.
Here is the same three categories as a quick switcher, so you can see at a glance what each is for and where it bites:
Autocomplete, but smarter. Genuinely good for the local loop: filling a function body, the next line, the obvious test.
The catch is the keyhole view. It sees a narrow window, so it confidently produces a call that does not match a signature two files away. I keep one for typing speed and expect nothing architectural from it.
How I actually pick, in order
When I evaluate a new agentic coding tool, I run it through the same questions every time:
- Can I make it run my exact validation gate, in order, and treat failure as not-done? If not, it's a toy.
- Can I constrain its scope? A tool that wanders into files I didn't mention produces unreviewable diffs. I want a diff I can read in five minutes, not a sprawling refactor I have to forensically audit.
- Can it see runtime, not just code? For UI work, can it drive a browser, read the console, check dark mode? Static checks are blind to half the bugs that matter.
- Does it leave a trail? Commits, logs, screenshots. If I can't reconstruct what it did, I can't trust it on anything unattended.
- How does it behave when it's wrong? This is the real tell. Good tools fail loudly and stop. Bad tools fail quietly and keep going.
Notice that "how clever is the model" isn't on that list. Model quality is real, but it's the most fungible part, everyone's model gets better every few months. The integration properties are what differ, and they're what you'll actually fight with.
If you're staring at the category list and not sure which one your next task wants, walk the few questions below and it will route you to the format I'd actually reach for, plus where to read more.
Which agentic tool should you reach for next?
Match the format to the blast radius you're comfortable with, not to the leaderboard.
Does the task span more than one file?
That tree is just my five questions compressed into a decision, so once you've walked it, one belief about these tools is still worth testing before you spend money chasing it:
The mistakes I made so you don't have to
A few honest ones from my own usage:
I trusted output over evidence. Early on I'd read a confident "all tests pass" and believe it. Now I don't consider anything done unless I've seen the command output. The model saying it ran the tests and the tests actually being green are different events.
I gave tasks that were too big. "Add the blog system" produces a giant diff where, when it breaks, root cause is buried under twenty decisions. "Add the /blog index route, touching these three files, done when the sitemap test passes" produces a diff I can actually reason about. Small tasks make the gate's feedback sharp.
I optimized before I had evidence. I tuned prompts and config for problems I hadn't proven were real. The fix is to start from a measurable user-visible outcome and only spend effort where something actually bottlenecks it.
I let perfect-looking formatting hide thin work. A tool can produce beautifully formatted code that does the wrong thing. Formatting is not correctness. The gate is correctness.
Where each category actually fits my week
It's easy to read the above as "terminal agents win, everything else loses." That's not it. They coexist, and the skill is knowing which one to reach for.
When I'm deep in a single file working out an algorithm or wiring up a component, the inline assistant is the right tool, fast, local, low-ceremony, and I'm reviewing every line as it appears anyway. The blast radius is one function, so the keyhole view doesn't hurt me.
When I'm implementing a feature that spans a few files, a new route, a schema change with the API and tests that follow from it, the terminal agent earns its place, because the loop of "change, run the gate, read the failure, fix" is the entire job and it can run that loop itself. This is the bulk of my real work, including a lot of the internal tooling I build as a working student, and it's where the close-the-loop property pays for itself many times over.
When I have a pile of small, independent, low-risk tasks and no energy, the autonomous loop is the answer. Not for anything I'd be nervous to wake up to, but for the long tail of fixes and chores that otherwise never get done. The danger is using this category for work that doesn't fit its shape, point it at something ambiguous or risky and the confident-wrong rate spikes.
So the real comparison isn't "which tool is best." It's "which tool matches the shape of this task and the blast radius I'm comfortable with." A tool used outside its shape will disappoint you, and you'll wrongly blame the tool.
The trap of optimizing the tool instead of the gate
There is a failure mode I fell into hard, and I see it constantly in other developers, so it is worth naming directly. You spend your energy tuning the tool, the prompts, the config, the model selection, the elaborate setup, and almost none of it on the thing the tool has to satisfy. It feels productive because you are doing something measurable, and it is almost entirely wasted, because a better-tuned tool pointed at a weak definition of done just produces plausible-wrong output faster.
I spent a real amount of time early on optimizing prompts for problems I had not proven were real. I tuned phrasing, swapped models, tweaked context windows, all in pursuit of output that looked better. What actually moved the needle was the opposite direction: making my gate stricter and my tasks smaller, so that any competent tool was forced into producing work I could trust. The day I stopped shopping for a smarter model and started hardening the wall it had to pass, the quality of what I shipped went up regardless of which tool I happened to be using that week.
I can put a rough number on it. Over a stretch of a few weeks where I logged this, somewhere around four in ten "done" patches from an unattended agent had at least one problem the model never mentioned: a null it didn't narrow, a locale file left out of the 14, a test it claimed to run but hadn't. Before I trusted the gate, every one of those was an evening I lost re-reading diffs by hand. After I made the gate non-negotiable, the same class of mistake still happened at roughly the same rate, but it now cost me nothing, because tsc and the 1,280-test suite caught it in seconds instead of in production. The mistakes didn't disappear; my exposure to them did. That is the entire return on hardening the gate, and it is why I treat the type checker as the real reviewer. I wrote up the exact setup in my TypeScript guardrails for AI-generated code, because once you've felt a strict gate catch a quiet failure for you, tuning prompts to avoid that failure stops feeling like the productive move it pretends to be. The same instinct shows up across the projects I ship this way: the gate is the part I invest in, the tool is the part I swap.
This is why I am almost bored by the question of which agentic tool is best. The honest answer for a solo developer is that it matters far less than the question implies, because the leverage does not live in the tool. It lives in the environment you drop the tool into. A strict gate makes a mediocre tool useful. A weak gate makes the best tool on the market dangerous, because now you have a very capable system producing confident output with nothing reliable standing between it and your main branch. The tool is the cheap, swappable part. The gate is the expensive part, and it is the part that is actually yours.
A note on lock-in
One more practical thing: don't build your whole process around one vendor's proprietary features. The models and the wrappers churn. What I keep stable is the gate, plain shell commands, a normal test runner, ordinary git. Every tool I use plugs into that. When something better shows up next quarter, switching costs me an afternoon, not a rewrite, because the thing I actually depend on lives in my repo, not in any vendor's product. That portability is itself a reason to favor tools that respect your existing scripts over ones that want you inside their walled workflow.
Next time you're evaluating an agentic tool, run it through this instead of watching its demo reel. A tool that can't tick most of these is a toy no matter how smart its model is:
How I evaluate a new agentic tool in an afternoon
I don't trust demos and I don't read the changelog hype. When a new agentic tool shows up, I give it a single, repeatable afternoon test, and I run that test against a repo I already know cold. Usually it's one of my own side projects, because I need to feel in my gut whether a diff is right or wrong without reading every line. Between coursework for the M.Sc. at LMU and the working-student hours at BMW, an afternoon is genuinely all the time I have, so the protocol has to be tight.
I start by handing it a real bug I've already fixed once, then reverted, so I know the correct answer and the dead ends around it. The first thing I watch is whether it reads files before it writes them. A tool that starts editing before it has opened the failing module is guessing, and guessing tools waste my evenings. Next I check whether it runs the tests on its own initiative, or whether it declares victory and waits for me to discover the failure. The ones I keep treat a red test suite as not-done, the same way I do.
The most revealing moment is failure. I want to see it hit a broken build or a failing assertion and then recover, reading the actual error instead of reshuffling code at random. Just as important is how it handles being wrong: does it stop and say so, or does it quietly paper over the problem and keep going? Loud failure beats confident nonsense every time. This is also where my TypeScript guardrails for AI-generated code earn their keep, because the type checker catches the quiet mistakes the tool won't admit to.
To make this concrete, here is roughly what the afternoon test looks like when I run it against one of my own repos. The exact commands are nothing exotic, just my plain gate driven by hand so I can watch the tool react to real output instead of its own narrative.
What I'm watching during those runs is not the model's prose; it's whether the tool re-runs the gate after each edit and treats that red as not-done. Finally I judge the diff itself. Is it scoped to the bug, or did it reformat half the file and rename three things I didn't ask about? A tool that produces a five-minute-reviewable patch survives. Everything else gets uninstalled before dinner.
What I'd tell another indie dev
Stop shopping for the smartest tool and start building the strictest gate. The leverage for a solo developer isn't in finding a model that's 10% better at reasoning. It's in setting up an environment where any competent agent is forced to produce work that compiles, passes tests, and doesn't break the page, and where you can tell at a glance when it didn't.
Once you have that, the tools become interchangeable in the best way. I switch between them depending on the task, and it barely matters, because the gate is the same and the gate is what I trust. That's the quiet truth of this whole space: the best agentic coding tool is the one you've wrapped in the most uncompromising definition of done.
The projects I've built this way are on /#projects, more of my thinking is on /blog, and if you want to know who's making these claims, that's /cv.
What's the next thing you want to lock down?
Pick the gap that worries you most, I've written up each one.
Tools change every few months, but the habit of trusting nothing past the gate is what keeps mine shipping, and I'd love to show you what it shipped.