I built Nightshift because I kept falling asleep with a backlog open. The pattern was always the same: I'd have eight or nine small, well-specified tasks for adatepe.dev, a copy fix, a missing test, a new locale key, a refactor I'd been avoiding, and zero energy to do any of them. Each one was the kind of work that's tedious for a human and almost insultingly mechanical for a model with a good validation gate. So Nightshift is the thing that runs that backlog while I sleep and leaves me a stack of atomic commits to review with coffee.

What it is not is a "build my startup overnight" button. I want to be precise about that because the genre is full of demos that fall apart the moment you ask them to ship something real. Nightshift is narrow on purpose. It does a small number of things and refuses to mark anything done until that thing provably works.

pollWould you let an agent commit to your repo while you sleep?

Here is the honest version of an overnight run. Guess before you scroll:

your guessSeven small, well-specified tasks queued, the agent runs unattended for hours. How many are finished and committed when I wake up?

The one rule that makes it safe

Everything in Nightshift hangs off a single non-negotiable: a task is not complete until a hard validation gate passes, and the gate is not the model's opinion.

An unconstrained overnight agent is genuinely dangerous, not because it does nothing, but because it produces output that looks finished. It'll write a feature, generate a confident summary, and move on while three files don't typecheck. The next morning you don't have eight features, you have eight plausible-looking diffs and a debugging session. The fear people have about autonomous coding is correct. The mistake is concluding that the answer is less autonomy. The answer is a harder gate.

In my repo that gate is the same one I use by hand:

prettier --write . && tsc --noEmit && eslint . && prettier --check .

Nightshift runs that after every task attempt, plus the relevant slice of my Bun test suite (around 1,280 tests). If any step is non-zero, the task is not done, full stop. The model doesn't get to vote.

To make the exit-code logic concrete, here is what a single gate run looks like when a task attempt actually fails, and why the runner refuses to commit it:

What the gate does on a failing task attempt

A non-zero exit on any of the four steps is the whole signal: the runner reads $?, sees it is not zero, and the task stays open. Only a clean sweep flips it to committed.

How the loop actually runs

The architecture is deliberately boring. Boring is what survives contact with a six-hour unattended run.

Backlog decomposition

I write tasks in a structured markdown file, each one has an outcome, the files it's allowed to touch, and explicit acceptance criteria. Nightshift parses that into a queue. Loose tasks produce loose code, so I keep each unit to something a focused developer would finish in 10 to 20 minutes. A task that says "improve the blog" is rejected by me before it ever reaches the runner.

Bounded attempts per task

Each task gets a budget: a maximum number of code-fix attempts and, for UI work, a separate cap on browser-fix attempts. When a task burns its budget, Nightshift doesn't spin forever. It marks the task needs-review, writes down why, and moves to the next one. One stubborn task is not allowed to eat the whole night's run budget, that's the difference between waking up to seven finished tasks and one flagged, versus waking up to nothing because attempt #94 on task two is still going.

Atomic commit per task

Every task that passes the gate gets its own commit. This is the feature I'd give up last. It means my morning review is a clean walk through discrete, self-contained changes, and if I dislike one, I drop that single commit without unpicking everything around it. Rollback is git revert, not surgery.

State and event logs

Nightshift writes an append-only event log and periodic state snapshots. When something weird happens overnight, and it will, I can reconstruct exactly what the runner saw and decided. Without that trail, debugging an unattended run is archaeology.

People always assume the wiring is exotic, but coming from the kind of disciplined tooling I work with at BMW and study during my M.Sc. CS at LMU, I deliberately kept the harness embarrassingly small. The literal skeleton of an overnight run is a shell loop you could read in ten seconds, and every line in it is load-bearing for a reason.

annotatedThe overnight loop, decoded

while true; do
  claude -p "$(cat task.md)" \
    --max-turns 40
  git add -A && git commit -m "wip"
  sleep 120
done

Reading task.md means you can edit the goal between runs without touching the script. The loop always picks up the latest instructions.

That skeleton is the honest core, but the production runner wraps it in the gates and budgets described above so a single bad iteration can't poison the night. With that mental model in place, it helps to see the four moving parts laid out side by side rather than as one undifferentiated loop.

The loop is four boring pieces, and boring is what survives a six-hour unattended run. Switch between them:

compareHow the loop actually runs

I write tasks in a structured markdown file, each with an outcome, the files it may touch, and explicit acceptance criteria. Nightshift parses that into a queue.

Loose tasks produce loose code, so I keep each unit to something a focused developer would finish in 10 to 20 minutes. A task that says improve the blog is rejected by me before it ever reaches the runner.

Those four pieces are the parts; the value is how a single task travels through them from contract to commit. Here is the exact path one task takes through a run:

flowThe path a single task takes through a run

Each task enters as a structured entry with an outcome, the files it may touch, and explicit acceptance criteria. I write these by hand and reject anything vague before it reaches the runner. The tip is to scope a unit to 10 to 20 minutes of focused work; looser contracts produce looser code.

Every stage above is a place the run can stop honestly rather than fake progress, which is the whole point of the design. With the path clear, the browser step deserves its own section because it guards a failure the gate structurally cannot.

Browser QA, because the gate is blind to pixels

A green build tells you the code compiles and the assertions pass. It tells you nothing about whether the page is actually usable. I learned this the hard way: a dark-mode card on adatepe.dev once rendered with near-invisible text while every test stayed green, because nothing tested the computed contrast.

So for any UI task, Nightshift drives a real Chrome session, the same loop I use manually. It loads the changed route, confirms the elements it touched are present and interactive, toggles dark and light, and reads the console for errors it might have introduced. Screenshots get saved as evidence per task. This is the single most valuable addition I made after the first version, because it closes the exact gap that static validation structurally cannot see. If your product has animation, route transitions, or responsive layout, treating browser QA as optional is how you ship broken UI with a green checkmark.

The gotcha I did not see coming is how much the browser step costs in wall-clock time. A pure code task, gate and all, settles in under a minute on my machine. A UI task that has to cold-start Chrome, load the route, toggle both themes, and re-screenshot after each fix attempt runs closer to three to four minutes per attempt, and with a browser-fix budget of three that single task can eat fifteen minutes of a night's runtime on its own. That is why I split the budget: code-fix attempts and browser-fix attempts are counted and capped separately, so a UI task that keeps failing its visual check cannot quietly starve the five text tasks queued behind it. The other thing I learned to do is reuse one Chrome session across an entire night instead of spawning a fresh one per task. Early on each task booted its own browser, and the cumulative startup overhead across seven UI tasks was longer than the actual fixing. The contrast bug I mentioned was the trigger for all of this: the fix itself took one attempt, but proving it stayed fixed across light and dark is what the browser loop exists for. If you want the manual version of this exact discipline, the habits I lean on day to day are in my Claude Code guide, and the gate that backs it is the TypeScript guardrails stack.

Failure modes I had to design around

The first version of Nightshift was naive in ways that are obvious in hindsight. A few of the anti-patterns I had to engineer out:

Optimistic completion. Early on it marked tasks done after generating the implementation, before the gate ran. That's how you get false progress and a painful morning. Now completion is strictly downstream of green checks.
No resumability. Long runs hit rate limits and transient API hiccups. If the loop can't resume from its last good state, your uptime looks fine and your delivery is garbage. State snapshots fixed this.
No evidence. A run with no logs, no screenshots, and no event trace is unfalsifiable. You can't debug what you can't see. Everything is persisted now.
Hard-failing on missing GitHub auth. If gh auth status fails, an early version aborted the whole run. Now it degrades gracefully: it keeps doing local work and atomic commits, and just skips the PR step. Wasting a night's runtime over a missing token is unforgivable.

What it does with GitHub

When auth is present, a finished run pushes its commits, opens a PR with a summary of what it did and what it flagged, and stops. It does not merge. I review, and if there are comments, I can hand those back for targeted patches, each of which re-runs the full gate before anything is pushed again. The model never gets a path to production that skips a human, and that's intentional.

What I let it touch, and what I don't

I scope tasks by risk before they ever enter the queue. Content edits, copy, and self-contained UI changes are fair game for unattended runs. Internal flows I'll let it draft but I read carefully. Anything near auth, payments, or destructive data operations, including database migrations, which in my setup are never applied automatically anyway, does not go in an overnight backlog. Some things earn a human checkpoint regardless of how good the gate is, and pretending otherwise is how you get an incident.

If you are sizing up your own backlog and trying to decide what is safe to hand an overnight loop, walk this short decision before you queue anything:

find your answer

Is this task safe to run unattended overnight?

Route a single backlog item to the right lane before it ever reaches the runner.

Does the task touch auth, secrets, payments, or a database migration?

That single triage question, reversible or not, is what keeps the dangerous moves out of the backlog and the boring ones flowing through it.

A worked example from one night

To make this concrete, here's the shape of a real run. The backlog had seven items: add a missing test for the blog word-count helper, fix a typo in the German locale file, add an aria-label to an icon button, tighten a Drizzle query that was over-fetching, add a canonical URL to one route's metadata, refactor a duplicated date formatter into a shared util, and update an excerpt that read badly.

None of those are glamorous. All of them are exactly the kind of thing that sits in a backlog for weeks because I'd rather build the next feature. Nightshift worked through them in order. Five passed the gate on the first or second attempt and got committed. The Drizzle query refactor hit its fix budget, it kept producing a query that typechecked but failed a test asserting the returned column set, so it got flagged needs-review with the failing output attached, and the loop moved on. The locale typo fix tripped my i18n uniformity check the first time because the model "fixed" the typo by rephrasing in a way that drifted from the English source's meaning; the second attempt corrected just the typo and passed.

In the morning I had six clean commits, one flagged task with a clear explanation, and an event log I could skim in two minutes. The flagged task took me four minutes to finish by hand once I read what it had been struggling with. That's the realistic win: not magic, just a tireless junior that never marks its own homework as done.

0tasks queued

0clean commits by morning

0flagged for review

0minto finish the flag by hand

What I want to stress is how unremarkable each individual commit was. There was no clever architecture, no big feature, nothing I'd put in a demo. That's the point. The leverage of an autonomous loop isn't in the heroic task, those still want my full attention, it's in clearing the long tail of small, well-specified work that otherwise rots in a list.

Before you let any agent commit while you sleep, tick these off. Every one is a failure mode I had to engineer out of Nightshift the hard way, skip one and you wake up to false progress, not finished work:

checklistIs your loop safe to run unattended?0/6

The morning review ritual

The first thing I do over coffee is open the repo and read the night's commits top to bottom. There's usually a small stack waiting, one per task, and the whole point of the review is to confirm each one before I let any of it near a deploy. I don't read the diffs in isolation: I read them against the event log first, because the log tells me the story and the diff tells me whether the story is true.

What makes a commit easy to review is that it does exactly one thing. Because every task lands as its own atomic commit, the diff has a single subject, the message says what it was supposed to accomplish, and I can hold the whole change in my head at once. I scan it, check the obvious edges, and move on. A two-minute review per task feels almost lazy, but that's the design working: small surface area means few places to be wrong.

What makes a commit hard is when it sprawls. If a change touched five files for reasons the log doesn't justify, or if the diff and the stated task disagree, that's friction, and friction is the signal I actually care about. The skimmable event log is what keeps the fast cases fast: I'm not reconstructing intent from code, I'm reading a timeline and only dropping into the diff when something looks off.

Flagged tasks get their own lane. When Nightshift can't finish something cleanly, it stops, leaves the work uncommitted, and writes why. I triage those by hand: sometimes the ticket was underspecified and I rewrite it for the next run, sometimes the agent was right to bail and I finish it myself during the day. Either way the flag is a feature. A night that flags three tasks and ships four honestly is worth more to me than one that quietly shipped seven.

What I will not let the loop do unattended

Autonomy is only safe because of the things I refuse to delegate, and I want to be explicit about where the line sits. The loop never deploys to production. Full stop. It commits, it pushes, it opens a PR, and there it stops, because a deploy is the one action you cannot cleanly take back, and I will not hand that to a process I am not awake to supervise. The same logic covers secrets and database migrations: the runner has no path to rotate a key, touch an environment variable, or apply a schema change, and migrations in my setup are never applied automatically anyway. Those are the exact moves where a confident-but-wrong agent does lasting damage, so they earn a human checkpoint regardless of how green the gate looks.

Branch isolation is the second guardrail. Every night's work happens off main, on its own branch, so a bad run can never corrupt the trunk. The worst case is a branch I delete, not a history I have to surgically rewrite. Because each task lands as its own atomic commit, even within a bad branch I can keep the good commits and drop the rest with a git revert instead of unpicking a tangle.

Then there is the morning review gate, which is the real backstop. Nothing the loop produced counts as done until I have read it over coffee, commit by commit, against the event log. I am the merge button. If I read three commits and dislike the shape of the work, none of it ships, and I have lost nothing but a branch. This is also why I scope tasks by risk before they ever enter the queue, and why the day-to-day discipline behind it is worth reading up on if you want the same safety: I lean on the same habits I describe in my notes on Claude Code skills.

Containing a bad run, then, is not one clever mechanism. It is layered: bounded attempts so a single task cannot burn the night, an isolated branch so the blast radius is one deletable ref, atomic commits so rollback is trivial, hard exclusions on deploys and secrets so the irreversible moves never happen, and a human gate so nothing reaches anyone but me without my say-so. Strip any layer and the loop stops being something I would leave running while I sleep.

Was it worth building?

Honestly, the value isn't raw speed, though it does clear small backlogs faster than I would. The real value is that it forced me to make "done" mechanical, and that discipline leaked back into how I work by hand during the day. Once you've written a backlog precise enough for an unattended agent to execute it without supervision, writing a sloppy ticket for yourself starts to feel like a bug.

Before you reach for a per-hour figure, drag the slider and watch what the loop actually buys you across a week of nights:

try itWhat an overnight loop is actually worth

Nights you let the loop run per week: 3 nights

Roughly autonomous hours per week (about 6 a night)18
Hours you would have spent babysitting it by hand0

18saved at this volume

The point is not that the machine works while you sleep. It is that you wake up to commits you can review instead of a blank repo. I build these loops.

See how

Those autonomous hours only count because of the second number staying at zero, and that is worth saying plainly.

Why the review step is the whole point

The number I actually track is not hours, it is reviewable commits waiting for me at breakfast. An overnight loop that produces six atomic commits I can read in two minutes each is worth far more than one that "worked all night" and hands me a tangled branch I have to reverse-engineer. Raw hours are a vanity metric. What I am buying is a morning where the repo has moved forward in a shape I can verify, not a pile of plausible diffs I have to trust on faith.

That is exactly why the review step is what makes the whole thing safe rather than reckless. The agent never gets a path to a deploy that skips me. It commits, it pushes, it opens a PR, and then it stops. I read each commit against the event log, confirm it does the one thing it claimed, and only then does any of it count. The autonomy is real, but it terminates at a human who can say no per commit. Strip out that review gate and you do not have a faster developer, you have a faster way to ship mistakes you have not read yet.

Nightshift sits alongside my other projects on /#projects, and it's the most-used thing I've built that nobody but me will ever run. If you want the broader thinking behind how I approach agent-assisted shipping, the rest is on /blog and the work history is on /cv.

your move

Want to build toward this yourself?

Pick the piece you'd start with.

Wherever you start, the same idea carries it: hand the agent a gate it cannot fake, then let it run while you sleep. Come say hi.

built by alperenI build the tools I write about, Nightshift runs nightly.Working student at BMW, M.Sc. CS at LMU Munich, BMW Fastlane Scholar. See the work, or get in touch.Explore my work