The Goatfied agent loop: how we ship code that compiles first try

Inside Goatfied's autonomous agent loop — plan, constrain, edit, validate, retry — and why a compile-first gate is what makes AI-written code trustworthy in real repositories.

The difference between an AI coding tool that's a party trick and one you'll actually let near `main` comes down to a single question: _does the change it hands you compile before a human looks at it?_ Goatfied is built around making the answer yes. This is how the agent loop works, and why "compile-first" is a design principle rather than a nice-to-have.

The loop, in five steps

Goatfied's agent runs a closed loop for every task. It's deliberately unglamorous:

1. **Plan.** Turn the task into an explicit, ordered set of edits with named acceptance checks. No edit happens until there's a plan that says what "done" means.

2. **Constrain.** Scope retrieval tightly to the files the plan touches and their direct dependencies. Big context windows tempt a model into rewriting architecture it doesn't understand; narrow, precise context keeps it honest.

3. **Edit.** Produce a small, reviewable diff — not a speculative rewrite. The loop prefers change sets that can be rolled back in minutes.

4. **Validate.** Run compile, lint, and the task's targeted tests. This gate is not optional. A change that fails the gate never reaches you as "done."

5. **Retry with narrower scope.** If validation fails, the loop doesn't dump the error on you — it shrinks the scope, uses the failure as new context, and tries again. Dead-end broad edits are killed before they hit CI.

Every pass produces a branch, a transcript, and a command log, so you can always trace _why_ a change passed or failed.

Why compile-first matters more than model quality

It's tempting to think a better base model solves reliability. It doesn't, on its own. Two things dominate real-world outcomes:

**Constraint discipline.** When the loop must satisfy explicit compile and test gates, error rates fall and review churn drops — across every stack we've tested. The gate turns a probabilistic model into a system with a hard floor on quality.
**Reversibility.** Small diffs that roll back in minutes build the trust required to let automation run at all. A tool that occasionally makes a brilliant 400-line change is worse than one that reliably makes correct 40-line changes, because you can't safely delegate to the first.

Compile-first is where these meet. The gate guarantees a floor; small diffs keep the cost of a miss near zero.

What the loop refuses to do

Guardrails are as important as capabilities:

It won't mark a change "done" if it hasn't compiled, linted, and passed the task's tests.
It won't expand scope to "fix" something outside the plan without surfacing it as a separate, explicit step.
It won't hide its work. The transcript and command log are the audit trail — no black-box "trust me."

A concrete pass

Take a task like _"add a `retryable` flag to the job scheduler and honor it in the worker."_ The loop:

plans the edit across `scheduler.ts`, `worker.ts`, and the shared `Job` type;
retrieves just those files plus the two call sites that construct jobs;
writes a diff that adds the field, defaults it safely, and threads it through;
runs `make compile && make lint && make test TASK=scheduler-retryable`;
on a type error in an old call site, narrows to that file, fixes the constructor, and re-validates — then opens a PR with the passing checks attached as evidence.

You review a small, green diff with a transcript. That's the whole point.

Designing your own tasks for the loop

You get the most out of a compile-first agent by meeting it halfway:

**Write acceptance checks before prompting.** "Done" should be a test, not a vibe.
**Keep tickets small and reversible.** If a change can't roll back in minutes, split it.
**Require compile, lint, and targeted tests** as the gate — the same gate a human PR would face.

Do this and the agent loop compounds: each failure note feeds the next run, so the process gets more reliable over time instead of drifting randomly.

The takeaway

Autonomous coding is only useful if you can trust the output enough to delegate. Goatfied earns that trust structurally — plan, constrain, edit, validate, retry — with a compile-first gate that no change is allowed to skip. It's less exciting than a one-shot demo and far more useful on a real repository.

[Goatfied vs Cursor vs GitHub Copilot: how to benchmark them on real PR tasks](/blog/goatfied-vs-cursor-vs-github-copilot-benchmark-50-pr-tasks)
[Multi-file refactors with Goatfied: case studies from real codebases](/blog/multi-file-refactors-with-goatfied-case-studies)