How the improvement loop works
What each stage does, what it needs from you, and where the engineering time comes back.
Archal, in plain terms
Your agent does real work in real tools. It answers the ticket, refunds the charge in Stripe, merges the branch. So when a run goes wrong, the mistake is not a bad answer on a screen: it is a wrong refund in a real ledger, and an engineer loses an afternoon in the traces working out why.
Archal turns that afternoon into a loop. The bad run gets caught, rebuilt in a safe copy of your tools, fixed by a coding agent, proven clean, and kept as a test, so the same mistake cannot ship twice. Seven terms carry the rest of this page:
Stateful agent
An agent that carries memory across steps and acts on real systems as it works. Step nine depends on step two, and the results live in your actual tools. A regular agent that goes wrong gives a bad answer; a stateful agent that goes wrong gives a bad refund.
Trace
The full record of one run: every tool call, its arguments, and the state it touched.
Harness
Everything around the model: the prompts, the tool wiring, the glue code your team maintains.
Clone
A working copy of a real service that holds state, resets instantly, and is safe to fail against.
Scenario
One situation, written down: a starting state, a task, and what counts as success.
Eval
A stored test that passes or fails every future version of your agent.
Benchmark
The growing pack of scenarios one agent must keep passing; every fixed failure joins it.
The loop, step by step
Five stages and the return edge that makes it a loop. Click a stage to see what it does.
Catch
Every run your agent takes lands in a trace store. Archal connects to the one you already have, Langfuse, OTLP, Braintrust, or your own database, and grades every trace that arrives. A run that looks fine in the transcript but left the wrong state behind still gets flagged, with the reason attached.
✗Catch
Every run your agent takes lands in a trace store. Archal connects to the one you already have, Langfuse, OTLP, Braintrust, or your own database, and grades every trace that arrives. A run that looks fine in the transcript but left the wrong state behind still gets flagged, with the reason attached.
↻Recreate
A flagged failure is rebuilt in clones: stateful working copies of the services your agent touched, with the same APIs and the same error responses. Clones start in milliseconds and reset in microseconds, so Archal can replay the failure twenty times in a row and confirm it happens the same way every time.
±Fix
A coding agent writes the fix, and it can be yours. Archal connects to Claude Code, Codex, or Cursor and hands it the reproduction, the trace, and the failing criteria. The fix lands in the harness: the prompts, the tool wiring, the retrieval, the glue code. It arrives as an ordinary pull request.
✓Prove
No fix ships on faith. Your patched agent runs again on the exact environment that failed, and the replay has to come back clean, counted. The pull request carries its own evidence, a patch without a regression test is rejected, and nothing merges without your review.
+Remember
When the fix merges, the failure becomes a stored eval in a growing suite for that agent. The suite runs on every future change, including before you deploy, so each incident makes your agent permanently harder to break. That is where the engineering time comes back.