Skip to main content

The run loop

When you execute archal run scenario.md, five things happen:

1. Parse the scenario

The runner reads your markdown file and extracts these sections:
  • Setup - natural language describing the initial twin state
  • Prompt (optional) - the task given to the agent. If omitted, the setup is used as the task.
  • Expected behavior - what the agent should do (used only for evaluation, never shown to the agent)
  • Success criteria - evaluable statements tagged as [D] (deterministic) or [P] (probabilistic)
  • Config - which twins to use, timeout, number of runs

2. Provision cloud twins

Archal requests a hosted session for the required twins. Each twin is pre-loaded with state generated from the scenario’s setup section. The hosted twins expose MCP/API endpoints that are reachable by your configured execution engine.

3. Run the engine

Archal executes your agent against the hosted twins. The mode is inferred from which flags you provide:
  • API mode (--engine-endpoint): sends the scenario task to a remote /v1/responses endpoint (e.g. an OpenClaw gateway or your own agent API). The engine receives the task plus twin endpoint URLs.
  • Harness mode (--harness-dir): spawns a local agent command from a directory, optionally configured with an archal-harness.json manifest
The run continues until completion or timeout, and every tool call is recorded in a trace.

4. Evaluate

After each run, the evaluator checks every success criterion:
  • Deterministic criteria [D] - checked against the twin’s final state. Numeric comparisons, existence checks, count assertions. Free and instant.
  • Probabilistic criteria [P] - assessed by an LLM (Claude) using the trace, final state, and expected behavior description

5. Score

The scenario runs N times (default 1, increase with -n). Each run gets a per-criterion pass/fail. The satisfaction score is the percentage of criterion-run pairs that passed across all runs.