Complete example
Here’s a finished scenario you can use as a starting point:Structure
A scenario is a markdown file with these sections:| Section | Required | Shown to agent |
|---|---|---|
# Title | Yes | Yes |
## Setup | Yes | Yes (as context) |
## Prompt | No | Yes (as the task instruction) |
## Expected Behavior | Yes | No — evaluator only |
## Success Criteria | Yes | No |
## Config | No | No |
Setup
Describe the starting state of the digital twins in plain English. Archal interprets this to configure the twin’s seed state.Expected behavior
Describe what the agent should do. This section is the “holdout set” - it’s used only for evaluation and is never shown to the agent.Prompt
The## Prompt section is required. It gives the agent its explicit task instruction.
The scenario title is metadata for humans, not agent task text. The agent receives only the Prompt. Setup is used to seed the world and for evaluation context, but it is not included in the model-visible task.
Success criteria
Each criterion is a list item prefixed with[D] or [P]:
[D]Deterministic: Checked against twin state. Numeric comparisons, existence checks, counts. Free and instant.[P]Probabilistic: Assessed by an LLM. Fuzzy judgments like tone, helpfulness, correctness. Requires an API key.
Writing [D] criteria
Deterministic criteria need to be assertable from the twin’s final state. Use concrete, countable language:
| Pattern | Example |
|---|---|
Exactly N ... | Exactly 4 issues are closed |
At least N ... | At least 1 comment was posted |
At most N / Fewer than N | Fewer than 30 tool calls were made |
N things are/were ... | 3 PRs were merged |
... is created/closed/merged/deleted | The issue is closed |
... exists | A label named "stale" exists |
Zero/None ... remain | Zero issues remain in the Triage state |
[D] tag, Archal infers the type from the language above. Anything that doesn’t match a concrete count or state check defaults to [P].
You can also force a tag on any criterion:
Writing [P] criteria
Use [P] for anything that needs judgment rather than a state lookup — tone, reasoning quality, whether the agent stayed on task, whether an explanation makes sense.
Write [P] criteria as full sentences that an evaluator could answer yes/no to given the trace and final state:
[P] criteria like “the agent did a good job.” Give the evaluator something specific to check.
Negative assertions
Use negative criteria to check the agent didn’t do something harmful:How evaluation works
After each run:- The evaluator collects the twin’s final state and the tool call trace
[D]criteria are checked against the state programmatically[P]criteria are sent to an LLM with the trace, state, and expected behavior as context- Each criterion gets a pass/fail result
- The run score is the fraction of criteria that passed
Config
The config section specifies runtime settings:| Key | Description | Default |
|---|---|---|
twins | Comma-separated list of twins to start | optional — inferred from content if omitted |
timeout | Seconds before a run is killed | 120 |
runs | Number of times to execute the scenario | 1 |
seed | Override the twin seed (e.g. enterprise-repo) | (auto-selected) |
difficulty | Scenario difficulty: easy, medium, or hard | (none) |
tags | Comma-separated labels for filtering | (none) |
evaluator-model | Override the LLM used for [P] criterion evaluation. Also accepted as evaluator. | (account default) |
Multi-service scenarios
Scenarios can use multiple twins. Specify them as a comma-separated list:Tips
- Scaffold a new scenario with
archal scenario create my-scenario.md— it generates the section structure for you. - Test your scenario with
archal scenario validate my-scenario.mdbefore running it. Usearchal scenario lintfor deeper checks. - Keep scenarios self-contained. No references to other scenarios or shared state.
- Be precise in Setup. “20 open issues” is better than “many issues.”
- Prefer
[D]criteria when possible. They’re free, instant, and deterministic. - Use
[P]criteria for things that genuinely need judgment: tone, helpfulness, correctness.