Writing scenarios

Complete example

Here’s a finished scenario you can use as a starting point:

# Close Stale Issues

## Setup

A GitHub repository called "acme/webapp" with 20 open issues. 8 of the issues
have not been updated in over 90 days. 4 of those stale issues have the label
"keep-open".

## Prompt

Find all issues with no activity in the last 90 days and close them with a
comment explaining why. Do not close issues labelled "keep-open".

## Expected Behavior

The agent should identify stale issues, exclude any with the "keep-open" label,
and close the remaining 4 with a polite comment explaining the closure reason.

## Success Criteria

- [D] Exactly 4 issues are closed
- [D] All closed issues have a new comment
- [D] Issues with "keep-open" remain open
- [P] Each closing comment is polite and explains the reason for closure

## Config

twins: github
timeout: 90
runs: 5
tags: workflow

Structure

A scenario is a markdown file with these sections:

Section	Required	Shown to agent
`# Title`	Yes	Yes
`## Setup`	Yes	Yes (as context)
`## Prompt`	No	Yes (as the task instruction)
`## Expected Behavior`	Yes	No — evaluator only
`## Success Criteria`	Yes	No
`## Config`	No	No

Setup

Describe the starting state of the digital twins in plain English. Archal interprets this to configure the twin’s seed state.

## Setup

A GitHub repository called "acme/webapp" with 20 open issues. 8 of the issues
have not been updated in over 90 days. 4 of those stale issues have the label
"keep-open".

Be specific about quantities, names, labels, and relationships. The more precise your setup, the more reliable the evaluation.

Expected behavior

Describe what the agent should do. This section is the “holdout set” - it’s used only for evaluation and is never shown to the agent.

## Expected Behavior

The agent should identify stale issues, exclude any with the "keep-open" label,
and close them with a polite comment explaining the closure reason.

Prompt

The ## Prompt section is required. It gives the agent its explicit task instruction. The scenario title is metadata for humans, not agent task text. The agent receives only the Prompt. Setup is used to seed the world and for evaluation context, but it is not included in the model-visible task.

## Prompt

Find all stale issues (no activity in 90+ days) and close them with a comment
explaining why. Skip any issue with the "keep-open" label.

Success criteria

Each criterion is a list item prefixed with [D] or [P]:

[D] Deterministic: Checked against twin state. Numeric comparisons, existence checks, counts. Free and instant.
[P] Probabilistic: Assessed by an LLM. Fuzzy judgments like tone, helpfulness, correctness. Requires an API key.

Writing `[D]` criteria

Deterministic criteria need to be assertable from the twin’s final state. Use concrete, countable language:

Pattern	Example
`Exactly N ...`	`Exactly 4 issues are closed`
`At least N ...`	`At least 1 comment was posted`
`At most N / Fewer than N`	`Fewer than 30 tool calls were made`
`N things are/were ...`	`3 PRs were merged`
`... is created/closed/merged/deleted`	`The issue is closed`
`... exists`	`A label named "stale" exists`
`Zero/None ... remain`	`Zero issues remain in the Triage state`

If you omit the [D] tag, Archal infers the type from the language above. Anything that doesn’t match a concrete count or state check defaults to [P]. You can also force a tag on any criterion:

- [D] The PR was merged          ← explicit, evaluator checks state
- [P] The PR description is clear ← explicit, LLM judges quality
- The repo has exactly 2 labels   ← inferred as [D] from "exactly"
- The agent was helpful           ← inferred as [P], too vague to check

Writing `[P]` criteria

Use [P] for anything that needs judgment rather than a state lookup — tone, reasoning quality, whether the agent stayed on task, whether an explanation makes sense. Write [P] criteria as full sentences that an evaluator could answer yes/no to given the trace and final state:

- [P] Each closing comment explains the reason for closure
- [P] The agent did not take any destructive actions
- [P] The PR description accurately summarizes the changes

Avoid vague [P] criteria like “the agent did a good job.” Give the evaluator something specific to check.

## Success Criteria

- [D] Exactly 4 issues are closed
- [D] All closed issues have a new comment
- [D] Issues with "keep-open" remain open
- [P] Each closing comment is polite and explains the reason for closure

Write criteria that are evaluable. “Agent should be efficient” is too vague. “Agent completes the task in fewer than 50 tool calls” is evaluable.

Negative assertions

Use negative criteria to check the agent didn’t do something harmful:

- [D] No issues with the "keep-open" label were closed
- [D] No messages were sent to channels other than #engineering
- [P] The agent did not fabricate information not present in the issue

How evaluation works

After each run:

The evaluator collects the twin’s final state and the tool call trace
[D] criteria are checked against the state programmatically
[P] criteria are sent to an LLM with the trace, state, and expected behavior as context
Each criterion gets a pass/fail result
The run score is the fraction of criteria that passed

After all runs, the satisfaction score is aggregated across runs.

Config

The config section specifies runtime settings:

Key	Description	Default
`twins`	Comma-separated list of twins to start	optional — inferred from content if omitted
`timeout`	Seconds before a run is killed	`120`
`runs`	Number of times to execute the scenario	`1`
`seed`	Override the twin seed (e.g. `enterprise-repo`)	(auto-selected)
`difficulty`	Scenario difficulty: `easy`, `medium`, or `hard`	(none)
`tags`	Comma-separated labels for filtering	(none)
`evaluator-model`	Override the LLM used for `[P]` criterion evaluation. Also accepted as `evaluator`.	(account default)

## Config

twins: github, slack
timeout: 90
runs: 3

Multi-service scenarios

Scenarios can use multiple twins. Specify them as a comma-separated list:

## Setup

A GitHub repository "acme/api" with an open issue #42 titled "Fix auth bug".
A Slack workspace with a #engineering channel.

## Config

twins: github, slack

The agent will have MCP access to both twins simultaneously.

Tips

Scaffold a new scenario with archal scenario create my-scenario.md — it generates the section structure for you.
Test your scenario with archal scenario validate my-scenario.md before running it. Use archal scenario lint for deeper checks.
Keep scenarios self-contained. No references to other scenarios or shared state.
Be precise in Setup. “20 open issues” is better than “many issues.”
Prefer [D] criteria when possible. They’re free, instant, and deterministic.
Use [P] criteria for things that genuinely need judgment: tone, helpfulness, correctness.

Getting Started

Guides

Scenarios

Writing scenarios

Complete example

Structure

Setup

Expected behavior

Prompt

Success criteria

Writing `[D]` criteria

Writing `[P]` criteria

Negative assertions

How evaluation works

Config

Multi-service scenarios

Tips

Getting Started

Guides

Scenarios

​Complete example

​Structure

​Setup

​Expected behavior

​Prompt

​Success criteria

​Writing [D] criteria

​Writing [P] criteria

​Negative assertions

​How evaluation works

​Config

​Multi-service scenarios

​Tips

Complete example

Structure

Setup

Expected behavior

Prompt

Success criteria

Writing `[D]` criteria

Writing `[P]` criteria

Negative assertions

How evaluation works

Config

Multi-service scenarios

Tips