Adaptive brand voice discovery¤
Skill: use-smart-humanize-text · Scenario: adaptive-brand-voice-discovery
Harness: 3 arms (smart_skill / flat_skill / no_skill) × 6 sessions
Executor: cursor-agent · Grader: claude-opus-4-5
Run: 2026-04-20
What we're testing¤
A fictional Toronto fintech ships ten idiosyncratic brand-voice rules through six sessions of in-prompt feedback. The agent must draft copy that satisfies the rules, with later sessions deliberately withholding feedback so retention is the only path to passing.
Why we're testing it¤
Three arms hold prompt and skill-machinery constant and vary only persistent memory. smart_skill carries learned rules forward in patterns.md / decisions.md / log.md; flat_skill runs the same prompts with the brain reset between sessions; no_skill runs without the skill at all. Any smart > flat delta on the no-feedback sessions isolates the contribution of memory.
Headline numbers¤
| Arm | Pass rate | Mean tokens / session |
|---|---|---|
smart_skill |
0.935 | 63,519 |
flat_skill |
0.679 | 72,266 |
no_skill |
0.740 | 193,785 |
Verifiable results comparing the skills¤
- smart vs flat: +0.256 pass rate (+37.7%), −12.1% tokens — same prompts, same scaffolding; brain is the only difference
- smart vs no_skill: +0.194 pass rate (+26.3%), −67.2% tokens — skill machinery wins on quality and cost
- Session 5 (pure retention, no in-prompt feedback): smart 0 violations · flat 10 · no_skill 9 — the decisive comparison
→ Open the full interactive report