Indie-launch copy iteration¤
Skill: use-smart-humanize-text · Scenario: indie-launch-copy-iteration
Harness: 4 arms (smart_skill / flat_skill_wiki / flat_skill / no_skill) × 3 runs × 6 sessions = 72 sessions
Executor: cursor-agent · Grader: claude-opus-4-5
Run: 2026-04-27
What we're testing¤
Indie-maker Liana Koval ships 13 voice rules — 9 banned clichés (powerful, seamless, leverage, …) and 4 required moves (named audience, concrete number, first-person, named limitation) — across six sessions of in-prompt feedback. Sessions 5 and 6 deliberately withhold all feedback so retention (S5) and generalization to a new 3-post-thread format (S6) are the only path to passing.
Why we're testing it¤
The 4-arm ablation isolates two effects that conflate in simpler harnesses. flat_skill_wiki keeps SKILL.md + the static wiki/ knowledge but resets the brain (patterns.md / decisions.md / log.md) every session — so smart vs flat_skill_wiki measures persistent memory alone. flat_skill strips the wiki too, so flat_skill_wiki vs flat_skill measures whether static knowledge helps when memory is absent. no_skill sets the floor. The decisive sessions are S5 and S6: any arm above the others there is winning on retention, not on rules-in-prompt.
Headline numbers (mean over 3 runs × 6 sessions = 18 sessions per arm)¤
| Arm | Pass rate | Mean tokens / session | Mean wall time / session |
|---|---|---|---|
smart_skill |
0.873 | 53,435 | 57.2 s |
no_skill |
0.851 | 55,387 | 42.6 s |
flat_skill_wiki |
0.842 | 64,923 | 51.0 s |
flat_skill |
0.832 | 75,296 | 58.1 s |
Verifiable results comparing the skills¤
The aggregate table buries the real signal because S1–S4 carry rules in-prompt and tie across arms. The story is the no-feedback tail (S5, S6 — mean over 3 runs):
| Session | smart | flat_wiki | flat | no_skill | What it probes |
|---|---|---|---|---|---|
| S5 | 0.922 | 0.843 | 0.902 | 0.843 | pure retention — no in-prompt feedback |
| S6 | 0.944 | 0.833 | 0.852 | 0.889 | generalization to a new 3-post format |
- smart vs flat_skill_wiki (the brain-only delta): +9.4% on S5, +13.3% on S6. Same
SKILL.md, samewiki/; the only difference is whetherpatterns.md/decisions.md/log.mdpersisted. This is the cleanest measurement of memory value. - smart vs flat_skill: +5% pass rate, −29% tokens, −1.4% wall time. Memory is cheaper than re-reading
SKILL.mdagainst an empty brain. - smart vs no_skill: +2.6% pass rate aggregate; +9.4% on S5, +6.2% on S6. The full-skill apparatus pays off precisely on the sessions that test what was learned.
- Smart is the only arm above 0.9 on both S5 and S6. Every other arm fails at least one of the no-feedback probes.
→ Open the full interactive report