Skip to content

Indie-launch copy iteration¤

Skill: use-smart-humanize-text · Scenario: indie-launch-copy-iteration Harness: 4 arms (smart_skill / flat_skill_wiki / flat_skill / no_skill) × 3 runs × 6 sessions = 72 sessions Executor: cursor-agent · Grader: claude-opus-4-5 Run: 2026-04-27

What we're testing¤

Indie-maker Liana Koval ships 13 voice rules — 9 banned clichés (powerful, seamless, leverage, …) and 4 required moves (named audience, concrete number, first-person, named limitation) — across six sessions of in-prompt feedback. Sessions 5 and 6 deliberately withhold all feedback so retention (S5) and generalization to a new 3-post-thread format (S6) are the only path to passing.

Why we're testing it¤

The 4-arm ablation isolates two effects that conflate in simpler harnesses. flat_skill_wiki keeps SKILL.md + the static wiki/ knowledge but resets the brain (patterns.md / decisions.md / log.md) every session — so smart vs flat_skill_wiki measures persistent memory alone. flat_skill strips the wiki too, so flat_skill_wiki vs flat_skill measures whether static knowledge helps when memory is absent. no_skill sets the floor. The decisive sessions are S5 and S6: any arm above the others there is winning on retention, not on rules-in-prompt.

Headline numbers (mean over 3 runs × 6 sessions = 18 sessions per arm)¤

Arm Pass rate Mean tokens / session Mean wall time / session
smart_skill 0.873 53,435 57.2 s
no_skill 0.851 55,387 42.6 s
flat_skill_wiki 0.842 64,923 51.0 s
flat_skill 0.832 75,296 58.1 s

Verifiable results comparing the skills¤

The aggregate table buries the real signal because S1–S4 carry rules in-prompt and tie across arms. The story is the no-feedback tail (S5, S6 — mean over 3 runs):

Session smart flat_wiki flat no_skill What it probes
S5 0.922 0.843 0.902 0.843 pure retention — no in-prompt feedback
S6 0.944 0.833 0.852 0.889 generalization to a new 3-post format
  • smart vs flat_skill_wiki (the brain-only delta): +9.4% on S5, +13.3% on S6. Same SKILL.md, same wiki/; the only difference is whether patterns.md / decisions.md / log.md persisted. This is the cleanest measurement of memory value.
  • smart vs flat_skill: +5% pass rate, −29% tokens, −1.4% wall time. Memory is cheaper than re-reading SKILL.md against an empty brain.
  • smart vs no_skill: +2.6% pass rate aggregate; +9.4% on S5, +6.2% on S6. The full-skill apparatus pays off precisely on the sessions that test what was learned.
  • Smart is the only arm above 0.9 on both S5 and S6. Every other arm fails at least one of the no-feedback probes.

→ Open the full interactive report

Live preview¤

Reproduction¤

humblskills eval run use-smart-humanize-text \
  --scenario indie-launch-copy-iteration \
  --runner cursor-agent