Skip to content

Eval overview¤

humblskills eval runs a three-arm benchmark for any skill:

  • no_skill - baseline without the skill
  • flat_skill - skill injected without smart multi-session state
  • smart_skill - full smart skill; sessions run in order so brain state (patterns, decisions, log, wiki) carries across runs

Outputs are graded and summarized in a single-file HTML dashboard (plus JSON/Markdown mirrors). For smart skills, you get a trajectory that shows compounding across sessions.

Runners¤

Pick the runner that matches how you authenticate today, or use the mock runner in CI.

Runner Auth Notes
claudecode Claude Code login Wraps claude -p --output-format stream-json
cursor-agent Cursor login Wraps cursor-agent headless CLI
codex Codex login Wraps the OpenAI codex CLI
anthropic-api ANTHROPIC_API_KEY / keyring Pure-Go Read/Write/Bash/Glob/Grep tool loop
openai-api OPENAI_API_KEY / keyring Pure-Go tool loop
mock none Deterministic, zero tokens (CI and dev)

Next: Eval quickstart, Artifacts, Scenarios.