Three arms (no skill, caveman, ponytail), three models, five everyday tasks, 10 runs per cell, median reported. Code LOC is counted from fenced code blocks; tokens, cost, and latency come straight from the API.
cp ../.env.example ../.env # add your ANTHROPIC_API_KEY
npx promptfoo@latest eval -c promptfooconfig.yaml --repeat 10
npx promptfoo@latest viewTasks: email validator, JS debounce, CSV sum, React countdown, FastAPI rate-limit (see promptfooconfig.yaml). Single-shot completions, default temperature.
Code (lines)
| arm | Haiku | Sonnet | Opus |
|---|---|---|---|
| baseline (no skill) | 518 | 693 | 256 |
| caveman | 116 | 120 | 67 |
| ponytail | 39 | 44 | 51 |
Cost (USD, 5 tasks)
| arm | Haiku | Sonnet | Opus |
|---|---|---|---|
| baseline (no skill) | 0.032 | 0.141 | 0.135 |
| caveman | 0.014 | 0.045 | 0.075 |
| ponytail | 0.010 | 0.032 | 0.071 |
Latency (seconds, 5 tasks)
| arm | Haiku | Sonnet | Opus |
|---|---|---|---|
| baseline (no skill) | 37.7 | 124.1 | 58.7 |
| caveman | 14.9 | 34.7 | 23.1 |
| ponytail | 9.9 | 20.1 | 18.0 |
Versus baseline, ponytail writes 80-94% less code, costs 47-77% less, and runs 3-6x faster, on every model.
- Caveman is a prose-compression skill (it leaves code "normal"), so it lands between baseline and ponytail on code size and wins mainly on prose tokens.
- Cost reflects single-shot calls that re-send the skill every time. In real sessions the skill is injected once and prompt-cached, so the cost gap widens further in ponytail's favor.
- These are everyday tasks. For production-grade specs, where an unconstrained agent bloats much harder, see the writeups in
results/.