Standard benchmarks use pp512/tg128 which underestimates real-world
agentic coding where responses are 500-2000 tokens. Now configurable:
--pp N Prompt processing tokens (default: 512)
--tg N Token generation count (default: 128)
Examples:
benchmark run --tag realistic --tg 1024 --pp 2048 --category moe
benchmark run --tag full-response --tg 2048 --category moe --reps 3
Log filenames include pp/tg when non-default (e.g., model__backend__fa1__pp2048_tg1024.log)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>