Standard benchmarks use pp512/tg128 which underestimates real-world agentic coding where responses are 500-2000 tokens. Now configurable: --pp N Prompt processing tokens (default: 512) --tg N Token generation count (default: 128) Examples: benchmark run --tag realistic --tg 1024 --pp 2048 --category moe benchmark run --tag full-response --tg 2048 --category moe --reps 3 Log filenames include pp/tg when non-default (e.g., model__backend__fa1__pp2048_tg1024.log) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 KiB
12 KiB