Add batch size override to benchmark scripts. Testing -b 256 vs default
2048 on Vulkan RADV shows no meaningful difference for MoE pp2048
(826 vs 843 t/s, within noise). Community-reported +70% improvement
does not reproduce on this backend.
KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.
Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.
Standard benchmarks use pp512/tg128 which underestimates real-world
agentic coding where responses are 500-2000 tokens. Now configurable:
--pp N Prompt processing tokens (default: 512)
--tg N Token generation count (default: 128)
Examples:
benchmark run --tag realistic --tg 1024 --pp 2048 --category moe
benchmark run --tag full-response --tg 2048 --category moe --reps 3
Log filenames include pp/tg when non-default (e.g., model__backend__fa1__pp2048_tg1024.log)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both run-baseline.sh and run-suite.sh now accept --context N to set
the long-context depth (default: 32768). Prompt tokens auto-scale to
~1/16 of context depth for larger windows.
Examples:
benchmark run --tag ctx64k --context 65536 --category moe
benchmark run --tag ctx128k --context 131072 --category moe --reps 3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The metric logger is killed via SIGTERM on benchmark completion, producing
exit code 143 (128+15) which propagated through set -e. Added explicit
return 0 / trailing true to cleanup traps.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Toolbox containers mount host / at /run/host/ but only /home is directly
accessible. Models on /data/models/ need the /run/host/ prefix when passed
to llama-bench inside the container.
Both run-baseline.sh and run-suite.sh now resolve model paths with realpath
and prepend /run/host/ for non-home paths. Paths under /home/ are passed
as-is (already mounted directly).
Verified with smoke test: Qwen3.5-0.8B-Q8_0 → 8900 t/s pp512, 177 t/s tg128.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both run-baseline.sh and run-suite.sh now support:
- --max-size GB: skip models larger than N GB (prevents OOM)
- --category LIST: filter by catalog category (smoke,dense,moe)
- --skip-longctx: skip 32K context tests (saves time + memory)
- --reps N: configure repetition count
- --help: shows usage with examples
Safe pre-optimization run: benchmark baseline --max-size 20 --skip-longctx
Full post-optimization: benchmark baseline (no filters, all models + longctx)
Also: 4 new BATS tests for flag parsing (98 total, all passing)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace GNU \b with portable word-boundary sed patterns in kernel-params
- Warn on unknown CLI arguments instead of silently swallowing
- Add floor check on recommended_gttsize_mib to prevent negative values
- Fix Python operator precedence in benchmark log parser
- Add root checks to tuned-profile.sh and rollback.sh
- Remove redundant sudo calls (scripts already require root at entry)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>