Commit Graph

7 Commits

Author SHA1 Message Date
Felipe Cardoso
3686783f4d feat: add --context flag for configurable long-context benchmarks
Both run-baseline.sh and run-suite.sh now accept --context N to set
the long-context depth (default: 32768). Prompt tokens auto-scale to
~1/16 of context depth for larger windows.

Examples:
  benchmark run --tag ctx64k --context 65536 --category moe
  benchmark run --tag ctx128k --context 131072 --category moe --reps 3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:46:16 +01:00
Felipe Cardoso
1b5b193e81 fix: suppress exit code 143 from metric logger cleanup
The metric logger is killed via SIGTERM on benchmark completion, producing
exit code 143 (128+15) which propagated through set -e. Added explicit
return 0 / trailing true to cleanup traps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:38:48 +01:00
Felipe Cardoso
7c8be55bfe fix: resolve model paths for toolbox container access
Toolbox containers mount host / at /run/host/ but only /home is directly
accessible. Models on /data/models/ need the /run/host/ prefix when passed
to llama-bench inside the container.

Both run-baseline.sh and run-suite.sh now resolve model paths with realpath
and prepend /run/host/ for non-home paths. Paths under /home/ are passed
as-is (already mounted directly).

Verified with smoke test: Qwen3.5-0.8B-Q8_0 → 8900 t/s pp512, 177 t/s tg128.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:17:16 +01:00
Felipe Cardoso
cb25fa3f6f feat: add benchmark filtering (--max-size, --category, --skip-longctx)
Both run-baseline.sh and run-suite.sh now support:
- --max-size GB: skip models larger than N GB (prevents OOM)
- --category LIST: filter by catalog category (smoke,dense,moe)
- --skip-longctx: skip 32K context tests (saves time + memory)
- --reps N: configure repetition count
- --help: shows usage with examples

Safe pre-optimization run: benchmark baseline --max-size 20 --skip-longctx
Full post-optimization: benchmark baseline (no filters, all models + longctx)

Also: 4 new BATS tests for flag parsing (98 total, all passing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:07:24 +01:00
Felipe Cardoso
eb52ea52ce fix: follow symlinks in model discovery, update model catalog
- Add -L flag to find in benchmark scripts (follows symlinks to /data/models/llms/)
- Exclude mmproj-*.gguf (vision projection files, not LLM models)
- Update configs/models.conf: remove Qwen3-Coder (user prefers Qwen3.5-35B-A3B),
  add Qwen3.5-27B-Q4_K_M and Q8_0 variant, reflect actual downloaded models

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 09:44:16 +01:00
Felipe Cardoso
af0515d05d fix: address code review findings (HIGH + MEDIUM)
- Replace GNU \b with portable word-boundary sed patterns in kernel-params
- Warn on unknown CLI arguments instead of silently swallowing
- Add floor check on recommended_gttsize_mib to prevent negative values
- Fix Python operator precedence in benchmark log parser
- Add root checks to tuned-profile.sh and rollback.sh
- Remove redundant sudo calls (scripts already require root at entry)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:44 +01:00
Felipe Cardoso
c596e38e9e Initial commit 2026-03-25 20:13:15 +01:00