Commit Graph

18 Commits

Author SHA1 Message Date
Felipe Cardoso
6ab08537ca fix: address code review findings — batch args, venv path, serve flags
- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)
2026-03-31 10:10:48 +02:00
Felipe Cardoso
dd403a907c feat(serve): add optimized llama-server launcher with n-gram speculation
Add `make serve` and `make serve-ngram` for launching llama-server with
baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention,
no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x
tg speedup on repetitive content without upstream PR dependencies.
Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE
support), draft-model speculation stalled on ROCm buffer crashes.
2026-03-30 21:12:30 +02:00
Felipe Cardoso
ba24091791 feat(benchmark): add -b/--batch flag, test MoE batch size impact
Add batch size override to benchmark scripts. Testing -b 256 vs default
2048 on Vulkan RADV shows no meaningful difference for MoE pp2048
(826 vs 843 t/s, within noise). Community-reported +70% improvement
does not reproduce on this backend.
2026-03-30 20:01:24 +02:00
Felipe Cardoso
1549bc27c0 feat(optimize): add Phase 2 power profile and system tuning
Add `make optimize-power` (ryzenadj 85W, sysctl, THP, RADV nogttspill)
with systemd services for boot/resume persistence. Integrate into
`make optimize --all` as Phase 2. Update optimization log with RyzenAdj
results (+46% tg at 70W sustained), KV sweep data, and quant shootout.
Add Qwen3-Coder-30B and Nemotron-Cascade-2 to model catalog.
2026-03-30 18:53:52 +02:00
Felipe Cardoso
f92b710492 fix(benchmark): parse llama-bench output with variable column count
KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00
Felipe Cardoso
7531f6fa74 feat(benchmark): add --kv-types flag for KV cache quantization sweep 2026-03-27 12:29:19 +01:00
Felipe Cardoso
38daf953bf feat: add --pp and --tg flags for realistic benchmark workloads
Standard benchmarks use pp512/tg128 which underestimates real-world
agentic coding where responses are 500-2000 tokens. Now configurable:

  --pp N    Prompt processing tokens (default: 512)
  --tg N    Token generation count (default: 128)

Examples:
  benchmark run --tag realistic --tg 1024 --pp 2048 --category moe
  benchmark run --tag full-response --tg 2048 --category moe --reps 3

Log filenames include pp/tg when non-default (e.g., model__backend__fa1__pp2048_tg1024.log)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:48:32 +01:00
Felipe Cardoso
3686783f4d feat: add --context flag for configurable long-context benchmarks
Both run-baseline.sh and run-suite.sh now accept --context N to set
the long-context depth (default: 32768). Prompt tokens auto-scale to
~1/16 of context depth for larger windows.

Examples:
  benchmark run --tag ctx64k --context 65536 --category moe
  benchmark run --tag ctx128k --context 131072 --category moe --reps 3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:46:16 +01:00
Felipe Cardoso
1b5b193e81 fix: suppress exit code 143 from metric logger cleanup
The metric logger is killed via SIGTERM on benchmark completion, producing
exit code 143 (128+15) which propagated through set -e. Added explicit
return 0 / trailing true to cleanup traps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:38:48 +01:00
Felipe Cardoso
fb1e57f1bf feat: make llama-rocm-7.2 a required toolbox in benchmark setup
ROCm 7.2 is now created alongside vulkan-radv during setup, giving
Vulkan vs ROCm comparison in baseline and post-optimization benchmarks.

Smoke test: ROCm 7.2 on Qwen3.5-0.8B → 8090 t/s pp512, 161 t/s tg128
(vs Vulkan: 8900 t/s pp512, 177 t/s tg128)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:23:03 +01:00
Felipe Cardoso
7c8be55bfe fix: resolve model paths for toolbox container access
Toolbox containers mount host / at /run/host/ but only /home is directly
accessible. Models on /data/models/ need the /run/host/ prefix when passed
to llama-bench inside the container.

Both run-baseline.sh and run-suite.sh now resolve model paths with realpath
and prepend /run/host/ for non-home paths. Paths under /home/ are passed
as-is (already mounted directly).

Verified with smoke test: Qwen3.5-0.8B-Q8_0 → 8900 t/s pp512, 177 t/s tg128.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:17:16 +01:00
Felipe Cardoso
d22c062ca7 fix: model catalog shows download status, GPU detection in toolbox
- Catalog * indicator now searches recursively (finds models in subdirs)
- GPU verification suppresses toolbox crun stderr (directory not found noise)
- Matches on "radeon" and "available devices" for Vulkan output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:14:31 +01:00
Felipe Cardoso
cb25fa3f6f feat: add benchmark filtering (--max-size, --category, --skip-longctx)
Both run-baseline.sh and run-suite.sh now support:
- --max-size GB: skip models larger than N GB (prevents OOM)
- --category LIST: filter by catalog category (smoke,dense,moe)
- --skip-longctx: skip 32K context tests (saves time + memory)
- --reps N: configure repetition count
- --help: shows usage with examples

Safe pre-optimization run: benchmark baseline --max-size 20 --skip-longctx
Full post-optimization: benchmark baseline (no filters, all models + longctx)

Also: 4 new BATS tests for flag parsing (98 total, all passing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:07:24 +01:00
Felipe Cardoso
eb52ea52ce fix: follow symlinks in model discovery, update model catalog
- Add -L flag to find in benchmark scripts (follows symlinks to /data/models/llms/)
- Exclude mmproj-*.gguf (vision projection files, not LLM models)
- Update configs/models.conf: remove Qwen3-Coder (user prefers Qwen3.5-35B-A3B),
  add Qwen3.5-27B-Q4_K_M and Q8_0 variant, reflect actual downloaded models

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 09:44:16 +01:00
Felipe Cardoso
58124cd657 feat: add Qwen3.5 model catalog and agentic evaluation framework
Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:20:23 +01:00
Felipe Cardoso
e9cb5c491f fix+test: improve test suite, fix 2 bugs found by tests
Bugs fixed in production code:
- compare.sh: Python truthiness on 0.0 — `if b_val` was False for 0.0 t/s,
  displaying it as a dash instead of "0.0". Fixed with `is not None` checks.
- compare.sh: ZeroDivisionError when computing delta % with 0.0 baseline.

Test improvements (review findings):
- detect.bats: kernel param tests now use real detect_kernel_param logic
  pattern (not a separate reimplementation). Added non-GiB-aligned RAM test,
  device ID without 0x prefix, empty firmware version, llama-bench detection,
  detect_total_physical_ram_kb tests.
- benchmark_compare.bats: assert delta percentages (+20.0%, -25.0%, 0.0%),
  test 0.0 t/s edge case, test per-directory error messages, test config
  change detection with specific field assertions.
- log_metrics.bats: add assert_success, --help test, timestamp format
  validation. Remove unused mock sysfs setup.
- common.bats: fix data_dir test, remove redundant assertion, add cleanup.
- test_helper.sh: remove unused FIXTURES_DIR.
- Remove empty tests/fixtures/ directory.

94 tests, all passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 22:22:41 +01:00
Felipe Cardoso
af0515d05d fix: address code review findings (HIGH + MEDIUM)
- Replace GNU \b with portable word-boundary sed patterns in kernel-params
- Warn on unknown CLI arguments instead of silently swallowing
- Add floor check on recommended_gttsize_mib to prevent negative values
- Fix Python operator precedence in benchmark log parser
- Add root checks to tuned-profile.sh and rollback.sh
- Remove redundant sudo calls (scripts already require root at entry)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:44 +01:00
Felipe Cardoso
c596e38e9e Initial commit 2026-03-25 20:13:15 +01:00