strix-halo-optimizations

Author	SHA1	Message	Date
Felipe Cardoso	474d94a07e	chore: update model catalog with gemma 4, opus distill, and hw-bandwidth target	2026-04-03 20:03:53 +02:00
Felipe Cardoso	6ab08537ca	fix: address code review findings — batch args, venv path, serve flags - Fix missing BATCH_ARGS in long-context commands (both benchmark scripts) - Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs - Add -b/--batch to bin/benchmark help text - Add --no-think flag to serve script (--reasoning-budget 0) - Sanitize model names in eval run directories - Simplify agentic setup to use requirements.txt - Add serve --help test, batch flag assertions to existing tests - Add requirements.txt for reproducible venv setup (Python 3.13)	2026-03-31 10:10:48 +02:00
Felipe Cardoso	dd403a907c	feat(serve): add optimized llama-server launcher with n-gram speculation Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.	2026-03-30 21:12:30 +02:00
Felipe Cardoso	ba24091791	feat(benchmark): add -b/--batch flag, test MoE batch size impact Add batch size override to benchmark scripts. Testing -b 256 vs default 2048 on Vulkan RADV shows no meaningful difference for MoE pp2048 (826 vs 843 t/s, within noise). Community-reported +70% improvement does not reproduce on this backend.	2026-03-30 20:01:24 +02:00
Felipe Cardoso	ea70687cd2	docs: update optimization guide with measured hardware data Replace estimated values with clpeak measurements: DRAM 216-233 GB/s, GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting only). Correct backend recommendation to Vulkan RADV (2.7x faster tg than ROCm at 131K). Update KV cache recommendation to q4_0. Add Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.	2026-03-30 19:56:18 +02:00
Felipe Cardoso	1549bc27c0	feat(optimize): add Phase 2 power profile and system tuning Add `make optimize-power` (ryzenadj 85W, sysctl, THP, RADV nogttspill) with systemd services for boot/resume persistence. Integrate into `make optimize --all` as Phase 2. Update optimization log with RyzenAdj results (+46% tg at 70W sustained), KV sweep data, and quant shootout. Add Qwen3-Coder-30B and Nemotron-Cascade-2 to model catalog.	2026-03-30 18:53:52 +02:00
Felipe Cardoso	f92b710492	fix(benchmark): parse llama-bench output with variable column count KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.	2026-03-27 14:54:19 +01:00
Felipe Cardoso	7531f6fa74	feat(benchmark): add --kv-types flag for KV cache quantization sweep	2026-03-27 12:29:19 +01:00
Felipe Cardoso	38daf953bf	feat: add --pp and --tg flags for realistic benchmark workloads Standard benchmarks use pp512/tg128 which underestimates real-world agentic coding where responses are 500-2000 tokens. Now configurable: --pp N Prompt processing tokens (default: 512) --tg N Token generation count (default: 128) Examples: benchmark run --tag realistic --tg 1024 --pp 2048 --category moe benchmark run --tag full-response --tg 2048 --category moe --reps 3 Log filenames include pp/tg when non-default (e.g., model__backend__fa1__pp2048_tg1024.log) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 22:48:32 +01:00
Felipe Cardoso	3686783f4d	feat: add --context flag for configurable long-context benchmarks Both run-baseline.sh and run-suite.sh now accept --context N to set the long-context depth (default: 32768). Prompt tokens auto-scale to ~1/16 of context depth for larger windows. Examples: benchmark run --tag ctx64k --context 65536 --category moe benchmark run --tag ctx128k --context 131072 --category moe --reps 3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 22:46:16 +01:00
Felipe Cardoso	1b5b193e81	fix: suppress exit code 143 from metric logger cleanup The metric logger is killed via SIGTERM on benchmark completion, producing exit code 143 (128+15) which propagated through set -e. Added explicit return 0 / trailing true to cleanup traps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 22:38:48 +01:00
Felipe Cardoso	fb1e57f1bf	feat: make llama-rocm-7.2 a required toolbox in benchmark setup ROCm 7.2 is now created alongside vulkan-radv during setup, giving Vulkan vs ROCm comparison in baseline and post-optimization benchmarks. Smoke test: ROCm 7.2 on Qwen3.5-0.8B → 8090 t/s pp512, 161 t/s tg128 (vs Vulkan: 8900 t/s pp512, 177 t/s tg128) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:23:03 +01:00
Felipe Cardoso	7c8be55bfe	fix: resolve model paths for toolbox container access Toolbox containers mount host / at /run/host/ but only /home is directly accessible. Models on /data/models/ need the /run/host/ prefix when passed to llama-bench inside the container. Both run-baseline.sh and run-suite.sh now resolve model paths with realpath and prepend /run/host/ for non-home paths. Paths under /home/ are passed as-is (already mounted directly). Verified with smoke test: Qwen3.5-0.8B-Q8_0 → 8900 t/s pp512, 177 t/s tg128. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:17:16 +01:00
Felipe Cardoso	d22c062ca7	fix: model catalog shows download status, GPU detection in toolbox - Catalog * indicator now searches recursively (finds models in subdirs) - GPU verification suppresses toolbox crun stderr (directory not found noise) - Matches on "radeon" and "available devices" for Vulkan output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:14:31 +01:00
Felipe Cardoso	6f197a1455	fix: pass ARGS through in benchmark Makefile targets Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:10:59 +01:00
Felipe Cardoso	cb25fa3f6f	feat: add benchmark filtering (--max-size, --category, --skip-longctx) Both run-baseline.sh and run-suite.sh now support: - --max-size GB: skip models larger than N GB (prevents OOM) - --category LIST: filter by catalog category (smoke,dense,moe) - --skip-longctx: skip 32K context tests (saves time + memory) - --reps N: configure repetition count - --help: shows usage with examples Safe pre-optimization run: benchmark baseline --max-size 20 --skip-longctx Full post-optimization: benchmark baseline (no filters, all models + longctx) Also: 4 new BATS tests for flag parsing (98 total, all passing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:07:24 +01:00
Felipe Cardoso	eb52ea52ce	fix: follow symlinks in model discovery, update model catalog - Add -L flag to find in benchmark scripts (follows symlinks to /data/models/llms/) - Exclude mmproj-*.gguf (vision projection files, not LLM models) - Update configs/models.conf: remove Qwen3-Coder (user prefers Qwen3.5-35B-A3B), add Qwen3.5-27B-Q4_K_M and Q8_0 variant, reflect actual downloaded models Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 09:44:16 +01:00
Felipe Cardoso	58124cd657	feat: add Qwen3.5 model catalog and agentic evaluation framework Models: - configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick), Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding) - Updated benchmark setup to show catalog with download status - docs/model-recommendations.md: memory planning, quantization guide Agentic evaluation: - scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench in a Python venv - scripts/agentic/run-eval.sh: runs evaluations against local LLM server (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code (EvalPlus+BigCodeBench), tooluse (BFCL), full (all) - bin/agentic: dispatcher with help - docs/agentic-benchmarks.md: methodology, framework comparison, model recommendations for agentic use Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 00:20:23 +01:00
Felipe Cardoso	71053997be	chore: remove .idea from tracking, add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 23:58:18 +01:00
Felipe Cardoso	e9cb5c491f	fix+test: improve test suite, fix 2 bugs found by tests Bugs fixed in production code: - compare.sh: Python truthiness on 0.0 — `if b_val` was False for 0.0 t/s, displaying it as a dash instead of "0.0". Fixed with `is not None` checks. - compare.sh: ZeroDivisionError when computing delta % with 0.0 baseline. Test improvements (review findings): - detect.bats: kernel param tests now use real detect_kernel_param logic pattern (not a separate reimplementation). Added non-GiB-aligned RAM test, device ID without 0x prefix, empty firmware version, llama-bench detection, detect_total_physical_ram_kb tests. - benchmark_compare.bats: assert delta percentages (+20.0%, -25.0%, 0.0%), test 0.0 t/s edge case, test per-directory error messages, test config change detection with specific field assertions. - log_metrics.bats: add assert_success, --help test, timestamp format validation. Remove unused mock sysfs setup. - common.bats: fix data_dir test, remove redundant assertion, add cleanup. - test_helper.sh: remove unused FIXTURES_DIR. - Remove empty tests/fixtures/ directory. 94 tests, all passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 22:22:41 +01:00
Felipe Cardoso	a403dd9ce0	test: add BATS test suite (79 tests) - tests/common.bats: PROJECT_ROOT detection, is_cmd, timestamp, data_dir, logging functions, color handling, require_root - tests/detect.bats: GPU sysfs reads with mock sysfs tree, kernel param parsing (word boundary, dot escaping, edge positions), recommended GTT/pages computation (64GB, 128GB, tiny, zero), firmware bad detection, stack detection - tests/format.bats: human_bytes (0, KiB, MiB, GiB boundaries, 64GiB), human_mib (sub-GiB, exact-GiB, recommended values, empty input) - tests/benchmark_compare.bats: improvement/regression display, empty results, missing files, usage output, config change detection - tests/log_metrics.bats: CSV header, data format, field count, input validation, unknown argument handling - tests/test_helper.sh: mock sysfs tree builder, bats-assert/support setup Makefile: add 'make test' target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 22:15:34 +01:00
Felipe Cardoso	da2c4c6b8a	fix(docs): address review findings — accuracy, consistency, completeness - architecture.md: fix kernel param math to match actual computed values, use cardN placeholder in sysfs paths, clarify system_ram_kb is OS-visible - benchmarking.md: normalize flags to -ngl 99 / -mmp 0 (matching code), add llama-rocm7-nightlies backend - CLAUDE.md: clarify HSA_OVERRIDE_GFX_VERSION is set in containers not scripts, fix lib sourcing description, specify which scripts need root - detect.sh: document detect_cpu_cores returns threads not cores - troubleshooting.md: add link to references.md - README.md: remove unsupported Fedora 42 claim, describe configs/ content Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 21:44:16 +01:00
Felipe Cardoso	5b81437637	docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite - README.md: project overview, quick start, command reference, workflow - CLAUDE.md: AI safety rules, technical details, conventions - AGENTS.md: agent workflows, file responsibility map, dependency matrix - docs/architecture.md: script layers, data flow, unified memory, JSON schemas - docs/optimization.md: step-by-step optimization walkthrough - docs/benchmarking.md: methodology, test params, result interpretation - docs/troubleshooting.md: common issues and fixes - docs/references.md: centralized external links (single source of truth) - docs/bios-vram-guide.md: add back-link to optimization workflow Cross-linked non-redundantly: each doc owns one layer, others link to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 20:50:00 +01:00
Felipe Cardoso	af0515d05d	fix: address code review findings (HIGH + MEDIUM) - Replace GNU \b with portable word-boundary sed patterns in kernel-params - Warn on unknown CLI arguments instead of silently swallowing - Add floor check on recommended_gttsize_mib to prevent negative values - Fix Python operator precedence in benchmark log parser - Add root checks to tuned-profile.sh and rollback.sh - Remove redundant sudo calls (scripts already require root at entry) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 20:19:44 +01:00
Felipe Cardoso	c596e38e9e	Initial commit	2026-03-25 20:13:15 +01:00

25 Commits