Files
strix-halo-optimizations/docs/optimization-log.md
Felipe Cardoso f92b710492 fix(benchmark): parse llama-bench output with variable column count
KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00

5.2 KiB

Optimization Log

Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.

Verdicts: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).


Phase 1: Core System

1.1 Tuned Profile: accelerator-performance

  • Date: 2026-03-26
  • Change: sudo tuned-adm profile accelerator-performance
  • Benchmark: data/benchmarks/after-tuned-*
  • Result: +5-8% pp improvement, +2-3% tg improvement
  • Verdict: KEEP

1.2 Kernel Boot Parameters

  • Date: 2026-03-26
  • Change: iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496
  • Benchmark: data/benchmarks/full-opt-all-models-*
  • Result: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
  • Verdict: KEEP

1.3 BIOS VRAM Reduction (512 MB)

  • Date: 2026-03-26
  • Change: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
  • Benchmark: data/benchmarks/full-opt-all-models-*
  • Result: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
  • Trade-off: Small model regression is acceptable given the massive capability gain.
  • Verdict: KEEP

Phase 2: System Tuning

2.1 RyzenAdj 85W PPT

  • Date: PENDING
  • Change: sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
  • Expected: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
  • Benchmark: Not yet run
  • Notes: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
  • Verdict: PENDING

2.2 VM Sysctl Tuning

  • Date: PENDING
  • Change: vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000
  • Expected: Prevent model weight eviction, reduce I/O disruption
  • Benchmark: Not yet run
  • Verdict: PENDING

2.3 Transparent Huge Pages

  • Date: PENDING
  • Change: transparent_hugepage=always
  • Expected: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
  • Benchmark: Not yet run
  • Verdict: PENDING

2.4 RADV_PERFTEST=nogttspill

  • Date: PENDING
  • Change: export RADV_PERFTEST=nogttspill
  • Expected: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
  • Benchmark: Not yet run — needs Vulkan-specific benchmark comparison
  • Verdict: PENDING

2.5 amdgpu.noretry=0

  • Date: PENDING
  • Change: Kernel cmdline amdgpu.noretry=0
  • Expected: Improved stability under memory pressure
  • Notes: Only apply if experiencing GPU page faults or crashes during large model loading
  • Verdict: PENDING

Phase 3: Runtime Flags

3.1 KV Cache Quantization

  • Date: PENDING (sweep running)
  • Change: -ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0
  • Benchmark: data/benchmarks/kv-sweep-128k-* (in progress)
  • Expected: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
  • Verdict: PENDING

3.2 MoE Batch Size -b 256

  • Date: PENDING
  • Change: Add -b 256 to MoE benchmark runs
  • Expected: Up to +70% pp improvement for MoE models (community benchmarks)
  • Benchmark: Not yet run
  • Verdict: PENDING

Phase 4: Build Optimizations

4.1 rocWMMA Flash Attention

  • Date: PENDING
  • Change: Rebuild ROCm toolbox with -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON
  • Expected: +96% long-context performance (65K+)
  • Notes: Need to check if Donato's toolboxes already include this
  • Verdict: PENDING

4.2 rocWMMA Tuned Patch (PR #16827)

  • Date: PENDING
  • Notes: Fixes long-context regression. Check Donato's latest toolbox builds.
  • Verdict: PENDING

Phase 5: Future / Blocked

5.1 Speculative Decoding

  • Status: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
  • Draft model: Downloaded Qwen3.5-0.8B-Q8_0.gguf (812 MB) on 2026-03-27
  • Last checked: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues

5.2 Native MTP (Multi-Token Prediction)

  • Status: BLOCKED — llama.cpp PR #20700
  • Last checked: 2026-03-27 — WIP, not expected to merge soon

5.3 GPU Clock Fix

  • Status: BLOCKED — ROCm issue #5750
  • Notes: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
  • Last checked: 2026-03-27

Context Window Benchmarks

64K Context (pp4096/tg1024, MoE models)

  • Date: 2026-03-26
  • Benchmark: data/benchmarks/ctx64k-*
  • Results: (check logs)

128K Context (pp8192/tg1024, MoE models)

  • Date: 2026-03-26
  • Benchmark: data/benchmarks/ctx128k-realistic-*
  • Results: (check logs)

256K Context (pp16384/tg1024, MoE models)

  • Date: 2026-03-27
  • Benchmark: data/benchmarks/ctx256k-*
  • Results: (check logs)

How to Add Entries

When testing a new optimization:

  1. Record the date and exact change
  2. Run a benchmark: make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."
  3. Compare: make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new
  4. Update this log with results and verdict
  5. If KEEP: document in optimization.md with the measured numbers