Files
strix-halo-optimizations/docs/optimization-log.md
Felipe Cardoso dd403a907c feat(serve): add optimized llama-server launcher with n-gram speculation
Add `make serve` and `make serve-ngram` for launching llama-server with
baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention,
no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x
tg speedup on repetitive content without upstream PR dependencies.
Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE
support), draft-model speculation stalled on ROCm buffer crashes.
2026-03-30 21:12:30 +02:00

10 KiB

Optimization Log

Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.

Verdicts: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).


Phase 1: Core System

1.1 Tuned Profile: accelerator-performance

  • Date: 2026-03-26
  • Change: sudo tuned-adm profile accelerator-performance
  • Benchmark: data/benchmarks/after-tuned-*
  • Result: +5-8% pp improvement, +2-3% tg improvement
  • Verdict: KEEP

1.2 Kernel Boot Parameters

  • Date: 2026-03-26
  • Change: iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496
  • Benchmark: data/benchmarks/full-opt-all-models-*
  • Result: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
  • Verdict: KEEP

1.3 BIOS VRAM Reduction (512 MB)

  • Date: 2026-03-26
  • Change: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
  • Benchmark: data/benchmarks/full-opt-all-models-*
  • Result: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
  • Trade-off: Small model regression is acceptable given the massive capability gain.
  • Verdict: KEEP

Phase 2: System Tuning

2.1 RyzenAdj PPT Increase

  • Date: 2026-03-30
  • Change: sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000
  • Result: STAPM raised from 59W→81W. PPT Fast raised to 81W. However, PPT SLOW and APU SLOW stuck at 70W — HP ZBook BIOS EC overrides these limits. Effective sustained power: ~70W (was ~59W).
  • Benchmark: data/benchmarks/qwen35-shootout-v2-* (Vulkan, q4_0 KV, pp2048/tg1024)
    • UD-Q4_K_L: 57.0 t/s (was ~39 t/s before RyzenAdj = +46%)
    • UD-Q4_K_XL: 56.4 t/s
    • Q8_0: 51.4 t/s (was ~39-41 t/s before = +25%)
  • Thermals: 70-73C under load, 30C headroom. Cooling handles it easily.
  • Notes: Settings are volatile (reset on reboot/sleep). Use sudo make optimize-power or install systemd service for persistence. HP firmware hard-caps slow PPT at 70W regardless.
  • Verdict: KEEP — significant real-world improvement despite HP firmware limit

2.2 VM Sysctl Tuning

  • Date: 2026-03-30
  • Change: vm.swappiness=1, vm.dirty_ratio=40, vm.dirty_background_ratio=10, vm.max_map_count=500000, vm.zone_reclaim_mode=0
  • Applied via: sudo make optimize-power (persists to /etc/sysctl.d/99-llm-inference.conf)
  • Notes: Hard to isolate impact — applied together with other Phase 2 changes. Prevents model weight eviction and I/O disruption.
  • Verdict: KEEP — low risk, persists across reboots

2.3 Transparent Huge Pages

  • Date: 2026-03-30
  • Change: echo always > /sys/kernel/mm/transparent_hugepage/enabled
  • Applied via: sudo make optimize-power (volatile — add transparent_hugepage=always to kernel cmdline for persistence)
  • Notes: Reduces TLB misses for mmap'd model files. Hard to isolate impact.
  • Verdict: KEEP — low risk

2.4 RADV_PERFTEST=nogttspill

  • Date: 2026-03-30
  • Change: RADV_PERFTEST=nogttspill persisted to /etc/environment.d/radv-llm.conf
  • Applied via: sudo make optimize-power
  • Notes: Prevents GTT spill management overhead on unified memory Vulkan. Takes effect on next login. For current session: export RADV_PERFTEST=nogttspill
  • Verdict: KEEP — persists across reboots

2.5 amdgpu.noretry=0

  • Date: PENDING
  • Change: Kernel cmdline amdgpu.noretry=0
  • Expected: Improved stability under memory pressure
  • Notes: Only apply if experiencing GPU page faults or crashes during large model loading
  • Verdict: PENDING

Phase 3: Runtime Flags

3.1 KV Cache Quantization

  • Date: 2026-03-27
  • Change: --kv-types f16,q8_0,q4_0 sweep
  • Benchmark: data/benchmarks/kv-sweep-256k-*
  • Result (Vulkan RADV, Qwen3.5-35B-A3B Q8, pp2048/tg1024):
    • f16: 456 pp, 39.8 tg
    • q8_0: 418 pp, 38.5 tg (slight Vulkan regression — unexpected)
    • q4_0: 460 pp, 41.1 tg (fastest overall, +3% tg over f16)
  • Result (ROCm, same model):
    • f16: 445 pp, 21.5 tg
    • q8_0: 495 pp, 21.7 tg (+11% pp, same tg)
    • q4_0: 494 pp, 21.8 tg (+11% pp, same tg)
  • Conclusion: q4_0 is the sweet spot on Vulkan (fastest tg + 75% less KV memory). On ROCm, KV quant helps pp but not tg.
  • Verdict: KEEP — use q4_0 KV as default for serving

3.2 MoE Batch Size -b 256

  • Date: 2026-03-30
  • Change: -b 256 vs default (2048)
  • Benchmark: data/benchmarks/batch-default-* vs data/benchmarks/batch-256-*
  • Result (Vulkan RADV, Qwen3.5-35B-A3B UD-Q4_K_XL, q4_0 KV):
    • Default: 826 pp, 55.9 tg
    • b=256: 843 pp, 55.5 tg (within noise)
  • Notes: Community-reported +70% improvement does not reproduce on Vulkan RADV. May only apply to ROCm or CPU backends, or to longer prompts (pp8192+).
  • Verdict: NO IMPACT on Vulkan — not recommended

Phase 4: Build Optimizations

4.1 rocWMMA Flash Attention

  • Date: PENDING
  • Change: Rebuild ROCm toolbox with -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON
  • Expected: +96% long-context performance (65K+)
  • Notes: Need to check if Donato's toolboxes already include this
  • Verdict: PENDING

4.2 rocWMMA Tuned Patch (PR #16827)

  • Date: PENDING
  • Notes: Fixes long-context regression. Check Donato's latest toolbox builds.
  • Verdict: PENDING

Phase 5: Future / Blocked

5.1 Speculative Decoding (draft model)

  • Status: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE checkpoint/restore)
  • Draft model: Downloaded Qwen3.5-0.8B-Q8_0.gguf (812 MB) on 2026-03-27
  • Last checked: 2026-03-30 — PR stalled since Mar 5. ROCm buffer crashes in copy_cell(). Works on Metal/CUDA but not AMD. Months away from landing.

5.2 Native MTP (Multi-Token Prediction)

  • Status: BLOCKED — multiple dependencies unmerged
  • Last checked: 2026-03-30
  • Details: 4 separate PRs in flight, none merged:
    • PR #18886: MTP API framework (DRAFT since Feb 6) — foundation for all MTP work
    • PR #20700: MTP for Qwen3.5 dense only (WIP, author says "not expected to merge soon")
    • PR #15225: GLM-style MTP (open since Aug 2025, "slower than baseline")
    • PR #18039: EAGLE3 speculative (open since Dec 2025)
  • Key gap: No MTP implementation exists for MoE models. PR #20700 only covers dense Qwen3.5 (0.8B-27B), not the 35B-A3B MoE.
  • Timeline estimate: MTP API (#18886) must merge first, then model-specific implementations adapted. Months, not weeks.

5.2a N-gram Speculative Decoding (AVAILABLE NOW)

  • Status: WORKS TODAY — no upstream PRs needed
  • How: llama-server --spec-type ngram-simple --draft-max 64 --draft-min 4
  • Expected: 1.1-1.4x tg speedup on repetitive content (code, structured output)
  • Added to: make serve-ngram ARGS="-m MODEL.gguf" and bin/serve --ngram
  • Notes: Pattern-matches from token history, no draft model needed. Best for code generation where patterns repeat. No quality impact.
  • Verdict: AVAILABLE — use --ngram flag when serving

5.3 GPU Clock Reporting

  • Status: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
  • Measured: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
  • Notes: ROCm issue #5750 is about sysfs pp_dpm_sclk reporting, not actual performance. No action needed.
  • Verdict: CLOSED — no performance impact

Context Window Benchmarks

64K Context (pp4096/tg1024, MoE models)

  • Date: 2026-03-26
  • Benchmark: data/benchmarks/ctx64k-*
  • Results: (check logs)

128K Context (pp8192/tg1024, MoE models)

  • Date: 2026-03-26
  • Benchmark: data/benchmarks/ctx128k-realistic-*
  • Results: (check logs)

256K Context (pp16384/tg1024, MoE models)

  • Date: 2026-03-27
  • Benchmark: data/benchmarks/ctx256k-*
  • Results: (check logs)

Model Quant Shootout

Qwen3.5-35B-A3B — Q4_K_L vs Q4_K_XL vs Q8 (2026-03-30)

  • Benchmark: data/benchmarks/qwen35-shootout-v2-*
  • Config: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
  • RyzenAdj: STAPM=81W (sustained ~70W due to HP firmware cap)
Quant File Size pp2048 (t/s) tg1024 (t/s) Recommendation
UD-Q4_K_L 18.8 GB 825 57.0 Fastest. Good quality.
UD-Q4_K_XL 20.7 GB 835 56.4 Daily driver — best quality/speed.
Q8_0 34.4 GB 850 51.4 Best quality, 10% slower tg.

Decision: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.

Coder Model Shootout (2026-03-30)

  • Benchmark: data/benchmarks/coder-shootout-*
  • Config: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
  • RyzenAdj: STAPM=81W (sustained ~70W)
Model Architecture File Size pp2048 (t/s) tg1024 (t/s)
Qwen3-Coder-30B UD-Q6_K_XL Pure Transformer 24.5 GB 737 61.0
Qwen3.5-35B-A3B UD-Q4_K_XL Hybrid DeltaNet 20.7 GB 821 54.9
Nemotron-Cascade-2 Q8_0 Hybrid Mamba-2 31.3 GB 643 52.8
Qwen3-Coder-Next UD-Q3_K_XL Hybrid DeltaNet 33.8 GB 545 46.8

Analysis:

  • tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
  • Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
  • DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
  • Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B

Recommended roles:

  • Qwen3-Coder-30B: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
  • Qwen3.5-35B-A3B: General tasks, long prompt processing (best pp, best all-rounder)
  • Qwen3-Coder-Next: Complex multi-file coding tasks where quality > speed

How to Add Entries

When testing a new optimization:

  1. Record the date and exact change
  2. Run a benchmark: make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."
  3. Compare: make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new
  4. Update this log with results and verdict
  5. If KEEP: document in optimization.md with the measured numbers