Files
strix-halo-optimizations/docs/optimization.md
Felipe Cardoso ea70687cd2 docs: update optimization guide with measured hardware data
Replace estimated values with clpeak measurements: DRAM 216-233 GB/s,
GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting
only). Correct backend recommendation to Vulkan RADV (2.7x faster tg
than ROCm at 131K). Update KV cache recommendation to q4_0. Add
Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano
from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.
2026-03-30 19:56:18 +02:00

10 KiB

Optimization Guide

Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.

Prerequisites: Run make audit first to see your current state. Run make benchmark-baseline to capture pre-optimization performance numbers.

Track results in optimization-log.md as you apply each change.


Phase 1: Core System (automated scripts)

These are the foundational optimizations handled by this repo's scripts. Apply in order.

1.1 Tuned Profile (no reboot)

sudo make optimize-tuned

Switches to accelerator-performance: disables higher-latency CPU STOP states, sets CPU governor to performance.

Measured: +5-8% pp, +2-3% tg.

1.2 Kernel Boot Parameters (reboot required)

sudo make optimize-kernel
Parameter Value (64 GB) Purpose
iommu=pt -- IOMMU passthrough, reduces memory access latency
amdgpu.gttsize 60416 Max GPU-addressable system RAM in MiB (~59 GiB)
ttm.pages_limit 15466496 Max pinnable 4K pages for GPU memory

Values computed dynamically from total physical RAM. See architecture.md for the math.

1.3 BIOS VRAM Reduction (reboot + BIOS access)

make optimize-vram   # Prints guidance — cannot modify BIOS directly

Reduce UMA Frame Buffer Size from 32 GB to 512 MB. Frees 31.5 GB for GTT. See bios-vram-guide.md (HP ZBook: F10 at boot).

Combine 1.2 and 1.3 into a single reboot.

1.4 Verify

make verify          # 9-point checklist, target: 9/9
make audit           # Single-screen system status with scores

Phase 1 Measured Impact

Optimization pp tg
Tuned profile +5-8% +2-3%
Kernel params + BIOS VRAM Enables 37 GB+ models +5-15%
Phase 1 combined +15-25% +8-18%

Trade-off: Small models (<5 GB) are ~3-8% slower due to GTT indirection vs dedicated VRAM. Acceptable given the massive capability gain.


Phase 2: System Tuning

All Phase 2 optimizations are applied with a single command:

sudo make optimize-power

This applies RyzenAdj, sysctl, THP, and RADV nogttspill. Sysctl and nogttspill persist across reboots. RyzenAdj and THP are volatile.

For RyzenAdj persistence across reboots:

sudo cp configs/ryzenadj-llm.service configs/ryzenadj-resume.service /etc/systemd/system/
sudo systemctl enable --now ryzenadj-llm.service
sudo systemctl enable ryzenadj-resume.service

2.1 RyzenAdj PPT Increase

The HP ZBook Ultra G1a ships at 59W sustained. RyzenAdj raises this.

Measured on HP ZBook: STAPM raised to 81W, but HP firmware hard-caps PPT SLOW at 70W. Effective sustained power: ~70W (was ~59W). Cannot be overridden without modded BIOS.

Measured impact: Qwen3.5-35B-A3B tg1024 went from ~39 t/s → 57 t/s (+46%). This was the single largest improvement in the entire optimization journey.

Thermals: 70-73C under sustained load. 30C headroom. Cooling handles it easily.

2.2 VM / Sysctl Tuning

Persisted to /etc/sysctl.d/99-llm-inference.conf:

Parameter Default Set To Why
vm.swappiness 60 1 Prevent model weight eviction
vm.dirty_ratio 20 40 Reduce I/O flush storms during inference
vm.max_map_count 65530 500000 Large models need many memory mappings
vm.zone_reclaim_mode 0 0 Don't aggressively reclaim memory zones

2.3 Transparent Huge Pages

Set to always. Reduces TLB misses for mmap'd model files. Volatile — add transparent_hugepage=always to kernel cmdline for persistence.

2.4 RADV_PERFTEST=nogttspill

Persisted to /etc/environment.d/radv-llm.conf. Prevents GTT spill management overhead on unified memory. Vulkan RADV only.


Phase 3: Runtime Flags (per-invocation)

These flags should be used when running llama-bench, llama-server, or llama-cli. Already set in this repo's benchmark scripts.

3.1 -mmp 0 (no mmap) — mandatory

On unified memory, mmap adds a double-copy penalty. Always disable.

3.2 KV Cache Quantization — use Q4_0

Measured (Vulkan RADV, Qwen3.5-35B-A3B):

KV Type pp2048 tg1024 Memory Savings
f16 456 39.8 Baseline
q8_0 418 38.5 ~50% (slightly slower on Vulkan)
q4_0 460 41.1 75% (fastest on Vulkan)

Q4_0 is faster because less memory bandwidth is spent reading KV cache. Use as default for serving. Quality impact is noticeable only on reasoning-heavy tasks.

3.3 Flash Attention (-fa 1) — always enable

+24% pp on ROCm. Modest improvement on Vulkan (CoopMat1). Already in benchmark scripts.

3.4 ROCBLAS_USE_HIPBLASLT=1 (ROCm only) — mandatory

Without this, ROCm pp is 2-7x slower on gfx1151. Already in benchmark scripts.

3.5 Backend Selection

Measured (Qwen3.5-35B-A3B Q8, 128K context):

Workload Vulkan RADV ROCm 7.2 Winner
pp2048 456 445 Vulkan (+2%)
tg1024 39.8 21.5 Vulkan (1.9x)
pp8192 @ 131K 130 84 Vulkan (1.5x)
tg32 @ 131K 22.0 8.1 Vulkan (2.7x)

Vulkan RADV dominates across all workloads on this hardware. ROCm is significantly slower for token generation, especially at long context. Never use AMDVLK.

3.6 MoE Batch Size -b 256 — NOT YET TESTED

Community reports up to +70% pp improvement for MoE models. Default batch (2048) may be too large. Needs benchmarking.


Phase 4: Build Optimizations (not yet tested)

These require rebuilding the llama.cpp toolbox containers. Given that Vulkan RADV already outperforms ROCm significantly, the ROI of these ROCm-specific optimizations is unclear.

4.1 ROCm Build Flags

cmake -B build \
    -DGGML_HIP=ON \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DGGML_HIP_UMA=ON \
    -DAMDGPU_TARGETS=gfx1151

GGML_HIP_ROCWMMA_FATTN enables GPU-accelerated flash attention via WMMA. Could close the gap between ROCm and Vulkan for long-context workloads. Check if Donato Capitella's ROCm toolboxes already include this.

4.2 Vulkan Cooperative Matrices

RADV supports VK_KHR_cooperative_matrix for RDNA 3+. Already used by llama.cpp for matrix operations. The current Vulkan toolbox shows matrix cores: KHR_coopmat — this is likely already active.


Phase 5: Future / Currently Blocked

5.1 Speculative Decoding

Expected 1.8-2.5x tg speedup. Draft model (Qwen3.5-0.8B-Q8_0.gguf) downloaded.

Blocked: llama.cpp PR #20075 — hybrid SSM/MoE speculative rollback.

5.2 Native Multi-Token Prediction

Qwen3.5 has built-in MTP heads. No draft model needed.

Blocked: llama.cpp PR #20700 — MTP for Qwen3.5.

5.3 GPU Clock Reporting Fix

ROCm issue #5750 reports GPU clocks appearing stuck at ~885 MHz in sysfs. However, clpeak confirms clocks reach 2900 MHz under compute load (measured 2026-03-30). The issue is likely broken sysfs reporting, not actual clock throttling. No performance impact.

Tracking: ROCm issue #5750 (sysfs reporting only, not a real blocker).

5.4 Other Future Items

  • SageAttention: 2-5x over FlashAttention. No AMD port yet.
  • XDNA NPU (50 TOPS): Linux support coming in kernel 7.1. Could run draft model for speculative decoding.
  • TurboQuant 3-bit KV (ICLR 2026): 4.9x KV compression. Being integrated into llama.cpp.
  • LLMLingua-2: 20x prompt compression for agentic/RAG workloads.

Hardware Limits (measured via clpeak on this system, 2026-03-30)

Resource Value Notes
DRAM bandwidth 216-233 GB/s clpeak float: 215.7, float16: 232.6. Theoretical max: 256 GB/s.
Infinity Cache 32 MB, ~1 TB/s MoE effective throughput exceeds DRAM BW due to cache hits
GPU clocks 2900 MHz confirmed clpeak shows clocks reach 2900 MHz under load. ROCm #5750 sysfs reporting may be broken but actual clocks are correct.
FP16 compute 21.9 TFLOPS 20 CUs x 2900 MHz
FP32 compute 12.3 TFLOPS
LPDDR5X-8000 8000 MT/s, 256-bit Soldered, no overclocking
Infinity Fabric 2 GHz FCLK Fixed
HP ZBook sustained power 70W (firmware cap) RyzenAdj can set 85W but HP overrides to 70W

Note on MoE bandwidth: MoE models (3B active params) read ~13.5 GB per token at Q4. At 55 t/s this implies ~740 GB/s effective throughput — well above the 216 GB/s DRAM ceiling. The 32 MB Infinity Cache (~1 TB/s) is actively boosting throughput for repeated KV cache and activation reads.


Performance Summary (all measured, Vulkan RADV, q4_0 KV)

Model Shootout (pp2048/tg1024, Phase 1+2 applied)

Model Arch Size pp2048 tg1024
Qwen3-Coder-30B UD-Q6_K_XL Pure Transformer 24.5 GB 737 61.0
Qwen3.5-35B-A3B UD-Q4_K_XL Hybrid DeltaNet 20.7 GB 821 54.9
Nemotron-Cascade-2 Q8_0 Hybrid Mamba-2 31.3 GB 643 52.8
Qwen3-Coder-Next UD-Q3_K_XL Hybrid DeltaNet 33.8 GB 545 46.8

tg speed is bandwidth-bound: smaller model = faster tokens.

Optimization Journey (Qwen3.5-35B-A3B, tg on Vulkan)

Stage tg (t/s) Improvement
Pre-optimization (stock) ~33 Baseline
After Phase 1 (tuned + kernel + BIOS) ~39 +18%
After Phase 2 (RyzenAdj + sysctl + THP) ~57 +46%

Rollback

sudo make rollback   # Restores GRUB backup and previous tuned profile

Phase 2 rollback:

  • RyzenAdj: sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000
  • Sysctl: sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system
  • THP: echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
  • nogttspill: sudo rm /etc/environment.d/radv-llm.conf
  • BIOS VRAM: revert manually (F10 at boot)

Further Reading