Replace estimated values with clpeak measurements: DRAM 216-233 GB/s, GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting only). Correct backend recommendation to Vulkan RADV (2.7x faster tg than ROCm at 131K). Update KV cache recommendation to q4_0. Add Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.
10 KiB
Optimization Guide
Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.
Prerequisites: Run make audit first to see your current state. Run make benchmark-baseline to capture pre-optimization performance numbers.
Track results in optimization-log.md as you apply each change.
Phase 1: Core System (automated scripts)
These are the foundational optimizations handled by this repo's scripts. Apply in order.
1.1 Tuned Profile (no reboot)
sudo make optimize-tuned
Switches to accelerator-performance: disables higher-latency CPU STOP states, sets CPU governor to performance.
Measured: +5-8% pp, +2-3% tg.
1.2 Kernel Boot Parameters (reboot required)
sudo make optimize-kernel
| Parameter | Value (64 GB) | Purpose |
|---|---|---|
iommu=pt |
-- | IOMMU passthrough, reduces memory access latency |
amdgpu.gttsize |
60416 |
Max GPU-addressable system RAM in MiB (~59 GiB) |
ttm.pages_limit |
15466496 |
Max pinnable 4K pages for GPU memory |
Values computed dynamically from total physical RAM. See architecture.md for the math.
1.3 BIOS VRAM Reduction (reboot + BIOS access)
make optimize-vram # Prints guidance — cannot modify BIOS directly
Reduce UMA Frame Buffer Size from 32 GB to 512 MB. Frees 31.5 GB for GTT. See bios-vram-guide.md (HP ZBook: F10 at boot).
Combine 1.2 and 1.3 into a single reboot.
1.4 Verify
make verify # 9-point checklist, target: 9/9
make audit # Single-screen system status with scores
Phase 1 Measured Impact
| Optimization | pp | tg |
|---|---|---|
| Tuned profile | +5-8% | +2-3% |
| Kernel params + BIOS VRAM | Enables 37 GB+ models | +5-15% |
| Phase 1 combined | +15-25% | +8-18% |
Trade-off: Small models (<5 GB) are ~3-8% slower due to GTT indirection vs dedicated VRAM. Acceptable given the massive capability gain.
Phase 2: System Tuning
All Phase 2 optimizations are applied with a single command:
sudo make optimize-power
This applies RyzenAdj, sysctl, THP, and RADV nogttspill. Sysctl and nogttspill persist across reboots. RyzenAdj and THP are volatile.
For RyzenAdj persistence across reboots:
sudo cp configs/ryzenadj-llm.service configs/ryzenadj-resume.service /etc/systemd/system/
sudo systemctl enable --now ryzenadj-llm.service
sudo systemctl enable ryzenadj-resume.service
2.1 RyzenAdj PPT Increase
The HP ZBook Ultra G1a ships at 59W sustained. RyzenAdj raises this.
Measured on HP ZBook: STAPM raised to 81W, but HP firmware hard-caps PPT SLOW at 70W. Effective sustained power: ~70W (was ~59W). Cannot be overridden without modded BIOS.
Measured impact: Qwen3.5-35B-A3B tg1024 went from ~39 t/s → 57 t/s (+46%). This was the single largest improvement in the entire optimization journey.
Thermals: 70-73C under sustained load. 30C headroom. Cooling handles it easily.
2.2 VM / Sysctl Tuning
Persisted to /etc/sysctl.d/99-llm-inference.conf:
| Parameter | Default | Set To | Why |
|---|---|---|---|
vm.swappiness |
60 | 1 | Prevent model weight eviction |
vm.dirty_ratio |
20 | 40 | Reduce I/O flush storms during inference |
vm.max_map_count |
65530 | 500000 | Large models need many memory mappings |
vm.zone_reclaim_mode |
0 | 0 | Don't aggressively reclaim memory zones |
2.3 Transparent Huge Pages
Set to always. Reduces TLB misses for mmap'd model files. Volatile — add transparent_hugepage=always to kernel cmdline for persistence.
2.4 RADV_PERFTEST=nogttspill
Persisted to /etc/environment.d/radv-llm.conf. Prevents GTT spill management overhead on unified memory. Vulkan RADV only.
Phase 3: Runtime Flags (per-invocation)
These flags should be used when running llama-bench, llama-server, or llama-cli. Already set in this repo's benchmark scripts.
3.1 -mmp 0 (no mmap) — mandatory
On unified memory, mmap adds a double-copy penalty. Always disable.
3.2 KV Cache Quantization — use Q4_0
Measured (Vulkan RADV, Qwen3.5-35B-A3B):
| KV Type | pp2048 | tg1024 | Memory Savings |
|---|---|---|---|
| f16 | 456 | 39.8 | Baseline |
| q8_0 | 418 | 38.5 | ~50% (slightly slower on Vulkan) |
| q4_0 | 460 | 41.1 | 75% (fastest on Vulkan) |
Q4_0 is faster because less memory bandwidth is spent reading KV cache. Use as default for serving. Quality impact is noticeable only on reasoning-heavy tasks.
3.3 Flash Attention (-fa 1) — always enable
+24% pp on ROCm. Modest improvement on Vulkan (CoopMat1). Already in benchmark scripts.
3.4 ROCBLAS_USE_HIPBLASLT=1 (ROCm only) — mandatory
Without this, ROCm pp is 2-7x slower on gfx1151. Already in benchmark scripts.
3.5 Backend Selection
Measured (Qwen3.5-35B-A3B Q8, 128K context):
| Workload | Vulkan RADV | ROCm 7.2 | Winner |
|---|---|---|---|
| pp2048 | 456 | 445 | Vulkan (+2%) |
| tg1024 | 39.8 | 21.5 | Vulkan (1.9x) |
| pp8192 @ 131K | 130 | 84 | Vulkan (1.5x) |
| tg32 @ 131K | 22.0 | 8.1 | Vulkan (2.7x) |
Vulkan RADV dominates across all workloads on this hardware. ROCm is significantly slower for token generation, especially at long context. Never use AMDVLK.
3.6 MoE Batch Size -b 256 — NOT YET TESTED
Community reports up to +70% pp improvement for MoE models. Default batch (2048) may be too large. Needs benchmarking.
Phase 4: Build Optimizations (not yet tested)
These require rebuilding the llama.cpp toolbox containers. Given that Vulkan RADV already outperforms ROCm significantly, the ROI of these ROCm-specific optimizations is unclear.
4.1 ROCm Build Flags
cmake -B build \
-DGGML_HIP=ON \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DGGML_HIP_UMA=ON \
-DAMDGPU_TARGETS=gfx1151
GGML_HIP_ROCWMMA_FATTN enables GPU-accelerated flash attention via WMMA. Could close the gap between ROCm and Vulkan for long-context workloads. Check if Donato Capitella's ROCm toolboxes already include this.
4.2 Vulkan Cooperative Matrices
RADV supports VK_KHR_cooperative_matrix for RDNA 3+. Already used by llama.cpp for matrix operations. The current Vulkan toolbox shows matrix cores: KHR_coopmat — this is likely already active.
Phase 5: Future / Currently Blocked
5.1 Speculative Decoding
Expected 1.8-2.5x tg speedup. Draft model (Qwen3.5-0.8B-Q8_0.gguf) downloaded.
Blocked: llama.cpp PR #20075 — hybrid SSM/MoE speculative rollback.
5.2 Native Multi-Token Prediction
Qwen3.5 has built-in MTP heads. No draft model needed.
Blocked: llama.cpp PR #20700 — MTP for Qwen3.5.
5.3 GPU Clock Reporting Fix
ROCm issue #5750 reports GPU clocks appearing stuck at ~885 MHz in sysfs. However, clpeak confirms clocks reach 2900 MHz under compute load (measured 2026-03-30). The issue is likely broken sysfs reporting, not actual clock throttling. No performance impact.
Tracking: ROCm issue #5750 (sysfs reporting only, not a real blocker).
5.4 Other Future Items
- SageAttention: 2-5x over FlashAttention. No AMD port yet.
- XDNA NPU (50 TOPS): Linux support coming in kernel 7.1. Could run draft model for speculative decoding.
- TurboQuant 3-bit KV (ICLR 2026): 4.9x KV compression. Being integrated into llama.cpp.
- LLMLingua-2: 20x prompt compression for agentic/RAG workloads.
Hardware Limits (measured via clpeak on this system, 2026-03-30)
| Resource | Value | Notes |
|---|---|---|
| DRAM bandwidth | 216-233 GB/s | clpeak float: 215.7, float16: 232.6. Theoretical max: 256 GB/s. |
| Infinity Cache | 32 MB, ~1 TB/s | MoE effective throughput exceeds DRAM BW due to cache hits |
| GPU clocks | 2900 MHz confirmed | clpeak shows clocks reach 2900 MHz under load. ROCm #5750 sysfs reporting may be broken but actual clocks are correct. |
| FP16 compute | 21.9 TFLOPS | 20 CUs x 2900 MHz |
| FP32 compute | 12.3 TFLOPS | |
| LPDDR5X-8000 | 8000 MT/s, 256-bit | Soldered, no overclocking |
| Infinity Fabric | 2 GHz FCLK | Fixed |
| HP ZBook sustained power | 70W (firmware cap) | RyzenAdj can set 85W but HP overrides to 70W |
Note on MoE bandwidth: MoE models (3B active params) read ~13.5 GB per token at Q4. At 55 t/s this implies ~740 GB/s effective throughput — well above the 216 GB/s DRAM ceiling. The 32 MB Infinity Cache (~1 TB/s) is actively boosting throughput for repeated KV cache and activation reads.
Performance Summary (all measured, Vulkan RADV, q4_0 KV)
Model Shootout (pp2048/tg1024, Phase 1+2 applied)
| Model | Arch | Size | pp2048 | tg1024 |
|---|---|---|---|---|
| Qwen3-Coder-30B UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | 61.0 |
| Qwen3.5-35B-A3B UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | 821 | 54.9 |
| Nemotron-Cascade-2 Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
| Qwen3-Coder-Next UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
tg speed is bandwidth-bound: smaller model = faster tokens.
Optimization Journey (Qwen3.5-35B-A3B, tg on Vulkan)
| Stage | tg (t/s) | Improvement |
|---|---|---|
| Pre-optimization (stock) | ~33 | Baseline |
| After Phase 1 (tuned + kernel + BIOS) | ~39 | +18% |
| After Phase 2 (RyzenAdj + sysctl + THP) | ~57 | +46% |
Rollback
sudo make rollback # Restores GRUB backup and previous tuned profile
Phase 2 rollback:
- RyzenAdj:
sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000 - Sysctl:
sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system - THP:
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled - nogttspill:
sudo rm /etc/environment.d/radv-llm.conf - BIOS VRAM: revert manually (F10 at boot)
Further Reading
- Optimization log — Detailed test results and verdicts
- Hardware analysis — llama.cpp flags, backends, quantization deep dive
- Inference landscape — Broader survey of engines and techniques
- Benchmarking guide — Methodology and result interpretation
- References — All external links