Files

Felipe Cardoso ea70687cd2 docs: update optimization guide with measured hardware data

Replace estimated values with clpeak measurements: DRAM 216-233 GB/s,
GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting
only). Correct backend recommendation to Vulkan RADV (2.7x faster tg
than ROCm at 131K). Update KV cache recommendation to q4_0. Add
Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano
from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.

2026-03-30 19:56:18 +02:00

10 KiB

Raw Blame History

Optimization Guide

Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.

Prerequisites: Run make audit first to see your current state. Run make benchmark-baseline to capture pre-optimization performance numbers.

Track results in optimization-log.md as you apply each change.

Phase 1: Core System (automated scripts)

These are the foundational optimizations handled by this repo's scripts. Apply in order.

1.1 Tuned Profile (no reboot)

sudo make optimize-tuned

Switches to accelerator-performance: disables higher-latency CPU STOP states, sets CPU governor to performance.

Measured: +5-8% pp, +2-3% tg.

1.2 Kernel Boot Parameters (reboot required)

sudo make optimize-kernel

Parameter	Value (64 GB)	Purpose
`iommu=pt`	--	IOMMU passthrough, reduces memory access latency
`amdgpu.gttsize`	`60416`	Max GPU-addressable system RAM in MiB (~59 GiB)
`ttm.pages_limit`	`15466496`	Max pinnable 4K pages for GPU memory

Values computed dynamically from total physical RAM. See architecture.md for the math.

1.3 BIOS VRAM Reduction (reboot + BIOS access)

make optimize-vram   # Prints guidance — cannot modify BIOS directly

Reduce UMA Frame Buffer Size from 32 GB to 512 MB. Frees 31.5 GB for GTT. See bios-vram-guide.md (HP ZBook: F10 at boot).

Combine 1.2 and 1.3 into a single reboot.

1.4 Verify

make verify          # 9-point checklist, target: 9/9
make audit           # Single-screen system status with scores

Phase 1 Measured Impact

Optimization	pp	tg
Tuned profile	+5-8%	+2-3%
Kernel params + BIOS VRAM	Enables 37 GB+ models	+5-15%
Phase 1 combined	+15-25%	+8-18%

Trade-off: Small models (<5 GB) are ~3-8% slower due to GTT indirection vs dedicated VRAM. Acceptable given the massive capability gain.

Phase 2: System Tuning

All Phase 2 optimizations are applied with a single command:

sudo make optimize-power

This applies RyzenAdj, sysctl, THP, and RADV nogttspill. Sysctl and nogttspill persist across reboots. RyzenAdj and THP are volatile.

For RyzenAdj persistence across reboots:

sudo cp configs/ryzenadj-llm.service configs/ryzenadj-resume.service /etc/systemd/system/
sudo systemctl enable --now ryzenadj-llm.service
sudo systemctl enable ryzenadj-resume.service

2.1 RyzenAdj PPT Increase

The HP ZBook Ultra G1a ships at 59W sustained. RyzenAdj raises this.

Measured on HP ZBook: STAPM raised to 81W, but HP firmware hard-caps PPT SLOW at 70W. Effective sustained power: ~70W (was ~59W). Cannot be overridden without modded BIOS.

Measured impact: Qwen3.5-35B-A3B tg1024 went from ~39 t/s → 57 t/s (+46%). This was the single largest improvement in the entire optimization journey.

Thermals: 70-73C under sustained load. 30C headroom. Cooling handles it easily.

2.2 VM / Sysctl Tuning

Persisted to /etc/sysctl.d/99-llm-inference.conf:

Parameter	Default	Set To	Why
`vm.swappiness`	60	1	Prevent model weight eviction
`vm.dirty_ratio`	20	40	Reduce I/O flush storms during inference
`vm.max_map_count`	65530	500000	Large models need many memory mappings
`vm.zone_reclaim_mode`	0	0	Don't aggressively reclaim memory zones

2.3 Transparent Huge Pages

Set to always. Reduces TLB misses for mmap'd model files. Volatile — add transparent_hugepage=always to kernel cmdline for persistence.

2.4 RADV_PERFTEST=nogttspill

Persisted to /etc/environment.d/radv-llm.conf. Prevents GTT spill management overhead on unified memory. Vulkan RADV only.

Phase 3: Runtime Flags (per-invocation)

These flags should be used when running llama-bench, llama-server, or llama-cli. Already set in this repo's benchmark scripts.

3.1 `-mmp 0` (no mmap) — mandatory

On unified memory, mmap adds a double-copy penalty. Always disable.

3.2 KV Cache Quantization — use Q4_0

Measured (Vulkan RADV, Qwen3.5-35B-A3B):

KV Type	pp2048	tg1024	Memory Savings
f16	456	39.8	Baseline
q8_0	418	38.5	~50% (slightly slower on Vulkan)
q4_0	460	41.1	75% (fastest on Vulkan)

Q4_0 is faster because less memory bandwidth is spent reading KV cache. Use as default for serving. Quality impact is noticeable only on reasoning-heavy tasks.

3.3 Flash Attention (`-fa 1`) — always enable

+24% pp on ROCm. Modest improvement on Vulkan (CoopMat1). Already in benchmark scripts.

3.4 `ROCBLAS_USE_HIPBLASLT=1` (ROCm only) — mandatory

Without this, ROCm pp is 2-7x slower on gfx1151. Already in benchmark scripts.

3.5 Backend Selection

Measured (Qwen3.5-35B-A3B Q8, 128K context):

Workload	Vulkan RADV	ROCm 7.2	Winner
pp2048	456	445	Vulkan (+2%)
tg1024	39.8	21.5	Vulkan (1.9x)
pp8192 @ 131K	130	84	Vulkan (1.5x)
tg32 @ 131K	22.0	8.1	Vulkan (2.7x)

Vulkan RADV dominates across all workloads on this hardware. ROCm is significantly slower for token generation, especially at long context. Never use AMDVLK.

3.6 MoE Batch Size `-b 256` — NOT YET TESTED

Community reports up to +70% pp improvement for MoE models. Default batch (2048) may be too large. Needs benchmarking.

Phase 4: Build Optimizations (not yet tested)

These require rebuilding the llama.cpp toolbox containers. Given that Vulkan RADV already outperforms ROCm significantly, the ROI of these ROCm-specific optimizations is unclear.

4.1 ROCm Build Flags

cmake -B build \
    -DGGML_HIP=ON \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DGGML_HIP_UMA=ON \
    -DAMDGPU_TARGETS=gfx1151

GGML_HIP_ROCWMMA_FATTN enables GPU-accelerated flash attention via WMMA. Could close the gap between ROCm and Vulkan for long-context workloads. Check if Donato Capitella's ROCm toolboxes already include this.

4.2 Vulkan Cooperative Matrices

RADV supports VK_KHR_cooperative_matrix for RDNA 3+. Already used by llama.cpp for matrix operations. The current Vulkan toolbox shows matrix cores: KHR_coopmat — this is likely already active.

Phase 5: Future / Currently Blocked

5.1 Speculative Decoding

Expected 1.8-2.5x tg speedup. Draft model (Qwen3.5-0.8B-Q8_0.gguf) downloaded.

Blocked: llama.cpp PR #20075 — hybrid SSM/MoE speculative rollback.

5.2 Native Multi-Token Prediction

Qwen3.5 has built-in MTP heads. No draft model needed.

Blocked: llama.cpp PR #20700 — MTP for Qwen3.5.

5.3 GPU Clock Reporting Fix

ROCm issue #5750 reports GPU clocks appearing stuck at ~885 MHz in sysfs. However, clpeak confirms clocks reach 2900 MHz under compute load (measured 2026-03-30). The issue is likely broken sysfs reporting, not actual clock throttling. No performance impact.

Tracking: ROCm issue #5750 (sysfs reporting only, not a real blocker).

5.4 Other Future Items

SageAttention: 2-5x over FlashAttention. No AMD port yet.
XDNA NPU (50 TOPS): Linux support coming in kernel 7.1. Could run draft model for speculative decoding.
TurboQuant 3-bit KV (ICLR 2026): 4.9x KV compression. Being integrated into llama.cpp.
LLMLingua-2: 20x prompt compression for agentic/RAG workloads.

Hardware Limits (measured via clpeak on this system, 2026-03-30)

Resource	Value	Notes
DRAM bandwidth	216-233 GB/s	clpeak float: 215.7, float16: 232.6. Theoretical max: 256 GB/s.
Infinity Cache	32 MB, ~1 TB/s	MoE effective throughput exceeds DRAM BW due to cache hits
GPU clocks	2900 MHz confirmed	clpeak shows clocks reach 2900 MHz under load. ROCm #5750 sysfs reporting may be broken but actual clocks are correct.
FP16 compute	21.9 TFLOPS	20 CUs x 2900 MHz
FP32 compute	12.3 TFLOPS
LPDDR5X-8000	8000 MT/s, 256-bit	Soldered, no overclocking
Infinity Fabric	2 GHz FCLK	Fixed
HP ZBook sustained power	70W (firmware cap)	RyzenAdj can set 85W but HP overrides to 70W

Note on MoE bandwidth: MoE models (3B active params) read ~13.5 GB per token at Q4. At 55 t/s this implies ~740 GB/s effective throughput — well above the 216 GB/s DRAM ceiling. The 32 MB Infinity Cache (~1 TB/s) is actively boosting throughput for repeated KV cache and activation reads.

Performance Summary (all measured, Vulkan RADV, q4_0 KV)

Model Shootout (pp2048/tg1024, Phase 1+2 applied)

Model	Arch	Size	pp2048	tg1024
Qwen3-Coder-30B UD-Q6_K_XL	Pure Transformer	24.5 GB	737	61.0
Qwen3.5-35B-A3B UD-Q4_K_XL	Hybrid DeltaNet	20.7 GB	821	54.9
Nemotron-Cascade-2 Q8_0	Hybrid Mamba-2	31.3 GB	643	52.8
Qwen3-Coder-Next UD-Q3_K_XL	Hybrid DeltaNet	33.8 GB	545	46.8

tg speed is bandwidth-bound: smaller model = faster tokens.

Optimization Journey (Qwen3.5-35B-A3B, tg on Vulkan)

Stage	tg (t/s)	Improvement
Pre-optimization (stock)	~33	Baseline
After Phase 1 (tuned + kernel + BIOS)	~39	+18%
After Phase 2 (RyzenAdj + sysctl + THP)	~57	+46%

Rollback

sudo make rollback   # Restores GRUB backup and previous tuned profile

Phase 2 rollback:

RyzenAdj: sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000
Sysctl: sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system
THP: echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
nogttspill: sudo rm /etc/environment.d/radv-llm.conf
BIOS VRAM: revert manually (F10 at boot)

10 KiB Raw Blame History