Files

Felipe Cardoso f92b710492 fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.

2026-03-27 14:54:19 +01:00

13 KiB

Raw Blame History

Optimization Guide

Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.

Prerequisites: Run make audit first to see your current state. Run make benchmark-baseline to capture pre-optimization performance numbers.

Track results in optimization-log.md as you apply each change.

Phase 1: Core System (automated scripts)

These are the foundational optimizations handled by this repo's scripts. Apply in order.

1.1 Tuned Profile (no reboot)

sudo make optimize-tuned

Switches from throughput-performance to accelerator-performance, which disables higher-latency CPU STOP states and sets CPU governor to performance.

Takes effect immediately. Previous profile is saved for rollback.

Expected Impact	pp512	tg128
Tuned profile	+5-8%	+2-3%

1.2 Kernel Boot Parameters (reboot required)

sudo make optimize-kernel

Adds three parameters to GRUB:

Parameter	Value (64 GB)	Purpose
`iommu=pt`	--	IOMMU passthrough, reduces memory access latency
`amdgpu.gttsize`	`60416`	Max GPU-addressable system RAM in MiB (~59 GiB)
`ttm.pages_limit`	`15466496`	Max pinnable 4K pages for GPU memory

Values are computed dynamically based on your system's total physical RAM. The script backs up /etc/default/grub before modifying it. See architecture.md for the math.

1.3 BIOS VRAM Reduction (reboot + BIOS access)

make optimize-vram   # Prints guidance — cannot modify BIOS directly

Reduce dedicated VRAM (UMA Frame Buffer Size) from 32 GB to 512 MB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.

See bios-vram-guide.md for the full BIOS walkthrough (HP ZBook: F10 at boot).

Combine 1.2 and 1.3 into a single reboot: apply kernel params, then reboot into BIOS to change VRAM, then boot normally.

1.4 Verify

make verify          # 9-point checklist, target: 9/9
make audit           # Single-screen system status with scores

Phase 1 Expected Impact (combined)

Optimization	pp512	tg128
Tuned profile	+5-8%	+2-3%
Kernel params + BIOS VRAM	+10-20%	+5-15%
Phase 1 combined	+15-25%	+8-18%

Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.

Phase 2: System Tuning (manual, no reboot unless noted)

These require root but are safe to apply and revert.

2.1 Power Budget Increase via RyzenAdj

The HP ZBook Ultra G1a ships with a conservative 60W power limit. The Strix Halo chip supports 120W. Community testing shows 85W is the sweet spot: +12-19% over 60W, with manageable thermals.

# Install ryzenadj (Fedora)
sudo dnf install ryzenadj   # or build from https://github.com/FlyGoat/RyzenAdj

# Apply 85W limits (milliwatts)
sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000

# Verify
sudo ryzenadj -i | grep -E 'STAPM|PPT'

Setting	HP Default	Recommended	Max (risky)
STAPM	60W	85W	120W
PPT Fast	60W	85W	120W
PPT Slow	20W	85W	120W

Notes:

Settings are volatile — reset on reboot/sleep. Create a systemd service for persistence.
Going above 85W yields only +2-3% more (LLM inference is memory-bandwidth-bound at ~215 GB/s).
Monitor thermals: sensors or amdgpu_top. Throttling starts around 100C junction temp.
HP firmware may periodically reset limits. Verify after wake from sleep.
The 140W USB-C charger limits total system draw. At 100W+ APU, battery will drain even while plugged in.

2.2 VM / Sysctl Tuning

# Apply immediately
sudo sysctl -w vm.swappiness=1
sudo sysctl -w vm.dirty_ratio=40
sudo sysctl -w vm.dirty_background_ratio=10
sudo sysctl -w vm.max_map_count=500000
sudo sysctl -w vm.zone_reclaim_mode=0

# Persist across reboots
sudo tee /etc/sysctl.d/99-llm-inference.conf << 'EOF'
vm.swappiness = 1
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.max_map_count = 500000
vm.zone_reclaim_mode = 0
EOF

Parameter	Default	Recommended	Why
`vm.swappiness`	60	1	Prevent model weights from being swapped out
`vm.dirty_ratio`	20	40	Reduce I/O flush storms during inference
`vm.dirty_background_ratio`	10	10	Keep background writeback at default
`vm.max_map_count`	65530	500000	Large models need many memory mappings
`vm.zone_reclaim_mode`	0	0	Don't aggressively reclaim memory zones

2.3 Transparent Huge Pages

THP reduces TLB misses for mmap'd model files (~55 GB model = 14M page table entries at 4KB vs 28K at 2MB).

# Apply immediately
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# Persist via kernel cmdline (add to GRUB):
# transparent_hugepage=always

# Verify THP is being used
grep -i huge /proc/meminfo
grep thp /proc/vmstat

Trade-off: always may cause rare latency spikes during memory compaction. Use madvise if you need predictable latency, but note that llama.cpp does not call madvise(MADV_HUGEPAGE) so always is needed.

2.4 RADV_PERFTEST=nogttspill (Vulkan backend)

Prevents unnecessary GTT spill management on unified memory. Fixes prompt processing degradation with the Vulkan RADV backend.

# Per-session
export RADV_PERFTEST=nogttspill

# Persist system-wide
echo 'RADV_PERFTEST=nogttspill' | sudo tee /etc/environment.d/radv.conf

Only affects the Vulkan RADV backend. No effect on ROCm.

2.5 Additional Kernel Parameters (reboot required)

These can be added to the GRUB cmdline alongside Phase 1 params:

Parameter	Value	Purpose	Priority
`amdgpu.noretry=0`	0	Enable GPU page fault retry, improves stability	Medium — add if seeing GPU crashes
`transparent_hugepage=always`	--	Persist THP setting	Medium
`preempt=voluntary`	--	Reduce context switch overhead	Low — only for batch inference
`processor.max_cstate=1`	1	Disable deep C-states	Low — tuned profile handles this

Do NOT add: amdgpu.ppfeaturemask=0xffffffff — OverDrive is non-functional on gfx1151 (ROCm issue #5750).

Phase 3: Runtime Flags (per-invocation, no system changes)

These are llama-bench / llama-server flags that affect performance without changing the system.

3.1 Always Use `-mmp 0` (no mmap)

On unified memory, mmap adds a double-copy penalty. The --no-mmap / -mmp 0 flag loads weights directly. Already set in this repo's benchmark scripts.

3.2 Batch Size for MoE Models (`-b 256`)

Default batch size (2048) is too large for MoE on this hardware. Reducing to 256 can improve pp512 throughput by up to 70% on MoE models.

# In llama-bench
llama-bench -m model.gguf -b 256 -ngl 99 -fa 1

# In llama-server
llama-server -m model.gguf -b 256 -ngl 99 -fa 1

3.3 KV Cache Quantization

Q8_0 KV cache halves KV memory usage with negligible quality loss. Recommended as default for all serving.

# llama-server
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0

# Benchmark sweep
make benchmark ARGS="--tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072 --models MODEL.gguf --reps 3"

KV Type	Memory Savings	Quality Impact	Recommendation
f16	Baseline	None	Default for benchmarks
q8_0	~50%	Negligible	Default for serving
q4_0	~75%	Noticeable on reasoning	Only for max context

3.4 Flash Attention (`-fa 1`)

Always enable on ROCm (+24% pp improvement). On Vulkan, FA uses CoopMat1 (modest improvement). Already set in benchmark scripts.

3.5 ROCBLAS_USE_HIPBLASLT=1 (ROCm only)

Without this, ROCm pp on gfx1151 is 2-7x slower. Already set in benchmark scripts.

3.6 Backend Selection

Neither ROCm nor Vulkan is universally faster:

Workload	Best Backend	Why
Short context tg	Vulkan RADV	Lower per-token overhead
Long context (8K-130K)	ROCm + rocWMMA	True HW flash attention
General stability	Vulkan RADV	More mature on gfx1151

Never use AMDVLK — RADV scales 3.6x better at extreme context depths.

Phase 4: Build Optimizations (requires rebuilding containers)

These require rebuilding the llama.cpp toolbox containers with specific flags.

4.1 ROCm Build Flags

cmake -B build \
    -DGGML_HIP=ON \
    -DGGML_HIP_ROCWMMA_FATTN=ON \   # GPU-accelerated flash attention via WMMA
    -DGGML_HIP_UMA=ON \              # Unified memory aware allocation
    -DAMDGPU_TARGETS=gfx1151

GGML_HIP_ROCWMMA_FATTN is the only path to true GPU-accelerated flash attention on AMD (96% speedup at 65K context). The Vulkan CoopMat1 path is a software fallback.

4.2 rocWMMA Tuned Patch (PR #16827)

Fixes a long-context regression in rocWMMA. Implements adaptive KQ stride, better launch bounds, and selective WMMA (prefill only; decode reverts to VEC/TILE). Check if Donato Capitella's ROCm toolboxes include this.

4.3 Vulkan Cooperative Matrices

RADV supports VK_KHR_cooperative_matrix for RDNA 3+. Building llama.cpp with cooperative matrix support could enable WMMA-like speedups without ROCm dependency.

Phase 5: Future / Currently Blocked

These optimizations are not available today but are worth tracking.

5.1 Speculative Decoding (blocked: llama.cpp PR #20075)

Expected 1.8-2.5x tg speedup for coding tasks. Draft model (Qwen3.5-0.8B-Q8_0.gguf, 812 MB) already downloaded. Blocked because Qwen3.5 MoE uses hybrid GatedDeltaNet architecture that breaks llama.cpp's speculative rollback mechanism.

Track: llama.cpp PR #20075 — fix for hybrid SSM/MoE speculative decoding.

5.2 Native Multi-Token Prediction (blocked: llama.cpp PR #20700)

Qwen3.5 was trained with built-in MTP heads. No separate draft model needed. Works in vLLM/SGLang today but not llama.cpp.

Track: llama.cpp PR #20700 — MTP for Qwen3.5 with FastMTP vocabulary trimming.

5.3 GPU Clock Fix (blocked: ROCm #5750)

GPU clocks on gfx1151 may be stuck at ~885 MHz instead of 2900 MHz. power_dpm_force_performance_level and OverDrive are non-functional. If fixed, this could unlock significant additional throughput.

Track: ROCm issue #5750 — Strix Halo stuck in low power clocks.

5.4 SageAttention

2-5x speedup over FlashAttention via quantized attention computation. No AMD port exists yet.

5.5 AMD XDNA NPU (50 TOPS)

Not viable for LLM inference today. Linux support coming in kernel 7.1. Future potential: running a draft model on the NPU for speculative decoding while the GPU runs the main model.

5.6 TurboQuant 3-bit KV Cache (ICLR 2026)

4.9x KV cache compression with minimal quality loss. Being integrated into llama.cpp.

5.7 LLMLingua-2 Prompt Compression

20x prompt compression for agentic/RAG workloads. Reduces pp time by compressing input before inference. Applicable to the agentic eval pipeline.

Hardware Limits (cannot be changed)

Understanding what is fixed helps avoid wasted effort.

Resource	Value	Notes
Memory bandwidth	~215 GB/s (measured)	84% of 256 GB/s theoretical. Hard ceiling for tg speed.
LPDDR5X-8000	8000 MT/s, 256-bit	Soldered, no XMP/EXPO, no overclocking
Infinity Fabric	2 GHz FCLK	Fixed, not tunable on Strix Halo
Infinity Cache	32 MB	~1 TB/s hit bandwidth. Per-layer weights exceed it.
GPU clocks	Up to 2900 MHz	Currently broken in driver (see 5.3)
Max power	120W APU	HP ZBook charger is 140W total system

Rollback

sudo make rollback   # Restores GRUB backup and previous tuned profile

BIOS VRAM must be reverted manually (F10 at boot, restore previous UMA Frame Buffer Size).

Phase 2 changes can be reverted individually:

RyzenAdj: sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000
Sysctl: sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system
THP: echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
nogttspill: sudo rm /etc/environment.d/radv.conf

Troubleshooting

If anything goes wrong, see troubleshooting.md.

13 KiB Raw Blame History