KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
13 KiB
Optimization Guide
Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.
Prerequisites: Run make audit first to see your current state. Run make benchmark-baseline to capture pre-optimization performance numbers.
Track results in optimization-log.md as you apply each change.
Phase 1: Core System (automated scripts)
These are the foundational optimizations handled by this repo's scripts. Apply in order.
1.1 Tuned Profile (no reboot)
sudo make optimize-tuned
Switches from throughput-performance to accelerator-performance, which disables higher-latency CPU STOP states and sets CPU governor to performance.
Takes effect immediately. Previous profile is saved for rollback.
| Expected Impact | pp512 | tg128 |
|---|---|---|
| Tuned profile | +5-8% | +2-3% |
1.2 Kernel Boot Parameters (reboot required)
sudo make optimize-kernel
Adds three parameters to GRUB:
| Parameter | Value (64 GB) | Purpose |
|---|---|---|
iommu=pt |
-- | IOMMU passthrough, reduces memory access latency |
amdgpu.gttsize |
60416 |
Max GPU-addressable system RAM in MiB (~59 GiB) |
ttm.pages_limit |
15466496 |
Max pinnable 4K pages for GPU memory |
Values are computed dynamically based on your system's total physical RAM. The script backs up /etc/default/grub before modifying it. See architecture.md for the math.
1.3 BIOS VRAM Reduction (reboot + BIOS access)
make optimize-vram # Prints guidance — cannot modify BIOS directly
Reduce dedicated VRAM (UMA Frame Buffer Size) from 32 GB to 512 MB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
See bios-vram-guide.md for the full BIOS walkthrough (HP ZBook: F10 at boot).
Combine 1.2 and 1.3 into a single reboot: apply kernel params, then reboot into BIOS to change VRAM, then boot normally.
1.4 Verify
make verify # 9-point checklist, target: 9/9
make audit # Single-screen system status with scores
Phase 1 Expected Impact (combined)
| Optimization | pp512 | tg128 |
|---|---|---|
| Tuned profile | +5-8% | +2-3% |
| Kernel params + BIOS VRAM | +10-20% | +5-15% |
| Phase 1 combined | +15-25% | +8-18% |
Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
Phase 2: System Tuning (manual, no reboot unless noted)
These require root but are safe to apply and revert.
2.1 Power Budget Increase via RyzenAdj
The HP ZBook Ultra G1a ships with a conservative 60W power limit. The Strix Halo chip supports 120W. Community testing shows 85W is the sweet spot: +12-19% over 60W, with manageable thermals.
# Install ryzenadj (Fedora)
sudo dnf install ryzenadj # or build from https://github.com/FlyGoat/RyzenAdj
# Apply 85W limits (milliwatts)
sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
# Verify
sudo ryzenadj -i | grep -E 'STAPM|PPT'
| Setting | HP Default | Recommended | Max (risky) |
|---|---|---|---|
| STAPM | 60W | 85W | 120W |
| PPT Fast | 60W | 85W | 120W |
| PPT Slow | 20W | 85W | 120W |
Notes:
- Settings are volatile — reset on reboot/sleep. Create a systemd service for persistence.
- Going above 85W yields only +2-3% more (LLM inference is memory-bandwidth-bound at ~215 GB/s).
- Monitor thermals:
sensorsoramdgpu_top. Throttling starts around 100C junction temp. - HP firmware may periodically reset limits. Verify after wake from sleep.
- The 140W USB-C charger limits total system draw. At 100W+ APU, battery will drain even while plugged in.
2.2 VM / Sysctl Tuning
# Apply immediately
sudo sysctl -w vm.swappiness=1
sudo sysctl -w vm.dirty_ratio=40
sudo sysctl -w vm.dirty_background_ratio=10
sudo sysctl -w vm.max_map_count=500000
sudo sysctl -w vm.zone_reclaim_mode=0
# Persist across reboots
sudo tee /etc/sysctl.d/99-llm-inference.conf << 'EOF'
vm.swappiness = 1
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.max_map_count = 500000
vm.zone_reclaim_mode = 0
EOF
| Parameter | Default | Recommended | Why |
|---|---|---|---|
vm.swappiness |
60 | 1 | Prevent model weights from being swapped out |
vm.dirty_ratio |
20 | 40 | Reduce I/O flush storms during inference |
vm.dirty_background_ratio |
10 | 10 | Keep background writeback at default |
vm.max_map_count |
65530 | 500000 | Large models need many memory mappings |
vm.zone_reclaim_mode |
0 | 0 | Don't aggressively reclaim memory zones |
2.3 Transparent Huge Pages
THP reduces TLB misses for mmap'd model files (~55 GB model = 14M page table entries at 4KB vs 28K at 2MB).
# Apply immediately
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# Persist via kernel cmdline (add to GRUB):
# transparent_hugepage=always
# Verify THP is being used
grep -i huge /proc/meminfo
grep thp /proc/vmstat
Trade-off: always may cause rare latency spikes during memory compaction. Use madvise if you need predictable latency, but note that llama.cpp does not call madvise(MADV_HUGEPAGE) so always is needed.
2.4 RADV_PERFTEST=nogttspill (Vulkan backend)
Prevents unnecessary GTT spill management on unified memory. Fixes prompt processing degradation with the Vulkan RADV backend.
# Per-session
export RADV_PERFTEST=nogttspill
# Persist system-wide
echo 'RADV_PERFTEST=nogttspill' | sudo tee /etc/environment.d/radv.conf
Only affects the Vulkan RADV backend. No effect on ROCm.
2.5 Additional Kernel Parameters (reboot required)
These can be added to the GRUB cmdline alongside Phase 1 params:
| Parameter | Value | Purpose | Priority |
|---|---|---|---|
amdgpu.noretry=0 |
0 | Enable GPU page fault retry, improves stability | Medium — add if seeing GPU crashes |
transparent_hugepage=always |
-- | Persist THP setting | Medium |
preempt=voluntary |
-- | Reduce context switch overhead | Low — only for batch inference |
processor.max_cstate=1 |
1 | Disable deep C-states | Low — tuned profile handles this |
Do NOT add: amdgpu.ppfeaturemask=0xffffffff — OverDrive is non-functional on gfx1151 (ROCm issue #5750).
Phase 3: Runtime Flags (per-invocation, no system changes)
These are llama-bench / llama-server flags that affect performance without changing the system.
3.1 Always Use -mmp 0 (no mmap)
On unified memory, mmap adds a double-copy penalty. The --no-mmap / -mmp 0 flag loads weights directly. Already set in this repo's benchmark scripts.
3.2 Batch Size for MoE Models (-b 256)
Default batch size (2048) is too large for MoE on this hardware. Reducing to 256 can improve pp512 throughput by up to 70% on MoE models.
# In llama-bench
llama-bench -m model.gguf -b 256 -ngl 99 -fa 1
# In llama-server
llama-server -m model.gguf -b 256 -ngl 99 -fa 1
3.3 KV Cache Quantization
Q8_0 KV cache halves KV memory usage with negligible quality loss. Recommended as default for all serving.
# llama-server
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
# Benchmark sweep
make benchmark ARGS="--tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072 --models MODEL.gguf --reps 3"
| KV Type | Memory Savings | Quality Impact | Recommendation |
|---|---|---|---|
| f16 | Baseline | None | Default for benchmarks |
| q8_0 | ~50% | Negligible | Default for serving |
| q4_0 | ~75% | Noticeable on reasoning | Only for max context |
3.4 Flash Attention (-fa 1)
Always enable on ROCm (+24% pp improvement). On Vulkan, FA uses CoopMat1 (modest improvement). Already set in benchmark scripts.
3.5 ROCBLAS_USE_HIPBLASLT=1 (ROCm only)
Without this, ROCm pp on gfx1151 is 2-7x slower. Already set in benchmark scripts.
3.6 Backend Selection
Neither ROCm nor Vulkan is universally faster:
| Workload | Best Backend | Why |
|---|---|---|
| Short context tg | Vulkan RADV | Lower per-token overhead |
| Long context (8K-130K) | ROCm + rocWMMA | True HW flash attention |
| General stability | Vulkan RADV | More mature on gfx1151 |
Never use AMDVLK — RADV scales 3.6x better at extreme context depths.
Phase 4: Build Optimizations (requires rebuilding containers)
These require rebuilding the llama.cpp toolbox containers with specific flags.
4.1 ROCm Build Flags
cmake -B build \
-DGGML_HIP=ON \
-DGGML_HIP_ROCWMMA_FATTN=ON \ # GPU-accelerated flash attention via WMMA
-DGGML_HIP_UMA=ON \ # Unified memory aware allocation
-DAMDGPU_TARGETS=gfx1151
GGML_HIP_ROCWMMA_FATTN is the only path to true GPU-accelerated flash attention on AMD (96% speedup at 65K context). The Vulkan CoopMat1 path is a software fallback.
4.2 rocWMMA Tuned Patch (PR #16827)
Fixes a long-context regression in rocWMMA. Implements adaptive KQ stride, better launch bounds, and selective WMMA (prefill only; decode reverts to VEC/TILE). Check if Donato Capitella's ROCm toolboxes include this.
4.3 Vulkan Cooperative Matrices
RADV supports VK_KHR_cooperative_matrix for RDNA 3+. Building llama.cpp with cooperative matrix support could enable WMMA-like speedups without ROCm dependency.
Phase 5: Future / Currently Blocked
These optimizations are not available today but are worth tracking.
5.1 Speculative Decoding (blocked: llama.cpp PR #20075)
Expected 1.8-2.5x tg speedup for coding tasks. Draft model (Qwen3.5-0.8B-Q8_0.gguf, 812 MB) already downloaded. Blocked because Qwen3.5 MoE uses hybrid GatedDeltaNet architecture that breaks llama.cpp's speculative rollback mechanism.
Track: llama.cpp PR #20075 — fix for hybrid SSM/MoE speculative decoding.
5.2 Native Multi-Token Prediction (blocked: llama.cpp PR #20700)
Qwen3.5 was trained with built-in MTP heads. No separate draft model needed. Works in vLLM/SGLang today but not llama.cpp.
Track: llama.cpp PR #20700 — MTP for Qwen3.5 with FastMTP vocabulary trimming.
5.3 GPU Clock Fix (blocked: ROCm #5750)
GPU clocks on gfx1151 may be stuck at ~885 MHz instead of 2900 MHz. power_dpm_force_performance_level and OverDrive are non-functional. If fixed, this could unlock significant additional throughput.
Track: ROCm issue #5750 — Strix Halo stuck in low power clocks.
5.4 SageAttention
2-5x speedup over FlashAttention via quantized attention computation. No AMD port exists yet.
5.5 AMD XDNA NPU (50 TOPS)
Not viable for LLM inference today. Linux support coming in kernel 7.1. Future potential: running a draft model on the NPU for speculative decoding while the GPU runs the main model.
5.6 TurboQuant 3-bit KV Cache (ICLR 2026)
4.9x KV cache compression with minimal quality loss. Being integrated into llama.cpp.
5.7 LLMLingua-2 Prompt Compression
20x prompt compression for agentic/RAG workloads. Reduces pp time by compressing input before inference. Applicable to the agentic eval pipeline.
Hardware Limits (cannot be changed)
Understanding what is fixed helps avoid wasted effort.
| Resource | Value | Notes |
|---|---|---|
| Memory bandwidth | ~215 GB/s (measured) | 84% of 256 GB/s theoretical. Hard ceiling for tg speed. |
| LPDDR5X-8000 | 8000 MT/s, 256-bit | Soldered, no XMP/EXPO, no overclocking |
| Infinity Fabric | 2 GHz FCLK | Fixed, not tunable on Strix Halo |
| Infinity Cache | 32 MB | ~1 TB/s hit bandwidth. Per-layer weights exceed it. |
| GPU clocks | Up to 2900 MHz | Currently broken in driver (see 5.3) |
| Max power | 120W APU | HP ZBook charger is 140W total system |
Rollback
sudo make rollback # Restores GRUB backup and previous tuned profile
BIOS VRAM must be reverted manually (F10 at boot, restore previous UMA Frame Buffer Size).
Phase 2 changes can be reverted individually:
- RyzenAdj:
sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000 - Sysctl:
sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system - THP:
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled - nogttspill:
sudo rm /etc/environment.d/radv.conf
Troubleshooting
If anything goes wrong, see troubleshooting.md.
Further Reading
- Hardware analysis — Deep dive into llama.cpp flags, backends, quantization
- Inference landscape — Broader survey of engines, techniques, and future directions
- Benchmarking guide — Methodology and result interpretation
- References — All external links