fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.
This commit is contained in:
Felipe Cardoso
2026-03-27 14:54:19 +01:00
parent 7531f6fa74
commit f92b710492
7 changed files with 2148 additions and 52 deletions

View File

@@ -1,20 +1,32 @@
# Optimization Guide
Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.
**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.
## Step 1: Tuned Profile (no reboot)
Track results in [optimization-log.md](optimization-log.md) as you apply each change.
---
## Phase 1: Core System (automated scripts)
These are the foundational optimizations handled by this repo's scripts. Apply in order.
### 1.1 Tuned Profile (no reboot)
```bash
sudo make optimize-tuned
```
Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states and sets CPU governor to performance.
Takes effect immediately. Previous profile is saved for rollback.
## Step 2: Kernel Boot Parameters (reboot required)
| Expected Impact | pp512 | tg128 |
|----------------|-------|-------|
| Tuned profile | +5-8% | +2-3% |
### 1.2 Kernel Boot Parameters (reboot required)
```bash
sudo make optimize-kernel
@@ -24,61 +36,315 @@ Adds three parameters to GRUB:
| Parameter | Value (64 GB) | Purpose |
|-----------|--------------|---------|
| `iommu=pt` | | IOMMU passthrough, reduces memory access latency |
| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
| `iommu=pt` | -- | IOMMU passthrough, reduces memory access latency |
| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB (~59 GiB) |
| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it. See [architecture.md](architecture.md) for the math.
See [docs/architecture.md](architecture.md) for the math behind these values.
## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
### 1.3 BIOS VRAM Reduction (reboot + BIOS access)
```bash
make optimize-vram
make optimize-vram # Prints guidance — cannot modify BIOS directly
```
This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
Reduce dedicated VRAM (UMA Frame Buffer Size) from 32 GB to 512 MB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
See [bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough (HP ZBook: F10 at boot).
**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
**Combine 1.2 and 1.3 into a single reboot**: apply kernel params, then reboot into BIOS to change VRAM, then boot normally.
## Step 4: Verify
### 1.4 Verify
```bash
make verify
make verify # 9-point checklist, target: 9/9
make audit # Single-screen system status with scores
```
Checks 9 criteria and reports a score. Target: 9/9.
### Phase 1 Expected Impact (combined)
## Step 5: Measure Impact
```bash
make benchmark
make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
```
See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
## Expected Impact
| Optimization | pp512 Improvement | tg128 Improvement |
|-------------|-------------------|-------------------|
| Optimization | pp512 | tg128 |
|-------------|-------|-------|
| Tuned profile | +5-8% | +2-3% |
| Kernel params + BIOS VRAM | +10-20% | +5-15% |
| **Combined** | **+15-25%** | **+8-18%** |
| **Phase 1 combined** | **+15-25%** | **+8-18%** |
Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
---
## Phase 2: System Tuning (manual, no reboot unless noted)
These require root but are safe to apply and revert.
### 2.1 Power Budget Increase via RyzenAdj
The HP ZBook Ultra G1a ships with a conservative 60W power limit. The Strix Halo chip supports 120W. Community testing shows **85W is the sweet spot**: +12-19% over 60W, with manageable thermals.
```bash
# Install ryzenadj (Fedora)
sudo dnf install ryzenadj # or build from https://github.com/FlyGoat/RyzenAdj
# Apply 85W limits (milliwatts)
sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
# Verify
sudo ryzenadj -i | grep -E 'STAPM|PPT'
```
| Setting | HP Default | Recommended | Max (risky) |
|---------|-----------|-------------|-------------|
| STAPM | 60W | **85W** | 120W |
| PPT Fast | 60W | **85W** | 120W |
| PPT Slow | 20W | **85W** | 120W |
**Notes**:
- Settings are volatile — reset on reboot/sleep. Create a systemd service for persistence.
- Going above 85W yields only +2-3% more (LLM inference is memory-bandwidth-bound at ~215 GB/s).
- Monitor thermals: `sensors` or `amdgpu_top`. Throttling starts around 100C junction temp.
- HP firmware may periodically reset limits. Verify after wake from sleep.
- The 140W USB-C charger limits total system draw. At 100W+ APU, battery will drain even while plugged in.
### 2.2 VM / Sysctl Tuning
```bash
# Apply immediately
sudo sysctl -w vm.swappiness=1
sudo sysctl -w vm.dirty_ratio=40
sudo sysctl -w vm.dirty_background_ratio=10
sudo sysctl -w vm.max_map_count=500000
sudo sysctl -w vm.zone_reclaim_mode=0
# Persist across reboots
sudo tee /etc/sysctl.d/99-llm-inference.conf << 'EOF'
vm.swappiness = 1
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.max_map_count = 500000
vm.zone_reclaim_mode = 0
EOF
```
| Parameter | Default | Recommended | Why |
|-----------|---------|-------------|-----|
| `vm.swappiness` | 60 | **1** | Prevent model weights from being swapped out |
| `vm.dirty_ratio` | 20 | **40** | Reduce I/O flush storms during inference |
| `vm.dirty_background_ratio` | 10 | **10** | Keep background writeback at default |
| `vm.max_map_count` | 65530 | **500000** | Large models need many memory mappings |
| `vm.zone_reclaim_mode` | 0 | **0** | Don't aggressively reclaim memory zones |
### 2.3 Transparent Huge Pages
THP reduces TLB misses for mmap'd model files (~55 GB model = 14M page table entries at 4KB vs 28K at 2MB).
```bash
# Apply immediately
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# Persist via kernel cmdline (add to GRUB):
# transparent_hugepage=always
# Verify THP is being used
grep -i huge /proc/meminfo
grep thp /proc/vmstat
```
**Trade-off**: `always` may cause rare latency spikes during memory compaction. Use `madvise` if you need predictable latency, but note that llama.cpp does not call `madvise(MADV_HUGEPAGE)` so `always` is needed.
### 2.4 RADV_PERFTEST=nogttspill (Vulkan backend)
Prevents unnecessary GTT spill management on unified memory. Fixes prompt processing degradation with the Vulkan RADV backend.
```bash
# Per-session
export RADV_PERFTEST=nogttspill
# Persist system-wide
echo 'RADV_PERFTEST=nogttspill' | sudo tee /etc/environment.d/radv.conf
```
Only affects the Vulkan RADV backend. No effect on ROCm.
### 2.5 Additional Kernel Parameters (reboot required)
These can be added to the GRUB cmdline alongside Phase 1 params:
| Parameter | Value | Purpose | Priority |
|-----------|-------|---------|----------|
| `amdgpu.noretry=0` | 0 | Enable GPU page fault retry, improves stability | Medium — add if seeing GPU crashes |
| `transparent_hugepage=always` | -- | Persist THP setting | Medium |
| `preempt=voluntary` | -- | Reduce context switch overhead | Low — only for batch inference |
| `processor.max_cstate=1` | 1 | Disable deep C-states | Low — tuned profile handles this |
**Do NOT add**: `amdgpu.ppfeaturemask=0xffffffff` — OverDrive is non-functional on gfx1151 (ROCm issue #5750).
---
## Phase 3: Runtime Flags (per-invocation, no system changes)
These are llama-bench / llama-server flags that affect performance without changing the system.
### 3.1 Always Use `-mmp 0` (no mmap)
On unified memory, mmap adds a double-copy penalty. The `--no-mmap` / `-mmp 0` flag loads weights directly. Already set in this repo's benchmark scripts.
### 3.2 Batch Size for MoE Models (`-b 256`)
Default batch size (2048) is too large for MoE on this hardware. Reducing to 256 can improve pp512 throughput by up to 70% on MoE models.
```bash
# In llama-bench
llama-bench -m model.gguf -b 256 -ngl 99 -fa 1
# In llama-server
llama-server -m model.gguf -b 256 -ngl 99 -fa 1
```
### 3.3 KV Cache Quantization
Q8_0 KV cache halves KV memory usage with negligible quality loss. Recommended as default for all serving.
```bash
# llama-server
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
# Benchmark sweep
make benchmark ARGS="--tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072 --models MODEL.gguf --reps 3"
```
| KV Type | Memory Savings | Quality Impact | Recommendation |
|---------|---------------|----------------|----------------|
| f16 | Baseline | None | Default for benchmarks |
| **q8_0** | **~50%** | **Negligible** | **Default for serving** |
| q4_0 | ~75% | Noticeable on reasoning | Only for max context |
### 3.4 Flash Attention (`-fa 1`)
Always enable on ROCm (+24% pp improvement). On Vulkan, FA uses CoopMat1 (modest improvement). Already set in benchmark scripts.
### 3.5 ROCBLAS_USE_HIPBLASLT=1 (ROCm only)
Without this, ROCm pp on gfx1151 is 2-7x slower. Already set in benchmark scripts.
### 3.6 Backend Selection
Neither ROCm nor Vulkan is universally faster:
| Workload | Best Backend | Why |
|----------|-------------|-----|
| Short context tg | Vulkan RADV | Lower per-token overhead |
| Long context (8K-130K) | ROCm + rocWMMA | True HW flash attention |
| General stability | Vulkan RADV | More mature on gfx1151 |
Never use AMDVLK — RADV scales 3.6x better at extreme context depths.
---
## Phase 4: Build Optimizations (requires rebuilding containers)
These require rebuilding the llama.cpp toolbox containers with specific flags.
### 4.1 ROCm Build Flags
```bash
cmake -B build \
-DGGML_HIP=ON \
-DGGML_HIP_ROCWMMA_FATTN=ON \ # GPU-accelerated flash attention via WMMA
-DGGML_HIP_UMA=ON \ # Unified memory aware allocation
-DAMDGPU_TARGETS=gfx1151
```
`GGML_HIP_ROCWMMA_FATTN` is the only path to true GPU-accelerated flash attention on AMD (96% speedup at 65K context). The Vulkan CoopMat1 path is a software fallback.
### 4.2 rocWMMA Tuned Patch (PR #16827)
Fixes a long-context regression in rocWMMA. Implements adaptive KQ stride, better launch bounds, and selective WMMA (prefill only; decode reverts to VEC/TILE). Check if Donato Capitella's ROCm toolboxes include this.
### 4.3 Vulkan Cooperative Matrices
RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Building llama.cpp with cooperative matrix support could enable WMMA-like speedups without ROCm dependency.
---
## Phase 5: Future / Currently Blocked
These optimizations are not available today but are worth tracking.
### 5.1 Speculative Decoding (blocked: llama.cpp PR #20075)
Expected 1.8-2.5x tg speedup for coding tasks. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`, 812 MB) already downloaded. Blocked because Qwen3.5 MoE uses hybrid GatedDeltaNet architecture that breaks llama.cpp's speculative rollback mechanism.
**Track**: [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — fix for hybrid SSM/MoE speculative decoding.
### 5.2 Native Multi-Token Prediction (blocked: llama.cpp PR #20700)
Qwen3.5 was trained with built-in MTP heads. No separate draft model needed. Works in vLLM/SGLang today but not llama.cpp.
**Track**: [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — MTP for Qwen3.5 with FastMTP vocabulary trimming.
### 5.3 GPU Clock Fix (blocked: ROCm #5750)
GPU clocks on gfx1151 may be stuck at ~885 MHz instead of 2900 MHz. `power_dpm_force_performance_level` and OverDrive are non-functional. If fixed, this could unlock significant additional throughput.
**Track**: [ROCm issue #5750](https://github.com/ROCm/ROCm/issues/5750) — Strix Halo stuck in low power clocks.
### 5.4 SageAttention
2-5x speedup over FlashAttention via quantized attention computation. No AMD port exists yet.
### 5.5 AMD XDNA NPU (50 TOPS)
Not viable for LLM inference today. Linux support coming in kernel 7.1. Future potential: running a draft model on the NPU for speculative decoding while the GPU runs the main model.
### 5.6 TurboQuant 3-bit KV Cache (ICLR 2026)
4.9x KV cache compression with minimal quality loss. Being integrated into llama.cpp.
### 5.7 LLMLingua-2 Prompt Compression
20x prompt compression for agentic/RAG workloads. Reduces pp time by compressing input before inference. Applicable to the agentic eval pipeline.
---
## Hardware Limits (cannot be changed)
Understanding what is fixed helps avoid wasted effort.
| Resource | Value | Notes |
|----------|-------|-------|
| Memory bandwidth | **~215 GB/s** (measured) | 84% of 256 GB/s theoretical. Hard ceiling for tg speed. |
| LPDDR5X-8000 | **8000 MT/s, 256-bit** | Soldered, no XMP/EXPO, no overclocking |
| Infinity Fabric | **2 GHz FCLK** | Fixed, not tunable on Strix Halo |
| Infinity Cache | **32 MB** | ~1 TB/s hit bandwidth. Per-layer weights exceed it. |
| GPU clocks | **Up to 2900 MHz** | Currently broken in driver (see 5.3) |
| Max power | **120W APU** | HP ZBook charger is 140W total system |
---
## Rollback
```bash
sudo make rollback
sudo make rollback # Restores GRUB backup and previous tuned profile
```
Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 restore previous UMA Frame Buffer Size).
BIOS VRAM must be reverted manually (F10 at boot, restore previous UMA Frame Buffer Size).
Phase 2 changes can be reverted individually:
- RyzenAdj: `sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000`
- Sysctl: `sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system`
- THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
- nogttspill: `sudo rm /etc/environment.d/radv.conf`
---
## Troubleshooting
If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).
If anything goes wrong, see [troubleshooting.md](troubleshooting.md).
## Further Reading
- [Hardware analysis](llama-cpp-optimization-research.md) — Deep dive into llama.cpp flags, backends, quantization
- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines, techniques, and future directions
- [Benchmarking guide](benchmarking.md) — Methodology and result interpretation
- [References](references.md) — All external links