fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00
parent 7531f6fa74
commit f92b710492
7 changed files with 2148 additions and 52 deletions
--- a/docs/optimization.md
+++ b/docs/optimization.md
@@ -1,20 +1,32 @@
 # Optimization Guide

-Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
+Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.

 **Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.

-## Step 1: Tuned Profile (no reboot)
+Track results in [optimization-log.md](optimization-log.md) as you apply each change.
+
+---
+
+## Phase 1: Core System (automated scripts)
+
+These are the foundational optimizations handled by this repo's scripts. Apply in order.
+
+### 1.1 Tuned Profile (no reboot)

 ```bash
 sudo make optimize-tuned
 ```

-Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
+Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states and sets CPU governor to performance.

 Takes effect immediately. Previous profile is saved for rollback.

-## Step 2: Kernel Boot Parameters (reboot required)
+| Expected Impact | pp512 | tg128 |
+|----------------|-------|-------|
+| Tuned profile | +5-8% | +2-3% |
+
+### 1.2 Kernel Boot Parameters (reboot required)

 ```bash
 sudo make optimize-kernel
@@ -24,61 +36,315 @@ Adds three parameters to GRUB:

 | Parameter | Value (64 GB) | Purpose |
 |-----------|--------------|---------|
-| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
-| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
+| `iommu=pt` | -- | IOMMU passthrough, reduces memory access latency |
+| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB (~59 GiB) |
 | `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |

-Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
+Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it. See [architecture.md](architecture.md) for the math.

-See [docs/architecture.md](architecture.md) for the math behind these values.
-
-## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
+### 1.3 BIOS VRAM Reduction (reboot + BIOS access)

 ```bash
-make optimize-vram
+make optimize-vram   # Prints guidance — cannot modify BIOS directly
 ```

-This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
+Reduce dedicated VRAM (UMA Frame Buffer Size) from 32 GB to 512 MB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.

-See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
+See [bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough (HP ZBook: F10 at boot).

-**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
+**Combine 1.2 and 1.3 into a single reboot**: apply kernel params, then reboot into BIOS to change VRAM, then boot normally.

-## Step 4: Verify
+### 1.4 Verify

 ```bash
-make verify
+make verify          # 9-point checklist, target: 9/9
+make audit           # Single-screen system status with scores
 ```

-Checks 9 criteria and reports a score. Target: 9/9.
+### Phase 1 Expected Impact (combined)

-## Step 5: Measure Impact
-
-```bash
-make benchmark
-make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
-```
-
-See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
-
-## Expected Impact
-
-| Optimization | pp512 Improvement | tg128 Improvement |
-|-------------|-------------------|-------------------|
+| Optimization | pp512 | tg128 |
+|-------------|-------|-------|
 | Tuned profile | +5-8% | +2-3% |
 | Kernel params + BIOS VRAM | +10-20% | +5-15% |
-| **Combined** | **+15-25%** | **+8-18%** |
+| **Phase 1 combined** | **+15-25%** | **+8-18%** |

 Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.

+---
+
+## Phase 2: System Tuning (manual, no reboot unless noted)
+
+These require root but are safe to apply and revert.
+
+### 2.1 Power Budget Increase via RyzenAdj
+
+The HP ZBook Ultra G1a ships with a conservative 60W power limit. The Strix Halo chip supports 120W. Community testing shows **85W is the sweet spot**: +12-19% over 60W, with manageable thermals.
+
+```bash
+# Install ryzenadj (Fedora)
+sudo dnf install ryzenadj   # or build from https://github.com/FlyGoat/RyzenAdj
+
+# Apply 85W limits (milliwatts)
+sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
+
+# Verify
+sudo ryzenadj -i | grep -E 'STAPM|PPT'
+```
+
+| Setting | HP Default | Recommended | Max (risky) |
+|---------|-----------|-------------|-------------|
+| STAPM | 60W | **85W** | 120W |
+| PPT Fast | 60W | **85W** | 120W |
+| PPT Slow | 20W | **85W** | 120W |
+
+**Notes**:
+- Settings are volatile — reset on reboot/sleep. Create a systemd service for persistence.
+- Going above 85W yields only +2-3% more (LLM inference is memory-bandwidth-bound at ~215 GB/s).
+- Monitor thermals: `sensors` or `amdgpu_top`. Throttling starts around 100C junction temp.
+- HP firmware may periodically reset limits. Verify after wake from sleep.
+- The 140W USB-C charger limits total system draw. At 100W+ APU, battery will drain even while plugged in.
+
+### 2.2 VM / Sysctl Tuning
+
+```bash
+# Apply immediately
+sudo sysctl -w vm.swappiness=1
+sudo sysctl -w vm.dirty_ratio=40
+sudo sysctl -w vm.dirty_background_ratio=10
+sudo sysctl -w vm.max_map_count=500000
+sudo sysctl -w vm.zone_reclaim_mode=0
+
+# Persist across reboots
+sudo tee /etc/sysctl.d/99-llm-inference.conf << 'EOF'
+vm.swappiness = 1
+vm.dirty_ratio = 40
+vm.dirty_background_ratio = 10
+vm.max_map_count = 500000
+vm.zone_reclaim_mode = 0
+EOF
+```
+
+| Parameter | Default | Recommended | Why |
+|-----------|---------|-------------|-----|
+| `vm.swappiness` | 60 | **1** | Prevent model weights from being swapped out |
+| `vm.dirty_ratio` | 20 | **40** | Reduce I/O flush storms during inference |
+| `vm.dirty_background_ratio` | 10 | **10** | Keep background writeback at default |
+| `vm.max_map_count` | 65530 | **500000** | Large models need many memory mappings |
+| `vm.zone_reclaim_mode` | 0 | **0** | Don't aggressively reclaim memory zones |
+
+### 2.3 Transparent Huge Pages
+
+THP reduces TLB misses for mmap'd model files (~55 GB model = 14M page table entries at 4KB vs 28K at 2MB).
+
+```bash
+# Apply immediately
+echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
+echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
+
+# Persist via kernel cmdline (add to GRUB):
+# transparent_hugepage=always
+
+# Verify THP is being used
+grep -i huge /proc/meminfo
+grep thp /proc/vmstat
+```
+
+**Trade-off**: `always` may cause rare latency spikes during memory compaction. Use `madvise` if you need predictable latency, but note that llama.cpp does not call `madvise(MADV_HUGEPAGE)` so `always` is needed.
+
+### 2.4 RADV_PERFTEST=nogttspill (Vulkan backend)
+
+Prevents unnecessary GTT spill management on unified memory. Fixes prompt processing degradation with the Vulkan RADV backend.
+
+```bash
+# Per-session
+export RADV_PERFTEST=nogttspill
+
+# Persist system-wide
+echo 'RADV_PERFTEST=nogttspill' | sudo tee /etc/environment.d/radv.conf
+```
+
+Only affects the Vulkan RADV backend. No effect on ROCm.
+
+### 2.5 Additional Kernel Parameters (reboot required)
+
+These can be added to the GRUB cmdline alongside Phase 1 params:
+
+| Parameter | Value | Purpose | Priority |
+|-----------|-------|---------|----------|
+| `amdgpu.noretry=0` | 0 | Enable GPU page fault retry, improves stability | Medium — add if seeing GPU crashes |
+| `transparent_hugepage=always` | -- | Persist THP setting | Medium |
+| `preempt=voluntary` | -- | Reduce context switch overhead | Low — only for batch inference |
+| `processor.max_cstate=1` | 1 | Disable deep C-states | Low — tuned profile handles this |
+
+**Do NOT add**: `amdgpu.ppfeaturemask=0xffffffff` — OverDrive is non-functional on gfx1151 (ROCm issue #5750).
+
+---
+
+## Phase 3: Runtime Flags (per-invocation, no system changes)
+
+These are llama-bench / llama-server flags that affect performance without changing the system.
+
+### 3.1 Always Use `-mmp 0` (no mmap)
+
+On unified memory, mmap adds a double-copy penalty. The `--no-mmap` / `-mmp 0` flag loads weights directly. Already set in this repo's benchmark scripts.
+
+### 3.2 Batch Size for MoE Models (`-b 256`)
+
+Default batch size (2048) is too large for MoE on this hardware. Reducing to 256 can improve pp512 throughput by up to 70% on MoE models.
+
+```bash
+# In llama-bench
+llama-bench -m model.gguf -b 256 -ngl 99 -fa 1
+
+# In llama-server
+llama-server -m model.gguf -b 256 -ngl 99 -fa 1
+```
+
+### 3.3 KV Cache Quantization
+
+Q8_0 KV cache halves KV memory usage with negligible quality loss. Recommended as default for all serving.
+
+```bash
+# llama-server
+llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
+
+# Benchmark sweep
+make benchmark ARGS="--tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072 --models MODEL.gguf --reps 3"
+```
+
+| KV Type | Memory Savings | Quality Impact | Recommendation |
+|---------|---------------|----------------|----------------|
+| f16 | Baseline | None | Default for benchmarks |
+| **q8_0** | **~50%** | **Negligible** | **Default for serving** |
+| q4_0 | ~75% | Noticeable on reasoning | Only for max context |
+
+### 3.4 Flash Attention (`-fa 1`)
+
+Always enable on ROCm (+24% pp improvement). On Vulkan, FA uses CoopMat1 (modest improvement). Already set in benchmark scripts.
+
+### 3.5 ROCBLAS_USE_HIPBLASLT=1 (ROCm only)
+
+Without this, ROCm pp on gfx1151 is 2-7x slower. Already set in benchmark scripts.
+
+### 3.6 Backend Selection
+
+Neither ROCm nor Vulkan is universally faster:
+
+| Workload | Best Backend | Why |
+|----------|-------------|-----|
+| Short context tg | Vulkan RADV | Lower per-token overhead |
+| Long context (8K-130K) | ROCm + rocWMMA | True HW flash attention |
+| General stability | Vulkan RADV | More mature on gfx1151 |
+
+Never use AMDVLK — RADV scales 3.6x better at extreme context depths.
+
+---
+
+## Phase 4: Build Optimizations (requires rebuilding containers)
+
+These require rebuilding the llama.cpp toolbox containers with specific flags.
+
+### 4.1 ROCm Build Flags
+
+```bash
+cmake -B build \
+    -DGGML_HIP=ON \
+    -DGGML_HIP_ROCWMMA_FATTN=ON \   # GPU-accelerated flash attention via WMMA
+    -DGGML_HIP_UMA=ON \              # Unified memory aware allocation
+    -DAMDGPU_TARGETS=gfx1151
+```
+
+`GGML_HIP_ROCWMMA_FATTN` is the only path to true GPU-accelerated flash attention on AMD (96% speedup at 65K context). The Vulkan CoopMat1 path is a software fallback.
+
+### 4.2 rocWMMA Tuned Patch (PR #16827)
+
+Fixes a long-context regression in rocWMMA. Implements adaptive KQ stride, better launch bounds, and selective WMMA (prefill only; decode reverts to VEC/TILE). Check if Donato Capitella's ROCm toolboxes include this.
+
+### 4.3 Vulkan Cooperative Matrices
+
+RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Building llama.cpp with cooperative matrix support could enable WMMA-like speedups without ROCm dependency.
+
+---
+
+## Phase 5: Future / Currently Blocked
+
+These optimizations are not available today but are worth tracking.
+
+### 5.1 Speculative Decoding (blocked: llama.cpp PR #20075)
+
+Expected 1.8-2.5x tg speedup for coding tasks. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`, 812 MB) already downloaded. Blocked because Qwen3.5 MoE uses hybrid GatedDeltaNet architecture that breaks llama.cpp's speculative rollback mechanism.
+
+**Track**: [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — fix for hybrid SSM/MoE speculative decoding.
+
+### 5.2 Native Multi-Token Prediction (blocked: llama.cpp PR #20700)
+
+Qwen3.5 was trained with built-in MTP heads. No separate draft model needed. Works in vLLM/SGLang today but not llama.cpp.
+
+**Track**: [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — MTP for Qwen3.5 with FastMTP vocabulary trimming.
+
+### 5.3 GPU Clock Fix (blocked: ROCm #5750)
+
+GPU clocks on gfx1151 may be stuck at ~885 MHz instead of 2900 MHz. `power_dpm_force_performance_level` and OverDrive are non-functional. If fixed, this could unlock significant additional throughput.
+
+**Track**: [ROCm issue #5750](https://github.com/ROCm/ROCm/issues/5750) — Strix Halo stuck in low power clocks.
+
+### 5.4 SageAttention
+
+2-5x speedup over FlashAttention via quantized attention computation. No AMD port exists yet.
+
+### 5.5 AMD XDNA NPU (50 TOPS)
+
+Not viable for LLM inference today. Linux support coming in kernel 7.1. Future potential: running a draft model on the NPU for speculative decoding while the GPU runs the main model.
+
+### 5.6 TurboQuant 3-bit KV Cache (ICLR 2026)
+
+4.9x KV cache compression with minimal quality loss. Being integrated into llama.cpp.
+
+### 5.7 LLMLingua-2 Prompt Compression
+
+20x prompt compression for agentic/RAG workloads. Reduces pp time by compressing input before inference. Applicable to the agentic eval pipeline.
+
+---
+
+## Hardware Limits (cannot be changed)
+
+Understanding what is fixed helps avoid wasted effort.
+
+| Resource | Value | Notes |
+|----------|-------|-------|
+| Memory bandwidth | **~215 GB/s** (measured) | 84% of 256 GB/s theoretical. Hard ceiling for tg speed. |
+| LPDDR5X-8000 | **8000 MT/s, 256-bit** | Soldered, no XMP/EXPO, no overclocking |
+| Infinity Fabric | **2 GHz FCLK** | Fixed, not tunable on Strix Halo |
+| Infinity Cache | **32 MB** | ~1 TB/s hit bandwidth. Per-layer weights exceed it. |
+| GPU clocks | **Up to 2900 MHz** | Currently broken in driver (see 5.3) |
+| Max power | **120W APU** | HP ZBook charger is 140W total system |
+
+---
+
 ## Rollback

 ```bash
-sudo make rollback
+sudo make rollback   # Restores GRUB backup and previous tuned profile
 ```

-Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
+BIOS VRAM must be reverted manually (F10 at boot, restore previous UMA Frame Buffer Size).
+
+Phase 2 changes can be reverted individually:
+- RyzenAdj: `sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000`
+- Sysctl: `sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system`
+- THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
+- nogttspill: `sudo rm /etc/environment.d/radv.conf`
+
+---

 ## Troubleshooting

-If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).
+If anything goes wrong, see [troubleshooting.md](troubleshooting.md).
+
+## Further Reading
+
+- [Hardware analysis](llama-cpp-optimization-research.md) — Deep dive into llama.cpp flags, backends, quantization
+- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines, techniques, and future directions
+- [Benchmarking guide](benchmarking.md) — Methodology and result interpretation
+- [References](references.md) — All external links