fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00
parent 7531f6fa74
commit f92b710492
7 changed files with 2148 additions and 52 deletions
--- a/docs/llama-cpp-optimization-research.md
+++ b/docs/llama-cpp-optimization-research.md
@@ -0,0 +1,806 @@
+# llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)
+
+Comprehensive research into maximizing inference performance on AMD Strix Halo
+(Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000).
+Researched March 2026.
+
+---
+
+## Scope
+
+This document covers every known compilation flag, runtime parameter, environment
+variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5)
+with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama
+internals, or non-llama.cpp inference engines except where their findings inform
+llama.cpp optimization.
+
+---
+
+## Table of Contents
+
+1. [Compilation Flags and Build Optimizations](#1-compilation-flags-and-build-optimizations)
+2. [Runtime Flags and Environment Variables](#2-runtime-flags-and-environment-variables)
+3. [Flash Attention and Attention Backends](#3-flash-attention-and-attention-backends)
+4. [Quantization Strategies for Speed](#4-quantization-strategies-for-speed)
+5. [Memory Layout and Caching](#5-memory-layout-and-caching)
+6. [llama-server Specific Optimizations](#6-llama-server-specific-optimizations)
+7. [Upcoming llama.cpp Features (2026)](#7-upcoming-llamacpp-features-2026)
+8. [Recommended Configurations](#8-recommended-configurations)
+9. [Sources](#9-sources)
+
+---
+
+## 1. Compilation Flags and Build Optimizations
+
+### 1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build
+
+Both backends are worth building. Neither is universally faster on gfx1151:
+
+| Workload | Winner | Rationale |
+|----------|--------|-----------|
+| Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths |
+| Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory |
+| Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes |
+| Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx |
+| Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent |
+
+Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):
+
+| Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s |
+|---------|-----------|-----------|----------------|----------------|
+| Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 |
+| Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 |
+| ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 |
+| ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 |
+
+Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x
+faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context
+prompt processing (3x faster than RADV at 130K).
+
+### 1.2 ROCm Build Flags
+
+**Minimal build:**
+```bash
+cmake -B build -S . \
+  -DGGML_HIP=ON \
+  -DAMDGPU_TARGETS="gfx1151" \
+  -DCMAKE_BUILD_TYPE=Release
+cmake --build build --config Release -j$(nproc)
+```
+
+**Optimized build (recommended):**
+```bash
+cmake -B build -S . \
+  -DGGML_HIP=ON \
+  -DAMDGPU_TARGETS="gfx1151" \
+  -DGGML_HIP_ROCWMMA_FATTN=ON \
+  -DCMAKE_BUILD_TYPE=Release
+cmake --build build --config Release -j$(nproc)
+```
+
+#### Critical ROCm build flags:
+
+| Flag | Effect | Recommendation |
+|------|--------|----------------|
+| `-DGGML_HIP=ON` | Enable HIP/ROCm backend | Required |
+| `-DAMDGPU_TARGETS="gfx1151"` | Target Strix Halo GPU | Required -- do not use gfx1100 |
+| `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enable rocWMMA flash attention | Strongly recommended for pp |
+| `-DGGML_HIP_GRAPHS=ON` | HIP graph kernel scheduling | Test -- may help reduce launch overhead |
+| `-DGGML_HIP_NO_VMM=OFF` | Re-enable Virtual Memory Management | Default is disabled; test if needed |
+
+#### Flags to be aware of but NOT set by default:
+
+| Flag | Notes |
+|------|-------|
+| `-DGGML_HIP_UMA=ON` | Uses hipMallocManaged for UMA. **Avoid on Strix Halo** -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. |
+| `-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` | Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. **Test before deploying.** |
+| `-DGGML_CUDA_FA_ALL_QUANTS=ON` | Compiles all KV cache quant type combinations for FA. **Works for CUDA kernels which HIP reuses via hipify.** Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. |
+
+#### ROCm version considerations:
+
+- **ROCm 7.2**: Known rocWMMA compilation issue (ambiguous template specializations in `mfma_impl.hpp`). Fixed in later point releases or by disabling rocWMMA.
+- **ROCm 7.0 RC / 7.1**: Generally work well with gfx1151.
+- **ROCm 6.4.4**: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
+- **Recommendation**: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.
+
+### 1.3 Vulkan Build Flags
+
+```bash
+cmake -B build -S . \
+  -DGGML_VULKAN=ON \
+  -DLLAMA_BUILD_SERVER=ON \
+  -DCMAKE_BUILD_TYPE=Release
+cmake --build build --config Release -j$(nproc)
+```
+
+The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime.
+gfx1151 with RADV reports `matrix cores: KHR_coopmat` in llama-bench logs.
+
+#### Vulkan-specific considerations:
+
+| Topic | Detail |
+|-------|--------|
+| **RADV vs AMDVLK** | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. |
+| **AMDVLK buffer limit** | AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. |
+| **CoopMat1 vs CoopMat2** | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. |
+| **Shader compilation** | Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them. |
+
+### 1.4 LTO and PGO
+
+llama.cpp does not have built-in LTO/PGO support in its CMake configuration.
+You can enable LTO manually:
+
+```bash
+cmake -B build -S . \
+  -DGGML_HIP=ON \
+  -DAMDGPU_TARGETS="gfx1151" \
+  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
+  -DCMAKE_BUILD_TYPE=Release
+```
+
+Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are
+compiled by the HIP/ROCm compiler and are not affected by host LTO.
+
+PGO would require a two-pass build (instrument, profile, rebuild) and is not
+commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance
+and memory bandwidth, not host-side code paths.
+
+### 1.5 Compiler Tuning for ROCm
+
+A known LLVM regression affects loop unrolling on RDNA. The following flag has
+been reported to help:
+
+```bash
+-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
+```
+
+This increases the unrolling threshold for local memory operations, which can
+improve kernel performance for flash attention and matrix multiplication.
+
+---
+
+## 2. Runtime Flags and Environment Variables
+
+### 2.1 ROCm Environment Variables
+
+| Variable | Value | Effect |
+|----------|-------|--------|
+| `ROCBLAS_USE_HIPBLASLT=1` | **Critical** | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. **This is a 2-7x improvement for prompt processing.** |
+| `HSA_OVERRIDE_GFX_VERSION=11.5.1` | Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. |
+| `HSA_ENABLE_SDMA=0` | Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. |
+| `HIP_VISIBLE_DEVICES=0` | Optional | Select specific GPU device. Useful in multi-GPU or container setups. |
+| `GPU_MAX_HEAP_SIZE=100` | Optional | Allow 100% of GPU memory for heap. Default may be lower. |
+| `GPU_MAX_ALLOC_PERCENT=100` | Optional | Allow single allocation up to 100% of GPU memory. |
+| `ROCR_VISIBLE_DEVICES=0` | Optional | HSA-level device visibility control. |
+| `AMD_LOG_LEVEL=0` | Optional | Suppress AMD driver logging noise. |
+
+**The single most impactful environment variable is `ROCBLAS_USE_HIPBLASLT=1`.**
+Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).
+
+### 2.2 Vulkan Environment Variables
+
+| Variable | Value | Effect |
+|----------|-------|--------|
+| `AMD_VULKAN_ICD=RADV` | Recommended | Force RADV driver (skip AMDVLK). |
+| `RADV_PERFTEST=nogttspill` | **Important** | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). |
+| `GGML_VK_VISIBLE_DEVICES=0` | Optional | Select Vulkan device index. |
+| `GGML_VULKAN_DISABLE_F16=1` | Debugging | Force FP32 compute. Slower but useful for debugging precision issues. |
+| `GGML_LOG_LEVEL=2` | Debugging | Verbose logging to verify coopmat detection. |
+
+### 2.3 Thread Count (`-t` flag)
+
+For GPU-dominant inference (all layers offloaded), the thread count has minimal
+impact on throughput. The recommendation:
+
+- **Single-user inference**: `-t 4` to `-t 8` (enough for tokenization/sampling overhead)
+- **Server with parallel slots**: `-t` equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
+- **Hybrid CPU+GPU (partial offload)**: `-t` equal to number of physical cores
+
+The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using
+all 12 big cores (`-t 12`) is optimal for CPU-involved workloads.
+
+### 2.4 Batch Size Tuning (`-b` and `-ub`)
+
+| Flag | Default | Role |
+|------|---------|------|
+| `-b` / `--batch-size` | 2048 | Logical batch size (application level) |
+| `-ub` / `--ubatch-size` | 512 | Physical batch size (device level) |
+
+Tuning guidance for gfx1151:
+
+- **MoE models**: `-b 256` significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
+- **Dense models**: Default `-b 2048` is generally fine
+- **Long context**: `-ub 2048` can improve performance, but test against OOM
+- **Ultra-long context**: Reduce `-ub` if memory allocation fails
+
+The Vulkan backend blog post for Strix Halo recommends: `-c 32768 -b 4096 -ub 256`
+for a good balance of performance and memory.
+
+### 2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`)
+
+**Critical finding for unified memory APUs:**
+
+> When you load large models to the GPU, memory mapping can make loading moderately
+> slower for Vulkan, and **catastrophically slower for ROCm**. You should always set
+> `--mmap 0` or `--no-mmap` to improve model loading times on Strix Halo.
+
+For `llama-bench`, use `-mmp 0`. For `llama-server`/`llama-cli`, use `--no-mmap`.
+
+On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x.
+The difference is which pages are mapped for GPU access. GPU-mapped pages have full
+bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for
+CPU-to-GPU copies).
+
+**Always use `-ngl 99` (or higher) to ensure all layers are on GPU memory.**
+Even on a unified memory system, GPU memory paths provide 2x the bandwidth.
+
+### 2.6 GPU Layer Offloading (`-ngl`)
+
+For Strix Halo with 64GB unified memory:
+
+- **Models < 50GB**: `-ngl 99` offloads everything. No tuning needed.
+- **Models 50-60GB**: `-ngl 99` should still work with GTT expanded via kernel params.
+- **Models > 60GB**: May need partial offload. Use `-ngl <N>` where N is tuned to
+  keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.
+
+**Never let GPU spill to system RAM paths** -- performance will be worse than pure CPU.
+
+---
+
+## 3. Flash Attention and Attention Backends
+
+### 3.1 When to Enable Flash Attention
+
+**Rule of thumb for gfx1151:**
+
+| Backend | Flash Attention | Recommendation |
+|---------|----------------|----------------|
+| ROCm + rocWMMA | `-fa 1` | **Always enable.** 24% pp improvement, maintains tg speed, uses less memory. |
+| ROCm without rocWMMA | `-fa 1` | Enable, but smaller improvement. |
+| Vulkan RADV | `-fa 1` | **Enable for short contexts.** Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. |
+| Vulkan AMDVLK | `-fa 1` | Similar to RADV. |
+
+**Key caveat**: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not
+the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior
+FA path.
+
+### 3.2 rocWMMA Flash Attention Performance
+
+Benchmark on gfx1151 (Llama2-7B Q4_K_M):
+
+| Configuration | pp512 t/s | tg128 t/s |
+|---------------|-----------|-----------|
+| HIP standard | 592.28 | 40.40 |
+| HIP + hipBLASLt | 548.72 | 40.43 |
+| HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 |
+| HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 |
+
+rocWMMA provides ~70% improvement in prompt processing with flash attention.
+Token generation is slightly slower (~2%) due to WMMA overhead at small batch.
+
+### 3.3 The rocWMMA Long-Context Regression and Fix
+
+The standard rocWMMA implementation has a **long-context decode regression**:
+at 65K context, tg degrades by up to 57% compared to HIP-only baseline.
+
+**The fix** (PR #16827, "rocm-wmma-tune" branch) implements:
+
+1. **`__launch_bounds__(256, 2)`**: Ensures minimum 2 blocks per SM, improving occupancy
+2. **Adaptive KQ stride**: Uses stride 128 when head dimension <= 128, reducing LDS footprint
+3. **Selective WMMA usage**: WMMA only for prefill; decode reverts to VEC/TILE kernels
+
+Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):
+- pp512 at 65K context: **96% faster** than untuned rocWMMA
+- tg128 at 65K context: Matches HIP baseline (previously 57% degraded)
+
+**Status**: This patch is available in `-rocwmma-improved` toolbox builds. It may
+not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.
+
+### 3.4 Vulkan Flash Attention Limitations on AMD
+
+The Vulkan backend supports three FA paths:
+
+| Path | Extension | AMD Support |
+|------|-----------|-------------|
+| FA_SCALAR | None | Yes (CPU fallback) |
+| FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) |
+| FA_COOPMAT2 | NV_cooperative_matrix2 | **No** (NVIDIA-only) |
+
+FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types.
+FA_COOPMAT2 additionally supports all quant types.
+
+When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations.
+This provides a modest improvement over scalar FA but is significantly less
+efficient than ROCm + rocWMMA.
+
+### 3.5 New Attention Models (GatedDeltaNet)
+
+Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have
+severe performance problems on gfx1151:
+
+- **Vulkan**: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
+- **ROCm/HIP**: Kernel cross-compiles but suffers from register spilling (float s[S_v]
+  allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95%
+  of decode time on models >15GB)
+
+Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s.
+**Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.**
+
+---
+
+## 4. Quantization Strategies for Speed
+
+### 4.1 Quantization Speed on RDNA 3.5
+
+Token generation speed is dominated by memory bandwidth, not compute. Smaller
+quantizations are faster because they reduce bytes-per-weight, allowing more
+tokens per second within the ~215 GB/s bandwidth envelope.
+
+Approximate throughput formula for decode (bandwidth-bound):
+```
+tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9
+```
+
+For a 7B Q4_K_M model (~4.1 GB):
+```
+215 / 4.1 ≈ 52 t/s  (theoretical max; practical ~50 t/s on gfx1151)
+```
+
+### 4.2 Quantization Type Comparison
+
+| Quant | Bits/Weight | Quality | Speed (relative) | Notes |
+|-------|------------|---------|-------------------|-------|
+| Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. |
+| Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. |
+| IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. |
+| Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. |
+| Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. |
+| Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. |
+| F16 | 16.0 | Perfect | Slowest | Reference baseline. |
+
+**For RDNA 3.5 specifically**:
+
+- **Q4_K_M** is the best general-purpose quantization. The K-quant family uses
+  hierarchical super-blocks (256 values) with per-sub-block scales, providing
+  better quality than Q4_0 at marginally higher dequant cost that is invisible
+  at the GPU level.
+
+- **Q4_0** has the simplest dequant kernels and is marginally faster than Q4_K_M
+  on some GPU backends. However, the quality loss is significant. Use only for
+  smoke tests or when every t/s matters more than quality.
+
+- **IQ4_XS** (importance-matrix quantized) offers better quality per bit than
+  Q4_K_M. Speed is similar. Requires an importance matrix file during quantization.
+  **Recommended over Q4_K_M when you control the quantization process.**
+
+- **Q8_0** does NOT have special hardware-accelerated dequant on RDNA 3.5.
+  RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the
+  same FP16 compute paths, just with more memory bandwidth consumed.
+
+### 4.3 Importance Matrix (imatrix) Quantization
+
+imatrix quantization records how much each weight affects output quality, then
+allocates more precision bits to important weights. This is essential for
+sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant
+shows measurable degradation.
+
+```bash
+# Generate importance matrix (GPU-accelerated)
+llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat
+
+# Quantize with imatrix
+llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS
+```
+
+**Speed impact**: None. imatrix affects quantization quality, not inference speed.
+The dequantization kernels are identical regardless of whether imatrix was used.
+
+### 4.4 Unsloth Dynamic (UD) Quantizations
+
+Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths,
+choosing the optimal quantization per layer based on sensitivity analysis.
+
+**Speed impact**: Minimal to none. UD quants use the same dequant kernels as
+standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a
+standard Q4_K_M of the same total size.
+
+**Quality impact**: Significantly better. UD consistently outperforms standard
+quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.
+
+**Recommendation**: Prefer UD quants (e.g., `UD-Q4_K_XL`, `UD-Q4_K_M`) from
+Unsloth when available. They are a free quality upgrade with no speed penalty.
+
+---
+
+## 5. Memory Layout and Caching
+
+### 5.1 KV Cache Quantization
+
+KV cache quantization reduces the memory footprint of the attention cache,
+allowing larger context windows within the same memory budget.
+
+| Cache Type | Memory vs F16 | Quality Impact | Recommendation |
+|------------|--------------|----------------|----------------|
+| f16 (default) | 1.0x | None | Baseline |
+| q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | **Recommended for production** |
+| q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained |
+| q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 |
+| iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option |
+
+Usage:
+```bash
+llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
+# or for llama-bench:
+# Not directly supported as flags; test via llama-server
+```
+
+**Performance impact**: Quantizing K cache slightly **improves** throughput
+(less memory to read). Quantizing V cache may have a slight negative impact.
+Overall performance impact is negligible for normal inference.
+
+**Caveat with speculative decoding**: Using KV cache quantization with a draft
+model causes a consistent ~16% performance drop. q4_0 KV with speculative
+decoding causes massive acceptance rate drops. **Avoid KV quant if using
+speculative decoding.**
+
+### 5.2 mmap vs Full Load on Unified Memory
+
+On Strix Halo's unified memory architecture:
+
+- **`--no-mmap` is strongly recommended** for both ROCm and Vulkan.
+- With mmap enabled, ROCm model loading is "catastrophically slower."
+- Vulkan loading is "moderately slower" with mmap.
+- Since CPU and GPU share physical RAM, there is no data copy when loading
+  to "GPU memory" -- it is just a page table update.
+
+For `llama-bench`: Always use `-mmp 0`.
+For `llama-server`/`llama-cli`: Always use `--no-mmap`.
+
+### 5.3 Prompt Caching
+
+llama-server supports two levels of prompt caching:
+
+**1. Automatic KV cache reuse (`cache_prompt: true`)**:
+Reuses KV cache from previous requests when prompts share a common prefix.
+The server only reprocesses the suffix that differs.
+
+**2. Host-memory prompt caching (`--cram N`)**:
+Stores pre-computed prompt representations in system RAM.
+- Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
+- +6% token throughput (34 vs 32 t/s)
+- Memory formula: `num_prefixes * avg_prefix_tokens * 8 bytes`
+
+Configuration:
+```bash
+llama-server -m model.gguf \
+  --cram 256 \              # 256 MB host RAM for prompt cache
+  --cache-type-k q8_0 \     # KV cache quantization
+  --cache-type-v q8_0 \
+  --no-mmap \
+  -fa \
+  -ngl 99
+```
+
+Best for:
+- System prompts > 5K tokens
+- Multi-user chatbots with shared context
+- Agentic use with repeated tool-call prefixes
+
+### 5.4 UMA Detection Bug (Issue #18159)
+
+llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark)
+incorrectly activates on AMD APUs when `prop.integrated=1`. It reads
+`/proc/meminfo` instead of `hipMemGetInfo()`, severely underreporting available
+GPU memory (e.g., reporting 27GB instead of 96GB).
+
+**Workarounds**:
+- Build without `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
+- Guard UMA detection with `!defined(GGML_USE_HIP)` (upstream fix pending)
+- Use toolbox containers where this has been patched
+
+### 5.5 KV Cache Placement on ROCm (Issue #18011)
+
+On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible)
+memory instead of GPU-mapped memory, causing performance degradation at high
+context sizes. This is a known issue contributing to ROCm falling behind Vulkan
+for tg at high contexts.
+
+**Mitigation**: Use the rocWMMA-tuned branch which maintains better memory
+placement, or use Vulkan RADV for workloads where this matters.
+
+---
+
+## 6. llama-server Specific Optimizations
+
+### 6.1 Recommended Server Configuration
+
+```bash
+llama-server -m model.gguf \
+  --host 0.0.0.0 --port 8080 \
+  -ngl 99 \
+  --no-mmap \
+  -fa \
+  -c 32768 \               # Total context across all slots
+  -np 4 \                  # 4 parallel slots (adjust for your use)
+  -b 2048 \                # Logical batch size
+  -ub 512 \                # Physical batch size
+  --cache-type-k q8_0 \
+  --cache-type-v q8_0 \
+  --cont-batching \         # Enabled by default
+  --jinja                   # Enable Jinja2 chat template
+```
+
+### 6.2 Parallel Slot Configuration (`-np`)
+
+| Use Case | Slots | Context per Slot | Total `-c` |
+|----------|-------|-----------------|-----------|
+| Single user chat | 1 | 32768 | 32768 |
+| Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 |
+| Multi-user API | 4-8 | 4096-8192 | 32768-65536 |
+| Eval harness | 1 | 32768+ | 32768+ |
+
+Memory formula: Each slot requires `context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element`.
+With q8_0 KV cache, this is roughly halved compared to f16.
+
+### 6.3 Continuous Batching
+
+Enabled by default (`--cont-batching`). Allows the server to process multiple
+requests simultaneously, interleaving prefill and decode operations.
+
+For agentic workloads: One slot typically holds a large system prompt + conversation
+context, while additional slots handle parallel tool calls. Configure with:
+
+```bash
+-np 4 -c 131072  # 4 slots, up to 32K context each
+```
+
+### 6.4 Prompt Caching for Agentic Use
+
+For agentic coding tools that send the same system prompt repeatedly:
+
+1. Use `cache_prompt: true` in API requests (reuses KV cache prefix)
+2. Use `--system-prompt-file system.txt` for static system prompts (note: may be
+   removed in recent versions; verify with your build)
+3. Use `--cram 128` to enable host-memory caching for prefix deduplication
+
+### 6.5 Speculative Decoding
+
+For token generation speedup with a draft model:
+
+```bash
+llama-server -m main-model.gguf \
+  --model-draft draft-model.gguf \
+  -ngl 99 \
+  --draft-max 8 \
+  --draft-min 1 \
+  --no-mmap \
+  -fa
+```
+
+**Caveat**: Do NOT combine speculative decoding with KV cache quantization.
+The 16% performance drop and reduced acceptance rate negate the benefits.
+
+---
+
+## 7. Upcoming llama.cpp Features (2026)
+
+### 7.1 Backend-Agnostic Tensor Parallelism (PR #19378)
+
+Merged January 2026. Adds `--split-mode tensor` for splitting computation across
+multiple GPUs via a new "meta" backend.
+
+**Relevance to Strix Halo**: Limited. Single integrated GPU. However, for RPC
+configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster),
+tensor parallelism could complement the existing layer-split approach.
+
+Currently supports 1-2 GPUs with equal data split. `--tensor-split` has no effect yet.
+
+### 7.2 TurboQuant KV Cache Compression (ICLR 2026)
+
+Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with
+no training and negligible quality loss:
+
+| Format | MSE vs FP16 | Compression |
+|--------|-------------|-------------|
+| TQ3 (3-bit) | 0.034 | 4.9x |
+| TQ4 (4-bit) | 0.009 | 3.8x |
+
+**Timeline**: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase
+integration plan exists covering GGML type registration, KV cache paths, FA
+integration, and CLI flags.
+
+### 7.3 Vulkan Improvements
+
+Active 2025-2026 developments:
+- Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
+- 13% pp improvement from CU mode optimization for LDS utilization
+- BFloat16 Vulkan support (`VK_KHR_shader_bfloat16`) maturing in Mesa 25.x
+- Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)
+
+### 7.4 Flash Attention for Head Dimension 512
+
+Pull request from March 2026 adds FA support for HD=512 in CUDA kernels.
+This benefits models with larger head dimensions (some newer architectures).
+The HIP path should inherit this via hipify.
+
+### 7.5 ik_llama.cpp Fork Innovations
+
+The `ik_llama.cpp` fork by ikawrakow introduces:
+- Row-interleaved quant packing (better memory access patterns)
+- Smart Expert Reduction for faster MoE inference
+- Tensor overrides with regex patterns for hybrid GPU/CPU placement
+- FlashMLA for DeepSeek models
+
+**Caveat**: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan
+are not maintained. Not recommended for AMD gfx1151.
+
+---
+
+## 8. Recommended Configurations
+
+### 8.1 For llama-bench (Benchmarking)
+
+**ROCm backend:**
+```bash
+ROCBLAS_USE_HIPBLASLT=1 \
+toolbox run -c llama-rocm-7.2 -- \
+  /path/to/llama-bench \
+  -m /path/to/model.gguf \
+  -ngl 99 -mmp 0 -fa 1 \
+  -p 512 -n 128 -r 5
+```
+
+**Vulkan backend:**
+```bash
+AMD_VULKAN_ICD=RADV \
+RADV_PERFTEST=nogttspill \
+toolbox run -c llama-vulkan -- \
+  /path/to/llama-bench \
+  -m /path/to/model.gguf \
+  -ngl 99 -mmp 0 -fa 1 \
+  -p 512 -n 128 -r 5
+```
+
+### 8.2 For llama-server (Production Agentic Use)
+
+**ROCm (best for long context):**
+```bash
+ROCBLAS_USE_HIPBLASLT=1 \
+llama-server -m model.gguf \
+  -ngl 99 --no-mmap -fa \
+  -c 65536 -np 4 \
+  -b 2048 -ub 512 \
+  --cache-type-k q8_0 --cache-type-v q8_0 \
+  --cram 256 \
+  --jinja --cont-batching \
+  --host 0.0.0.0 --port 8080
+```
+
+**Vulkan RADV (best for single-user tg):**
+```bash
+AMD_VULKAN_ICD=RADV \
+RADV_PERFTEST=nogttspill \
+llama-server -m model.gguf \
+  -ngl 99 --no-mmap -fa \
+  -c 32768 -np 2 \
+  -b 4096 -ub 256 \
+  --cache-type-k q8_0 --cache-type-v q8_0 \
+  --jinja --cont-batching \
+  --host 0.0.0.0 --port 8080
+```
+
+### 8.3 Decision Matrix
+
+| Question | Answer |
+|----------|--------|
+| Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. |
+| Which backend for daily chat? | Vulkan RADV for best tg speed. |
+| Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. |
+| Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. |
+| Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. |
+| Use `--no-mmap`? | Always. |
+| Set `ROCBLAS_USE_HIPBLASLT=1`? | Always for ROCm. |
+| Set `RADV_PERFTEST=nogttspill`? | Always for Vulkan RADV. |
+| KV cache quantization? | q8_0 for both K and V unless using speculative decoding. |
+| Batch size for MoE? | `-b 256` (lower than default improves some MoE models). |
+
+---
+
+## 9. Sources
+
+### GitHub Issues and Discussions
+- [Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021](https://github.com/ggml-org/llama.cpp/discussions/15021)
+- [Performance of llama.cpp with Vulkan - Discussion #10879](https://github.com/ggml-org/llama.cpp/discussions/10879)
+- [HIP backend performs poorly on gfx1151 - Issue #13565](https://github.com/ggml-org/llama.cpp/issues/13565)
+- [UMA detection incorrectly limits memory on AMD APUs - Issue #18159](https://github.com/ggml-org/llama.cpp/issues/18159)
+- [ROCm model loading dumps KV cache to shared memory - Issue #18011](https://github.com/ggml-org/llama.cpp/issues/18011)
+- [GATED_DELTA_NET underperformance on gfx1151 - Issue #20354](https://github.com/ggml-org/llama.cpp/issues/20354)
+- [Under-Performance of ROCm 7.2 binaries - Issue #19984](https://github.com/ggml-org/llama.cpp/issues/19984)
+- [ROCm 7.2 + rocWMMA compilation - Issue #19269](https://github.com/ggml-org/llama.cpp/issues/19269)
+- [Building for gfx1151 - Issue #14734](https://github.com/ggml-org/llama.cpp/issues/14734)
+- [AMDVLK 2GB buffer allocation limit - Issue #15054](https://github.com/ggml-org/llama.cpp/issues/15054)
+- [Mastering Host-Memory Prompt Caching - Discussion #20574](https://github.com/ggml-org/llama.cpp/discussions/20574)
+- [TurboQuant Extreme KV Cache Quantization - Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969)
+- [Backend-agnostic tensor parallelism - PR #19378](https://github.com/ggml-org/llama.cpp/pull/19378)
+- [Massively Improved rocWMMA Performance - PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827)
+- [rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7)
+- [Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5](https://github.com/geerlingguy/beowulf-ai-cluster/issues/5)
+
+### Wiki and Community Resources
+- [Strix Halo Wiki - llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance)
+- [Strix Halo Wiki - llama.cpp with ROCm](https://strixhalo.wiki/AI/llamacpp-with-ROCm)
+- [AMD Strix Halo Backend Benchmarks (Grid View)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
+- [LLM Tracker - AMD Strix Halo GPU Performance](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
+- [Framework Community - Strix Halo GPU LLM Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
+- [Framework Community - Toolboxes for LLM inference on Strix Halo](https://community.frame.work/t/llama-cpp-vllm-toolboxes-for-llm-inference-on-strix-halo/74916)
+
+### Articles and Blog Posts
+- [Hardware Corner - Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/)
+- [Hardware Corner - RADV Vulkan Driver 13% Improvement](https://www.hardware-corner.net/llama-cpp-amd-radv-vulkan-driver-update/)
+- [Phoronix - AMD ROCm 7.1 vs RADV Vulkan](https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan)
+- [Phoronix - Valve Developer RADV Improvement](https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp)
+- [Yifei's Blog - Strix Halo Matrix Cores with llama.cpp](https://blog.yifei.sg/jekyll/update/2025/08/27/building-llamacpp-strix-halo.html)
+- [Strix Halo CUDA/HIP Testing Notes (lhl)](https://github.com/lhl/strix-halo-testing/blob/main/llama-cpp-cuda-hip.md)
+
+### Official Documentation
+- [ROCm - llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
+- [ROCm - llama.cpp installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
+- [ROCm Blog - Llama.cpp Meets Instinct](https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html)
+- [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
+- [llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
+
+### Papers
+- "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
+- "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
+- [Unsloth Dynamic 2.0 GGUFs Documentation](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)
+
+---
+
+## Open Questions / Limitations
+
+1. **rocWMMA-tuned patch upstream status**: PR #16827 may not be fully merged.
+   Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.
+
+2. **ROCm 7.2 stability on gfx1151**: Multiple reports of crashes (MUT_MAL errors),
+   performance regressions, and compilation issues. ROCm 7.x is maturing but
+   not fully stable for gfx1151 as of March 2026.
+
+3. **Vulkan CoopMat FA for AMD**: Will AMD ever get CoopMat2 support? The current
+   CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent
+   extension would close the gap with ROCm FA.
+
+4. **KV cache placement on ROCm**: Issue #18011 (KV cache dumped to shared memory)
+   reduces ROCm tg performance at high contexts. Root cause appears to be in
+   HIP memory allocation behavior on APUs.
+
+5. **GGML_HIP_UMA vs kernel-param GTT expansion**: The UMA flag uses slow
+   fine-grained memory. GTT expansion via `amdgpu.gttsize` kernel params provides
+   coarse-grained GPU-mapped memory that is much faster. The upstream approach
+   may eventually improve, but kernel params remain the correct method for now.
+
+6. **GatedDeltaNet architecture support**: Both Vulkan (missing shader) and ROCm
+   (register pressure, memcpy bottleneck) perform poorly on GDN models. This
+   blocks efficient use of Qwen3.5-27B and similar models.
+
+7. **TurboQuant integration timeline**: Expected Q2-Q3 2026 for llama.cpp.
+   Would provide 3-bit KV cache with no quality loss, roughly doubling available
+   context within the same memory budget.
+
+8. **NPU utilization**: The 50 TOPS NPU on Strix Halo is currently Linux-unusable
+   for llama.cpp inference. AMD driver support for NPU on Linux remains pending.
+
+---
+
+## Overlap Notes
+
+- **Kernel parameters** (`amdgpu.gttsize`, `ttm.pages_limit`, `iommu=pt`):
+  Already documented in the project's `scripts/optimize/kernel-params.sh`.
+  This research covers the llama.cpp side (why they matter for inference).
+
+- **BIOS VRAM allocation**: Reducing dedicated VRAM in BIOS frees more memory
+  for GTT. This is documented in the project's audit scripts but is a prerequisite
+  for the optimizations described here.
+
+- **Toolbox container builds**: The project uses pre-built toolboxes
+  (`llama-rocm-7.2`, `llama-vulkan`). The compilation flags documented here
+  describe what should be baked into those containers.