fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00
parent 7531f6fa74
commit f92b710492
7 changed files with 2148 additions and 52 deletions
--- a/docs/inference-optimization-landscape.md
+++ b/docs/inference-optimization-landscape.md
@@ -0,0 +1,825 @@
 # LLM Inference Optimization Landscape (March 2026)
 ## Scope
 Comprehensive survey of cutting-edge LLM inference optimization techniques applicable
 to a high-end AMD APU workstation: Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S
 (gfx1151, RDNA 3.5), 64 GB unified LPDDR5X memory, 256 GB/s bandwidth. Covers
 inference engines, quantization, attention, MoE optimization, memory bandwidth, OS-level
 tuning, hardware features, and model-level techniques. Research current as of March 2026.
 ---
 ## Table of Contents
 1. [Inference Engines and Backends](#1-inference-engines-and-backends)
 2. [Advanced Quantization Techniques](#2-advanced-quantization-techniques)
 3. [Attention Optimization](#3-attention-optimization)
 4. [MoE-Specific Optimizations](#4-moe-specific-optimizations)
 5. [Memory Bandwidth Optimization](#5-memory-bandwidth-optimization)
 6. [OS and Runtime Techniques](#6-os-and-runtime-techniques)
 7. [Emerging Hardware Features](#7-emerging-hardware-features)
 8. [Model-Level Optimizations](#8-model-level-optimizations)
 9. [Prioritized Recommendations for Strix Halo](#9-prioritized-recommendations-for-strix-halo)
 10. [Sources](#10-sources)
 ---
 ## 1. Inference Engines and Backends
 ### 1.1 llama.cpp -- Still the Foundation
 llama.cpp remains the dominant local inference engine. All major interfaces (Ollama,
 LM Studio, GPT4All, KoboldCpp) use it under the hood. For Strix Halo specifically:
 - **ROCm/HIP backend**: Build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151
  -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`. The `ROCBLAS_USE_HIPBLASLT=1`
  environment variable forces hipBLASLt kernels, which deliver the best throughput on
  gfx1151.
 - **Vulkan backend**: The RADV Mesa driver has seen active RDNA 3.5/4 optimization
  in Mesa 25.x. In some benchmarks Vulkan outperforms ROCm for single-shot inference
  and shorter contexts. HIP+WMMA+FlashAttention is fastest for long contexts (tg8192+).
 - **UMA detection bug (issue #18159)**: llama.cpp's UMA detection can incorrectly
  limit available memory on AMD APUs with large TTM allocations. The `--mmp 0`
  (disable mmap) flag is critical for ROCm on Strix Halo to avoid catastrophically
  slow model loading.
 - **Performance**: Llama-2-7B Q4_0 achieves ~1464 t/s prompt processing (pp512) and
  ~50 t/s token generation (tg128) on Strix Halo with ROCm.
 - **Known regression**: A commit enabling WMMA-MMQ INT kernels for RDNA 3 introduced
  significant prompt processing regression on gfx1151 with ROCm 7.x (issue #17917).
 **Status**: Production-ready. Best single-engine choice for Strix Halo.
 ### 1.2 KTransformers -- CPU/GPU Hybrid MoE Specialist
 KTransformers (SOSP 2025) is the most significant new engine for hybrid inference.
 It was purpose-built for running large MoE models (DeepSeek-R1/V3) on systems with
 limited GPU memory but abundant CPU memory.
 - **AMX-optimized kernels**: Uses Intel AMX instructions for CPU-side expert
  computation. For AMD Zen 5, it falls back to AVX-512, which is still substantially
  faster than naive CPU inference.
 - **Async CPU-GPU scheduling**: Overlaps CPU expert computation with GPU attention
  computation, hiding CPU latency.
 - **Performance**: 4.62-19.74x prefill speedup, 1.25-4.09x decode speedup vs
  existing hybrid systems. SGLang + KTransformers achieves 220+ tok/s total
  throughput on trillion-parameter MoE models.
 - **Relevance to Strix Halo**: Moderate. KTransformers shines when GPU VRAM is
  scarce (24 GB discrete) and CPU RAM is abundant (382 GB). On Strix Halo, all 64 GB
  is accessible to the GPU, so the CPU offloading advantage is diminished. However,
  for models exceeding 64 GB, KTransformers-style hybrid inference becomes relevant.
 **Status**: Production. Most useful for models that exceed available VRAM.
 ### 1.3 PowerInfer / PowerInfer-2
 PowerInfer-2 targets smartphones, achieving 11.68 t/s on Mixtral 47B (22x faster
 than alternatives). It exploits MoE sparsity by predicting which experts will
 activate and only loading those. The core technique -- hot/cold neuron partitioning
 and GPU-resident hot neurons -- is architecturally interesting but the implementation
 targets mobile SoCs with discrete memory hierarchies, not unified-memory APUs where
 all memory is equally accessible to the GPU.
 **Status**: Research. Techniques are partially subsumed by llama.cpp's own MoE
 offloading improvements.
 ### 1.4 MLC-LLM
 MLC-LLM compiles models via TVM to target multiple backends including ROCm, Vulkan,
 Metal, and OpenCL. It was one of the first engines to make AMD GPUs competitive for
 LLM inference (2023 blog post). The Vulkan backend provides a universal fallback
 that works on any GPU.
 **Status**: Active but niche. For Strix Halo, llama.cpp's native ROCm/Vulkan
 backends are more mature and better optimized.
 ### 1.5 mistral.rs / candle / burn
 Rust-based inference engines:
 - **mistral.rs**: Built on Hugging Face's candle library. Supports GGUF, GPTQ,
  ISQ (in-situ quantization). Has CUDA support but no ROCm backend.
 - **candle**: Hugging Face's Rust ML framework. GPU support via CUDA; no ROCm.
 - **burn**: Rust ML framework with multiple backends (WGPU, Vulkan, CUDA). The
  WGPU/Vulkan path could theoretically work on AMD, but LLM inference support
  is limited.
 **Status**: Not viable for Strix Halo in 2026. No ROCm support, and the Vulkan
 paths are less optimized than llama.cpp's.
 ### 1.6 BitNet.cpp
 Microsoft's official 1-bit LLM inference framework. Achieves 6x faster inference
 and 82% lower energy consumption. GPU kernel support was added May 2025 for NVIDIA
 and Apple Silicon. No AMD GPU kernels yet. CPU-only mode works on any x86 system
 and could be relevant for future 1-bit models, but the model ecosystem (BitNet b1.58
 variants) remains small.
 **Status**: Watch. No AMD GPU support. CPU path works but model selection is limited.
 ### 1.7 vLLM and SGLang
 Both are production LLM serving frameworks with AMD ROCm support:
 - **vLLM v0.16.0** (Feb 2026): ROCm is now a first-class platform. 93% of AMD
  test groups passing. Native AITER FP8 kernels, fused LayerNorm/SiLU, optimized
  Paged Attention. Extended bitsandbytes quantization to warp-size-32 GPUs (RDNA).
 - **SGLang**: Supports ROCm. KTransformers integration for hybrid MoE inference.
 Both are overkill for single-user local inference but become relevant for serving
 multiple users or running agentic workloads with concurrent requests.
 **Status**: Production for server workloads. Consider if running multi-user or
 agentic eval pipelines.
 ### 1.8 ExLlamaV3 / EXL3
 ExLlamaV3 introduces the EXL3 format (based on QTIP from Cornell RelaxML), achieving
 excellent perplexity at extreme compression (Llama 3.3 70B at 1.75 bpw, 19 GB). The
 Marlin-inspired GEMM kernels are highly optimized for NVIDIA GPUs. AMD ROCm support
 was absent at launch (early 2025) and current status is uncertain.
 **Status**: Watch. Potentially best-in-class quantization quality, but AMD support
 is unclear.
 ---
 ## 2. Advanced Quantization Techniques
 ### 2.1 GGUF Quantization Landscape
 GGUF remains the dominant format for local inference via llama.cpp. The key variants:
 | Format    | Bits | Method          | Best For                    |
 |-----------|------|-----------------|-----------------------------|
 | Q8_0      | 8    | Round-to-nearest| Maximum quality, 2x compress|
 | Q6_K      | 6.5  | K-quant         | High quality, 2.5x compress |
 | Q5_K_M    | 5.5  | K-quant+imatrix | Balanced quality/size       |
 | Q4_K_M    | 4.5  | K-quant+imatrix | Default recommendation      |
 | Q3_K_M    | 3.9  | K-quant+imatrix | Aggressive, still usable    |
 | IQ3_XXS   | 3.06 | I-quant+imatrix | Extreme compression         |
 | IQ2_XXS   | 2.06 | I-quant+imatrix | Near-minimum viable         |
 | IQ1_S     | 1.56 | I-quant+imatrix | Experimental                |
 **imatrix (Importance Matrix)**: The single most impactful quality improvement for
 sub-4-bit quantization. The importance matrix identifies which weights produce large
 activations during inference and allocates more precision to them. For aggressive
 quantization (<4 bits), imatrix is no longer optional -- it is essential.
 **Recommendation**: Q4_K_M + imatrix for most use cases. Q3_K_M + imatrix when
 fitting a larger model matters more than marginal quality.
 ### 2.2 Unsloth Dynamic 2.0
 Unsloth Dynamic 2.0 (Feb 2026) represents the state-of-the-art in intelligent GGUF
 quantization:
 - **Per-layer adaptive quantization**: Each layer gets a custom quantization type
  based on sensitivity analysis. The quantization scheme for Gemma 3 differs
  significantly from Llama 4.
 - **Universal MoE + dense support**: Dynamic 2.0 works on all architectures
  (previously MoE-only).
 - **Calibration dataset**: 1.5M+ token hand-curated dataset for improved
  conversational quality.
 - **Quality results**: Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6% on 5-shot
  MMLU, surpassing many full-precision models.
 - **KL Divergence tracking**: Every GGUF is benchmarked against the original model
  on both perplexity and KL divergence.
 **Relevance**: Directly applicable. Use Unsloth Dynamic 2.0 GGUFs when available
 for any model. They consistently outperform standard k-quant GGUFs at the same
 bit-width.
 ### 2.3 AQLM and QuIP#
 Both target extreme compression (2-3 bits):
 - **QuIP#** (ICML 2024): Uses randomized Hadamard transforms + E8 lattice codebooks.
  First PTQ method where 3-bit outperforms theoretical lossless 4-bit. The E8
  codebook fits in L1 cache, enabling inference speedups over FP16.
 - **AQLM** v1.1.7 (April 2025): Additive quantization achieving Pareto optimality
  below 3 bpw. Outperforms QuIP# on MoE models at 2-bit. Added arbitrary
  8-dimensional codebooks on GPU.
 Both require PyTorch/CUDA for dequantization kernels. Neither has native llama.cpp
 integration or AMD support. They represent the theoretical frontier of what is
 achievable at extreme compression but are not practical for Strix Halo today.
 **Status**: Research. Watch for llama.cpp integration of QTIP (via ExLlamaV3/EXL3).
 ### 2.4 AWQ vs GPTQ vs GGUF on AMD
 For AMD GPUs in the llama.cpp ecosystem:
 - **GGUF**: The only practical choice. Native llama.cpp support with ROCm/Vulkan
  acceleration. K-quants and I-quants are well-optimized.
 - **AWQ/GPTQ**: Require Marlin kernels for competitive speed (741 tok/s with
  Marlin-AWQ vs 67 tok/s without on NVIDIA). Marlin kernels are CUDA-only. On AMD,
  these formats are accessible via vLLM or Hugging Face Transformers with ROCm, but
  not through llama.cpp.
 - **Performance hierarchy on AMD (via vLLM)**: GPTQ and AWQ with Marlin kernels are
  fastest on NVIDIA; on AMD ROCm, the performance advantage over GGUF is minimal
  and setup complexity is higher.
 **Recommendation**: GGUF for llama.cpp on Strix Halo. AWQ/GPTQ only if using vLLM.
 ### 2.5 Mixed-Precision and Layer-Wise Quantization
 Active research area with direct practical implications:
 - **Attention vs FFN sensitivity**: Attention layers (QKV projections, output
  projection) have varying sensitivity. FFN layers are often the largest component
  and frequent targets for aggressive quantization (INT4).
 - **Channel-Wise Mixed-Precision (CMPQ)**: Allocates quantization precision per
  weight channel based on activation distributions. Adapts to any bit-width.
 - **HOBBIT for MoE**: Maintains FP16 and INT4 versions of experts simultaneously.
  Hot experts stay at FP16; cold experts use INT4 or even INT2. This concept is
  partially implemented in Unsloth Dynamic 2.0's per-layer approach.
 - **Fine-Grained Mixed Precision (FGMP)**: Goes below row-level granularity to
  handle unstructured sensitivity patterns in both weights and activations.
 **Relevance**: Unsloth Dynamic 2.0 already implements the practical version of
 layer-wise mixed precision for GGUF. The research frontier is moving toward
 sub-layer and channel-level mixed precision.
 ### 2.6 KV Cache Quantization
 - **TurboQuant** (ICLR 2026): Being integrated into llama.cpp. TQ3 (3-bit) achieves
  4.9x compression vs FP16 KV cache; TQ4 (4-bit) achieves 3.8x. This directly
  reduces memory pressure for long-context inference.
 - **llama.cpp native**: Already supports Q8_0 and Q4_0 KV cache quantization via
  `--cache-type-k` and `--cache-type-v` flags.
 **Relevance**: High. On a 64 GB system, KV cache can consume significant memory for
 long contexts. Q4_0 KV cache is recommended; TurboQuant will push this further.
 ---
 ## 3. Attention Optimization
 ### 3.1 Flash Attention on AMD
 Current status for RDNA 3.5 / gfx1151:
 - **Triton backend**: Supports CDNA and RDNA GPUs with fp16, bf16, fp32. This is
  the primary Flash Attention path for non-Instinct AMD GPUs.
 - **PyTorch integration**: Since PyTorch 2.5.0+, `F.scaled_dot_product_attention`
  automatically uses Flash Attention on RDNA cards via the Triton backend.
 - **llama.cpp WMMA Flash Attention**: Enabled via `-DGGML_HIP_ROCWMMA_FATTN=ON`.
  Uses RDNA 3.5's WMMA instructions for matrix multiply within the attention kernel.
  This is the fastest path for long-context inference on Strix Halo.
 - **CK (Composable Kernel) backend**: Supports MI200x, MI250x, MI300x, MI355x.
  Not available for RDNA consumer GPUs.
 **Gap**: Flash Attention 3 (with asynchronous pipelines and FP8 attention) is
 NVIDIA Hopper-specific. No AMD equivalent exists.
 ### 3.2 SageAttention
 SageAttention (ICLR 2025, ICML 2025, NeurIPS 2025 Spotlight) achieves 2-5x speedup
 over FlashAttention through quantized attention (8-bit Q/K matrices, FP16 values).
 SageAttention3 further uses FP4 Tensor Cores on Blackwell GPUs.
 **AMD status**: SageAttention's Triton implementation could theoretically work on
 AMD GPUs, but no AMD-optimized kernels exist. The quantized attention concept is
 sound and could be adapted.
 **Status**: Watch. Would be high-impact if ported to AMD.
 ### 3.3 Paged Attention
 Paged Attention (vLLM) manages KV cache as non-contiguous memory pages, eliminating
 60-80% of memory waste from fragmentation. llama.cpp's server mode implements a
 simplified version of this for concurrent request handling, but the full PagedAttention
 system is more mature in vLLM.
 **Relevance**: Moderate for single-user. High for multi-user serving.
 ### 3.4 GQA/MQA Architecture Implications
 Modern models (Llama 2/3, Mistral, Qwen) use Grouped Query Attention:
 - GQA reduces KV cache by up to 90% vs MHA (Multi-Head Attention)
 - 30-40% faster inference than MHA with near-equivalent accuracy
 - Enables larger batch sizes due to smaller memory footprint
 **Practical impact**: When choosing models for Strix Halo, prefer GQA models. All
 modern model families (Llama 3, Qwen 3, Gemma 3, Mistral) use GQA. Avoid older MHA
 models when alternatives exist.
 ### 3.5 Ring Attention and Linear Attention
 - **Ring Attention**: Distributes long sequences across multiple devices. Achieves
  1M context prefill in 77s with 93% parallelization efficiency. Not applicable to
  single-device Strix Halo.
 - **Linear Attention**: Reduces KV cache from O(n) to O(1) and computation from
  O(n^2) to O(n). The Ring-Linear models (hybrid softmax + linear attention) reduce
  inference cost to 1/10 of dense models. This is a model architecture choice, not
  a runtime optimization.
 **Relevance**: Linear attention models would be transformative for long-context on
 Strix Halo. Watch for Qwen, DeepSeek, or Llama variants with hybrid attention.
 ---
 ## 4. MoE-Specific Optimizations
 ### 4.1 Expert Offloading on Unified Memory
 On discrete GPU systems, MoE inference involves expensive PCIe transfers of expert
 weights between CPU RAM and GPU VRAM. On Strix Halo's unified memory, this bottleneck
 is fundamentally different:
 - All expert weights reside in the same physical memory accessible to both CPU and
  GPU. There is no PCIe transfer cost.
 - The bottleneck shifts to **memory bandwidth**: at 256 GB/s, loading a 2 GB expert
  takes ~7.8 ms. With GGUF Q4 quantization, experts are 4x smaller, reducing this
  to ~2 ms.
 - **Implication**: Unified memory eliminates the offloading problem but does not
  eliminate the bandwidth problem. The optimization focus should be on reducing the
  number of expert weights that must be read per token.
 ### 4.2 Expert Caching and Prediction
 The research frontier in 2025-2026 focuses on predicting which experts will be needed:
 - **OD-MoE**: 99.94% expert activation prediction accuracy, delivering ~75% of
  fully GPU-cached speed using 1/3 GPU memory.
 - **MoE-SpeQ**: Uses a small draft model to predict expert sequences, enabling
  prefetching. Combines speculative decoding with expert prediction.
 - **SP-MoE**: First speculative-decoding-aware expert offloading framework. Achieves
  1.07-3.5x TPOT speedup by exploiting structural correspondence between draft
  and target models.
 - **SliceMoE**: Dynamic Bit-Sliced Caching -- caches experts at sub-expert
  granularity, assigning precision on demand.
 - **FlashMoE**: ML-based cache replacement for SSD-based expert offloading on edge.
 **Relevance for Strix Halo**: Expert caching is less critical when all experts fit
 in memory, but expert prediction can still help by enabling **prefetching into L2/
 Infinity Cache** before the expert is needed, reducing effective memory latency.
 ### 4.3 Expert Pruning
 - Static pruning: Remove least-used experts entirely (MC-SMoE, EEP). Can reduce
  active parameters by up to 96.875% (TSEP). Requires fine-tuning.
 - Dynamic pruning: Skip experts below an activation threshold at inference time.
  38.2% FLOPs reduction with 1.32x speedup (Li et al.).
 - **DynMoE**: 9% FLOPs reduction, 1.37x speedup through dynamic gating.
 **Relevance**: Moderate. Dynamic expert skipping could reduce memory bandwidth
 requirements on Strix Halo, but requires model-specific configuration.
 ### 4.4 MoE Quantization -- Inactive Expert Compression
 HOBBIT maintains multiple precision versions of experts: FP16 hot experts, INT4 cold
 experts, INT2 for rarely-used experts. On unified memory, a variant of this approach
 could keep the working set of experts at higher precision while storing rarely-activated
 experts at aggressive quantization, reducing total memory footprint.
 MoE-CSP achieves 26x speedup through 4-bit/8-bit quantization with custom CUDA
 kernels. QMoE achieves 20x memory reduction but lacks efficient 1-bit kernel support.
 **Practical approach for Strix Halo**: Use Unsloth Dynamic 2.0 GGUFs, which already
 implement per-layer (including per-expert) precision allocation.
 ---
 ## 5. Memory Bandwidth Optimization
 ### 5.1 The Fundamental Bottleneck
 LLM inference (especially token generation / decode) is almost always memory-bandwidth
 bound. On Strix Halo:
 - **Available bandwidth**: 256 GB/s (LPDDR5X-8000, 256-bit bus)
 - **Theoretical decode throughput** for a 7B Q4_0 model (~3.5 GB):
  256 GB/s / 3.5 GB = ~73 tok/s (assuming 100% utilization)
 - **Measured**: ~50 t/s (tg128), implying ~68% bandwidth utilization
 - **Infinity Cache effect**: The 32 MB Infinity Cache acts as a bandwidth amplifier.
  When working set fits in cache, effective bandwidth can exceed 256 GB/s. For LLM
  inference, per-layer weights typically exceed 32 MB, so cache benefit is limited
  to KV cache and activations.
 ### 5.2 Techniques to Reduce Bandwidth Requirements
 | Technique                  | Bandwidth Reduction | Status on Strix Halo |
 |----------------------------|--------------------|-----------------------|
 | Lower quantization (Q4->Q3)| ~25%              | Available now         |
 | KV cache quantization (Q4) | ~75% for KV reads  | Available now         |
 | Speculative decoding       | 2-3x effective     | Available now         |
 | Expert prediction/caching  | Variable (MoE)     | Research              |
 | Weight compression (EXL3)  | Up to 8x           | No AMD support        |
 | Activation checkpointing   | Reduces peak memory | Available             |
 ### 5.3 Speculative Decoding
 The most impactful bandwidth optimization technique available today:
 - **Principle**: A small, fast "draft" model generates N candidate tokens. The large
  "target" model verifies all N tokens in a single forward pass (batch). Accepted
  tokens are "free" -- they required no additional bandwidth from the target model.
 - **Speedup**: 2-3x without accuracy loss. NVIDIA demonstrates 3.6x on H200.
 - **EAGLE-3**: Lightweight autoregressive head attached to target model internals.
  No separate draft model needed.
 - **TurboSpec**: Closed-loop control system that dynamically adjusts speculative
  parameters based on online feedback.
 - **MoE-SpeQ**: Combines speculative decoding with expert prefetching.
 **Relevance**: High. Speculative decoding is the single highest-impact optimization
 for decode throughput on bandwidth-limited systems like Strix Halo. llama.cpp
 supports speculative decoding via `--model-draft`.
 ### 5.4 Prefetching Strategies
 - **L2 cache prefetching**: Proactively load KV cache and next-layer weights into
  GPU L2 during computation. Achieves 2.15x attention kernel speedup on NVIDIA H20.
 - **PRESERVE**: Prefetch model weights from HBM to on-chip cache during communication
  operations. Up to 1.6x end-to-end speedup.
 - **Strix Halo consideration**: The 32 MB Infinity Cache + 2 MB L2 provides limited
  on-chip storage. Prefetching activations and KV cache (which are smaller than
  weights) into Infinity Cache during weight reads could help.
 ### 5.5 Batched Inference
 Batching amortizes weight-read cost across multiple requests:
 - Single request: ~68% bandwidth utilization on Strix Halo
 - Batch of 4: Approaches compute-bound regime for prefill; still bandwidth-bound
  for decode on most models
 - **Continuous batching** (vLLM, llama.cpp server): 10-20x throughput improvement
  over naive batching
 **Trade-off**: Batching increases throughput but also increases per-request latency
 and memory consumption (KV cache scales linearly with batch size).
 ---
 ## 6. OS and Runtime Techniques
 ### 6.1 Memory Management
 **Huge Pages**: Transparent Huge Pages (THP) reduce TLB misses for large model
 weights. On Fedora 43, THP is enabled by default. For explicit control:
 ```bash
 # Check current THP setting
 cat /sys/kernel/mm/transparent_hugepage/enabled
 # For llama.cpp, ensure THP is at least "madvise"
 echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
 ```
 For models loaded with mmap, THP automatically promotes 4 KB pages to 2 MB pages,
 reducing page faults during inference.
 **Memory Locking**: `mlock` prevents model weights from being swapped. llama.cpp's
 `--mlock` flag enables this. Critical for systems running other workloads alongside
 inference.
 **mmap vs direct load**: On Strix Halo with ROCm, `--mmp 0` (disable mmap) is
 recommended. mmap causes catastrophically slow model loading when GPU offloading is
 active because of the double-copy path through page cache.
 ### 6.2 Process Pinning and NUMA
 Strix Halo is a single-die APU, so NUMA topology is simple (typically 1 NUMA node).
 However, CPU core affinity still matters:
 ```bash
 # Pin inference to specific cores, keeping others free for OS
 numactl --physcpubind=0-15 llama-server [args]
 # Or via taskset
 taskset -c 0-15 llama-server [args]
 ```
 **Core isolation**: For minimum-jitter inference:
 ```bash
 # Add to kernel cmdline
 isolcpus=0-15 nohz_full=0-15 rcu_nocbs=0-15
 ```
 This prevents the OS from scheduling unrelated tasks on inference cores.
 ### 6.3 CPU Frequency and Power
 ```bash
 # Set performance governor for consistent throughput
 sudo cpupower frequency-set -g performance
 # Verify
 cpupower frequency-info | grep "current CPU frequency"
 ```
 ### 6.4 cgroups v2 for Resource Isolation
 Reserve memory and CPU for inference workloads:
 ```bash
 # Create inference cgroup
 sudo mkdir /sys/fs/cgroup/inference
 echo "+memory +cpu" | sudo tee /sys/fs/cgroup/inference/cgroup.subtree_control
 # Reserve 56 GB for inference (leave 8 GB for system)
 echo $((56 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/inference/memory.min
 # Pin CPU cores
 echo "0-15" | sudo tee /sys/fs/cgroup/inference/cpuset.cpus
 # Run inference in cgroup
 sudo cgexec -g memory,cpu:inference llama-server [args]
 ```
 ### 6.5 io_uring for Model Loading
 io_uring provides zero-copy, kernel-bypassing I/O that can accelerate initial model
 loading. While llama.cpp does not natively use io_uring, the underlying mmap/read
 path can benefit from io_uring-based file I/O when loading from NVMe:
 - Eliminates context switch overhead during model load
 - Enables true async I/O with completion ring buffers
 - Most benefit when loading very large models (>32 GB) from storage
 **Practical impact**: Minor for Strix Halo since model loading is a one-time cost,
 and LPDDR5X bandwidth far exceeds NVMe read speeds.
 ### 6.6 eBPF-Based Performance Monitoring
 eBPF enables zero-instrumentation monitoring of inference workloads:
 ```bash
 # Monitor GPU DRM scheduler jobs (works with amdgpu driver)
 sudo bpftrace -e 'tracepoint:drm:drm_sched_job { printf("GPU job: %s\n", args->name); }'
 # Track page faults during model loading
 sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'
 # Monitor context switches on inference cores
 sudo bpftrace -e 'tracepoint:sched:sched_switch /cpu == 0/ { @[args->next_comm] = count(); }'
 ```
 The eunomia project provides ready-made eBPF programs for AI workload monitoring.
 ---
 ## 7. Emerging Hardware Features
 ### 7.1 AMD XDNA NPU
 The Ryzen AI MAX+ 395 includes an XDNA 2 NPU rated at 50 TOPS. Current status for
 LLM inference:
 - **Software stack**: AMD Ryzen AI Software supports ONNX model execution on the NPU.
  AMD Quark provides quantization for NPU deployment (SmoothQuant, GPTQ, Quarot).
 - **LLM capability**: The NPU can accelerate small models and specific operations
  (attention heads, small expert networks) but cannot run full large LLMs.
 - **Linux support**: Kernel 7.1 (expected 2026) brings significant XDNA upstreaming.
  Current Linux support is limited compared to Windows.
 - **Practical use**: The NPU could potentially handle a speculative decoding draft
  model while the GPU runs the main model. This is not yet implemented in any
  inference engine.
 **Status**: Not viable for LLM inference in March 2026. Watch for Linux kernel 7.1
 and llama.cpp NPU backend development.
 ### 7.2 RDNA 3.5 Matrix Cores (WMMA)
 The Radeon 8060S (gfx1151) has the same WMMA instruction set as RDNA 3 (gfx11xx),
 which is a generation behind RDNA 4 (gfx12xx):
 **RDNA 3 / 3.5 (gfx1151) WMMA capabilities**:
 - FP16/BF16: 512 FLOPS/clock/CU
 - INT8: 1024 OPS/clock/CU
 - 16x16 matrix dimensions
 - Requires inter-lane data shuffling for chained operations
 **RDNA 4 (gfx12xx) improvements over RDNA 3.5**:
 - FP16/BF16: 1024 FLOPS/clock/CU (2x)
 - INT8: 2048 OPS/clock/CU (2x)
 - New FP8/BF8 formats at 4x the FP16 rate
 - 4:2 structured sparsity support (effectively 2x more)
 - No inter-lane shuffling needed for chained WMMA (major efficiency gain)
 - New efficient matrix load instruction
 **Current usage in llama.cpp**: WMMA is used for Flash Attention
 (`GGML_HIP_ROCWMMA_FATTN`) and matrix-multiply quantized (`MMQ`) kernels. The
 ROCm 7.x regression for gfx1151 (issue #17917) specifically affects MMQ kernels.
 ### 7.3 Vulkan Cooperative Matrices
 The `VK_KHR_cooperative_matrix` Vulkan extension was merged into the RADV driver
 for RDNA 3+ hardware. This provides a portable API for matrix operations that maps
 to WMMA hardware:
 - Enables inference engines to use matrix cores through Vulkan instead of
  vendor-specific ROCm/HIP APIs
 - llama.cpp's Vulkan backend could leverage this for WMMA-accelerated matrix
  operations
 - Currently less optimized than native HIP/ROCm paths
 **Status**: Available in Mesa 25.x. Watch for llama.cpp Vulkan backend improvements
 using cooperative matrices.
 ### 7.4 Infinity Cache for Inference
 Strix Halo has a 32 MB Infinity Cache (MALL -- Memory Attached Last Level):
 - **Architecture**: L1 (256 KB/shader array) -> L2 (2 MB) -> Infinity Cache (32 MB)
  -> LPDDR5X
 - **Latency**: Slightly higher than discrete GPU Infinity Cache implementations
 - **Hit rate**: Varies by workload. Graphics benchmarks show ~73% hit rate at peak.
 - **LLM inference implications**: For a 7B Q4 model (~3.5 GB), per-layer weights
  are ~70-140 MB, far exceeding the 32 MB cache. Benefit is limited to:
  - KV cache for current context (fits well for shorter contexts)
  - Activations and intermediate results
  - Embedding layer (often accessed repeatedly)
  - Small models/layers that fit entirely in cache
 The Infinity Cache is most impactful as a bandwidth amplifier -- when inference
 accesses exhibit temporal locality (same data accessed multiple times within a
 short window), effective bandwidth exceeds the 256 GB/s DRAM limit.
 ---
 ## 8. Model-Level Optimizations
 ### 8.1 Prompt Compression
 - **LLMLingua / LLMLingua-2** (Microsoft): Compresses input prompts by removing
  low-information tokens. 20x compression with 1.5 point performance drop.
  1.7-5.7x end-to-end inference speedup. LLMLingua-2 is 3-6x faster than v1.
  Integrated into LangChain and LlamaIndex.
 - **500xCompressor**: Compresses contexts into a single special token. 6x-480x
  compression. Adds only 0.25% parameters. More aggressive but less mature.
 **Relevance**: High for RAG and agentic workloads where prompts are long. Reduces
 both prefill time and KV cache memory.
 ### 8.2 Speculative Decoding (Model-Level)
 Beyond the engine-level implementation described in Section 5.3:
 - **Self-speculative decoding**: Model drafts its own tokens using early exit from
  lower layers. No separate draft model needed.
 - **EAGLE-3**: Autoregressive head on target model internals. Higher acceptance
  rates than separate draft models.
 - **Draft model latency > accuracy**: Research shows that draft model speed matters
  more than its language modeling accuracy for overall throughput.
 ### 8.3 Mixture of Depths / Mixture of Recursions
 - **Mixture of Depths (MoD)**: Dynamically allocates compute to tokens that need it.
  2-3x inference speedup with minimal quality degradation. Implemented at training
  time -- requires model architecture support.
 - **Mixture of Recursions (MoR)** (NeurIPS 2025): Combines parameter sharing with
  adaptive token-level compute. Lightweight routers assign different recursion depths
  to individual tokens. 2x inference throughput with reduced KV cache sizes.
 **Relevance**: These are model architecture choices, not runtime optimizations.
 Watch for models trained with MoD/MoR architectures.
 ### 8.4 Structured Pruning
 Post-training methods to permanently remove model components:
 - **Width pruning**: Remove neurons, attention heads, or embedding channels. Better
  accuracy retention than depth pruning.
 - **Depth pruning**: Remove entire layers. More latency reduction per parameter
  removed.
 - **LLM-Pruner, SliceGPT, FLAP**: State-of-the-art structured pruning methods.
 - **AMP**: Jointly prunes attention heads and MLP neurons.
 - **NIRVANA** (2025): Structured pruning reimagined for LLM compression.
 **Practical approach**: Structured pruning requires per-model effort and is generally
 less practical than quantization for local inference. Exception: if a specific model
 is too slow at a given quantization level, pruning the model first and then
 quantizing can yield a better speed/quality trade-off.
 ### 8.5 Token Merging and Pruning
 - **TokenSelect** (EMNLP 2025): Dynamic token-level KV cache selection for
  efficient long-context inference and length extrapolation.
 - **LightThinker**: Step-by-step compression of chain-of-thought reasoning.
 - **Attention sparsity**: Twilight (NeurIPS 2025) uses hierarchical top-p pruning
  for adaptive attention sparsity.
 These techniques reduce the effective sequence length during inference, directly
 reducing both compute and memory bandwidth requirements.
 ---
 ## 9. Prioritized Recommendations for Strix Halo
 ### Tier 1: Implement Now (High Impact, Available Today)
 1. **Use Unsloth Dynamic 2.0 GGUFs** for all models. They provide the best
   quality-per-bit through intelligent layer-wise quantization.
 2. **Build llama.cpp with WMMA Flash Attention**: `-DGGML_HIP_ROCWMMA_FATTN=ON
   -DGGML_HIP_UMA=ON`. Monitor issue #17917 for MMQ regression fix.
 3. **Disable mmap for ROCm**: Always use `--mmp 0` / `--no-mmap` to avoid the
   double-copy performance penalty.
 4. **Enable KV cache quantization**: Use `--cache-type-k q4_0 --cache-type-v q4_0`
   for long-context workloads. Watch for TurboQuant integration.
 5. **Set ROCBLAS_USE_HIPBLASLT=1**: Forces the optimized hipBLASLt kernels.
 6. **Speculative decoding for decode-heavy workloads**: Use `--model-draft` with a
   small model from the same family.
 7. **GPU performance governor and frequency pinning**: Ensures consistent throughput.
 ### Tier 2: Evaluate (Moderate Impact, Some Setup Required)
 8. **LLMLingua-2 for agentic/RAG workloads**: Compress long prompts before inference.
   3-6x prompt processing speedup.
 9. **vLLM for multi-user serving**: If running concurrent inference requests
   (e.g., agentic eval pipelines), vLLM's continuous batching and PagedAttention
   provide 10-20x throughput improvement.
 10. **cgroups v2 memory reservation**: Prevent the OS from reclaiming GPU-mapped
    memory under memory pressure.
 11. **Vulkan backend for short-context workloads**: Test whether the Vulkan/RADV
    path is faster than ROCm for your specific model and context length.
 12. **Process pinning** with `numactl` or `taskset` for reduced scheduling jitter.
 ### Tier 3: Watch and Prepare (High Potential, Not Ready)
 13. **KTransformers for >64 GB models**: When running DeepSeek V3 or similar models
    that exceed available memory.
 14. **ExLlamaV3/EXL3 AMD support**: If AMD kernels arrive, EXL3's QTIP-based
    quantization could significantly improve quality at extreme compression.
 15. **XDNA NPU for draft model acceleration**: If/when llama.cpp adds NPU backend
    support, the NPU could run the draft model for speculative decoding.
 16. **SageAttention AMD port**: 2-5x attention speedup through quantized attention.
 17. **Linear attention models**: Watch for hybrid softmax/linear attention models
    from major labs that would dramatically improve long-context inference.
 18. **Cooperative matrices in Vulkan**: As llama.cpp's Vulkan backend matures, this
    provides a portable path to WMMA acceleration without ROCm dependency.
 ---
 ## 10. Sources
 ### Papers and Conference Proceedings
 - Raposo et al., "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models," 2024. https://arxiv.org/abs/2404.02258
 - Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," ICML 2023. https://arxiv.org/abs/2305.13245
 - Tseng et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks," ICML 2024. https://arxiv.org/abs/2402.04396
 - Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization," ICLR 2025. https://arxiv.org/abs/2401.06118
 - Chen et al., "KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models," SOSP 2025. https://dl.acm.org/doi/10.1145/3731569.3764843
 - Min et al., "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation," NeurIPS 2025. https://arxiv.org/abs/2507.10524
 - Varadarajan et al., "Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures," 2025. https://arxiv.org/abs/2504.11750
 - Zandieh et al., "TurboQuant: Extreme KV Cache Quantization," ICLR 2026. https://github.com/ggml-org/llama.cpp/discussions/20969
 - Agrawal et al., "SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration," ICLR 2025. https://arxiv.org/abs/2410.02367
 - Ye et al., "FlashInfer: Efficient and Customizable Attention Engine for LLM Serving," 2025. https://arxiv.org/abs/2501.01005
 - Jiang et al., "LLMLingua: Compressing Prompts for Accelerated Inference," EMNLP 2023. https://arxiv.org/abs/2310.05736
 - Li et al., "A Survey on Inference Optimization Techniques for Mixture of Experts Models," 2024. https://arxiv.org/abs/2412.14219
 - Liu et al., "MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching," 2025. https://arxiv.org/abs/2511.14102
 - Zhou et al., "SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE Inference," 2025. https://arxiv.org/abs/2510.10302
 - He et al., "SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints," 2025. https://arxiv.org/abs/2512.12990
 - Jin et al., "OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference," 2025. https://arxiv.org/abs/2512.03927
 ### Documentation and Technical References
 - AMD ROCm Strix Halo System Optimization: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html
 - AMD GPUOpen -- Using Matrix Cores of RDNA 4: https://gpuopen.com/learn/using_matrix_core_amd_rdna4/
 - AMD GPUOpen -- Accelerating Generative AI on Radeon GPUs: https://gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/
 - vLLM ROCm Blog: https://blog.vllm.ai/2026/02/27/rocm-attention-backend.html
 - AMD ROCm vLLM Blog: https://rocm.blogs.amd.com/software-tools-optimization/vllm-omni/README.html
 - AMD AI Inference on Ryzen AI NPU with Quark: https://www.amd.com/en/developer/resources/technical-articles/2025/ai-inference-acceleration-on-ryzen-ai-with-quark.html
 - Chips and Cheese -- Evaluating Infinity Cache in Strix Halo: https://chipsandcheese.com/p/evaluating-the-infinity-cache-in
 - Chips and Cheese -- RDNA 4 Architecture at Hot Chips 2025: https://chipsandcheese.com/p/amds-rdna4-gpu-architecture-at-hot
 - Linux Kernel XDNA NPU Documentation: https://docs.kernel.org/accel/amdxdna/amdnpu.html
 ### Community Resources and Guides
 - llama.cpp ROCm Performance Discussion: https://github.com/ggml-org/llama.cpp/discussions/15021
 - llama.cpp Strix Halo UMA Detection Bug: https://github.com/ggml-org/llama.cpp/issues/18159
 - llama.cpp Strix Halo Performance Regression: https://github.com/ggml-org/llama.cpp/issues/17917
 - Strix Halo Wiki -- llama.cpp with ROCm: https://strixhalo.wiki/AI/llamacpp-with-ROCm
 - Strix Halo Wiki -- Performance: https://strixhalo.wiki/AI/llamacpp-performance
 - AMD Strix Halo Toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes
 - LLM Tracker -- AMD GPUs: https://llm-tracker.info/howto/AMD-GPUs
 - LLM Tracker -- Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
 - Unsloth Dynamic 2.0 Documentation: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
 - Unsloth Dynamic v2.0 Blog: https://unsloth.ai/blog/dynamic-v2
 - KTransformers GitHub: https://github.com/kvcache-ai/ktransformers
 - ExLlamaV3 GitHub: https://github.com/turboderp-org/exllamav3
 - BitNet GitHub: https://github.com/microsoft/BitNet
 - LLMLingua GitHub: https://github.com/microsoft/LLMLingua
 - MoE Inference Awesome List: https://github.com/MoE-Inf/awesome-moe-inference
 - Awesome LLM Inference: https://github.com/xlite-dev/Awesome-LLM-Inference
 - Phoronix -- ROCm 7.1 vs Vulkan on AI PRO R9700: https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan
 - eunomia -- OS-Level LLM Inference Optimizations: https://eunomia.dev/blog/2025/02/18/os-level-challenges-in-llm-inference-and-optimizations/
 - RADV Cooperative Matrix for RDNA4: https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1524861-vulkan-cooperative-matrix-merged-for-rdna4-gpus-with-radv-dcc-support-inches-closer
 - Kaitchup -- GGUF Quant Selection: https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i
--- a/docs/llama-cpp-optimization-research.md
+++ b/docs/llama-cpp-optimization-research.md
@@ -0,0 +1,806 @@
 # llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)
 Comprehensive research into maximizing inference performance on AMD Strix Halo
 (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000).
 Researched March 2026.
 ---
 ## Scope
 This document covers every known compilation flag, runtime parameter, environment
 variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5)
 with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama
 internals, or non-llama.cpp inference engines except where their findings inform
 llama.cpp optimization.
 ---
 ## Table of Contents
 1. [Compilation Flags and Build Optimizations](#1-compilation-flags-and-build-optimizations)
 2. [Runtime Flags and Environment Variables](#2-runtime-flags-and-environment-variables)
 3. [Flash Attention and Attention Backends](#3-flash-attention-and-attention-backends)
 4. [Quantization Strategies for Speed](#4-quantization-strategies-for-speed)
 5. [Memory Layout and Caching](#5-memory-layout-and-caching)
 6. [llama-server Specific Optimizations](#6-llama-server-specific-optimizations)
 7. [Upcoming llama.cpp Features (2026)](#7-upcoming-llamacpp-features-2026)
 8. [Recommended Configurations](#8-recommended-configurations)
 9. [Sources](#9-sources)
 ---
 ## 1. Compilation Flags and Build Optimizations
 ### 1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build
 Both backends are worth building. Neither is universally faster on gfx1151:
 | Workload | Winner | Rationale |
 |----------|--------|-----------|
 | Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths |
 | Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory |
 | Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes |
 | Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx |
 | Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent |
 Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):
 | Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s |
 |---------|-----------|-----------|----------------|----------------|
 | Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 |
 | Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 |
 | ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 |
 | ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 |
 Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x
 faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context
 prompt processing (3x faster than RADV at 130K).
 ### 1.2 ROCm Build Flags
 **Minimal build:**
 ```bash
 cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_BUILD_TYPE=Release
 cmake --build build --config Release -j$(nproc)
 ```
 **Optimized build (recommended):**
 ```bash
 cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DCMAKE_BUILD_TYPE=Release
 cmake --build build --config Release -j$(nproc)
 ```
 #### Critical ROCm build flags:
 | Flag | Effect | Recommendation |
 |------|--------|----------------|
 | `-DGGML_HIP=ON` | Enable HIP/ROCm backend | Required |
 | `-DAMDGPU_TARGETS="gfx1151"` | Target Strix Halo GPU | Required -- do not use gfx1100 |
 | `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enable rocWMMA flash attention | Strongly recommended for pp |
 | `-DGGML_HIP_GRAPHS=ON` | HIP graph kernel scheduling | Test -- may help reduce launch overhead |
 | `-DGGML_HIP_NO_VMM=OFF` | Re-enable Virtual Memory Management | Default is disabled; test if needed |
 #### Flags to be aware of but NOT set by default:
 | Flag | Notes |
 |------|-------|
 | `-DGGML_HIP_UMA=ON` | Uses hipMallocManaged for UMA. **Avoid on Strix Halo** -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. |
 | `-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` | Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. **Test before deploying.** |
 | `-DGGML_CUDA_FA_ALL_QUANTS=ON` | Compiles all KV cache quant type combinations for FA. **Works for CUDA kernels which HIP reuses via hipify.** Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. |
 #### ROCm version considerations:
 - **ROCm 7.2**: Known rocWMMA compilation issue (ambiguous template specializations in `mfma_impl.hpp`). Fixed in later point releases or by disabling rocWMMA.
 - **ROCm 7.0 RC / 7.1**: Generally work well with gfx1151.
 - **ROCm 6.4.4**: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
 - **Recommendation**: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.
 ### 1.3 Vulkan Build Flags
 ```bash
 cmake -B build -S . \
  -DGGML_VULKAN=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DCMAKE_BUILD_TYPE=Release
 cmake --build build --config Release -j$(nproc)
 ```
 The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime.
 gfx1151 with RADV reports `matrix cores: KHR_coopmat` in llama-bench logs.
 #### Vulkan-specific considerations:
 | Topic | Detail |
 |-------|--------|
 | **RADV vs AMDVLK** | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. |
 | **AMDVLK buffer limit** | AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. |
 | **CoopMat1 vs CoopMat2** | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. |
 | **Shader compilation** | Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them. |
 ### 1.4 LTO and PGO
 llama.cpp does not have built-in LTO/PGO support in its CMake configuration.
 You can enable LTO manually:
 ```bash
 cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
  -DCMAKE_BUILD_TYPE=Release
 ```
 Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are
 compiled by the HIP/ROCm compiler and are not affected by host LTO.
 PGO would require a two-pass build (instrument, profile, rebuild) and is not
 commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance
 and memory bandwidth, not host-side code paths.
 ### 1.5 Compiler Tuning for ROCm
 A known LLVM regression affects loop unrolling on RDNA. The following flag has
 been reported to help:
 ```bash
 -DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
 ```
 This increases the unrolling threshold for local memory operations, which can
 improve kernel performance for flash attention and matrix multiplication.
 ---
 ## 2. Runtime Flags and Environment Variables
 ### 2.1 ROCm Environment Variables
 | Variable | Value | Effect |
 |----------|-------|--------|
 | `ROCBLAS_USE_HIPBLASLT=1` | **Critical** | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. **This is a 2-7x improvement for prompt processing.** |
 | `HSA_OVERRIDE_GFX_VERSION=11.5.1` | Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. |
 | `HSA_ENABLE_SDMA=0` | Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. |
 | `HIP_VISIBLE_DEVICES=0` | Optional | Select specific GPU device. Useful in multi-GPU or container setups. |
 | `GPU_MAX_HEAP_SIZE=100` | Optional | Allow 100% of GPU memory for heap. Default may be lower. |
 | `GPU_MAX_ALLOC_PERCENT=100` | Optional | Allow single allocation up to 100% of GPU memory. |
 | `ROCR_VISIBLE_DEVICES=0` | Optional | HSA-level device visibility control. |
 | `AMD_LOG_LEVEL=0` | Optional | Suppress AMD driver logging noise. |
 **The single most impactful environment variable is `ROCBLAS_USE_HIPBLASLT=1`.**
 Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).
 ### 2.2 Vulkan Environment Variables
 | Variable | Value | Effect |
 |----------|-------|--------|
 | `AMD_VULKAN_ICD=RADV` | Recommended | Force RADV driver (skip AMDVLK). |
 | `RADV_PERFTEST=nogttspill` | **Important** | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). |
 | `GGML_VK_VISIBLE_DEVICES=0` | Optional | Select Vulkan device index. |
 | `GGML_VULKAN_DISABLE_F16=1` | Debugging | Force FP32 compute. Slower but useful for debugging precision issues. |
 | `GGML_LOG_LEVEL=2` | Debugging | Verbose logging to verify coopmat detection. |
 ### 2.3 Thread Count (`-t` flag)
 For GPU-dominant inference (all layers offloaded), the thread count has minimal
 impact on throughput. The recommendation:
 - **Single-user inference**: `-t 4` to `-t 8` (enough for tokenization/sampling overhead)
 - **Server with parallel slots**: `-t` equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
 - **Hybrid CPU+GPU (partial offload)**: `-t` equal to number of physical cores
 The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using
 all 12 big cores (`-t 12`) is optimal for CPU-involved workloads.
 ### 2.4 Batch Size Tuning (`-b` and `-ub`)
 | Flag | Default | Role |
 |------|---------|------|
 | `-b` / `--batch-size` | 2048 | Logical batch size (application level) |
 | `-ub` / `--ubatch-size` | 512 | Physical batch size (device level) |
 Tuning guidance for gfx1151:
 - **MoE models**: `-b 256` significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
 - **Dense models**: Default `-b 2048` is generally fine
 - **Long context**: `-ub 2048` can improve performance, but test against OOM
 - **Ultra-long context**: Reduce `-ub` if memory allocation fails
 The Vulkan backend blog post for Strix Halo recommends: `-c 32768 -b 4096 -ub 256`
 for a good balance of performance and memory.
 ### 2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`)
 **Critical finding for unified memory APUs:**
 > When you load large models to the GPU, memory mapping can make loading moderately
 > slower for Vulkan, and **catastrophically slower for ROCm**. You should always set
 > `--mmap 0` or `--no-mmap` to improve model loading times on Strix Halo.
 For `llama-bench`, use `-mmp 0`. For `llama-server`/`llama-cli`, use `--no-mmap`.
 On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x.
 The difference is which pages are mapped for GPU access. GPU-mapped pages have full
 bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for
 CPU-to-GPU copies).
 **Always use `-ngl 99` (or higher) to ensure all layers are on GPU memory.**
 Even on a unified memory system, GPU memory paths provide 2x the bandwidth.
 ### 2.6 GPU Layer Offloading (`-ngl`)
 For Strix Halo with 64GB unified memory:
 - **Models < 50GB**: `-ngl 99` offloads everything. No tuning needed.
 - **Models 50-60GB**: `-ngl 99` should still work with GTT expanded via kernel params.
 - **Models > 60GB**: May need partial offload. Use `-ngl <N>` where N is tuned to
  keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.
 **Never let GPU spill to system RAM paths** -- performance will be worse than pure CPU.
 ---
 ## 3. Flash Attention and Attention Backends
 ### 3.1 When to Enable Flash Attention
 **Rule of thumb for gfx1151:**
 | Backend | Flash Attention | Recommendation |
 |---------|----------------|----------------|
 | ROCm + rocWMMA | `-fa 1` | **Always enable.** 24% pp improvement, maintains tg speed, uses less memory. |
 | ROCm without rocWMMA | `-fa 1` | Enable, but smaller improvement. |
 | Vulkan RADV | `-fa 1` | **Enable for short contexts.** Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. |
 | Vulkan AMDVLK | `-fa 1` | Similar to RADV. |
 **Key caveat**: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not
 the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior
 FA path.
 ### 3.2 rocWMMA Flash Attention Performance
 Benchmark on gfx1151 (Llama2-7B Q4_K_M):
 | Configuration | pp512 t/s | tg128 t/s |
 |---------------|-----------|-----------|
 | HIP standard | 592.28 | 40.40 |
 | HIP + hipBLASLt | 548.72 | 40.43 |
 | HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 |
 | HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 |
 rocWMMA provides ~70% improvement in prompt processing with flash attention.
 Token generation is slightly slower (~2%) due to WMMA overhead at small batch.
 ### 3.3 The rocWMMA Long-Context Regression and Fix
 The standard rocWMMA implementation has a **long-context decode regression**:
 at 65K context, tg degrades by up to 57% compared to HIP-only baseline.
 **The fix** (PR #16827, "rocm-wmma-tune" branch) implements:
 1. **`__launch_bounds__(256, 2)`**: Ensures minimum 2 blocks per SM, improving occupancy
 2. **Adaptive KQ stride**: Uses stride 128 when head dimension <= 128, reducing LDS footprint
 3. **Selective WMMA usage**: WMMA only for prefill; decode reverts to VEC/TILE kernels
 Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):
 - pp512 at 65K context: **96% faster** than untuned rocWMMA
 - tg128 at 65K context: Matches HIP baseline (previously 57% degraded)
 **Status**: This patch is available in `-rocwmma-improved` toolbox builds. It may
 not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.
 ### 3.4 Vulkan Flash Attention Limitations on AMD
 The Vulkan backend supports three FA paths:
 | Path | Extension | AMD Support |
 |------|-----------|-------------|
 | FA_SCALAR | None | Yes (CPU fallback) |
 | FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) |
 | FA_COOPMAT2 | NV_cooperative_matrix2 | **No** (NVIDIA-only) |
 FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types.
 FA_COOPMAT2 additionally supports all quant types.
 When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations.
 This provides a modest improvement over scalar FA but is significantly less
 efficient than ROCm + rocWMMA.
 ### 3.5 New Attention Models (GatedDeltaNet)
 Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have
 severe performance problems on gfx1151:
 - **Vulkan**: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
 - **ROCm/HIP**: Kernel cross-compiles but suffers from register spilling (float s[S_v]
  allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95%
  of decode time on models >15GB)
 Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s.
 **Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.**
 ---
 ## 4. Quantization Strategies for Speed
 ### 4.1 Quantization Speed on RDNA 3.5
 Token generation speed is dominated by memory bandwidth, not compute. Smaller
 quantizations are faster because they reduce bytes-per-weight, allowing more
 tokens per second within the ~215 GB/s bandwidth envelope.
 Approximate throughput formula for decode (bandwidth-bound):
 ```
 tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9
 ```
 For a 7B Q4_K_M model (~4.1 GB):
 ```
 215 / 4.1 ≈ 52 t/s  (theoretical max; practical ~50 t/s on gfx1151)
 ```
 ### 4.2 Quantization Type Comparison
 | Quant | Bits/Weight | Quality | Speed (relative) | Notes |
 |-------|------------|---------|-------------------|-------|
 | Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. |
 | Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. |
 | IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. |
 | Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. |
 | Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. |
 | Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. |
 | F16 | 16.0 | Perfect | Slowest | Reference baseline. |
 **For RDNA 3.5 specifically**:
 - **Q4_K_M** is the best general-purpose quantization. The K-quant family uses
  hierarchical super-blocks (256 values) with per-sub-block scales, providing
  better quality than Q4_0 at marginally higher dequant cost that is invisible
  at the GPU level.
 - **Q4_0** has the simplest dequant kernels and is marginally faster than Q4_K_M
  on some GPU backends. However, the quality loss is significant. Use only for
  smoke tests or when every t/s matters more than quality.
 - **IQ4_XS** (importance-matrix quantized) offers better quality per bit than
  Q4_K_M. Speed is similar. Requires an importance matrix file during quantization.
  **Recommended over Q4_K_M when you control the quantization process.**
 - **Q8_0** does NOT have special hardware-accelerated dequant on RDNA 3.5.
  RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the
  same FP16 compute paths, just with more memory bandwidth consumed.
 ### 4.3 Importance Matrix (imatrix) Quantization
 imatrix quantization records how much each weight affects output quality, then
 allocates more precision bits to important weights. This is essential for
 sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant
 shows measurable degradation.
 ```bash
 # Generate importance matrix (GPU-accelerated)
 llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat
 # Quantize with imatrix
 llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS
 ```
 **Speed impact**: None. imatrix affects quantization quality, not inference speed.
 The dequantization kernels are identical regardless of whether imatrix was used.
 ### 4.4 Unsloth Dynamic (UD) Quantizations
 Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths,
 choosing the optimal quantization per layer based on sensitivity analysis.
 **Speed impact**: Minimal to none. UD quants use the same dequant kernels as
 standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a
 standard Q4_K_M of the same total size.
 **Quality impact**: Significantly better. UD consistently outperforms standard
 quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.
 **Recommendation**: Prefer UD quants (e.g., `UD-Q4_K_XL`, `UD-Q4_K_M`) from
 Unsloth when available. They are a free quality upgrade with no speed penalty.
 ---
 ## 5. Memory Layout and Caching
 ### 5.1 KV Cache Quantization
 KV cache quantization reduces the memory footprint of the attention cache,
 allowing larger context windows within the same memory budget.
 | Cache Type | Memory vs F16 | Quality Impact | Recommendation |
 |------------|--------------|----------------|----------------|
 | f16 (default) | 1.0x | None | Baseline |
 | q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | **Recommended for production** |
 | q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained |
 | q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 |
 | iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option |
 Usage:
 ```bash
 llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
 # or for llama-bench:
 # Not directly supported as flags; test via llama-server
 ```
 **Performance impact**: Quantizing K cache slightly **improves** throughput
 (less memory to read). Quantizing V cache may have a slight negative impact.
 Overall performance impact is negligible for normal inference.
 **Caveat with speculative decoding**: Using KV cache quantization with a draft
 model causes a consistent ~16% performance drop. q4_0 KV with speculative
 decoding causes massive acceptance rate drops. **Avoid KV quant if using
 speculative decoding.**
 ### 5.2 mmap vs Full Load on Unified Memory
 On Strix Halo's unified memory architecture:
 - **`--no-mmap` is strongly recommended** for both ROCm and Vulkan.
 - With mmap enabled, ROCm model loading is "catastrophically slower."
 - Vulkan loading is "moderately slower" with mmap.
 - Since CPU and GPU share physical RAM, there is no data copy when loading
  to "GPU memory" -- it is just a page table update.
 For `llama-bench`: Always use `-mmp 0`.
 For `llama-server`/`llama-cli`: Always use `--no-mmap`.
 ### 5.3 Prompt Caching
 llama-server supports two levels of prompt caching:
 **1. Automatic KV cache reuse (`cache_prompt: true`)**:
 Reuses KV cache from previous requests when prompts share a common prefix.
 The server only reprocesses the suffix that differs.
 **2. Host-memory prompt caching (`--cram N`)**:
 Stores pre-computed prompt representations in system RAM.
 - Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
 - +6% token throughput (34 vs 32 t/s)
 - Memory formula: `num_prefixes * avg_prefix_tokens * 8 bytes`
 Configuration:
 ```bash
 llama-server -m model.gguf \
  --cram 256 \              # 256 MB host RAM for prompt cache
  --cache-type-k q8_0 \     # KV cache quantization
  --cache-type-v q8_0 \
  --no-mmap \
  -fa \
  -ngl 99
 ```
 Best for:
 - System prompts > 5K tokens
 - Multi-user chatbots with shared context
 - Agentic use with repeated tool-call prefixes
 ### 5.4 UMA Detection Bug (Issue #18159)
 llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark)
 incorrectly activates on AMD APUs when `prop.integrated=1`. It reads
 `/proc/meminfo` instead of `hipMemGetInfo()`, severely underreporting available
 GPU memory (e.g., reporting 27GB instead of 96GB).
 **Workarounds**:
 - Build without `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
 - Guard UMA detection with `!defined(GGML_USE_HIP)` (upstream fix pending)
 - Use toolbox containers where this has been patched
 ### 5.5 KV Cache Placement on ROCm (Issue #18011)
 On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible)
 memory instead of GPU-mapped memory, causing performance degradation at high
 context sizes. This is a known issue contributing to ROCm falling behind Vulkan
 for tg at high contexts.
 **Mitigation**: Use the rocWMMA-tuned branch which maintains better memory
 placement, or use Vulkan RADV for workloads where this matters.
 ---
 ## 6. llama-server Specific Optimizations
 ### 6.1 Recommended Server Configuration
 ```bash
 llama-server -m model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --no-mmap \
  -fa \
  -c 32768 \               # Total context across all slots
  -np 4 \                  # 4 parallel slots (adjust for your use)
  -b 2048 \                # Logical batch size
  -ub 512 \                # Physical batch size
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cont-batching \         # Enabled by default
  --jinja                   # Enable Jinja2 chat template
 ```
 ### 6.2 Parallel Slot Configuration (`-np`)
 | Use Case | Slots | Context per Slot | Total `-c` |
 |----------|-------|-----------------|-----------|
 | Single user chat | 1 | 32768 | 32768 |
 | Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 |
 | Multi-user API | 4-8 | 4096-8192 | 32768-65536 |
 | Eval harness | 1 | 32768+ | 32768+ |
 Memory formula: Each slot requires `context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element`.
 With q8_0 KV cache, this is roughly halved compared to f16.
 ### 6.3 Continuous Batching
 Enabled by default (`--cont-batching`). Allows the server to process multiple
 requests simultaneously, interleaving prefill and decode operations.
 For agentic workloads: One slot typically holds a large system prompt + conversation
 context, while additional slots handle parallel tool calls. Configure with:
 ```bash
 -np 4 -c 131072  # 4 slots, up to 32K context each
 ```
 ### 6.4 Prompt Caching for Agentic Use
 For agentic coding tools that send the same system prompt repeatedly:
 1. Use `cache_prompt: true` in API requests (reuses KV cache prefix)
 2. Use `--system-prompt-file system.txt` for static system prompts (note: may be
   removed in recent versions; verify with your build)
 3. Use `--cram 128` to enable host-memory caching for prefix deduplication
 ### 6.5 Speculative Decoding
 For token generation speedup with a draft model:
 ```bash
 llama-server -m main-model.gguf \
  --model-draft draft-model.gguf \
  -ngl 99 \
  --draft-max 8 \
  --draft-min 1 \
  --no-mmap \
  -fa
 ```
 **Caveat**: Do NOT combine speculative decoding with KV cache quantization.
 The 16% performance drop and reduced acceptance rate negate the benefits.
 ---
 ## 7. Upcoming llama.cpp Features (2026)
 ### 7.1 Backend-Agnostic Tensor Parallelism (PR #19378)
 Merged January 2026. Adds `--split-mode tensor` for splitting computation across
 multiple GPUs via a new "meta" backend.
 **Relevance to Strix Halo**: Limited. Single integrated GPU. However, for RPC
 configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster),
 tensor parallelism could complement the existing layer-split approach.
 Currently supports 1-2 GPUs with equal data split. `--tensor-split` has no effect yet.
 ### 7.2 TurboQuant KV Cache Compression (ICLR 2026)
 Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with
 no training and negligible quality loss:
 | Format | MSE vs FP16 | Compression |
 |--------|-------------|-------------|
 | TQ3 (3-bit) | 0.034 | 4.9x |
 | TQ4 (4-bit) | 0.009 | 3.8x |
 **Timeline**: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase
 integration plan exists covering GGML type registration, KV cache paths, FA
 integration, and CLI flags.
 ### 7.3 Vulkan Improvements
 Active 2025-2026 developments:
 - Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
 - 13% pp improvement from CU mode optimization for LDS utilization
 - BFloat16 Vulkan support (`VK_KHR_shader_bfloat16`) maturing in Mesa 25.x
 - Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)
 ### 7.4 Flash Attention for Head Dimension 512
 Pull request from March 2026 adds FA support for HD=512 in CUDA kernels.
 This benefits models with larger head dimensions (some newer architectures).
 The HIP path should inherit this via hipify.
 ### 7.5 ik_llama.cpp Fork Innovations
 The `ik_llama.cpp` fork by ikawrakow introduces:
 - Row-interleaved quant packing (better memory access patterns)
 - Smart Expert Reduction for faster MoE inference
 - Tensor overrides with regex patterns for hybrid GPU/CPU placement
 - FlashMLA for DeepSeek models
 **Caveat**: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan
 are not maintained. Not recommended for AMD gfx1151.
 ---
 ## 8. Recommended Configurations
 ### 8.1 For llama-bench (Benchmarking)
 **ROCm backend:**
 ```bash
 ROCBLAS_USE_HIPBLASLT=1 \
 toolbox run -c llama-rocm-7.2 -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5
 ```
 **Vulkan backend:**
 ```bash
 AMD_VULKAN_ICD=RADV \
 RADV_PERFTEST=nogttspill \
 toolbox run -c llama-vulkan -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5
 ```
 ### 8.2 For llama-server (Production Agentic Use)
 **ROCm (best for long context):**
 ```bash
 ROCBLAS_USE_HIPBLASLT=1 \
 llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 65536 -np 4 \
  -b 2048 -ub 512 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --cram 256 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080
 ```
 **Vulkan RADV (best for single-user tg):**
 ```bash
 AMD_VULKAN_ICD=RADV \
 RADV_PERFTEST=nogttspill \
 llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 32768 -np 2 \
  -b 4096 -ub 256 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080
 ```
 ### 8.3 Decision Matrix
 | Question | Answer |
 |----------|--------|
 | Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. |
 | Which backend for daily chat? | Vulkan RADV for best tg speed. |
 | Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. |
 | Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. |
 | Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. |
 | Use `--no-mmap`? | Always. |
 | Set `ROCBLAS_USE_HIPBLASLT=1`? | Always for ROCm. |
 | Set `RADV_PERFTEST=nogttspill`? | Always for Vulkan RADV. |
 | KV cache quantization? | q8_0 for both K and V unless using speculative decoding. |
 | Batch size for MoE? | `-b 256` (lower than default improves some MoE models). |
 ---
 ## 9. Sources
 ### GitHub Issues and Discussions
 - [Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021](https://github.com/ggml-org/llama.cpp/discussions/15021)
 - [Performance of llama.cpp with Vulkan - Discussion #10879](https://github.com/ggml-org/llama.cpp/discussions/10879)
 - [HIP backend performs poorly on gfx1151 - Issue #13565](https://github.com/ggml-org/llama.cpp/issues/13565)
 - [UMA detection incorrectly limits memory on AMD APUs - Issue #18159](https://github.com/ggml-org/llama.cpp/issues/18159)
 - [ROCm model loading dumps KV cache to shared memory - Issue #18011](https://github.com/ggml-org/llama.cpp/issues/18011)
 - [GATED_DELTA_NET underperformance on gfx1151 - Issue #20354](https://github.com/ggml-org/llama.cpp/issues/20354)
 - [Under-Performance of ROCm 7.2 binaries - Issue #19984](https://github.com/ggml-org/llama.cpp/issues/19984)
 - [ROCm 7.2 + rocWMMA compilation - Issue #19269](https://github.com/ggml-org/llama.cpp/issues/19269)
 - [Building for gfx1151 - Issue #14734](https://github.com/ggml-org/llama.cpp/issues/14734)
 - [AMDVLK 2GB buffer allocation limit - Issue #15054](https://github.com/ggml-org/llama.cpp/issues/15054)
 - [Mastering Host-Memory Prompt Caching - Discussion #20574](https://github.com/ggml-org/llama.cpp/discussions/20574)
 - [TurboQuant Extreme KV Cache Quantization - Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969)
 - [Backend-agnostic tensor parallelism - PR #19378](https://github.com/ggml-org/llama.cpp/pull/19378)
 - [Massively Improved rocWMMA Performance - PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827)
 - [rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7)
 - [Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5](https://github.com/geerlingguy/beowulf-ai-cluster/issues/5)
 ### Wiki and Community Resources
 - [Strix Halo Wiki - llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance)
 - [Strix Halo Wiki - llama.cpp with ROCm](https://strixhalo.wiki/AI/llamacpp-with-ROCm)
 - [AMD Strix Halo Backend Benchmarks (Grid View)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
 - [LLM Tracker - AMD Strix Halo GPU Performance](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
 - [Framework Community - Strix Halo GPU LLM Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
 - [Framework Community - Toolboxes for LLM inference on Strix Halo](https://community.frame.work/t/llama-cpp-vllm-toolboxes-for-llm-inference-on-strix-halo/74916)
 ### Articles and Blog Posts
 - [Hardware Corner - Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/)
 - [Hardware Corner - RADV Vulkan Driver 13% Improvement](https://www.hardware-corner.net/llama-cpp-amd-radv-vulkan-driver-update/)
 - [Phoronix - AMD ROCm 7.1 vs RADV Vulkan](https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan)
 - [Phoronix - Valve Developer RADV Improvement](https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp)
 - [Yifei's Blog - Strix Halo Matrix Cores with llama.cpp](https://blog.yifei.sg/jekyll/update/2025/08/27/building-llamacpp-strix-halo.html)
 - [Strix Halo CUDA/HIP Testing Notes (lhl)](https://github.com/lhl/strix-halo-testing/blob/main/llama-cpp-cuda-hip.md)
 ### Official Documentation
 - [ROCm - llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
 - [ROCm - llama.cpp installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
 - [ROCm Blog - Llama.cpp Meets Instinct](https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html)
 - [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
 - [llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
 ### Papers
 - "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
 - "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
 - [Unsloth Dynamic 2.0 GGUFs Documentation](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)
 ---
 ## Open Questions / Limitations
 1. **rocWMMA-tuned patch upstream status**: PR #16827 may not be fully merged.
   Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.
 2. **ROCm 7.2 stability on gfx1151**: Multiple reports of crashes (MUT_MAL errors),
   performance regressions, and compilation issues. ROCm 7.x is maturing but
   not fully stable for gfx1151 as of March 2026.
 3. **Vulkan CoopMat FA for AMD**: Will AMD ever get CoopMat2 support? The current
   CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent
   extension would close the gap with ROCm FA.
 4. **KV cache placement on ROCm**: Issue #18011 (KV cache dumped to shared memory)
   reduces ROCm tg performance at high contexts. Root cause appears to be in
   HIP memory allocation behavior on APUs.
 5. **GGML_HIP_UMA vs kernel-param GTT expansion**: The UMA flag uses slow
   fine-grained memory. GTT expansion via `amdgpu.gttsize` kernel params provides
   coarse-grained GPU-mapped memory that is much faster. The upstream approach
   may eventually improve, but kernel params remain the correct method for now.
 6. **GatedDeltaNet architecture support**: Both Vulkan (missing shader) and ROCm
   (register pressure, memcpy bottleneck) perform poorly on GDN models. This
   blocks efficient use of Qwen3.5-27B and similar models.
 7. **TurboQuant integration timeline**: Expected Q2-Q3 2026 for llama.cpp.
   Would provide 3-bit KV cache with no quality loss, roughly doubling available
   context within the same memory budget.
 8. **NPU utilization**: The 50 TOPS NPU on Strix Halo is currently Linux-unusable
   for llama.cpp inference. AMD driver support for NPU on Linux remains pending.
 ---
 ## Overlap Notes
 - **Kernel parameters** (`amdgpu.gttsize`, `ttm.pages_limit`, `iommu=pt`):
  Already documented in the project's `scripts/optimize/kernel-params.sh`.
  This research covers the llama.cpp side (why they matter for inference).
 - **BIOS VRAM allocation**: Reducing dedicated VRAM in BIOS frees more memory
  for GTT. This is documented in the project's audit scripts but is a prerequisite
  for the optimizations described here.
 - **Toolbox container builds**: The project uses pre-built toolboxes
  (`llama-rocm-7.2`, `llama-vulkan`). The compilation flags documented here
  describe what should be baked into those containers.
--- a/docs/optimization-log.md
+++ b/docs/optimization-log.md
@@ -0,0 +1,172 @@
 # Optimization Log
 Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
 **Verdicts**: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
 ---
 ## Phase 1: Core System
 ### 1.1 Tuned Profile: accelerator-performance
 - **Date**: 2026-03-26
 - **Change**: `sudo tuned-adm profile accelerator-performance`
 - **Benchmark**: `data/benchmarks/after-tuned-*`
 - **Result**: +5-8% pp improvement, +2-3% tg improvement
 - **Verdict**: KEEP
 ### 1.2 Kernel Boot Parameters
 - **Date**: 2026-03-26
 - **Change**: `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`
 - **Benchmark**: `data/benchmarks/full-opt-all-models-*`
 - **Result**: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
 - **Verdict**: KEEP
 ### 1.3 BIOS VRAM Reduction (512 MB)
 - **Date**: 2026-03-26
 - **Change**: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
 - **Benchmark**: `data/benchmarks/full-opt-all-models-*`
 - **Result**: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
 - **Trade-off**: Small model regression is acceptable given the massive capability gain.
 - **Verdict**: KEEP
 ---
 ## Phase 2: System Tuning
 ### 2.1 RyzenAdj 85W PPT
 - **Date**: PENDING
 - **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000`
 - **Expected**: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
 - **Benchmark**: Not yet run
 - **Notes**: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
 - **Verdict**: PENDING
 ### 2.2 VM Sysctl Tuning
 - **Date**: PENDING
 - **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000`
 - **Expected**: Prevent model weight eviction, reduce I/O disruption
 - **Benchmark**: Not yet run
 - **Verdict**: PENDING
 ### 2.3 Transparent Huge Pages
 - **Date**: PENDING
 - **Change**: `transparent_hugepage=always`
 - **Expected**: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
 - **Benchmark**: Not yet run
 - **Verdict**: PENDING
 ### 2.4 RADV_PERFTEST=nogttspill
 - **Date**: PENDING
 - **Change**: `export RADV_PERFTEST=nogttspill`
 - **Expected**: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
 - **Benchmark**: Not yet run — needs Vulkan-specific benchmark comparison
 - **Verdict**: PENDING
 ### 2.5 amdgpu.noretry=0
 - **Date**: PENDING
 - **Change**: Kernel cmdline `amdgpu.noretry=0`
 - **Expected**: Improved stability under memory pressure
 - **Notes**: Only apply if experiencing GPU page faults or crashes during large model loading
 - **Verdict**: PENDING
 ---
 ## Phase 3: Runtime Flags
 ### 3.1 KV Cache Quantization
 - **Date**: PENDING (sweep running)
 - **Change**: `-ctk q8_0 -ctv q8_0` / `-ctk q4_0 -ctv q4_0`
 - **Benchmark**: `data/benchmarks/kv-sweep-128k-*` (in progress)
 - **Expected**: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
 - **Verdict**: PENDING
 ### 3.2 MoE Batch Size `-b 256`
 - **Date**: PENDING
 - **Change**: Add `-b 256` to MoE benchmark runs
 - **Expected**: Up to +70% pp improvement for MoE models (community benchmarks)
 - **Benchmark**: Not yet run
 - **Verdict**: PENDING
 ---
 ## Phase 4: Build Optimizations
 ### 4.1 rocWMMA Flash Attention
 - **Date**: PENDING
 - **Change**: Rebuild ROCm toolbox with `-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`
 - **Expected**: +96% long-context performance (65K+)
 - **Notes**: Need to check if Donato's toolboxes already include this
 - **Verdict**: PENDING
 ### 4.2 rocWMMA Tuned Patch (PR #16827)
 - **Date**: PENDING
 - **Notes**: Fixes long-context regression. Check Donato's latest toolbox builds.
 - **Verdict**: PENDING
 ---
 ## Phase 5: Future / Blocked
 ### 5.1 Speculative Decoding
 - **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
 - **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
 - **Last checked**: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
 ### 5.2 Native MTP (Multi-Token Prediction)
 - **Status**: BLOCKED — llama.cpp PR #20700
 - **Last checked**: 2026-03-27 — WIP, not expected to merge soon
 ### 5.3 GPU Clock Fix
 - **Status**: BLOCKED — ROCm issue #5750
 - **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
 - **Last checked**: 2026-03-27
 ---
 ## Context Window Benchmarks
 ### 64K Context (pp4096/tg1024, MoE models)
 - **Date**: 2026-03-26
 - **Benchmark**: `data/benchmarks/ctx64k-*`
 - **Results**: (check logs)
 ### 128K Context (pp8192/tg1024, MoE models)
 - **Date**: 2026-03-26
 - **Benchmark**: `data/benchmarks/ctx128k-realistic-*`
 - **Results**: (check logs)
 ### 256K Context (pp16384/tg1024, MoE models)
 - **Date**: 2026-03-27
 - **Benchmark**: `data/benchmarks/ctx256k-*`
 - **Results**: (check logs)
 ---
 ## How to Add Entries
 When testing a new optimization:
 1. Record the date and exact change
 2. Run a benchmark: `make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."`
 3. Compare: `make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new`
 4. Update this log with results and verdict
 5. If KEEP: document in [optimization.md](optimization.md) with the measured numbers
--- a/docs/optimization.md
+++ b/docs/optimization.md
@@ -1,20 +1,32 @@
 # Optimization Guide
-Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
+Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.
 **Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.
-## Step 1: Tuned Profile (no reboot)
+Track results in [optimization-log.md](optimization-log.md) as you apply each change.
 ---
 ## Phase 1: Core System (automated scripts)
 These are the foundational optimizations handled by this repo's scripts. Apply in order.
 ### 1.1 Tuned Profile (no reboot)
 ```bash
 sudo make optimize-tuned
 ```
-Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
+Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states and sets CPU governor to performance.
 Takes effect immediately. Previous profile is saved for rollback.
-## Step 2: Kernel Boot Parameters (reboot required)
+| Expected Impact | pp512 | tg128 |
 |----------------|-------|-------|
 | Tuned profile | +5-8% | +2-3% |
 ### 1.2 Kernel Boot Parameters (reboot required)
 ```bash
 sudo make optimize-kernel
@@ -24,61 +36,315 @@ Adds three parameters to GRUB:
 | Parameter | Value (64 GB) | Purpose |
 |-----------|--------------|---------|
-| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
+| `iommu=pt` | -- | IOMMU passthrough, reduces memory access latency |
-| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
+| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB (~59 GiB) |
 | `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
-Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
+Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it. See [architecture.md](architecture.md) for the math.
-See [docs/architecture.md](architecture.md) for the math behind these values.
+### 1.3 BIOS VRAM Reduction (reboot + BIOS access)
 ## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
 ```bash
-make optimize-vram
+make optimize-vram   # Prints guidance — cannot modify BIOS directly
 ```
-This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
+Reduce dedicated VRAM (UMA Frame Buffer Size) from 32 GB to 512 MB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
-See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
+See [bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough (HP ZBook: F10 at boot).
-**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
+**Combine 1.2 and 1.3 into a single reboot**: apply kernel params, then reboot into BIOS to change VRAM, then boot normally.
-## Step 4: Verify
+### 1.4 Verify
 ```bash
-make verify
+make verify          # 9-point checklist, target: 9/9
 make audit           # Single-screen system status with scores
 ```
-Checks 9 criteria and reports a score. Target: 9/9.
+### Phase 1 Expected Impact (combined)
-## Step 5: Measure Impact
+| Optimization | pp512 | tg128 |
-
+|-------------|-------|-------|
 ```bash
 make benchmark
 make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
 ```
 See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
 ## Expected Impact
 | Optimization | pp512 Improvement | tg128 Improvement |
 |-------------|-------------------|-------------------|
 | Tuned profile | +5-8% | +2-3% |
 | Kernel params + BIOS VRAM | +10-20% | +5-15% |
-| **Combined** | **+15-25%** | **+8-18%** |
+| **Phase 1 combined** | **+15-25%** | **+8-18%** |
 Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
 ---
 ## Phase 2: System Tuning (manual, no reboot unless noted)
 These require root but are safe to apply and revert.
 ### 2.1 Power Budget Increase via RyzenAdj
 The HP ZBook Ultra G1a ships with a conservative 60W power limit. The Strix Halo chip supports 120W. Community testing shows **85W is the sweet spot**: +12-19% over 60W, with manageable thermals.
 ```bash
 # Install ryzenadj (Fedora)
 sudo dnf install ryzenadj   # or build from https://github.com/FlyGoat/RyzenAdj
 # Apply 85W limits (milliwatts)
 sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
 # Verify
 sudo ryzenadj -i | grep -E 'STAPM|PPT'
 ```
 | Setting | HP Default | Recommended | Max (risky) |
 |---------|-----------|-------------|-------------|
 | STAPM | 60W | **85W** | 120W |
 | PPT Fast | 60W | **85W** | 120W |
 | PPT Slow | 20W | **85W** | 120W |
 **Notes**:
 - Settings are volatile — reset on reboot/sleep. Create a systemd service for persistence.
 - Going above 85W yields only +2-3% more (LLM inference is memory-bandwidth-bound at ~215 GB/s).
 - Monitor thermals: `sensors` or `amdgpu_top`. Throttling starts around 100C junction temp.
 - HP firmware may periodically reset limits. Verify after wake from sleep.
 - The 140W USB-C charger limits total system draw. At 100W+ APU, battery will drain even while plugged in.
 ### 2.2 VM / Sysctl Tuning
 ```bash
 # Apply immediately
 sudo sysctl -w vm.swappiness=1
 sudo sysctl -w vm.dirty_ratio=40
 sudo sysctl -w vm.dirty_background_ratio=10
 sudo sysctl -w vm.max_map_count=500000
 sudo sysctl -w vm.zone_reclaim_mode=0
 # Persist across reboots
 sudo tee /etc/sysctl.d/99-llm-inference.conf << 'EOF'
 vm.swappiness = 1
 vm.dirty_ratio = 40
 vm.dirty_background_ratio = 10
 vm.max_map_count = 500000
 vm.zone_reclaim_mode = 0
 EOF
 ```
 | Parameter | Default | Recommended | Why |
 |-----------|---------|-------------|-----|
 | `vm.swappiness` | 60 | **1** | Prevent model weights from being swapped out |
 | `vm.dirty_ratio` | 20 | **40** | Reduce I/O flush storms during inference |
 | `vm.dirty_background_ratio` | 10 | **10** | Keep background writeback at default |
 | `vm.max_map_count` | 65530 | **500000** | Large models need many memory mappings |
 | `vm.zone_reclaim_mode` | 0 | **0** | Don't aggressively reclaim memory zones |
 ### 2.3 Transparent Huge Pages
 THP reduces TLB misses for mmap'd model files (~55 GB model = 14M page table entries at 4KB vs 28K at 2MB).
 ```bash
 # Apply immediately
 echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
 echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
 # Persist via kernel cmdline (add to GRUB):
 # transparent_hugepage=always
 # Verify THP is being used
 grep -i huge /proc/meminfo
 grep thp /proc/vmstat
 ```
 **Trade-off**: `always` may cause rare latency spikes during memory compaction. Use `madvise` if you need predictable latency, but note that llama.cpp does not call `madvise(MADV_HUGEPAGE)` so `always` is needed.
 ### 2.4 RADV_PERFTEST=nogttspill (Vulkan backend)
 Prevents unnecessary GTT spill management on unified memory. Fixes prompt processing degradation with the Vulkan RADV backend.
 ```bash
 # Per-session
 export RADV_PERFTEST=nogttspill
 # Persist system-wide
 echo 'RADV_PERFTEST=nogttspill' | sudo tee /etc/environment.d/radv.conf
 ```
 Only affects the Vulkan RADV backend. No effect on ROCm.
 ### 2.5 Additional Kernel Parameters (reboot required)
 These can be added to the GRUB cmdline alongside Phase 1 params:
 | Parameter | Value | Purpose | Priority |
 |-----------|-------|---------|----------|
 | `amdgpu.noretry=0` | 0 | Enable GPU page fault retry, improves stability | Medium — add if seeing GPU crashes |
 | `transparent_hugepage=always` | -- | Persist THP setting | Medium |
 | `preempt=voluntary` | -- | Reduce context switch overhead | Low — only for batch inference |
 | `processor.max_cstate=1` | 1 | Disable deep C-states | Low — tuned profile handles this |
 **Do NOT add**: `amdgpu.ppfeaturemask=0xffffffff` — OverDrive is non-functional on gfx1151 (ROCm issue #5750).
 ---
 ## Phase 3: Runtime Flags (per-invocation, no system changes)
 These are llama-bench / llama-server flags that affect performance without changing the system.
 ### 3.1 Always Use `-mmp 0` (no mmap)
 On unified memory, mmap adds a double-copy penalty. The `--no-mmap` / `-mmp 0` flag loads weights directly. Already set in this repo's benchmark scripts.
 ### 3.2 Batch Size for MoE Models (`-b 256`)
 Default batch size (2048) is too large for MoE on this hardware. Reducing to 256 can improve pp512 throughput by up to 70% on MoE models.
 ```bash
 # In llama-bench
 llama-bench -m model.gguf -b 256 -ngl 99 -fa 1
 # In llama-server
 llama-server -m model.gguf -b 256 -ngl 99 -fa 1
 ```
 ### 3.3 KV Cache Quantization
 Q8_0 KV cache halves KV memory usage with negligible quality loss. Recommended as default for all serving.
 ```bash
 # llama-server
 llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
 # Benchmark sweep
 make benchmark ARGS="--tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072 --models MODEL.gguf --reps 3"
 ```
 | KV Type | Memory Savings | Quality Impact | Recommendation |
 |---------|---------------|----------------|----------------|
 | f16 | Baseline | None | Default for benchmarks |
 | **q8_0** | **~50%** | **Negligible** | **Default for serving** |
 | q4_0 | ~75% | Noticeable on reasoning | Only for max context |
 ### 3.4 Flash Attention (`-fa 1`)
 Always enable on ROCm (+24% pp improvement). On Vulkan, FA uses CoopMat1 (modest improvement). Already set in benchmark scripts.
 ### 3.5 ROCBLAS_USE_HIPBLASLT=1 (ROCm only)
 Without this, ROCm pp on gfx1151 is 2-7x slower. Already set in benchmark scripts.
 ### 3.6 Backend Selection
 Neither ROCm nor Vulkan is universally faster:
 | Workload | Best Backend | Why |
 |----------|-------------|-----|
 | Short context tg | Vulkan RADV | Lower per-token overhead |
 | Long context (8K-130K) | ROCm + rocWMMA | True HW flash attention |
 | General stability | Vulkan RADV | More mature on gfx1151 |
 Never use AMDVLK — RADV scales 3.6x better at extreme context depths.
 ---
 ## Phase 4: Build Optimizations (requires rebuilding containers)
 These require rebuilding the llama.cpp toolbox containers with specific flags.
 ### 4.1 ROCm Build Flags
 ```bash
 cmake -B build \
    -DGGML_HIP=ON \
    -DGGML_HIP_ROCWMMA_FATTN=ON \   # GPU-accelerated flash attention via WMMA
    -DGGML_HIP_UMA=ON \              # Unified memory aware allocation
    -DAMDGPU_TARGETS=gfx1151
 ```
 `GGML_HIP_ROCWMMA_FATTN` is the only path to true GPU-accelerated flash attention on AMD (96% speedup at 65K context). The Vulkan CoopMat1 path is a software fallback.
 ### 4.2 rocWMMA Tuned Patch (PR #16827)
 Fixes a long-context regression in rocWMMA. Implements adaptive KQ stride, better launch bounds, and selective WMMA (prefill only; decode reverts to VEC/TILE). Check if Donato Capitella's ROCm toolboxes include this.
 ### 4.3 Vulkan Cooperative Matrices
 RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Building llama.cpp with cooperative matrix support could enable WMMA-like speedups without ROCm dependency.
 ---
 ## Phase 5: Future / Currently Blocked
 These optimizations are not available today but are worth tracking.
 ### 5.1 Speculative Decoding (blocked: llama.cpp PR #20075)
 Expected 1.8-2.5x tg speedup for coding tasks. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`, 812 MB) already downloaded. Blocked because Qwen3.5 MoE uses hybrid GatedDeltaNet architecture that breaks llama.cpp's speculative rollback mechanism.
 **Track**: [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — fix for hybrid SSM/MoE speculative decoding.
 ### 5.2 Native Multi-Token Prediction (blocked: llama.cpp PR #20700)
 Qwen3.5 was trained with built-in MTP heads. No separate draft model needed. Works in vLLM/SGLang today but not llama.cpp.
 **Track**: [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — MTP for Qwen3.5 with FastMTP vocabulary trimming.
 ### 5.3 GPU Clock Fix (blocked: ROCm #5750)
 GPU clocks on gfx1151 may be stuck at ~885 MHz instead of 2900 MHz. `power_dpm_force_performance_level` and OverDrive are non-functional. If fixed, this could unlock significant additional throughput.
 **Track**: [ROCm issue #5750](https://github.com/ROCm/ROCm/issues/5750) — Strix Halo stuck in low power clocks.
 ### 5.4 SageAttention
 2-5x speedup over FlashAttention via quantized attention computation. No AMD port exists yet.
 ### 5.5 AMD XDNA NPU (50 TOPS)
 Not viable for LLM inference today. Linux support coming in kernel 7.1. Future potential: running a draft model on the NPU for speculative decoding while the GPU runs the main model.
 ### 5.6 TurboQuant 3-bit KV Cache (ICLR 2026)
 4.9x KV cache compression with minimal quality loss. Being integrated into llama.cpp.
 ### 5.7 LLMLingua-2 Prompt Compression
 20x prompt compression for agentic/RAG workloads. Reduces pp time by compressing input before inference. Applicable to the agentic eval pipeline.
 ---
 ## Hardware Limits (cannot be changed)
 Understanding what is fixed helps avoid wasted effort.
 | Resource | Value | Notes |
 |----------|-------|-------|
 | Memory bandwidth | **~215 GB/s** (measured) | 84% of 256 GB/s theoretical. Hard ceiling for tg speed. |
 | LPDDR5X-8000 | **8000 MT/s, 256-bit** | Soldered, no XMP/EXPO, no overclocking |
 | Infinity Fabric | **2 GHz FCLK** | Fixed, not tunable on Strix Halo |
 | Infinity Cache | **32 MB** | ~1 TB/s hit bandwidth. Per-layer weights exceed it. |
 | GPU clocks | **Up to 2900 MHz** | Currently broken in driver (see 5.3) |
 | Max power | **120W APU** | HP ZBook charger is 140W total system |
 ---
 ## Rollback
 ```bash
-sudo make rollback
+sudo make rollback   # Restores GRUB backup and previous tuned profile
 ```
-Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
+BIOS VRAM must be reverted manually (F10 at boot, restore previous UMA Frame Buffer Size).
 Phase 2 changes can be reverted individually:
 - RyzenAdj: `sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000`
 - Sysctl: `sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system`
 - THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
 - nogttspill: `sudo rm /etc/environment.d/radv.conf`
 ---
 ## Troubleshooting
-If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).
+If anything goes wrong, see [troubleshooting.md](troubleshooting.md).
 ## Further Reading
 - [Hardware analysis](llama-cpp-optimization-research.md) — Deep dive into llama.cpp flags, backends, quantization
 - [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines, techniques, and future directions
 - [Benchmarking guide](benchmarking.md) — Methodology and result interpretation
 - [References](references.md) — All external links
--- a/docs/references.md
+++ b/docs/references.md
@@ -21,8 +21,13 @@ The most comprehensive community resource for Strix Halo LLM optimization.
 ## Community
 - [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
 - [Strix Halo Wiki — Power Modes](https://strixhalo.wiki/Guides/Power-Modes-and-Performance) — RyzenAdj sweet spots (85W recommended)
 - [Strix Halo Wiki — llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance) — Backend comparison data
 - [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
 - [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
 - [Framework Community — Compiling vLLM on Strix Halo](https://community.frame.work/t/how-to-compiling-vllm-from-source-on-strix-halo/77241) — Native vLLM build guide
 - [Hardware Corner — Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/) — Comprehensive optimization walkthrough
 - [Chips and Cheese — Strix Halo Memory Subsystem](https://chipsandcheese.com/p/strix-halos-memory-subsystem-tackling) — Bandwidth measurements (215 GB/s)
 - [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database
 ## Other Strix Halo Repos
@@ -61,6 +66,22 @@ The most comprehensive community resource for Strix Halo LLM optimization.
 - [SWE-bench](https://github.com/princeton-nlp/SWE-bench) — Real GitHub issue resolution
 - [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) — Optimized agentic framework for Qwen models
 ## System Tuning
 - [RyzenAdj](https://github.com/FlyGoat/RyzenAdj) — Power management for Ryzen APUs (PPT/TDP control)
 - [geohot/ztop](https://github.com/geohot/ztop) — Power monitoring for Strix Halo (discovered 60W HP limits)
 - [ROCm Issue #5750](https://github.com/ROCm/ROCm/issues/5750) — GPU clocks stuck at idle on gfx1151
 - [Mesa RADV Environment Variables](https://docs.mesa3d.org/envvars.html) — RADV_PERFTEST=nogttspill docs
 - [Linux Kernel: amd-pstate](https://docs.kernel.org/admin-guide/pm/amd-pstate.html) — CPU power management
 ## llama.cpp Optimization
 - [llama.cpp Speculative Decoding](https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md) — Draft model setup
 - [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — Fix speculative for Qwen3.5 MoE (pending)
 - [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — Native MTP for Qwen3.5 (WIP)
 - [llama.cpp PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827) — rocWMMA tuned flash attention
 - [llama.cpp Issue #12444](https://github.com/ggml-org/llama.cpp/issues/12444) — Hugepage support proposal
 ## AMD GPU Profiling
 - [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling
--- a/scripts/benchmark/run-baseline.sh
+++ b/scripts/benchmark/run-baseline.sh
@@ -217,7 +217,7 @@ for MODEL_PATH in "${MODEL_PATHS[@]}"; do
            KV_SUFFIX=""
            if [[ "$KV_K" != "f16" || "$KV_V" != "f16" ]]; then
                KV_ARGS+=(-ctk "$KV_K" -ctv "$KV_V")
-                KV_SUFFIX="__kv_${KV_K}_${KV_V}"
+                KV_SUFFIX="__kv_${KV_K}-${KV_V}"
            fi
            # Standard test
@@ -292,8 +292,8 @@ for logfile in sorted(result_dir.glob("*.log")):
    if "FAILED" in content:
        continue
-    # Extract KV cache type from filename (__kv_q8_0_q8_0)
+    # Extract KV cache type from filename (__kv_q8_0-q8_0)
-    kv_match = re.search(r'__kv_([a-z0-9_]+)_([a-z0-9_]+)\.log$', logfile.name)
+    kv_match = re.search(r'__kv_(.+)-(.+)\.log$', logfile.name)
    kv_type = f"{kv_match.group(1)}/{kv_match.group(2)}" if kv_match else "f16/f16"
    for line in content.splitlines():
@@ -304,12 +304,15 @@ for logfile in sorted(result_dir.glob("*.log")):
            continue
        parts = [p.strip() for p in line.split("|")]
-        if len(parts) < 10:
+        # Filter out empty parts from leading/trailing pipes
        data = [p for p in parts if p and "---" not in p]
        if len(data) < 6:
            continue
        try:
-            test_type = parts[8].strip() if len(parts) > 8 else ""
+            # test and t/s are always the last two columns
-            ts_raw = parts[9].strip() if len(parts) > 9 else ""
+            test_type = data[-2]
            ts_raw = data[-1]
            if not test_type or not ts_raw:
                continue
@@ -319,9 +322,9 @@ for logfile in sorted(result_dir.glob("*.log")):
            results.append({
                "file": logfile.name,
-                "model": parts[1].strip(),
+                "model": data[0],
-                "size": parts[2].strip(),
+                "size": data[1],
-                "backend": parts[4].strip(),
+                "backend": data[3],
                "test": test_type,
                "tokens_per_sec": float(ts_match.group(1)),
                "kv_cache": kv_type,
--- a/scripts/benchmark/run-suite.sh
+++ b/scripts/benchmark/run-suite.sh
@@ -210,7 +210,7 @@ for MODEL_PATH in "${MODEL_PATHS[@]}"; do
            KV_SUFFIX=""
            if [[ "$KV_K" != "f16" || "$KV_V" != "f16" ]]; then
                KV_ARGS+=(-ctk "$KV_K" -ctv "$KV_V")
-                KV_SUFFIX="__kv_${KV_K}_${KV_V}"
+                KV_SUFFIX="__kv_${KV_K}-${KV_V}"
            fi
            # Standard test
@@ -267,8 +267,8 @@ for logfile in sorted(result_dir.glob("*.log")):
    if "FAILED" in content:
        continue
-    # Extract KV cache type from filename (__kv_q8_0_q8_0)
+    # Extract KV cache type from filename (__kv_q8_0-q8_0)
-    kv_match = re.search(r'__kv_([a-z0-9_]+)_([a-z0-9_]+)\.log$', logfile.name)
+    kv_match = re.search(r'__kv_(.+)-(.+)\.log$', logfile.name)
    kv_type = f"{kv_match.group(1)}/{kv_match.group(2)}" if kv_match else "f16/f16"
    for line in content.splitlines():
@@ -278,19 +278,22 @@ for logfile in sorted(result_dir.glob("*.log")):
        if "---" in line:
            continue
        parts = [p.strip() for p in line.split("|")]
-        if len(parts) < 10:
+        # Filter out empty parts from leading/trailing pipes
        data = [p for p in parts if p and "---" not in p]
        if len(data) < 6:
            continue
        try:
-            test_type = parts[8].strip()
+            # test and t/s are always the last two columns
-            ts_raw = parts[9].strip()
+            test_type = data[-2]
            ts_raw = data[-1]
            ts_match = re.match(r'([\d.]+)', ts_raw)
            if not ts_match:
                continue
            results.append({
                "file": logfile.name,
-                "model": parts[1].strip(),
+                "model": data[0],
-                "size": parts[2].strip(),
+                "size": data[1],
-                "backend": parts[4].strip(),
+                "backend": data[3],
                "test": test_type,
                "tokens_per_sec": float(ts_match.group(1)),
                "kv_cache": kv_type,