fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00
parent 7531f6fa74
commit f92b710492
7 changed files with 2148 additions and 52 deletions
--- a/docs/inference-optimization-landscape.md
+++ b/docs/inference-optimization-landscape.md
@@ -0,0 +1,825 @@
+# LLM Inference Optimization Landscape (March 2026)
+
+## Scope
+
+Comprehensive survey of cutting-edge LLM inference optimization techniques applicable
+to a high-end AMD APU workstation: Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S
+(gfx1151, RDNA 3.5), 64 GB unified LPDDR5X memory, 256 GB/s bandwidth. Covers
+inference engines, quantization, attention, MoE optimization, memory bandwidth, OS-level
+tuning, hardware features, and model-level techniques. Research current as of March 2026.
+
+---
+
+## Table of Contents
+
+1. [Inference Engines and Backends](#1-inference-engines-and-backends)
+2. [Advanced Quantization Techniques](#2-advanced-quantization-techniques)
+3. [Attention Optimization](#3-attention-optimization)
+4. [MoE-Specific Optimizations](#4-moe-specific-optimizations)
+5. [Memory Bandwidth Optimization](#5-memory-bandwidth-optimization)
+6. [OS and Runtime Techniques](#6-os-and-runtime-techniques)
+7. [Emerging Hardware Features](#7-emerging-hardware-features)
+8. [Model-Level Optimizations](#8-model-level-optimizations)
+9. [Prioritized Recommendations for Strix Halo](#9-prioritized-recommendations-for-strix-halo)
+10. [Sources](#10-sources)
+
+---
+
+## 1. Inference Engines and Backends
+
+### 1.1 llama.cpp -- Still the Foundation
+
+llama.cpp remains the dominant local inference engine. All major interfaces (Ollama,
+LM Studio, GPT4All, KoboldCpp) use it under the hood. For Strix Halo specifically:
+
+- **ROCm/HIP backend**: Build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151
+  -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`. The `ROCBLAS_USE_HIPBLASLT=1`
+  environment variable forces hipBLASLt kernels, which deliver the best throughput on
+  gfx1151.
+- **Vulkan backend**: The RADV Mesa driver has seen active RDNA 3.5/4 optimization
+  in Mesa 25.x. In some benchmarks Vulkan outperforms ROCm for single-shot inference
+  and shorter contexts. HIP+WMMA+FlashAttention is fastest for long contexts (tg8192+).
+- **UMA detection bug (issue #18159)**: llama.cpp's UMA detection can incorrectly
+  limit available memory on AMD APUs with large TTM allocations. The `--mmp 0`
+  (disable mmap) flag is critical for ROCm on Strix Halo to avoid catastrophically
+  slow model loading.
+- **Performance**: Llama-2-7B Q4_0 achieves ~1464 t/s prompt processing (pp512) and
+  ~50 t/s token generation (tg128) on Strix Halo with ROCm.
+- **Known regression**: A commit enabling WMMA-MMQ INT kernels for RDNA 3 introduced
+  significant prompt processing regression on gfx1151 with ROCm 7.x (issue #17917).
+
+**Status**: Production-ready. Best single-engine choice for Strix Halo.
+
+### 1.2 KTransformers -- CPU/GPU Hybrid MoE Specialist
+
+KTransformers (SOSP 2025) is the most significant new engine for hybrid inference.
+It was purpose-built for running large MoE models (DeepSeek-R1/V3) on systems with
+limited GPU memory but abundant CPU memory.
+
+- **AMX-optimized kernels**: Uses Intel AMX instructions for CPU-side expert
+  computation. For AMD Zen 5, it falls back to AVX-512, which is still substantially
+  faster than naive CPU inference.
+- **Async CPU-GPU scheduling**: Overlaps CPU expert computation with GPU attention
+  computation, hiding CPU latency.
+- **Performance**: 4.62-19.74x prefill speedup, 1.25-4.09x decode speedup vs
+  existing hybrid systems. SGLang + KTransformers achieves 220+ tok/s total
+  throughput on trillion-parameter MoE models.
+- **Relevance to Strix Halo**: Moderate. KTransformers shines when GPU VRAM is
+  scarce (24 GB discrete) and CPU RAM is abundant (382 GB). On Strix Halo, all 64 GB
+  is accessible to the GPU, so the CPU offloading advantage is diminished. However,
+  for models exceeding 64 GB, KTransformers-style hybrid inference becomes relevant.
+
+**Status**: Production. Most useful for models that exceed available VRAM.
+
+### 1.3 PowerInfer / PowerInfer-2
+
+PowerInfer-2 targets smartphones, achieving 11.68 t/s on Mixtral 47B (22x faster
+than alternatives). It exploits MoE sparsity by predicting which experts will
+activate and only loading those. The core technique -- hot/cold neuron partitioning
+and GPU-resident hot neurons -- is architecturally interesting but the implementation
+targets mobile SoCs with discrete memory hierarchies, not unified-memory APUs where
+all memory is equally accessible to the GPU.
+
+**Status**: Research. Techniques are partially subsumed by llama.cpp's own MoE
+offloading improvements.
+
+### 1.4 MLC-LLM
+
+MLC-LLM compiles models via TVM to target multiple backends including ROCm, Vulkan,
+Metal, and OpenCL. It was one of the first engines to make AMD GPUs competitive for
+LLM inference (2023 blog post). The Vulkan backend provides a universal fallback
+that works on any GPU.
+
+**Status**: Active but niche. For Strix Halo, llama.cpp's native ROCm/Vulkan
+backends are more mature and better optimized.
+
+### 1.5 mistral.rs / candle / burn
+
+Rust-based inference engines:
+
+- **mistral.rs**: Built on Hugging Face's candle library. Supports GGUF, GPTQ,
+  ISQ (in-situ quantization). Has CUDA support but no ROCm backend.
+- **candle**: Hugging Face's Rust ML framework. GPU support via CUDA; no ROCm.
+- **burn**: Rust ML framework with multiple backends (WGPU, Vulkan, CUDA). The
+  WGPU/Vulkan path could theoretically work on AMD, but LLM inference support
+  is limited.
+
+**Status**: Not viable for Strix Halo in 2026. No ROCm support, and the Vulkan
+paths are less optimized than llama.cpp's.
+
+### 1.6 BitNet.cpp
+
+Microsoft's official 1-bit LLM inference framework. Achieves 6x faster inference
+and 82% lower energy consumption. GPU kernel support was added May 2025 for NVIDIA
+and Apple Silicon. No AMD GPU kernels yet. CPU-only mode works on any x86 system
+and could be relevant for future 1-bit models, but the model ecosystem (BitNet b1.58
+variants) remains small.
+
+**Status**: Watch. No AMD GPU support. CPU path works but model selection is limited.
+
+### 1.7 vLLM and SGLang
+
+Both are production LLM serving frameworks with AMD ROCm support:
+
+- **vLLM v0.16.0** (Feb 2026): ROCm is now a first-class platform. 93% of AMD
+  test groups passing. Native AITER FP8 kernels, fused LayerNorm/SiLU, optimized
+  Paged Attention. Extended bitsandbytes quantization to warp-size-32 GPUs (RDNA).
+- **SGLang**: Supports ROCm. KTransformers integration for hybrid MoE inference.
+
+Both are overkill for single-user local inference but become relevant for serving
+multiple users or running agentic workloads with concurrent requests.
+
+**Status**: Production for server workloads. Consider if running multi-user or
+agentic eval pipelines.
+
+### 1.8 ExLlamaV3 / EXL3
+
+ExLlamaV3 introduces the EXL3 format (based on QTIP from Cornell RelaxML), achieving
+excellent perplexity at extreme compression (Llama 3.3 70B at 1.75 bpw, 19 GB). The
+Marlin-inspired GEMM kernels are highly optimized for NVIDIA GPUs. AMD ROCm support
+was absent at launch (early 2025) and current status is uncertain.
+
+**Status**: Watch. Potentially best-in-class quantization quality, but AMD support
+is unclear.
+
+---
+
+## 2. Advanced Quantization Techniques
+
+### 2.1 GGUF Quantization Landscape
+
+GGUF remains the dominant format for local inference via llama.cpp. The key variants:
+
+| Format    | Bits | Method          | Best For                    |
+|-----------|------|-----------------|-----------------------------|
+| Q8_0      | 8    | Round-to-nearest| Maximum quality, 2x compress|
+| Q6_K      | 6.5  | K-quant         | High quality, 2.5x compress |
+| Q5_K_M    | 5.5  | K-quant+imatrix | Balanced quality/size       |
+| Q4_K_M    | 4.5  | K-quant+imatrix | Default recommendation      |
+| Q3_K_M    | 3.9  | K-quant+imatrix | Aggressive, still usable    |
+| IQ3_XXS   | 3.06 | I-quant+imatrix | Extreme compression         |
+| IQ2_XXS   | 2.06 | I-quant+imatrix | Near-minimum viable         |
+| IQ1_S     | 1.56 | I-quant+imatrix | Experimental                |
+
+**imatrix (Importance Matrix)**: The single most impactful quality improvement for
+sub-4-bit quantization. The importance matrix identifies which weights produce large
+activations during inference and allocates more precision to them. For aggressive
+quantization (<4 bits), imatrix is no longer optional -- it is essential.
+
+**Recommendation**: Q4_K_M + imatrix for most use cases. Q3_K_M + imatrix when
+fitting a larger model matters more than marginal quality.
+
+### 2.2 Unsloth Dynamic 2.0
+
+Unsloth Dynamic 2.0 (Feb 2026) represents the state-of-the-art in intelligent GGUF
+quantization:
+
+- **Per-layer adaptive quantization**: Each layer gets a custom quantization type
+  based on sensitivity analysis. The quantization scheme for Gemma 3 differs
+  significantly from Llama 4.
+- **Universal MoE + dense support**: Dynamic 2.0 works on all architectures
+  (previously MoE-only).
+- **Calibration dataset**: 1.5M+ token hand-curated dataset for improved
+  conversational quality.
+- **Quality results**: Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6% on 5-shot
+  MMLU, surpassing many full-precision models.
+- **KL Divergence tracking**: Every GGUF is benchmarked against the original model
+  on both perplexity and KL divergence.
+
+**Relevance**: Directly applicable. Use Unsloth Dynamic 2.0 GGUFs when available
+for any model. They consistently outperform standard k-quant GGUFs at the same
+bit-width.
+
+### 2.3 AQLM and QuIP#
+
+Both target extreme compression (2-3 bits):
+
+- **QuIP#** (ICML 2024): Uses randomized Hadamard transforms + E8 lattice codebooks.
+  First PTQ method where 3-bit outperforms theoretical lossless 4-bit. The E8
+  codebook fits in L1 cache, enabling inference speedups over FP16.
+- **AQLM** v1.1.7 (April 2025): Additive quantization achieving Pareto optimality
+  below 3 bpw. Outperforms QuIP# on MoE models at 2-bit. Added arbitrary
+  8-dimensional codebooks on GPU.
+
+Both require PyTorch/CUDA for dequantization kernels. Neither has native llama.cpp
+integration or AMD support. They represent the theoretical frontier of what is
+achievable at extreme compression but are not practical for Strix Halo today.
+
+**Status**: Research. Watch for llama.cpp integration of QTIP (via ExLlamaV3/EXL3).
+
+### 2.4 AWQ vs GPTQ vs GGUF on AMD
+
+For AMD GPUs in the llama.cpp ecosystem:
+
+- **GGUF**: The only practical choice. Native llama.cpp support with ROCm/Vulkan
+  acceleration. K-quants and I-quants are well-optimized.
+- **AWQ/GPTQ**: Require Marlin kernels for competitive speed (741 tok/s with
+  Marlin-AWQ vs 67 tok/s without on NVIDIA). Marlin kernels are CUDA-only. On AMD,
+  these formats are accessible via vLLM or Hugging Face Transformers with ROCm, but
+  not through llama.cpp.
+- **Performance hierarchy on AMD (via vLLM)**: GPTQ and AWQ with Marlin kernels are
+  fastest on NVIDIA; on AMD ROCm, the performance advantage over GGUF is minimal
+  and setup complexity is higher.
+
+**Recommendation**: GGUF for llama.cpp on Strix Halo. AWQ/GPTQ only if using vLLM.
+
+### 2.5 Mixed-Precision and Layer-Wise Quantization
+
+Active research area with direct practical implications:
+
+- **Attention vs FFN sensitivity**: Attention layers (QKV projections, output
+  projection) have varying sensitivity. FFN layers are often the largest component
+  and frequent targets for aggressive quantization (INT4).
+- **Channel-Wise Mixed-Precision (CMPQ)**: Allocates quantization precision per
+  weight channel based on activation distributions. Adapts to any bit-width.
+- **HOBBIT for MoE**: Maintains FP16 and INT4 versions of experts simultaneously.
+  Hot experts stay at FP16; cold experts use INT4 or even INT2. This concept is
+  partially implemented in Unsloth Dynamic 2.0's per-layer approach.
+- **Fine-Grained Mixed Precision (FGMP)**: Goes below row-level granularity to
+  handle unstructured sensitivity patterns in both weights and activations.
+
+**Relevance**: Unsloth Dynamic 2.0 already implements the practical version of
+layer-wise mixed precision for GGUF. The research frontier is moving toward
+sub-layer and channel-level mixed precision.
+
+### 2.6 KV Cache Quantization
+
+- **TurboQuant** (ICLR 2026): Being integrated into llama.cpp. TQ3 (3-bit) achieves
+  4.9x compression vs FP16 KV cache; TQ4 (4-bit) achieves 3.8x. This directly
+  reduces memory pressure for long-context inference.
+- **llama.cpp native**: Already supports Q8_0 and Q4_0 KV cache quantization via
+  `--cache-type-k` and `--cache-type-v` flags.
+
+**Relevance**: High. On a 64 GB system, KV cache can consume significant memory for
+long contexts. Q4_0 KV cache is recommended; TurboQuant will push this further.
+
+---
+
+## 3. Attention Optimization
+
+### 3.1 Flash Attention on AMD
+
+Current status for RDNA 3.5 / gfx1151:
+
+- **Triton backend**: Supports CDNA and RDNA GPUs with fp16, bf16, fp32. This is
+  the primary Flash Attention path for non-Instinct AMD GPUs.
+- **PyTorch integration**: Since PyTorch 2.5.0+, `F.scaled_dot_product_attention`
+  automatically uses Flash Attention on RDNA cards via the Triton backend.
+- **llama.cpp WMMA Flash Attention**: Enabled via `-DGGML_HIP_ROCWMMA_FATTN=ON`.
+  Uses RDNA 3.5's WMMA instructions for matrix multiply within the attention kernel.
+  This is the fastest path for long-context inference on Strix Halo.
+- **CK (Composable Kernel) backend**: Supports MI200x, MI250x, MI300x, MI355x.
+  Not available for RDNA consumer GPUs.
+
+**Gap**: Flash Attention 3 (with asynchronous pipelines and FP8 attention) is
+NVIDIA Hopper-specific. No AMD equivalent exists.
+
+### 3.2 SageAttention
+
+SageAttention (ICLR 2025, ICML 2025, NeurIPS 2025 Spotlight) achieves 2-5x speedup
+over FlashAttention through quantized attention (8-bit Q/K matrices, FP16 values).
+SageAttention3 further uses FP4 Tensor Cores on Blackwell GPUs.
+
+**AMD status**: SageAttention's Triton implementation could theoretically work on
+AMD GPUs, but no AMD-optimized kernels exist. The quantized attention concept is
+sound and could be adapted.
+
+**Status**: Watch. Would be high-impact if ported to AMD.
+
+### 3.3 Paged Attention
+
+Paged Attention (vLLM) manages KV cache as non-contiguous memory pages, eliminating
+60-80% of memory waste from fragmentation. llama.cpp's server mode implements a
+simplified version of this for concurrent request handling, but the full PagedAttention
+system is more mature in vLLM.
+
+**Relevance**: Moderate for single-user. High for multi-user serving.
+
+### 3.4 GQA/MQA Architecture Implications
+
+Modern models (Llama 2/3, Mistral, Qwen) use Grouped Query Attention:
+
+- GQA reduces KV cache by up to 90% vs MHA (Multi-Head Attention)
+- 30-40% faster inference than MHA with near-equivalent accuracy
+- Enables larger batch sizes due to smaller memory footprint
+
+**Practical impact**: When choosing models for Strix Halo, prefer GQA models. All
+modern model families (Llama 3, Qwen 3, Gemma 3, Mistral) use GQA. Avoid older MHA
+models when alternatives exist.
+
+### 3.5 Ring Attention and Linear Attention
+
+- **Ring Attention**: Distributes long sequences across multiple devices. Achieves
+  1M context prefill in 77s with 93% parallelization efficiency. Not applicable to
+  single-device Strix Halo.
+- **Linear Attention**: Reduces KV cache from O(n) to O(1) and computation from
+  O(n^2) to O(n). The Ring-Linear models (hybrid softmax + linear attention) reduce
+  inference cost to 1/10 of dense models. This is a model architecture choice, not
+  a runtime optimization.
+
+**Relevance**: Linear attention models would be transformative for long-context on
+Strix Halo. Watch for Qwen, DeepSeek, or Llama variants with hybrid attention.
+
+---
+
+## 4. MoE-Specific Optimizations
+
+### 4.1 Expert Offloading on Unified Memory
+
+On discrete GPU systems, MoE inference involves expensive PCIe transfers of expert
+weights between CPU RAM and GPU VRAM. On Strix Halo's unified memory, this bottleneck
+is fundamentally different:
+
+- All expert weights reside in the same physical memory accessible to both CPU and
+  GPU. There is no PCIe transfer cost.
+- The bottleneck shifts to **memory bandwidth**: at 256 GB/s, loading a 2 GB expert
+  takes ~7.8 ms. With GGUF Q4 quantization, experts are 4x smaller, reducing this
+  to ~2 ms.
+- **Implication**: Unified memory eliminates the offloading problem but does not
+  eliminate the bandwidth problem. The optimization focus should be on reducing the
+  number of expert weights that must be read per token.
+
+### 4.2 Expert Caching and Prediction
+
+The research frontier in 2025-2026 focuses on predicting which experts will be needed:
+
+- **OD-MoE**: 99.94% expert activation prediction accuracy, delivering ~75% of
+  fully GPU-cached speed using 1/3 GPU memory.
+- **MoE-SpeQ**: Uses a small draft model to predict expert sequences, enabling
+  prefetching. Combines speculative decoding with expert prediction.
+- **SP-MoE**: First speculative-decoding-aware expert offloading framework. Achieves
+  1.07-3.5x TPOT speedup by exploiting structural correspondence between draft
+  and target models.
+- **SliceMoE**: Dynamic Bit-Sliced Caching -- caches experts at sub-expert
+  granularity, assigning precision on demand.
+- **FlashMoE**: ML-based cache replacement for SSD-based expert offloading on edge.
+
+**Relevance for Strix Halo**: Expert caching is less critical when all experts fit
+in memory, but expert prediction can still help by enabling **prefetching into L2/
+Infinity Cache** before the expert is needed, reducing effective memory latency.
+
+### 4.3 Expert Pruning
+
+- Static pruning: Remove least-used experts entirely (MC-SMoE, EEP). Can reduce
+  active parameters by up to 96.875% (TSEP). Requires fine-tuning.
+- Dynamic pruning: Skip experts below an activation threshold at inference time.
+  38.2% FLOPs reduction with 1.32x speedup (Li et al.).
+- **DynMoE**: 9% FLOPs reduction, 1.37x speedup through dynamic gating.
+
+**Relevance**: Moderate. Dynamic expert skipping could reduce memory bandwidth
+requirements on Strix Halo, but requires model-specific configuration.
+
+### 4.4 MoE Quantization -- Inactive Expert Compression
+
+HOBBIT maintains multiple precision versions of experts: FP16 hot experts, INT4 cold
+experts, INT2 for rarely-used experts. On unified memory, a variant of this approach
+could keep the working set of experts at higher precision while storing rarely-activated
+experts at aggressive quantization, reducing total memory footprint.
+
+MoE-CSP achieves 26x speedup through 4-bit/8-bit quantization with custom CUDA
+kernels. QMoE achieves 20x memory reduction but lacks efficient 1-bit kernel support.
+
+**Practical approach for Strix Halo**: Use Unsloth Dynamic 2.0 GGUFs, which already
+implement per-layer (including per-expert) precision allocation.
+
+---
+
+## 5. Memory Bandwidth Optimization
+
+### 5.1 The Fundamental Bottleneck
+
+LLM inference (especially token generation / decode) is almost always memory-bandwidth
+bound. On Strix Halo:
+
+- **Available bandwidth**: 256 GB/s (LPDDR5X-8000, 256-bit bus)
+- **Theoretical decode throughput** for a 7B Q4_0 model (~3.5 GB):
+  256 GB/s / 3.5 GB = ~73 tok/s (assuming 100% utilization)
+- **Measured**: ~50 t/s (tg128), implying ~68% bandwidth utilization
+- **Infinity Cache effect**: The 32 MB Infinity Cache acts as a bandwidth amplifier.
+  When working set fits in cache, effective bandwidth can exceed 256 GB/s. For LLM
+  inference, per-layer weights typically exceed 32 MB, so cache benefit is limited
+  to KV cache and activations.
+
+### 5.2 Techniques to Reduce Bandwidth Requirements
+
+| Technique                  | Bandwidth Reduction | Status on Strix Halo |
+|----------------------------|--------------------|-----------------------|
+| Lower quantization (Q4->Q3)| ~25%              | Available now         |
+| KV cache quantization (Q4) | ~75% for KV reads  | Available now         |
+| Speculative decoding       | 2-3x effective     | Available now         |
+| Expert prediction/caching  | Variable (MoE)     | Research              |
+| Weight compression (EXL3)  | Up to 8x           | No AMD support        |
+| Activation checkpointing   | Reduces peak memory | Available             |
+
+### 5.3 Speculative Decoding
+
+The most impactful bandwidth optimization technique available today:
+
+- **Principle**: A small, fast "draft" model generates N candidate tokens. The large
+  "target" model verifies all N tokens in a single forward pass (batch). Accepted
+  tokens are "free" -- they required no additional bandwidth from the target model.
+- **Speedup**: 2-3x without accuracy loss. NVIDIA demonstrates 3.6x on H200.
+- **EAGLE-3**: Lightweight autoregressive head attached to target model internals.
+  No separate draft model needed.
+- **TurboSpec**: Closed-loop control system that dynamically adjusts speculative
+  parameters based on online feedback.
+- **MoE-SpeQ**: Combines speculative decoding with expert prefetching.
+
+**Relevance**: High. Speculative decoding is the single highest-impact optimization
+for decode throughput on bandwidth-limited systems like Strix Halo. llama.cpp
+supports speculative decoding via `--model-draft`.
+
+### 5.4 Prefetching Strategies
+
+- **L2 cache prefetching**: Proactively load KV cache and next-layer weights into
+  GPU L2 during computation. Achieves 2.15x attention kernel speedup on NVIDIA H20.
+- **PRESERVE**: Prefetch model weights from HBM to on-chip cache during communication
+  operations. Up to 1.6x end-to-end speedup.
+- **Strix Halo consideration**: The 32 MB Infinity Cache + 2 MB L2 provides limited
+  on-chip storage. Prefetching activations and KV cache (which are smaller than
+  weights) into Infinity Cache during weight reads could help.
+
+### 5.5 Batched Inference
+
+Batching amortizes weight-read cost across multiple requests:
+
+- Single request: ~68% bandwidth utilization on Strix Halo
+- Batch of 4: Approaches compute-bound regime for prefill; still bandwidth-bound
+  for decode on most models
+- **Continuous batching** (vLLM, llama.cpp server): 10-20x throughput improvement
+  over naive batching
+
+**Trade-off**: Batching increases throughput but also increases per-request latency
+and memory consumption (KV cache scales linearly with batch size).
+
+---
+
+## 6. OS and Runtime Techniques
+
+### 6.1 Memory Management
+
+**Huge Pages**: Transparent Huge Pages (THP) reduce TLB misses for large model
+weights. On Fedora 43, THP is enabled by default. For explicit control:
+
+```bash
+# Check current THP setting
+cat /sys/kernel/mm/transparent_hugepage/enabled
+
+# For llama.cpp, ensure THP is at least "madvise"
+echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
+```
+
+For models loaded with mmap, THP automatically promotes 4 KB pages to 2 MB pages,
+reducing page faults during inference.
+
+**Memory Locking**: `mlock` prevents model weights from being swapped. llama.cpp's
+`--mlock` flag enables this. Critical for systems running other workloads alongside
+inference.
+
+**mmap vs direct load**: On Strix Halo with ROCm, `--mmp 0` (disable mmap) is
+recommended. mmap causes catastrophically slow model loading when GPU offloading is
+active because of the double-copy path through page cache.
+
+### 6.2 Process Pinning and NUMA
+
+Strix Halo is a single-die APU, so NUMA topology is simple (typically 1 NUMA node).
+However, CPU core affinity still matters:
+
+```bash
+# Pin inference to specific cores, keeping others free for OS
+numactl --physcpubind=0-15 llama-server [args]
+
+# Or via taskset
+taskset -c 0-15 llama-server [args]
+```
+
+**Core isolation**: For minimum-jitter inference:
+```bash
+# Add to kernel cmdline
+isolcpus=0-15 nohz_full=0-15 rcu_nocbs=0-15
+```
+This prevents the OS from scheduling unrelated tasks on inference cores.
+
+### 6.3 CPU Frequency and Power
+
+```bash
+# Set performance governor for consistent throughput
+sudo cpupower frequency-set -g performance
+
+# Verify
+cpupower frequency-info | grep "current CPU frequency"
+```
+
+### 6.4 cgroups v2 for Resource Isolation
+
+Reserve memory and CPU for inference workloads:
+
+```bash
+# Create inference cgroup
+sudo mkdir /sys/fs/cgroup/inference
+echo "+memory +cpu" | sudo tee /sys/fs/cgroup/inference/cgroup.subtree_control
+
+# Reserve 56 GB for inference (leave 8 GB for system)
+echo $((56 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/inference/memory.min
+
+# Pin CPU cores
+echo "0-15" | sudo tee /sys/fs/cgroup/inference/cpuset.cpus
+
+# Run inference in cgroup
+sudo cgexec -g memory,cpu:inference llama-server [args]
+```
+
+### 6.5 io_uring for Model Loading
+
+io_uring provides zero-copy, kernel-bypassing I/O that can accelerate initial model
+loading. While llama.cpp does not natively use io_uring, the underlying mmap/read
+path can benefit from io_uring-based file I/O when loading from NVMe:
+
+- Eliminates context switch overhead during model load
+- Enables true async I/O with completion ring buffers
+- Most benefit when loading very large models (>32 GB) from storage
+
+**Practical impact**: Minor for Strix Halo since model loading is a one-time cost,
+and LPDDR5X bandwidth far exceeds NVMe read speeds.
+
+### 6.6 eBPF-Based Performance Monitoring
+
+eBPF enables zero-instrumentation monitoring of inference workloads:
+
+```bash
+# Monitor GPU DRM scheduler jobs (works with amdgpu driver)
+sudo bpftrace -e 'tracepoint:drm:drm_sched_job { printf("GPU job: %s\n", args->name); }'
+
+# Track page faults during model loading
+sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'
+
+# Monitor context switches on inference cores
+sudo bpftrace -e 'tracepoint:sched:sched_switch /cpu == 0/ { @[args->next_comm] = count(); }'
+```
+
+The eunomia project provides ready-made eBPF programs for AI workload monitoring.
+
+---
+
+## 7. Emerging Hardware Features
+
+### 7.1 AMD XDNA NPU
+
+The Ryzen AI MAX+ 395 includes an XDNA 2 NPU rated at 50 TOPS. Current status for
+LLM inference:
+
+- **Software stack**: AMD Ryzen AI Software supports ONNX model execution on the NPU.
+  AMD Quark provides quantization for NPU deployment (SmoothQuant, GPTQ, Quarot).
+- **LLM capability**: The NPU can accelerate small models and specific operations
+  (attention heads, small expert networks) but cannot run full large LLMs.
+- **Linux support**: Kernel 7.1 (expected 2026) brings significant XDNA upstreaming.
+  Current Linux support is limited compared to Windows.
+- **Practical use**: The NPU could potentially handle a speculative decoding draft
+  model while the GPU runs the main model. This is not yet implemented in any
+  inference engine.
+
+**Status**: Not viable for LLM inference in March 2026. Watch for Linux kernel 7.1
+and llama.cpp NPU backend development.
+
+### 7.2 RDNA 3.5 Matrix Cores (WMMA)
+
+The Radeon 8060S (gfx1151) has the same WMMA instruction set as RDNA 3 (gfx11xx),
+which is a generation behind RDNA 4 (gfx12xx):
+
+**RDNA 3 / 3.5 (gfx1151) WMMA capabilities**:
+- FP16/BF16: 512 FLOPS/clock/CU
+- INT8: 1024 OPS/clock/CU
+- 16x16 matrix dimensions
+- Requires inter-lane data shuffling for chained operations
+
+**RDNA 4 (gfx12xx) improvements over RDNA 3.5**:
+- FP16/BF16: 1024 FLOPS/clock/CU (2x)
+- INT8: 2048 OPS/clock/CU (2x)
+- New FP8/BF8 formats at 4x the FP16 rate
+- 4:2 structured sparsity support (effectively 2x more)
+- No inter-lane shuffling needed for chained WMMA (major efficiency gain)
+- New efficient matrix load instruction
+
+**Current usage in llama.cpp**: WMMA is used for Flash Attention
+(`GGML_HIP_ROCWMMA_FATTN`) and matrix-multiply quantized (`MMQ`) kernels. The
+ROCm 7.x regression for gfx1151 (issue #17917) specifically affects MMQ kernels.
+
+### 7.3 Vulkan Cooperative Matrices
+
+The `VK_KHR_cooperative_matrix` Vulkan extension was merged into the RADV driver
+for RDNA 3+ hardware. This provides a portable API for matrix operations that maps
+to WMMA hardware:
+
+- Enables inference engines to use matrix cores through Vulkan instead of
+  vendor-specific ROCm/HIP APIs
+- llama.cpp's Vulkan backend could leverage this for WMMA-accelerated matrix
+  operations
+- Currently less optimized than native HIP/ROCm paths
+
+**Status**: Available in Mesa 25.x. Watch for llama.cpp Vulkan backend improvements
+using cooperative matrices.
+
+### 7.4 Infinity Cache for Inference
+
+Strix Halo has a 32 MB Infinity Cache (MALL -- Memory Attached Last Level):
+
+- **Architecture**: L1 (256 KB/shader array) -> L2 (2 MB) -> Infinity Cache (32 MB)
+  -> LPDDR5X
+- **Latency**: Slightly higher than discrete GPU Infinity Cache implementations
+- **Hit rate**: Varies by workload. Graphics benchmarks show ~73% hit rate at peak.
+- **LLM inference implications**: For a 7B Q4 model (~3.5 GB), per-layer weights
+  are ~70-140 MB, far exceeding the 32 MB cache. Benefit is limited to:
+  - KV cache for current context (fits well for shorter contexts)
+  - Activations and intermediate results
+  - Embedding layer (often accessed repeatedly)
+  - Small models/layers that fit entirely in cache
+
+The Infinity Cache is most impactful as a bandwidth amplifier -- when inference
+accesses exhibit temporal locality (same data accessed multiple times within a
+short window), effective bandwidth exceeds the 256 GB/s DRAM limit.
+
+---
+
+## 8. Model-Level Optimizations
+
+### 8.1 Prompt Compression
+
+- **LLMLingua / LLMLingua-2** (Microsoft): Compresses input prompts by removing
+  low-information tokens. 20x compression with 1.5 point performance drop.
+  1.7-5.7x end-to-end inference speedup. LLMLingua-2 is 3-6x faster than v1.
+  Integrated into LangChain and LlamaIndex.
+- **500xCompressor**: Compresses contexts into a single special token. 6x-480x
+  compression. Adds only 0.25% parameters. More aggressive but less mature.
+
+**Relevance**: High for RAG and agentic workloads where prompts are long. Reduces
+both prefill time and KV cache memory.
+
+### 8.2 Speculative Decoding (Model-Level)
+
+Beyond the engine-level implementation described in Section 5.3:
+
+- **Self-speculative decoding**: Model drafts its own tokens using early exit from
+  lower layers. No separate draft model needed.
+- **EAGLE-3**: Autoregressive head on target model internals. Higher acceptance
+  rates than separate draft models.
+- **Draft model latency > accuracy**: Research shows that draft model speed matters
+  more than its language modeling accuracy for overall throughput.
+
+### 8.3 Mixture of Depths / Mixture of Recursions
+
+- **Mixture of Depths (MoD)**: Dynamically allocates compute to tokens that need it.
+  2-3x inference speedup with minimal quality degradation. Implemented at training
+  time -- requires model architecture support.
+- **Mixture of Recursions (MoR)** (NeurIPS 2025): Combines parameter sharing with
+  adaptive token-level compute. Lightweight routers assign different recursion depths
+  to individual tokens. 2x inference throughput with reduced KV cache sizes.
+
+**Relevance**: These are model architecture choices, not runtime optimizations.
+Watch for models trained with MoD/MoR architectures.
+
+### 8.4 Structured Pruning
+
+Post-training methods to permanently remove model components:
+
+- **Width pruning**: Remove neurons, attention heads, or embedding channels. Better
+  accuracy retention than depth pruning.
+- **Depth pruning**: Remove entire layers. More latency reduction per parameter
+  removed.
+- **LLM-Pruner, SliceGPT, FLAP**: State-of-the-art structured pruning methods.
+- **AMP**: Jointly prunes attention heads and MLP neurons.
+- **NIRVANA** (2025): Structured pruning reimagined for LLM compression.
+
+**Practical approach**: Structured pruning requires per-model effort and is generally
+less practical than quantization for local inference. Exception: if a specific model
+is too slow at a given quantization level, pruning the model first and then
+quantizing can yield a better speed/quality trade-off.
+
+### 8.5 Token Merging and Pruning
+
+- **TokenSelect** (EMNLP 2025): Dynamic token-level KV cache selection for
+  efficient long-context inference and length extrapolation.
+- **LightThinker**: Step-by-step compression of chain-of-thought reasoning.
+- **Attention sparsity**: Twilight (NeurIPS 2025) uses hierarchical top-p pruning
+  for adaptive attention sparsity.
+
+These techniques reduce the effective sequence length during inference, directly
+reducing both compute and memory bandwidth requirements.
+
+---
+
+## 9. Prioritized Recommendations for Strix Halo
+
+### Tier 1: Implement Now (High Impact, Available Today)
+
+1. **Use Unsloth Dynamic 2.0 GGUFs** for all models. They provide the best
+   quality-per-bit through intelligent layer-wise quantization.
+
+2. **Build llama.cpp with WMMA Flash Attention**: `-DGGML_HIP_ROCWMMA_FATTN=ON
+   -DGGML_HIP_UMA=ON`. Monitor issue #17917 for MMQ regression fix.
+
+3. **Disable mmap for ROCm**: Always use `--mmp 0` / `--no-mmap` to avoid the
+   double-copy performance penalty.
+
+4. **Enable KV cache quantization**: Use `--cache-type-k q4_0 --cache-type-v q4_0`
+   for long-context workloads. Watch for TurboQuant integration.
+
+5. **Set ROCBLAS_USE_HIPBLASLT=1**: Forces the optimized hipBLASLt kernels.
+
+6. **Speculative decoding for decode-heavy workloads**: Use `--model-draft` with a
+   small model from the same family.
+
+7. **GPU performance governor and frequency pinning**: Ensures consistent throughput.
+
+### Tier 2: Evaluate (Moderate Impact, Some Setup Required)
+
+8. **LLMLingua-2 for agentic/RAG workloads**: Compress long prompts before inference.
+   3-6x prompt processing speedup.
+
+9. **vLLM for multi-user serving**: If running concurrent inference requests
+   (e.g., agentic eval pipelines), vLLM's continuous batching and PagedAttention
+   provide 10-20x throughput improvement.
+
+10. **cgroups v2 memory reservation**: Prevent the OS from reclaiming GPU-mapped
+    memory under memory pressure.
+
+11. **Vulkan backend for short-context workloads**: Test whether the Vulkan/RADV
+    path is faster than ROCm for your specific model and context length.
+
+12. **Process pinning** with `numactl` or `taskset` for reduced scheduling jitter.
+
+### Tier 3: Watch and Prepare (High Potential, Not Ready)
+
+13. **KTransformers for >64 GB models**: When running DeepSeek V3 or similar models
+    that exceed available memory.
+
+14. **ExLlamaV3/EXL3 AMD support**: If AMD kernels arrive, EXL3's QTIP-based
+    quantization could significantly improve quality at extreme compression.
+
+15. **XDNA NPU for draft model acceleration**: If/when llama.cpp adds NPU backend
+    support, the NPU could run the draft model for speculative decoding.
+
+16. **SageAttention AMD port**: 2-5x attention speedup through quantized attention.
+
+17. **Linear attention models**: Watch for hybrid softmax/linear attention models
+    from major labs that would dramatically improve long-context inference.
+
+18. **Cooperative matrices in Vulkan**: As llama.cpp's Vulkan backend matures, this
+    provides a portable path to WMMA acceleration without ROCm dependency.
+
+---
+
+## 10. Sources
+
+### Papers and Conference Proceedings
+
+- Raposo et al., "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models," 2024. https://arxiv.org/abs/2404.02258
+- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," ICML 2023. https://arxiv.org/abs/2305.13245
+- Tseng et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks," ICML 2024. https://arxiv.org/abs/2402.04396
+- Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization," ICLR 2025. https://arxiv.org/abs/2401.06118
+- Chen et al., "KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models," SOSP 2025. https://dl.acm.org/doi/10.1145/3731569.3764843
+- Min et al., "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation," NeurIPS 2025. https://arxiv.org/abs/2507.10524
+- Varadarajan et al., "Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures," 2025. https://arxiv.org/abs/2504.11750
+- Zandieh et al., "TurboQuant: Extreme KV Cache Quantization," ICLR 2026. https://github.com/ggml-org/llama.cpp/discussions/20969
+- Agrawal et al., "SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration," ICLR 2025. https://arxiv.org/abs/2410.02367
+- Ye et al., "FlashInfer: Efficient and Customizable Attention Engine for LLM Serving," 2025. https://arxiv.org/abs/2501.01005
+- Jiang et al., "LLMLingua: Compressing Prompts for Accelerated Inference," EMNLP 2023. https://arxiv.org/abs/2310.05736
+- Li et al., "A Survey on Inference Optimization Techniques for Mixture of Experts Models," 2024. https://arxiv.org/abs/2412.14219
+- Liu et al., "MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching," 2025. https://arxiv.org/abs/2511.14102
+- Zhou et al., "SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE Inference," 2025. https://arxiv.org/abs/2510.10302
+- He et al., "SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints," 2025. https://arxiv.org/abs/2512.12990
+- Jin et al., "OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference," 2025. https://arxiv.org/abs/2512.03927
+
+### Documentation and Technical References
+
+- AMD ROCm Strix Halo System Optimization: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html
+- AMD GPUOpen -- Using Matrix Cores of RDNA 4: https://gpuopen.com/learn/using_matrix_core_amd_rdna4/
+- AMD GPUOpen -- Accelerating Generative AI on Radeon GPUs: https://gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/
+- vLLM ROCm Blog: https://blog.vllm.ai/2026/02/27/rocm-attention-backend.html
+- AMD ROCm vLLM Blog: https://rocm.blogs.amd.com/software-tools-optimization/vllm-omni/README.html
+- AMD AI Inference on Ryzen AI NPU with Quark: https://www.amd.com/en/developer/resources/technical-articles/2025/ai-inference-acceleration-on-ryzen-ai-with-quark.html
+- Chips and Cheese -- Evaluating Infinity Cache in Strix Halo: https://chipsandcheese.com/p/evaluating-the-infinity-cache-in
+- Chips and Cheese -- RDNA 4 Architecture at Hot Chips 2025: https://chipsandcheese.com/p/amds-rdna4-gpu-architecture-at-hot
+- Linux Kernel XDNA NPU Documentation: https://docs.kernel.org/accel/amdxdna/amdnpu.html
+
+### Community Resources and Guides
+
+- llama.cpp ROCm Performance Discussion: https://github.com/ggml-org/llama.cpp/discussions/15021
+- llama.cpp Strix Halo UMA Detection Bug: https://github.com/ggml-org/llama.cpp/issues/18159
+- llama.cpp Strix Halo Performance Regression: https://github.com/ggml-org/llama.cpp/issues/17917
+- Strix Halo Wiki -- llama.cpp with ROCm: https://strixhalo.wiki/AI/llamacpp-with-ROCm
+- Strix Halo Wiki -- Performance: https://strixhalo.wiki/AI/llamacpp-performance
+- AMD Strix Halo Toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes
+- LLM Tracker -- AMD GPUs: https://llm-tracker.info/howto/AMD-GPUs
+- LLM Tracker -- Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
+- Unsloth Dynamic 2.0 Documentation: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
+- Unsloth Dynamic v2.0 Blog: https://unsloth.ai/blog/dynamic-v2
+- KTransformers GitHub: https://github.com/kvcache-ai/ktransformers
+- ExLlamaV3 GitHub: https://github.com/turboderp-org/exllamav3
+- BitNet GitHub: https://github.com/microsoft/BitNet
+- LLMLingua GitHub: https://github.com/microsoft/LLMLingua
+- MoE Inference Awesome List: https://github.com/MoE-Inf/awesome-moe-inference
+- Awesome LLM Inference: https://github.com/xlite-dev/Awesome-LLM-Inference
+- Phoronix -- ROCm 7.1 vs Vulkan on AI PRO R9700: https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan
+- eunomia -- OS-Level LLM Inference Optimizations: https://eunomia.dev/blog/2025/02/18/os-level-challenges-in-llm-inference-and-optimizations/
+- RADV Cooperative Matrix for RDNA4: https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1524861-vulkan-cooperative-matrix-merged-for-rdna4-gpus-with-radv-dcc-support-inches-closer
+- Kaitchup -- GGUF Quant Selection: https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i