KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
39 KiB
LLM Inference Optimization Landscape (March 2026)
Scope
Comprehensive survey of cutting-edge LLM inference optimization techniques applicable to a high-end AMD APU workstation: Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151, RDNA 3.5), 64 GB unified LPDDR5X memory, 256 GB/s bandwidth. Covers inference engines, quantization, attention, MoE optimization, memory bandwidth, OS-level tuning, hardware features, and model-level techniques. Research current as of March 2026.
Table of Contents
- Inference Engines and Backends
- Advanced Quantization Techniques
- Attention Optimization
- MoE-Specific Optimizations
- Memory Bandwidth Optimization
- OS and Runtime Techniques
- Emerging Hardware Features
- Model-Level Optimizations
- Prioritized Recommendations for Strix Halo
- Sources
1. Inference Engines and Backends
1.1 llama.cpp -- Still the Foundation
llama.cpp remains the dominant local inference engine. All major interfaces (Ollama, LM Studio, GPT4All, KoboldCpp) use it under the hood. For Strix Halo specifically:
- ROCm/HIP backend: Build with
-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON. TheROCBLAS_USE_HIPBLASLT=1environment variable forces hipBLASLt kernels, which deliver the best throughput on gfx1151. - Vulkan backend: The RADV Mesa driver has seen active RDNA 3.5/4 optimization in Mesa 25.x. In some benchmarks Vulkan outperforms ROCm for single-shot inference and shorter contexts. HIP+WMMA+FlashAttention is fastest for long contexts (tg8192+).
- UMA detection bug (issue #18159): llama.cpp's UMA detection can incorrectly
limit available memory on AMD APUs with large TTM allocations. The
--mmp 0(disable mmap) flag is critical for ROCm on Strix Halo to avoid catastrophically slow model loading. - Performance: Llama-2-7B Q4_0 achieves ~1464 t/s prompt processing (pp512) and ~50 t/s token generation (tg128) on Strix Halo with ROCm.
- Known regression: A commit enabling WMMA-MMQ INT kernels for RDNA 3 introduced significant prompt processing regression on gfx1151 with ROCm 7.x (issue #17917).
Status: Production-ready. Best single-engine choice for Strix Halo.
1.2 KTransformers -- CPU/GPU Hybrid MoE Specialist
KTransformers (SOSP 2025) is the most significant new engine for hybrid inference. It was purpose-built for running large MoE models (DeepSeek-R1/V3) on systems with limited GPU memory but abundant CPU memory.
- AMX-optimized kernels: Uses Intel AMX instructions for CPU-side expert computation. For AMD Zen 5, it falls back to AVX-512, which is still substantially faster than naive CPU inference.
- Async CPU-GPU scheduling: Overlaps CPU expert computation with GPU attention computation, hiding CPU latency.
- Performance: 4.62-19.74x prefill speedup, 1.25-4.09x decode speedup vs existing hybrid systems. SGLang + KTransformers achieves 220+ tok/s total throughput on trillion-parameter MoE models.
- Relevance to Strix Halo: Moderate. KTransformers shines when GPU VRAM is scarce (24 GB discrete) and CPU RAM is abundant (382 GB). On Strix Halo, all 64 GB is accessible to the GPU, so the CPU offloading advantage is diminished. However, for models exceeding 64 GB, KTransformers-style hybrid inference becomes relevant.
Status: Production. Most useful for models that exceed available VRAM.
1.3 PowerInfer / PowerInfer-2
PowerInfer-2 targets smartphones, achieving 11.68 t/s on Mixtral 47B (22x faster than alternatives). It exploits MoE sparsity by predicting which experts will activate and only loading those. The core technique -- hot/cold neuron partitioning and GPU-resident hot neurons -- is architecturally interesting but the implementation targets mobile SoCs with discrete memory hierarchies, not unified-memory APUs where all memory is equally accessible to the GPU.
Status: Research. Techniques are partially subsumed by llama.cpp's own MoE offloading improvements.
1.4 MLC-LLM
MLC-LLM compiles models via TVM to target multiple backends including ROCm, Vulkan, Metal, and OpenCL. It was one of the first engines to make AMD GPUs competitive for LLM inference (2023 blog post). The Vulkan backend provides a universal fallback that works on any GPU.
Status: Active but niche. For Strix Halo, llama.cpp's native ROCm/Vulkan backends are more mature and better optimized.
1.5 mistral.rs / candle / burn
Rust-based inference engines:
- mistral.rs: Built on Hugging Face's candle library. Supports GGUF, GPTQ, ISQ (in-situ quantization). Has CUDA support but no ROCm backend.
- candle: Hugging Face's Rust ML framework. GPU support via CUDA; no ROCm.
- burn: Rust ML framework with multiple backends (WGPU, Vulkan, CUDA). The WGPU/Vulkan path could theoretically work on AMD, but LLM inference support is limited.
Status: Not viable for Strix Halo in 2026. No ROCm support, and the Vulkan paths are less optimized than llama.cpp's.
1.6 BitNet.cpp
Microsoft's official 1-bit LLM inference framework. Achieves 6x faster inference and 82% lower energy consumption. GPU kernel support was added May 2025 for NVIDIA and Apple Silicon. No AMD GPU kernels yet. CPU-only mode works on any x86 system and could be relevant for future 1-bit models, but the model ecosystem (BitNet b1.58 variants) remains small.
Status: Watch. No AMD GPU support. CPU path works but model selection is limited.
1.7 vLLM and SGLang
Both are production LLM serving frameworks with AMD ROCm support:
- vLLM v0.16.0 (Feb 2026): ROCm is now a first-class platform. 93% of AMD test groups passing. Native AITER FP8 kernels, fused LayerNorm/SiLU, optimized Paged Attention. Extended bitsandbytes quantization to warp-size-32 GPUs (RDNA).
- SGLang: Supports ROCm. KTransformers integration for hybrid MoE inference.
Both are overkill for single-user local inference but become relevant for serving multiple users or running agentic workloads with concurrent requests.
Status: Production for server workloads. Consider if running multi-user or agentic eval pipelines.
1.8 ExLlamaV3 / EXL3
ExLlamaV3 introduces the EXL3 format (based on QTIP from Cornell RelaxML), achieving excellent perplexity at extreme compression (Llama 3.3 70B at 1.75 bpw, 19 GB). The Marlin-inspired GEMM kernels are highly optimized for NVIDIA GPUs. AMD ROCm support was absent at launch (early 2025) and current status is uncertain.
Status: Watch. Potentially best-in-class quantization quality, but AMD support is unclear.
2. Advanced Quantization Techniques
2.1 GGUF Quantization Landscape
GGUF remains the dominant format for local inference via llama.cpp. The key variants:
| Format | Bits | Method | Best For |
|---|---|---|---|
| Q8_0 | 8 | Round-to-nearest | Maximum quality, 2x compress |
| Q6_K | 6.5 | K-quant | High quality, 2.5x compress |
| Q5_K_M | 5.5 | K-quant+imatrix | Balanced quality/size |
| Q4_K_M | 4.5 | K-quant+imatrix | Default recommendation |
| Q3_K_M | 3.9 | K-quant+imatrix | Aggressive, still usable |
| IQ3_XXS | 3.06 | I-quant+imatrix | Extreme compression |
| IQ2_XXS | 2.06 | I-quant+imatrix | Near-minimum viable |
| IQ1_S | 1.56 | I-quant+imatrix | Experimental |
imatrix (Importance Matrix): The single most impactful quality improvement for sub-4-bit quantization. The importance matrix identifies which weights produce large activations during inference and allocates more precision to them. For aggressive quantization (<4 bits), imatrix is no longer optional -- it is essential.
Recommendation: Q4_K_M + imatrix for most use cases. Q3_K_M + imatrix when fitting a larger model matters more than marginal quality.
2.2 Unsloth Dynamic 2.0
Unsloth Dynamic 2.0 (Feb 2026) represents the state-of-the-art in intelligent GGUF quantization:
- Per-layer adaptive quantization: Each layer gets a custom quantization type based on sensitivity analysis. The quantization scheme for Gemma 3 differs significantly from Llama 4.
- Universal MoE + dense support: Dynamic 2.0 works on all architectures (previously MoE-only).
- Calibration dataset: 1.5M+ token hand-curated dataset for improved conversational quality.
- Quality results: Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6% on 5-shot MMLU, surpassing many full-precision models.
- KL Divergence tracking: Every GGUF is benchmarked against the original model on both perplexity and KL divergence.
Relevance: Directly applicable. Use Unsloth Dynamic 2.0 GGUFs when available for any model. They consistently outperform standard k-quant GGUFs at the same bit-width.
2.3 AQLM and QuIP#
Both target extreme compression (2-3 bits):
- QuIP# (ICML 2024): Uses randomized Hadamard transforms + E8 lattice codebooks. First PTQ method where 3-bit outperforms theoretical lossless 4-bit. The E8 codebook fits in L1 cache, enabling inference speedups over FP16.
- AQLM v1.1.7 (April 2025): Additive quantization achieving Pareto optimality below 3 bpw. Outperforms QuIP# on MoE models at 2-bit. Added arbitrary 8-dimensional codebooks on GPU.
Both require PyTorch/CUDA for dequantization kernels. Neither has native llama.cpp integration or AMD support. They represent the theoretical frontier of what is achievable at extreme compression but are not practical for Strix Halo today.
Status: Research. Watch for llama.cpp integration of QTIP (via ExLlamaV3/EXL3).
2.4 AWQ vs GPTQ vs GGUF on AMD
For AMD GPUs in the llama.cpp ecosystem:
- GGUF: The only practical choice. Native llama.cpp support with ROCm/Vulkan acceleration. K-quants and I-quants are well-optimized.
- AWQ/GPTQ: Require Marlin kernels for competitive speed (741 tok/s with Marlin-AWQ vs 67 tok/s without on NVIDIA). Marlin kernels are CUDA-only. On AMD, these formats are accessible via vLLM or Hugging Face Transformers with ROCm, but not through llama.cpp.
- Performance hierarchy on AMD (via vLLM): GPTQ and AWQ with Marlin kernels are fastest on NVIDIA; on AMD ROCm, the performance advantage over GGUF is minimal and setup complexity is higher.
Recommendation: GGUF for llama.cpp on Strix Halo. AWQ/GPTQ only if using vLLM.
2.5 Mixed-Precision and Layer-Wise Quantization
Active research area with direct practical implications:
- Attention vs FFN sensitivity: Attention layers (QKV projections, output projection) have varying sensitivity. FFN layers are often the largest component and frequent targets for aggressive quantization (INT4).
- Channel-Wise Mixed-Precision (CMPQ): Allocates quantization precision per weight channel based on activation distributions. Adapts to any bit-width.
- HOBBIT for MoE: Maintains FP16 and INT4 versions of experts simultaneously. Hot experts stay at FP16; cold experts use INT4 or even INT2. This concept is partially implemented in Unsloth Dynamic 2.0's per-layer approach.
- Fine-Grained Mixed Precision (FGMP): Goes below row-level granularity to handle unstructured sensitivity patterns in both weights and activations.
Relevance: Unsloth Dynamic 2.0 already implements the practical version of layer-wise mixed precision for GGUF. The research frontier is moving toward sub-layer and channel-level mixed precision.
2.6 KV Cache Quantization
- TurboQuant (ICLR 2026): Being integrated into llama.cpp. TQ3 (3-bit) achieves 4.9x compression vs FP16 KV cache; TQ4 (4-bit) achieves 3.8x. This directly reduces memory pressure for long-context inference.
- llama.cpp native: Already supports Q8_0 and Q4_0 KV cache quantization via
--cache-type-kand--cache-type-vflags.
Relevance: High. On a 64 GB system, KV cache can consume significant memory for long contexts. Q4_0 KV cache is recommended; TurboQuant will push this further.
3. Attention Optimization
3.1 Flash Attention on AMD
Current status for RDNA 3.5 / gfx1151:
- Triton backend: Supports CDNA and RDNA GPUs with fp16, bf16, fp32. This is the primary Flash Attention path for non-Instinct AMD GPUs.
- PyTorch integration: Since PyTorch 2.5.0+,
F.scaled_dot_product_attentionautomatically uses Flash Attention on RDNA cards via the Triton backend. - llama.cpp WMMA Flash Attention: Enabled via
-DGGML_HIP_ROCWMMA_FATTN=ON. Uses RDNA 3.5's WMMA instructions for matrix multiply within the attention kernel. This is the fastest path for long-context inference on Strix Halo. - CK (Composable Kernel) backend: Supports MI200x, MI250x, MI300x, MI355x. Not available for RDNA consumer GPUs.
Gap: Flash Attention 3 (with asynchronous pipelines and FP8 attention) is NVIDIA Hopper-specific. No AMD equivalent exists.
3.2 SageAttention
SageAttention (ICLR 2025, ICML 2025, NeurIPS 2025 Spotlight) achieves 2-5x speedup over FlashAttention through quantized attention (8-bit Q/K matrices, FP16 values). SageAttention3 further uses FP4 Tensor Cores on Blackwell GPUs.
AMD status: SageAttention's Triton implementation could theoretically work on AMD GPUs, but no AMD-optimized kernels exist. The quantized attention concept is sound and could be adapted.
Status: Watch. Would be high-impact if ported to AMD.
3.3 Paged Attention
Paged Attention (vLLM) manages KV cache as non-contiguous memory pages, eliminating 60-80% of memory waste from fragmentation. llama.cpp's server mode implements a simplified version of this for concurrent request handling, but the full PagedAttention system is more mature in vLLM.
Relevance: Moderate for single-user. High for multi-user serving.
3.4 GQA/MQA Architecture Implications
Modern models (Llama 2/3, Mistral, Qwen) use Grouped Query Attention:
- GQA reduces KV cache by up to 90% vs MHA (Multi-Head Attention)
- 30-40% faster inference than MHA with near-equivalent accuracy
- Enables larger batch sizes due to smaller memory footprint
Practical impact: When choosing models for Strix Halo, prefer GQA models. All modern model families (Llama 3, Qwen 3, Gemma 3, Mistral) use GQA. Avoid older MHA models when alternatives exist.
3.5 Ring Attention and Linear Attention
- Ring Attention: Distributes long sequences across multiple devices. Achieves 1M context prefill in 77s with 93% parallelization efficiency. Not applicable to single-device Strix Halo.
- Linear Attention: Reduces KV cache from O(n) to O(1) and computation from O(n^2) to O(n). The Ring-Linear models (hybrid softmax + linear attention) reduce inference cost to 1/10 of dense models. This is a model architecture choice, not a runtime optimization.
Relevance: Linear attention models would be transformative for long-context on Strix Halo. Watch for Qwen, DeepSeek, or Llama variants with hybrid attention.
4. MoE-Specific Optimizations
4.1 Expert Offloading on Unified Memory
On discrete GPU systems, MoE inference involves expensive PCIe transfers of expert weights between CPU RAM and GPU VRAM. On Strix Halo's unified memory, this bottleneck is fundamentally different:
- All expert weights reside in the same physical memory accessible to both CPU and GPU. There is no PCIe transfer cost.
- The bottleneck shifts to memory bandwidth: at 256 GB/s, loading a 2 GB expert takes ~7.8 ms. With GGUF Q4 quantization, experts are 4x smaller, reducing this to ~2 ms.
- Implication: Unified memory eliminates the offloading problem but does not eliminate the bandwidth problem. The optimization focus should be on reducing the number of expert weights that must be read per token.
4.2 Expert Caching and Prediction
The research frontier in 2025-2026 focuses on predicting which experts will be needed:
- OD-MoE: 99.94% expert activation prediction accuracy, delivering ~75% of fully GPU-cached speed using 1/3 GPU memory.
- MoE-SpeQ: Uses a small draft model to predict expert sequences, enabling prefetching. Combines speculative decoding with expert prediction.
- SP-MoE: First speculative-decoding-aware expert offloading framework. Achieves 1.07-3.5x TPOT speedup by exploiting structural correspondence between draft and target models.
- SliceMoE: Dynamic Bit-Sliced Caching -- caches experts at sub-expert granularity, assigning precision on demand.
- FlashMoE: ML-based cache replacement for SSD-based expert offloading on edge.
Relevance for Strix Halo: Expert caching is less critical when all experts fit in memory, but expert prediction can still help by enabling prefetching into L2/ Infinity Cache before the expert is needed, reducing effective memory latency.
4.3 Expert Pruning
- Static pruning: Remove least-used experts entirely (MC-SMoE, EEP). Can reduce active parameters by up to 96.875% (TSEP). Requires fine-tuning.
- Dynamic pruning: Skip experts below an activation threshold at inference time. 38.2% FLOPs reduction with 1.32x speedup (Li et al.).
- DynMoE: 9% FLOPs reduction, 1.37x speedup through dynamic gating.
Relevance: Moderate. Dynamic expert skipping could reduce memory bandwidth requirements on Strix Halo, but requires model-specific configuration.
4.4 MoE Quantization -- Inactive Expert Compression
HOBBIT maintains multiple precision versions of experts: FP16 hot experts, INT4 cold experts, INT2 for rarely-used experts. On unified memory, a variant of this approach could keep the working set of experts at higher precision while storing rarely-activated experts at aggressive quantization, reducing total memory footprint.
MoE-CSP achieves 26x speedup through 4-bit/8-bit quantization with custom CUDA kernels. QMoE achieves 20x memory reduction but lacks efficient 1-bit kernel support.
Practical approach for Strix Halo: Use Unsloth Dynamic 2.0 GGUFs, which already implement per-layer (including per-expert) precision allocation.
5. Memory Bandwidth Optimization
5.1 The Fundamental Bottleneck
LLM inference (especially token generation / decode) is almost always memory-bandwidth bound. On Strix Halo:
- Available bandwidth: 256 GB/s (LPDDR5X-8000, 256-bit bus)
- Theoretical decode throughput for a 7B Q4_0 model (~3.5 GB): 256 GB/s / 3.5 GB = ~73 tok/s (assuming 100% utilization)
- Measured: ~50 t/s (tg128), implying ~68% bandwidth utilization
- Infinity Cache effect: The 32 MB Infinity Cache acts as a bandwidth amplifier. When working set fits in cache, effective bandwidth can exceed 256 GB/s. For LLM inference, per-layer weights typically exceed 32 MB, so cache benefit is limited to KV cache and activations.
5.2 Techniques to Reduce Bandwidth Requirements
| Technique | Bandwidth Reduction | Status on Strix Halo |
|---|---|---|
| Lower quantization (Q4->Q3) | ~25% | Available now |
| KV cache quantization (Q4) | ~75% for KV reads | Available now |
| Speculative decoding | 2-3x effective | Available now |
| Expert prediction/caching | Variable (MoE) | Research |
| Weight compression (EXL3) | Up to 8x | No AMD support |
| Activation checkpointing | Reduces peak memory | Available |
5.3 Speculative Decoding
The most impactful bandwidth optimization technique available today:
- Principle: A small, fast "draft" model generates N candidate tokens. The large "target" model verifies all N tokens in a single forward pass (batch). Accepted tokens are "free" -- they required no additional bandwidth from the target model.
- Speedup: 2-3x without accuracy loss. NVIDIA demonstrates 3.6x on H200.
- EAGLE-3: Lightweight autoregressive head attached to target model internals. No separate draft model needed.
- TurboSpec: Closed-loop control system that dynamically adjusts speculative parameters based on online feedback.
- MoE-SpeQ: Combines speculative decoding with expert prefetching.
Relevance: High. Speculative decoding is the single highest-impact optimization
for decode throughput on bandwidth-limited systems like Strix Halo. llama.cpp
supports speculative decoding via --model-draft.
5.4 Prefetching Strategies
- L2 cache prefetching: Proactively load KV cache and next-layer weights into GPU L2 during computation. Achieves 2.15x attention kernel speedup on NVIDIA H20.
- PRESERVE: Prefetch model weights from HBM to on-chip cache during communication operations. Up to 1.6x end-to-end speedup.
- Strix Halo consideration: The 32 MB Infinity Cache + 2 MB L2 provides limited on-chip storage. Prefetching activations and KV cache (which are smaller than weights) into Infinity Cache during weight reads could help.
5.5 Batched Inference
Batching amortizes weight-read cost across multiple requests:
- Single request: ~68% bandwidth utilization on Strix Halo
- Batch of 4: Approaches compute-bound regime for prefill; still bandwidth-bound for decode on most models
- Continuous batching (vLLM, llama.cpp server): 10-20x throughput improvement over naive batching
Trade-off: Batching increases throughput but also increases per-request latency and memory consumption (KV cache scales linearly with batch size).
6. OS and Runtime Techniques
6.1 Memory Management
Huge Pages: Transparent Huge Pages (THP) reduce TLB misses for large model weights. On Fedora 43, THP is enabled by default. For explicit control:
# Check current THP setting
cat /sys/kernel/mm/transparent_hugepage/enabled
# For llama.cpp, ensure THP is at least "madvise"
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
For models loaded with mmap, THP automatically promotes 4 KB pages to 2 MB pages, reducing page faults during inference.
Memory Locking: mlock prevents model weights from being swapped. llama.cpp's
--mlock flag enables this. Critical for systems running other workloads alongside
inference.
mmap vs direct load: On Strix Halo with ROCm, --mmp 0 (disable mmap) is
recommended. mmap causes catastrophically slow model loading when GPU offloading is
active because of the double-copy path through page cache.
6.2 Process Pinning and NUMA
Strix Halo is a single-die APU, so NUMA topology is simple (typically 1 NUMA node). However, CPU core affinity still matters:
# Pin inference to specific cores, keeping others free for OS
numactl --physcpubind=0-15 llama-server [args]
# Or via taskset
taskset -c 0-15 llama-server [args]
Core isolation: For minimum-jitter inference:
# Add to kernel cmdline
isolcpus=0-15 nohz_full=0-15 rcu_nocbs=0-15
This prevents the OS from scheduling unrelated tasks on inference cores.
6.3 CPU Frequency and Power
# Set performance governor for consistent throughput
sudo cpupower frequency-set -g performance
# Verify
cpupower frequency-info | grep "current CPU frequency"
6.4 cgroups v2 for Resource Isolation
Reserve memory and CPU for inference workloads:
# Create inference cgroup
sudo mkdir /sys/fs/cgroup/inference
echo "+memory +cpu" | sudo tee /sys/fs/cgroup/inference/cgroup.subtree_control
# Reserve 56 GB for inference (leave 8 GB for system)
echo $((56 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/inference/memory.min
# Pin CPU cores
echo "0-15" | sudo tee /sys/fs/cgroup/inference/cpuset.cpus
# Run inference in cgroup
sudo cgexec -g memory,cpu:inference llama-server [args]
6.5 io_uring for Model Loading
io_uring provides zero-copy, kernel-bypassing I/O that can accelerate initial model loading. While llama.cpp does not natively use io_uring, the underlying mmap/read path can benefit from io_uring-based file I/O when loading from NVMe:
- Eliminates context switch overhead during model load
- Enables true async I/O with completion ring buffers
- Most benefit when loading very large models (>32 GB) from storage
Practical impact: Minor for Strix Halo since model loading is a one-time cost, and LPDDR5X bandwidth far exceeds NVMe read speeds.
6.6 eBPF-Based Performance Monitoring
eBPF enables zero-instrumentation monitoring of inference workloads:
# Monitor GPU DRM scheduler jobs (works with amdgpu driver)
sudo bpftrace -e 'tracepoint:drm:drm_sched_job { printf("GPU job: %s\n", args->name); }'
# Track page faults during model loading
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'
# Monitor context switches on inference cores
sudo bpftrace -e 'tracepoint:sched:sched_switch /cpu == 0/ { @[args->next_comm] = count(); }'
The eunomia project provides ready-made eBPF programs for AI workload monitoring.
7. Emerging Hardware Features
7.1 AMD XDNA NPU
The Ryzen AI MAX+ 395 includes an XDNA 2 NPU rated at 50 TOPS. Current status for LLM inference:
- Software stack: AMD Ryzen AI Software supports ONNX model execution on the NPU. AMD Quark provides quantization for NPU deployment (SmoothQuant, GPTQ, Quarot).
- LLM capability: The NPU can accelerate small models and specific operations (attention heads, small expert networks) but cannot run full large LLMs.
- Linux support: Kernel 7.1 (expected 2026) brings significant XDNA upstreaming. Current Linux support is limited compared to Windows.
- Practical use: The NPU could potentially handle a speculative decoding draft model while the GPU runs the main model. This is not yet implemented in any inference engine.
Status: Not viable for LLM inference in March 2026. Watch for Linux kernel 7.1 and llama.cpp NPU backend development.
7.2 RDNA 3.5 Matrix Cores (WMMA)
The Radeon 8060S (gfx1151) has the same WMMA instruction set as RDNA 3 (gfx11xx), which is a generation behind RDNA 4 (gfx12xx):
RDNA 3 / 3.5 (gfx1151) WMMA capabilities:
- FP16/BF16: 512 FLOPS/clock/CU
- INT8: 1024 OPS/clock/CU
- 16x16 matrix dimensions
- Requires inter-lane data shuffling for chained operations
RDNA 4 (gfx12xx) improvements over RDNA 3.5:
- FP16/BF16: 1024 FLOPS/clock/CU (2x)
- INT8: 2048 OPS/clock/CU (2x)
- New FP8/BF8 formats at 4x the FP16 rate
- 4:2 structured sparsity support (effectively 2x more)
- No inter-lane shuffling needed for chained WMMA (major efficiency gain)
- New efficient matrix load instruction
Current usage in llama.cpp: WMMA is used for Flash Attention
(GGML_HIP_ROCWMMA_FATTN) and matrix-multiply quantized (MMQ) kernels. The
ROCm 7.x regression for gfx1151 (issue #17917) specifically affects MMQ kernels.
7.3 Vulkan Cooperative Matrices
The VK_KHR_cooperative_matrix Vulkan extension was merged into the RADV driver
for RDNA 3+ hardware. This provides a portable API for matrix operations that maps
to WMMA hardware:
- Enables inference engines to use matrix cores through Vulkan instead of vendor-specific ROCm/HIP APIs
- llama.cpp's Vulkan backend could leverage this for WMMA-accelerated matrix operations
- Currently less optimized than native HIP/ROCm paths
Status: Available in Mesa 25.x. Watch for llama.cpp Vulkan backend improvements using cooperative matrices.
7.4 Infinity Cache for Inference
Strix Halo has a 32 MB Infinity Cache (MALL -- Memory Attached Last Level):
- Architecture: L1 (256 KB/shader array) -> L2 (2 MB) -> Infinity Cache (32 MB) -> LPDDR5X
- Latency: Slightly higher than discrete GPU Infinity Cache implementations
- Hit rate: Varies by workload. Graphics benchmarks show ~73% hit rate at peak.
- LLM inference implications: For a 7B Q4 model (~3.5 GB), per-layer weights
are ~70-140 MB, far exceeding the 32 MB cache. Benefit is limited to:
- KV cache for current context (fits well for shorter contexts)
- Activations and intermediate results
- Embedding layer (often accessed repeatedly)
- Small models/layers that fit entirely in cache
The Infinity Cache is most impactful as a bandwidth amplifier -- when inference accesses exhibit temporal locality (same data accessed multiple times within a short window), effective bandwidth exceeds the 256 GB/s DRAM limit.
8. Model-Level Optimizations
8.1 Prompt Compression
- LLMLingua / LLMLingua-2 (Microsoft): Compresses input prompts by removing low-information tokens. 20x compression with 1.5 point performance drop. 1.7-5.7x end-to-end inference speedup. LLMLingua-2 is 3-6x faster than v1. Integrated into LangChain and LlamaIndex.
- 500xCompressor: Compresses contexts into a single special token. 6x-480x compression. Adds only 0.25% parameters. More aggressive but less mature.
Relevance: High for RAG and agentic workloads where prompts are long. Reduces both prefill time and KV cache memory.
8.2 Speculative Decoding (Model-Level)
Beyond the engine-level implementation described in Section 5.3:
- Self-speculative decoding: Model drafts its own tokens using early exit from lower layers. No separate draft model needed.
- EAGLE-3: Autoregressive head on target model internals. Higher acceptance rates than separate draft models.
- Draft model latency > accuracy: Research shows that draft model speed matters more than its language modeling accuracy for overall throughput.
8.3 Mixture of Depths / Mixture of Recursions
- Mixture of Depths (MoD): Dynamically allocates compute to tokens that need it. 2-3x inference speedup with minimal quality degradation. Implemented at training time -- requires model architecture support.
- Mixture of Recursions (MoR) (NeurIPS 2025): Combines parameter sharing with adaptive token-level compute. Lightweight routers assign different recursion depths to individual tokens. 2x inference throughput with reduced KV cache sizes.
Relevance: These are model architecture choices, not runtime optimizations. Watch for models trained with MoD/MoR architectures.
8.4 Structured Pruning
Post-training methods to permanently remove model components:
- Width pruning: Remove neurons, attention heads, or embedding channels. Better accuracy retention than depth pruning.
- Depth pruning: Remove entire layers. More latency reduction per parameter removed.
- LLM-Pruner, SliceGPT, FLAP: State-of-the-art structured pruning methods.
- AMP: Jointly prunes attention heads and MLP neurons.
- NIRVANA (2025): Structured pruning reimagined for LLM compression.
Practical approach: Structured pruning requires per-model effort and is generally less practical than quantization for local inference. Exception: if a specific model is too slow at a given quantization level, pruning the model first and then quantizing can yield a better speed/quality trade-off.
8.5 Token Merging and Pruning
- TokenSelect (EMNLP 2025): Dynamic token-level KV cache selection for efficient long-context inference and length extrapolation.
- LightThinker: Step-by-step compression of chain-of-thought reasoning.
- Attention sparsity: Twilight (NeurIPS 2025) uses hierarchical top-p pruning for adaptive attention sparsity.
These techniques reduce the effective sequence length during inference, directly reducing both compute and memory bandwidth requirements.
9. Prioritized Recommendations for Strix Halo
Tier 1: Implement Now (High Impact, Available Today)
-
Use Unsloth Dynamic 2.0 GGUFs for all models. They provide the best quality-per-bit through intelligent layer-wise quantization.
-
Build llama.cpp with WMMA Flash Attention:
-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON. Monitor issue #17917 for MMQ regression fix. -
Disable mmap for ROCm: Always use
--mmp 0/--no-mmapto avoid the double-copy performance penalty. -
Enable KV cache quantization: Use
--cache-type-k q4_0 --cache-type-v q4_0for long-context workloads. Watch for TurboQuant integration. -
Set ROCBLAS_USE_HIPBLASLT=1: Forces the optimized hipBLASLt kernels.
-
Speculative decoding for decode-heavy workloads: Use
--model-draftwith a small model from the same family. -
GPU performance governor and frequency pinning: Ensures consistent throughput.
Tier 2: Evaluate (Moderate Impact, Some Setup Required)
-
LLMLingua-2 for agentic/RAG workloads: Compress long prompts before inference. 3-6x prompt processing speedup.
-
vLLM for multi-user serving: If running concurrent inference requests (e.g., agentic eval pipelines), vLLM's continuous batching and PagedAttention provide 10-20x throughput improvement.
-
cgroups v2 memory reservation: Prevent the OS from reclaiming GPU-mapped memory under memory pressure.
-
Vulkan backend for short-context workloads: Test whether the Vulkan/RADV path is faster than ROCm for your specific model and context length.
-
Process pinning with
numactlortasksetfor reduced scheduling jitter.
Tier 3: Watch and Prepare (High Potential, Not Ready)
-
KTransformers for >64 GB models: When running DeepSeek V3 or similar models that exceed available memory.
-
ExLlamaV3/EXL3 AMD support: If AMD kernels arrive, EXL3's QTIP-based quantization could significantly improve quality at extreme compression.
-
XDNA NPU for draft model acceleration: If/when llama.cpp adds NPU backend support, the NPU could run the draft model for speculative decoding.
-
SageAttention AMD port: 2-5x attention speedup through quantized attention.
-
Linear attention models: Watch for hybrid softmax/linear attention models from major labs that would dramatically improve long-context inference.
-
Cooperative matrices in Vulkan: As llama.cpp's Vulkan backend matures, this provides a portable path to WMMA acceleration without ROCm dependency.
10. Sources
Papers and Conference Proceedings
- Raposo et al., "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models," 2024. https://arxiv.org/abs/2404.02258
- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," ICML 2023. https://arxiv.org/abs/2305.13245
- Tseng et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks," ICML 2024. https://arxiv.org/abs/2402.04396
- Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization," ICLR 2025. https://arxiv.org/abs/2401.06118
- Chen et al., "KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models," SOSP 2025. https://dl.acm.org/doi/10.1145/3731569.3764843
- Min et al., "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation," NeurIPS 2025. https://arxiv.org/abs/2507.10524
- Varadarajan et al., "Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures," 2025. https://arxiv.org/abs/2504.11750
- Zandieh et al., "TurboQuant: Extreme KV Cache Quantization," ICLR 2026. https://github.com/ggml-org/llama.cpp/discussions/20969
- Agrawal et al., "SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration," ICLR 2025. https://arxiv.org/abs/2410.02367
- Ye et al., "FlashInfer: Efficient and Customizable Attention Engine for LLM Serving," 2025. https://arxiv.org/abs/2501.01005
- Jiang et al., "LLMLingua: Compressing Prompts for Accelerated Inference," EMNLP 2023. https://arxiv.org/abs/2310.05736
- Li et al., "A Survey on Inference Optimization Techniques for Mixture of Experts Models," 2024. https://arxiv.org/abs/2412.14219
- Liu et al., "MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching," 2025. https://arxiv.org/abs/2511.14102
- Zhou et al., "SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE Inference," 2025. https://arxiv.org/abs/2510.10302
- He et al., "SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints," 2025. https://arxiv.org/abs/2512.12990
- Jin et al., "OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference," 2025. https://arxiv.org/abs/2512.03927
Documentation and Technical References
- AMD ROCm Strix Halo System Optimization: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html
- AMD GPUOpen -- Using Matrix Cores of RDNA 4: https://gpuopen.com/learn/using_matrix_core_amd_rdna4/
- AMD GPUOpen -- Accelerating Generative AI on Radeon GPUs: https://gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/
- vLLM ROCm Blog: https://blog.vllm.ai/2026/02/27/rocm-attention-backend.html
- AMD ROCm vLLM Blog: https://rocm.blogs.amd.com/software-tools-optimization/vllm-omni/README.html
- AMD AI Inference on Ryzen AI NPU with Quark: https://www.amd.com/en/developer/resources/technical-articles/2025/ai-inference-acceleration-on-ryzen-ai-with-quark.html
- Chips and Cheese -- Evaluating Infinity Cache in Strix Halo: https://chipsandcheese.com/p/evaluating-the-infinity-cache-in
- Chips and Cheese -- RDNA 4 Architecture at Hot Chips 2025: https://chipsandcheese.com/p/amds-rdna4-gpu-architecture-at-hot
- Linux Kernel XDNA NPU Documentation: https://docs.kernel.org/accel/amdxdna/amdnpu.html
Community Resources and Guides
- llama.cpp ROCm Performance Discussion: https://github.com/ggml-org/llama.cpp/discussions/15021
- llama.cpp Strix Halo UMA Detection Bug: https://github.com/ggml-org/llama.cpp/issues/18159
- llama.cpp Strix Halo Performance Regression: https://github.com/ggml-org/llama.cpp/issues/17917
- Strix Halo Wiki -- llama.cpp with ROCm: https://strixhalo.wiki/AI/llamacpp-with-ROCm
- Strix Halo Wiki -- Performance: https://strixhalo.wiki/AI/llamacpp-performance
- AMD Strix Halo Toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes
- LLM Tracker -- AMD GPUs: https://llm-tracker.info/howto/AMD-GPUs
- LLM Tracker -- Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
- Unsloth Dynamic 2.0 Documentation: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
- Unsloth Dynamic v2.0 Blog: https://unsloth.ai/blog/dynamic-v2
- KTransformers GitHub: https://github.com/kvcache-ai/ktransformers
- ExLlamaV3 GitHub: https://github.com/turboderp-org/exllamav3
- BitNet GitHub: https://github.com/microsoft/BitNet
- LLMLingua GitHub: https://github.com/microsoft/LLMLingua
- MoE Inference Awesome List: https://github.com/MoE-Inf/awesome-moe-inference
- Awesome LLM Inference: https://github.com/xlite-dev/Awesome-LLM-Inference
- Phoronix -- ROCm 7.1 vs Vulkan on AI PRO R9700: https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan
- eunomia -- OS-Level LLM Inference Optimizations: https://eunomia.dev/blog/2025/02/18/os-level-challenges-in-llm-inference-and-optimizations/
- RADV Cooperative Matrix for RDNA4: https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1524861-vulkan-cooperative-matrix-merged-for-rdna4-gpus-with-radv-dcc-support-inches-closer
- Kaitchup -- GGUF Quant Selection: https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i