Files
strix-halo-optimizations/docs/llama-cpp-optimization-research.md
Felipe Cardoso f92b710492 fix(benchmark): parse llama-bench output with variable column count
KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00

34 KiB

llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)

Comprehensive research into maximizing inference performance on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000). Researched March 2026.


Scope

This document covers every known compilation flag, runtime parameter, environment variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5) with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama internals, or non-llama.cpp inference engines except where their findings inform llama.cpp optimization.


Table of Contents

  1. Compilation Flags and Build Optimizations
  2. Runtime Flags and Environment Variables
  3. Flash Attention and Attention Backends
  4. Quantization Strategies for Speed
  5. Memory Layout and Caching
  6. llama-server Specific Optimizations
  7. Upcoming llama.cpp Features (2026)
  8. Recommended Configurations
  9. Sources

1. Compilation Flags and Build Optimizations

1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build

Both backends are worth building. Neither is universally faster on gfx1151:

Workload Winner Rationale
Token generation (short ctx) Vulkan RADV Lower driver overhead, mature kernel paths
Token generation (long ctx, 8K+) ROCm + rocWMMA + FA Maintains speed as context grows; uses less memory
Prompt processing (short ctx) Mixed -- model-dependent AMDVLK or ROCm hipBLASLt win on some shapes
Prompt processing (long ctx) ROCm + rocWMMA-tuned 96% speedup over untuned rocWMMA at 65K ctx
Memory efficiency at long ctx ROCm + FA Less memory than Vulkan equivalent

Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):

Backend pp512 t/s tg128 t/s pp512@130K t/s tg128@130K t/s
Vulkan RADV 755.14 85.11 17.24 12.54
Vulkan AMDVLK 741.60 81.79 10.75 3.51
ROCm hipBLASLt 651.93 63.95 40.35 4.97
ROCm rocWMMA-tuned 659.07 67.66 51.12 13.33

Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context prompt processing (3x faster than RADV at 130K).

1.2 ROCm Build Flags

Minimal build:

cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Optimized build (recommended):

cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Critical ROCm build flags:

Flag Effect Recommendation
-DGGML_HIP=ON Enable HIP/ROCm backend Required
-DAMDGPU_TARGETS="gfx1151" Target Strix Halo GPU Required -- do not use gfx1100
-DGGML_HIP_ROCWMMA_FATTN=ON Enable rocWMMA flash attention Strongly recommended for pp
-DGGML_HIP_GRAPHS=ON HIP graph kernel scheduling Test -- may help reduce launch overhead
-DGGML_HIP_NO_VMM=OFF Re-enable Virtual Memory Management Default is disabled; test if needed

Flags to be aware of but NOT set by default:

Flag Notes
-DGGML_HIP_UMA=ON Uses hipMallocManaged for UMA. Avoid on Strix Halo -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster.
-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. Test before deploying.
-DGGML_CUDA_FA_ALL_QUANTS=ON Compiles all KV cache quant type combinations for FA. Works for CUDA kernels which HIP reuses via hipify. Increases compilation time substantially. Enable if you need quantized KV cache with flash attention.

ROCm version considerations:

  • ROCm 7.2: Known rocWMMA compilation issue (ambiguous template specializations in mfma_impl.hpp). Fixed in later point releases or by disabling rocWMMA.
  • ROCm 7.0 RC / 7.1: Generally work well with gfx1151.
  • ROCm 6.4.4: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
  • Recommendation: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.

1.3 Vulkan Build Flags

cmake -B build -S . \
  -DGGML_VULKAN=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime. gfx1151 with RADV reports matrix cores: KHR_coopmat in llama-bench logs.

Vulkan-specific considerations:

Topic Detail
RADV vs AMDVLK RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit.
AMDVLK buffer limit AMDVLK caps single Vulkan allocations at ~2 GiB (VkPhysicalDeviceLimits::maxMemoryAllocationSize). RADV allows ~4 GiB. This causes OOM for models with large compute buffers.
CoopMat1 vs CoopMat2 gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (VK_NV_cooperative_matrix2) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA.
Shader compilation Building from source with glslc available enables cooperative matrix shader variants. Pre-built binaries may omit them.

1.4 LTO and PGO

llama.cpp does not have built-in LTO/PGO support in its CMake configuration. You can enable LTO manually:

cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
  -DCMAKE_BUILD_TYPE=Release

Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are compiled by the HIP/ROCm compiler and are not affected by host LTO.

PGO would require a two-pass build (instrument, profile, rebuild) and is not commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance and memory bandwidth, not host-side code paths.

1.5 Compiler Tuning for ROCm

A known LLVM regression affects loop unrolling on RDNA. The following flag has been reported to help:

-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"

This increases the unrolling threshold for local memory operations, which can improve kernel performance for flash attention and matrix multiplication.


2. Runtime Flags and Environment Variables

2.1 ROCm Environment Variables

Variable Value Effect
ROCBLAS_USE_HIPBLASLT=1 Critical Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. This is a 2-7x improvement for prompt processing.
HSA_OVERRIDE_GFX_VERSION=11.5.1 Set inside toolbox containers Required for ROCm to recognize gfx1151. Set in container, not by host scripts.
HSA_ENABLE_SDMA=0 Optional Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels.
HIP_VISIBLE_DEVICES=0 Optional Select specific GPU device. Useful in multi-GPU or container setups.
GPU_MAX_HEAP_SIZE=100 Optional Allow 100% of GPU memory for heap. Default may be lower.
GPU_MAX_ALLOC_PERCENT=100 Optional Allow single allocation up to 100% of GPU memory.
ROCR_VISIBLE_DEVICES=0 Optional HSA-level device visibility control.
AMD_LOG_LEVEL=0 Optional Suppress AMD driver logging noise.

The single most impactful environment variable is ROCBLAS_USE_HIPBLASLT=1. Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).

2.2 Vulkan Environment Variables

Variable Value Effect
AMD_VULKAN_ICD=RADV Recommended Force RADV driver (skip AMDVLK).
RADV_PERFTEST=nogttspill Important Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off).
GGML_VK_VISIBLE_DEVICES=0 Optional Select Vulkan device index.
GGML_VULKAN_DISABLE_F16=1 Debugging Force FP32 compute. Slower but useful for debugging precision issues.
GGML_LOG_LEVEL=2 Debugging Verbose logging to verify coopmat detection.

2.3 Thread Count (-t flag)

For GPU-dominant inference (all layers offloaded), the thread count has minimal impact on throughput. The recommendation:

  • Single-user inference: -t 4 to -t 8 (enough for tokenization/sampling overhead)
  • Server with parallel slots: -t equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
  • Hybrid CPU+GPU (partial offload): -t equal to number of physical cores

The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using all 12 big cores (-t 12) is optimal for CPU-involved workloads.

2.4 Batch Size Tuning (-b and -ub)

Flag Default Role
-b / --batch-size 2048 Logical batch size (application level)
-ub / --ubatch-size 512 Physical batch size (device level)

Tuning guidance for gfx1151:

  • MoE models: -b 256 significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
  • Dense models: Default -b 2048 is generally fine
  • Long context: -ub 2048 can improve performance, but test against OOM
  • Ultra-long context: Reduce -ub if memory allocation fails

The Vulkan backend blog post for Strix Halo recommends: -c 32768 -b 4096 -ub 256 for a good balance of performance and memory.

2.5 Memory-Mapped Loading (-mmp / --no-mmap)

Critical finding for unified memory APUs:

When you load large models to the GPU, memory mapping can make loading moderately slower for Vulkan, and catastrophically slower for ROCm. You should always set --mmap 0 or --no-mmap to improve model loading times on Strix Halo.

For llama-bench, use -mmp 0. For llama-server/llama-cli, use --no-mmap.

On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x. The difference is which pages are mapped for GPU access. GPU-mapped pages have full bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for CPU-to-GPU copies).

Always use -ngl 99 (or higher) to ensure all layers are on GPU memory. Even on a unified memory system, GPU memory paths provide 2x the bandwidth.

2.6 GPU Layer Offloading (-ngl)

For Strix Halo with 64GB unified memory:

  • Models < 50GB: -ngl 99 offloads everything. No tuning needed.
  • Models 50-60GB: -ngl 99 should still work with GTT expanded via kernel params.
  • Models > 60GB: May need partial offload. Use -ngl <N> where N is tuned to keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.

Never let GPU spill to system RAM paths -- performance will be worse than pure CPU.


3. Flash Attention and Attention Backends

3.1 When to Enable Flash Attention

Rule of thumb for gfx1151:

Backend Flash Attention Recommendation
ROCm + rocWMMA -fa 1 Always enable. 24% pp improvement, maintains tg speed, uses less memory.
ROCm without rocWMMA -fa 1 Enable, but smaller improvement.
Vulkan RADV -fa 1 Enable for short contexts. Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance.
Vulkan AMDVLK -fa 1 Similar to RADV.

Key caveat: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior FA path.

3.2 rocWMMA Flash Attention Performance

Benchmark on gfx1151 (Llama2-7B Q4_K_M):

Configuration pp512 t/s tg128 t/s
HIP standard 592.28 40.40
HIP + hipBLASLt 548.72 40.43
HIP + rocWMMA + hipBLASLt 1006.80 39.46
HIP + rocWMMA (no hipBLASLt) 899.73 39.45

rocWMMA provides ~70% improvement in prompt processing with flash attention. Token generation is slightly slower (~2%) due to WMMA overhead at small batch.

3.3 The rocWMMA Long-Context Regression and Fix

The standard rocWMMA implementation has a long-context decode regression: at 65K context, tg degrades by up to 57% compared to HIP-only baseline.

The fix (PR #16827, "rocm-wmma-tune" branch) implements:

  1. __launch_bounds__(256, 2): Ensures minimum 2 blocks per SM, improving occupancy
  2. Adaptive KQ stride: Uses stride 128 when head dimension <= 128, reducing LDS footprint
  3. Selective WMMA usage: WMMA only for prefill; decode reverts to VEC/TILE kernels

Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):

  • pp512 at 65K context: 96% faster than untuned rocWMMA
  • tg128 at 65K context: Matches HIP baseline (previously 57% degraded)

Status: This patch is available in -rocwmma-improved toolbox builds. It may not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.

3.4 Vulkan Flash Attention Limitations on AMD

The Vulkan backend supports three FA paths:

Path Extension AMD Support
FA_SCALAR None Yes (CPU fallback)
FA_COOPMAT1 KHR_cooperative_matrix Yes (gfx1151 reports support)
FA_COOPMAT2 NV_cooperative_matrix2 No (NVIDIA-only)

FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types. FA_COOPMAT2 additionally supports all quant types.

When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations. This provides a modest improvement over scalar FA but is significantly less efficient than ROCm + rocWMMA.

3.5 New Attention Models (GatedDeltaNet)

Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have severe performance problems on gfx1151:

  • Vulkan: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
  • ROCm/HIP: Kernel cross-compiles but suffers from register spilling (float s[S_v] allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95% of decode time on models >15GB)

Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s. Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.


4. Quantization Strategies for Speed

4.1 Quantization Speed on RDNA 3.5

Token generation speed is dominated by memory bandwidth, not compute. Smaller quantizations are faster because they reduce bytes-per-weight, allowing more tokens per second within the ~215 GB/s bandwidth envelope.

Approximate throughput formula for decode (bandwidth-bound):

tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9

For a 7B Q4_K_M model (~4.1 GB):

215 / 4.1 ≈ 52 t/s  (theoretical max; practical ~50 t/s on gfx1151)

4.2 Quantization Type Comparison

Quant Bits/Weight Quality Speed (relative) Notes
Q4_0 4.0 Low Fastest Legacy. Simple dequant.
Q4_K_M 4.83 Good Very fast K-quant with hierarchical blocks. Recommended default.
IQ4_XS 4.25 Good Fast Importance-weighted. Better quality/bit than Q4_K_M.
Q5_K_M 5.69 Very good Fast Sweet spot for quality-sensitive use.
Q6_K 6.56 Excellent Moderate Near-lossless quality.
Q8_0 8.0 Near-perfect Slower 2x the bytes of Q4_K_M, ~2x slower tg.
F16 16.0 Perfect Slowest Reference baseline.

For RDNA 3.5 specifically:

  • Q4_K_M is the best general-purpose quantization. The K-quant family uses hierarchical super-blocks (256 values) with per-sub-block scales, providing better quality than Q4_0 at marginally higher dequant cost that is invisible at the GPU level.

  • Q4_0 has the simplest dequant kernels and is marginally faster than Q4_K_M on some GPU backends. However, the quality loss is significant. Use only for smoke tests or when every t/s matters more than quality.

  • IQ4_XS (importance-matrix quantized) offers better quality per bit than Q4_K_M. Speed is similar. Requires an importance matrix file during quantization. Recommended over Q4_K_M when you control the quantization process.

  • Q8_0 does NOT have special hardware-accelerated dequant on RDNA 3.5. RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the same FP16 compute paths, just with more memory bandwidth consumed.

4.3 Importance Matrix (imatrix) Quantization

imatrix quantization records how much each weight affects output quality, then allocates more precision bits to important weights. This is essential for sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant shows measurable degradation.

# Generate importance matrix (GPU-accelerated)
llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat

# Quantize with imatrix
llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS

Speed impact: None. imatrix affects quantization quality, not inference speed. The dequantization kernels are identical regardless of whether imatrix was used.

4.4 Unsloth Dynamic (UD) Quantizations

Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths, choosing the optimal quantization per layer based on sensitivity analysis.

Speed impact: Minimal to none. UD quants use the same dequant kernels as standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a standard Q4_K_M of the same total size.

Quality impact: Significantly better. UD consistently outperforms standard quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.

Recommendation: Prefer UD quants (e.g., UD-Q4_K_XL, UD-Q4_K_M) from Unsloth when available. They are a free quality upgrade with no speed penalty.


5. Memory Layout and Caching

5.1 KV Cache Quantization

KV cache quantization reduces the memory footprint of the attention cache, allowing larger context windows within the same memory budget.

Cache Type Memory vs F16 Quality Impact Recommendation
f16 (default) 1.0x None Baseline
q8_0 0.5x Negligible (+0.002-0.05 ppl) Recommended for production
q4_0 0.33x Noticeable (+0.2-0.25 ppl) Use when memory-constrained
q4_1 0.33x Slightly better than q4_0 Alternative to q4_0
iq4_nl 0.33x Better than q4_0 Best 4-bit KV option

Usage:

llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
# or for llama-bench:
# Not directly supported as flags; test via llama-server

Performance impact: Quantizing K cache slightly improves throughput (less memory to read). Quantizing V cache may have a slight negative impact. Overall performance impact is negligible for normal inference.

Caveat with speculative decoding: Using KV cache quantization with a draft model causes a consistent ~16% performance drop. q4_0 KV with speculative decoding causes massive acceptance rate drops. Avoid KV quant if using speculative decoding.

5.2 mmap vs Full Load on Unified Memory

On Strix Halo's unified memory architecture:

  • --no-mmap is strongly recommended for both ROCm and Vulkan.
  • With mmap enabled, ROCm model loading is "catastrophically slower."
  • Vulkan loading is "moderately slower" with mmap.
  • Since CPU and GPU share physical RAM, there is no data copy when loading to "GPU memory" -- it is just a page table update.

For llama-bench: Always use -mmp 0. For llama-server/llama-cli: Always use --no-mmap.

5.3 Prompt Caching

llama-server supports two levels of prompt caching:

1. Automatic KV cache reuse (cache_prompt: true): Reuses KV cache from previous requests when prompts share a common prefix. The server only reprocesses the suffix that differs.

2. Host-memory prompt caching (--cram N): Stores pre-computed prompt representations in system RAM.

  • Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
  • +6% token throughput (34 vs 32 t/s)
  • Memory formula: num_prefixes * avg_prefix_tokens * 8 bytes

Configuration:

llama-server -m model.gguf \
  --cram 256 \              # 256 MB host RAM for prompt cache
  --cache-type-k q8_0 \     # KV cache quantization
  --cache-type-v q8_0 \
  --no-mmap \
  -fa \
  -ngl 99

Best for:

  • System prompts > 5K tokens
  • Multi-user chatbots with shared context
  • Agentic use with repeated tool-call prefixes

5.4 UMA Detection Bug (Issue #18159)

llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark) incorrectly activates on AMD APUs when prop.integrated=1. It reads /proc/meminfo instead of hipMemGetInfo(), severely underreporting available GPU memory (e.g., reporting 27GB instead of 96GB).

Workarounds:

  • Build without GGML_CUDA_ENABLE_UNIFIED_MEMORY
  • Guard UMA detection with !defined(GGML_USE_HIP) (upstream fix pending)
  • Use toolbox containers where this has been patched

5.5 KV Cache Placement on ROCm (Issue #18011)

On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible) memory instead of GPU-mapped memory, causing performance degradation at high context sizes. This is a known issue contributing to ROCm falling behind Vulkan for tg at high contexts.

Mitigation: Use the rocWMMA-tuned branch which maintains better memory placement, or use Vulkan RADV for workloads where this matters.


6. llama-server Specific Optimizations

llama-server -m model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --no-mmap \
  -fa \
  -c 32768 \               # Total context across all slots
  -np 4 \                  # 4 parallel slots (adjust for your use)
  -b 2048 \                # Logical batch size
  -ub 512 \                # Physical batch size
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cont-batching \         # Enabled by default
  --jinja                   # Enable Jinja2 chat template

6.2 Parallel Slot Configuration (-np)

Use Case Slots Context per Slot Total -c
Single user chat 1 32768 32768
Agentic coding (Claude Code style) 2-4 8192-16384 32768-65536
Multi-user API 4-8 4096-8192 32768-65536
Eval harness 1 32768+ 32768+

Memory formula: Each slot requires context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element. With q8_0 KV cache, this is roughly halved compared to f16.

6.3 Continuous Batching

Enabled by default (--cont-batching). Allows the server to process multiple requests simultaneously, interleaving prefill and decode operations.

For agentic workloads: One slot typically holds a large system prompt + conversation context, while additional slots handle parallel tool calls. Configure with:

-np 4 -c 131072  # 4 slots, up to 32K context each

6.4 Prompt Caching for Agentic Use

For agentic coding tools that send the same system prompt repeatedly:

  1. Use cache_prompt: true in API requests (reuses KV cache prefix)
  2. Use --system-prompt-file system.txt for static system prompts (note: may be removed in recent versions; verify with your build)
  3. Use --cram 128 to enable host-memory caching for prefix deduplication

6.5 Speculative Decoding

For token generation speedup with a draft model:

llama-server -m main-model.gguf \
  --model-draft draft-model.gguf \
  -ngl 99 \
  --draft-max 8 \
  --draft-min 1 \
  --no-mmap \
  -fa

Caveat: Do NOT combine speculative decoding with KV cache quantization. The 16% performance drop and reduced acceptance rate negate the benefits.


7. Upcoming llama.cpp Features (2026)

7.1 Backend-Agnostic Tensor Parallelism (PR #19378)

Merged January 2026. Adds --split-mode tensor for splitting computation across multiple GPUs via a new "meta" backend.

Relevance to Strix Halo: Limited. Single integrated GPU. However, for RPC configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster), tensor parallelism could complement the existing layer-split approach.

Currently supports 1-2 GPUs with equal data split. --tensor-split has no effect yet.

7.2 TurboQuant KV Cache Compression (ICLR 2026)

Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with no training and negligible quality loss:

Format MSE vs FP16 Compression
TQ3 (3-bit) 0.034 4.9x
TQ4 (4-bit) 0.009 3.8x

Timeline: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase integration plan exists covering GGML type registration, KV cache paths, FA integration, and CLI flags.

7.3 Vulkan Improvements

Active 2025-2026 developments:

  • Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
  • 13% pp improvement from CU mode optimization for LDS utilization
  • BFloat16 Vulkan support (VK_KHR_shader_bfloat16) maturing in Mesa 25.x
  • Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)

7.4 Flash Attention for Head Dimension 512

Pull request from March 2026 adds FA support for HD=512 in CUDA kernels. This benefits models with larger head dimensions (some newer architectures). The HIP path should inherit this via hipify.

7.5 ik_llama.cpp Fork Innovations

The ik_llama.cpp fork by ikawrakow introduces:

  • Row-interleaved quant packing (better memory access patterns)
  • Smart Expert Reduction for faster MoE inference
  • Tensor overrides with regex patterns for hybrid GPU/CPU placement
  • FlashMLA for DeepSeek models

Caveat: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan are not maintained. Not recommended for AMD gfx1151.


8.1 For llama-bench (Benchmarking)

ROCm backend:

ROCBLAS_USE_HIPBLASLT=1 \
toolbox run -c llama-rocm-7.2 -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5

Vulkan backend:

AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
toolbox run -c llama-vulkan -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5

8.2 For llama-server (Production Agentic Use)

ROCm (best for long context):

ROCBLAS_USE_HIPBLASLT=1 \
llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 65536 -np 4 \
  -b 2048 -ub 512 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --cram 256 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080

Vulkan RADV (best for single-user tg):

AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 32768 -np 2 \
  -b 4096 -ub 256 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080

8.3 Decision Matrix

Question Answer
Which backend for benchmarking? Both. ROCm and Vulkan have different strengths.
Which backend for daily chat? Vulkan RADV for best tg speed.
Which backend for long-context agentic? ROCm + rocWMMA-tuned for context resilience.
Which quantization? Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality.
Enable flash attention? Yes, always on ROCm. Yes on Vulkan for short contexts.
Use --no-mmap? Always.
Set ROCBLAS_USE_HIPBLASLT=1? Always for ROCm.
Set RADV_PERFTEST=nogttspill? Always for Vulkan RADV.
KV cache quantization? q8_0 for both K and V unless using speculative decoding.
Batch size for MoE? -b 256 (lower than default improves some MoE models).

9. Sources

GitHub Issues and Discussions

Wiki and Community Resources

Articles and Blog Posts

Official Documentation

Papers

  • "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
  • "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
  • Unsloth Dynamic 2.0 GGUFs Documentation

Open Questions / Limitations

  1. rocWMMA-tuned patch upstream status: PR #16827 may not be fully merged. Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.

  2. ROCm 7.2 stability on gfx1151: Multiple reports of crashes (MUT_MAL errors), performance regressions, and compilation issues. ROCm 7.x is maturing but not fully stable for gfx1151 as of March 2026.

  3. Vulkan CoopMat FA for AMD: Will AMD ever get CoopMat2 support? The current CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent extension would close the gap with ROCm FA.

  4. KV cache placement on ROCm: Issue #18011 (KV cache dumped to shared memory) reduces ROCm tg performance at high contexts. Root cause appears to be in HIP memory allocation behavior on APUs.

  5. GGML_HIP_UMA vs kernel-param GTT expansion: The UMA flag uses slow fine-grained memory. GTT expansion via amdgpu.gttsize kernel params provides coarse-grained GPU-mapped memory that is much faster. The upstream approach may eventually improve, but kernel params remain the correct method for now.

  6. GatedDeltaNet architecture support: Both Vulkan (missing shader) and ROCm (register pressure, memcpy bottleneck) perform poorly on GDN models. This blocks efficient use of Qwen3.5-27B and similar models.

  7. TurboQuant integration timeline: Expected Q2-Q3 2026 for llama.cpp. Would provide 3-bit KV cache with no quality loss, roughly doubling available context within the same memory budget.

  8. NPU utilization: The 50 TOPS NPU on Strix Halo is currently Linux-unusable for llama.cpp inference. AMD driver support for NPU on Linux remains pending.


Overlap Notes

  • Kernel parameters (amdgpu.gttsize, ttm.pages_limit, iommu=pt): Already documented in the project's scripts/optimize/kernel-params.sh. This research covers the llama.cpp side (why they matter for inference).

  • BIOS VRAM allocation: Reducing dedicated VRAM in BIOS frees more memory for GTT. This is documented in the project's audit scripts but is a prerequisite for the optimizations described here.

  • Toolbox container builds: The project uses pre-built toolboxes (llama-rocm-7.2, llama-vulkan). The compilation flags documented here describe what should be baked into those containers.