Files

Felipe Cardoso f92b710492 fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.

2026-03-27 14:54:19 +01:00

34 KiB

Raw Permalink Blame History

llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)

Comprehensive research into maximizing inference performance on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000). Researched March 2026.

Scope

This document covers every known compilation flag, runtime parameter, environment variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5) with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama internals, or non-llama.cpp inference engines except where their findings inform llama.cpp optimization.

Compilation Flags and Build Optimizations
Runtime Flags and Environment Variables
Flash Attention and Attention Backends
Quantization Strategies for Speed
Memory Layout and Caching
llama-server Specific Optimizations
Upcoming llama.cpp Features (2026)
Recommended Configurations
Sources

1. Compilation Flags and Build Optimizations

1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build

Both backends are worth building. Neither is universally faster on gfx1151:

Workload	Winner	Rationale
Token generation (short ctx)	Vulkan RADV	Lower driver overhead, mature kernel paths
Token generation (long ctx, 8K+)	ROCm + rocWMMA + FA	Maintains speed as context grows; uses less memory
Prompt processing (short ctx)	Mixed -- model-dependent	AMDVLK or ROCm hipBLASLt win on some shapes
Prompt processing (long ctx)	ROCm + rocWMMA-tuned	96% speedup over untuned rocWMMA at 65K ctx
Memory efficiency at long ctx	ROCm + FA	Less memory than Vulkan equivalent

Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):

Backend	pp512 t/s	tg128 t/s	pp512@130K t/s	tg128@130K t/s
Vulkan RADV	755.14	85.11	17.24	12.54
Vulkan AMDVLK	741.60	81.79	10.75	3.51
ROCm hipBLASLt	651.93	63.95	40.35	4.97
ROCm rocWMMA-tuned	659.07	67.66	51.12	13.33

Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context prompt processing (3x faster than RADV at 130K).

1.2 ROCm Build Flags

Minimal build:

cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Optimized build (recommended):

cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Critical ROCm build flags:

Flag	Effect	Recommendation
`-DGGML_HIP=ON`	Enable HIP/ROCm backend	Required
`-DAMDGPU_TARGETS="gfx1151"`	Target Strix Halo GPU	Required -- do not use gfx1100
`-DGGML_HIP_ROCWMMA_FATTN=ON`	Enable rocWMMA flash attention	Strongly recommended for pp
`-DGGML_HIP_GRAPHS=ON`	HIP graph kernel scheduling	Test -- may help reduce launch overhead
`-DGGML_HIP_NO_VMM=OFF`	Re-enable Virtual Memory Management	Default is disabled; test if needed

Flags to be aware of but NOT set by default:

Flag	Notes
`-DGGML_HIP_UMA=ON`	Uses hipMallocManaged for UMA. Avoid on Strix Halo -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster.
`-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F`	Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. Test before deploying.
`-DGGML_CUDA_FA_ALL_QUANTS=ON`	Compiles all KV cache quant type combinations for FA. Works for CUDA kernels which HIP reuses via hipify. Increases compilation time substantially. Enable if you need quantized KV cache with flash attention.

ROCm version considerations:

ROCm 7.2: Known rocWMMA compilation issue (ambiguous template specializations in mfma_impl.hpp). Fixed in later point releases or by disabling rocWMMA.
ROCm 7.0 RC / 7.1: Generally work well with gfx1151.
ROCm 6.4.4: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
Recommendation: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.

1.3 Vulkan Build Flags

cmake -B build -S . \
  -DGGML_VULKAN=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime. gfx1151 with RADV reports matrix cores: KHR_coopmat in llama-bench logs.

Vulkan-specific considerations:

Topic	Detail
RADV vs AMDVLK	RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit.
AMDVLK buffer limit	AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers.
CoopMat1 vs CoopMat2	gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA.
Shader compilation	Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them.

1.4 LTO and PGO

llama.cpp does not have built-in LTO/PGO support in its CMake configuration. You can enable LTO manually:

cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
  -DCMAKE_BUILD_TYPE=Release

Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are compiled by the HIP/ROCm compiler and are not affected by host LTO.

PGO would require a two-pass build (instrument, profile, rebuild) and is not commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance and memory bandwidth, not host-side code paths.

1.5 Compiler Tuning for ROCm

A known LLVM regression affects loop unrolling on RDNA. The following flag has been reported to help:

-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"

This increases the unrolling threshold for local memory operations, which can improve kernel performance for flash attention and matrix multiplication.

2. Runtime Flags and Environment Variables

2.1 ROCm Environment Variables

Variable	Value	Effect
`ROCBLAS_USE_HIPBLASLT=1`	Critical	Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. This is a 2-7x improvement for prompt processing.
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	Set inside toolbox containers	Required for ROCm to recognize gfx1151. Set in container, not by host scripts.
`HSA_ENABLE_SDMA=0`	Optional	Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels.
`HIP_VISIBLE_DEVICES=0`	Optional	Select specific GPU device. Useful in multi-GPU or container setups.
`GPU_MAX_HEAP_SIZE=100`	Optional	Allow 100% of GPU memory for heap. Default may be lower.
`GPU_MAX_ALLOC_PERCENT=100`	Optional	Allow single allocation up to 100% of GPU memory.
`ROCR_VISIBLE_DEVICES=0`	Optional	HSA-level device visibility control.
`AMD_LOG_LEVEL=0`	Optional	Suppress AMD driver logging noise.

The single most impactful environment variable is ROCBLAS_USE_HIPBLASLT=1. Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).

2.2 Vulkan Environment Variables

Variable	Value	Effect
`AMD_VULKAN_ICD=RADV`	Recommended	Force RADV driver (skip AMDVLK).
`RADV_PERFTEST=nogttspill`	Important	Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off).
`GGML_VK_VISIBLE_DEVICES=0`	Optional	Select Vulkan device index.
`GGML_VULKAN_DISABLE_F16=1`	Debugging	Force FP32 compute. Slower but useful for debugging precision issues.
`GGML_LOG_LEVEL=2`	Debugging	Verbose logging to verify coopmat detection.

2.3 Thread Count (`-t` flag)

For GPU-dominant inference (all layers offloaded), the thread count has minimal impact on throughput. The recommendation:

Single-user inference: -t 4 to -t 8 (enough for tokenization/sampling overhead)
Server with parallel slots: -t equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
Hybrid CPU+GPU (partial offload): -t equal to number of physical cores

The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using all 12 big cores (-t 12) is optimal for CPU-involved workloads.

2.4 Batch Size Tuning (`-b` and `-ub`)

Flag	Default	Role
`-b` / `--batch-size`	2048	Logical batch size (application level)
`-ub` / `--ubatch-size`	512	Physical batch size (device level)

Tuning guidance for gfx1151:

MoE models: -b 256 significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
Dense models: Default -b 2048 is generally fine
Long context: -ub 2048 can improve performance, but test against OOM
Ultra-long context: Reduce -ub if memory allocation fails

The Vulkan backend blog post for Strix Halo recommends: -c 32768 -b 4096 -ub 256 for a good balance of performance and memory.

2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`)

Critical finding for unified memory APUs:

When you load large models to the GPU, memory mapping can make loading moderately slower for Vulkan, and catastrophically slower for ROCm. You should always set --mmap 0 or --no-mmap to improve model loading times on Strix Halo.

For llama-bench, use -mmp 0. For llama-server/llama-cli, use --no-mmap.

On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x. The difference is which pages are mapped for GPU access. GPU-mapped pages have full bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for CPU-to-GPU copies).

Always use -ngl 99 (or higher) to ensure all layers are on GPU memory. Even on a unified memory system, GPU memory paths provide 2x the bandwidth.

2.6 GPU Layer Offloading (`-ngl`)

For Strix Halo with 64GB unified memory:

Models < 50GB: -ngl 99 offloads everything. No tuning needed.
Models 50-60GB: -ngl 99 should still work with GTT expanded via kernel params.
Models > 60GB: May need partial offload. Use -ngl <N> where N is tuned to keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.

Never let GPU spill to system RAM paths -- performance will be worse than pure CPU.

3. Flash Attention and Attention Backends

3.1 When to Enable Flash Attention

Rule of thumb for gfx1151:

Backend	Flash Attention	Recommendation
ROCm + rocWMMA	`-fa 1`	Always enable. 24% pp improvement, maintains tg speed, uses less memory.
ROCm without rocWMMA	`-fa 1`	Enable, but smaller improvement.
Vulkan RADV	`-fa 1`	Enable for short contexts. Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance.
Vulkan AMDVLK	`-fa 1`	Similar to RADV.

Key caveat: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior FA path.

3.2 rocWMMA Flash Attention Performance

Benchmark on gfx1151 (Llama2-7B Q4_K_M):

Configuration	pp512 t/s	tg128 t/s
HIP standard	592.28	40.40
HIP + hipBLASLt	548.72	40.43
HIP + rocWMMA + hipBLASLt	1006.80	39.46
HIP + rocWMMA (no hipBLASLt)	899.73	39.45

rocWMMA provides ~70% improvement in prompt processing with flash attention. Token generation is slightly slower (~2%) due to WMMA overhead at small batch.

3.3 The rocWMMA Long-Context Regression and Fix

The standard rocWMMA implementation has a long-context decode regression: at 65K context, tg degrades by up to 57% compared to HIP-only baseline.

The fix (PR #16827, "rocm-wmma-tune" branch) implements:

__launch_bounds__(256, 2): Ensures minimum 2 blocks per SM, improving occupancy
Adaptive KQ stride: Uses stride 128 when head dimension <= 128, reducing LDS footprint
Selective WMMA usage: WMMA only for prefill; decode reverts to VEC/TILE kernels

Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):

pp512 at 65K context: 96% faster than untuned rocWMMA
tg128 at 65K context: Matches HIP baseline (previously 57% degraded)

Status: This patch is available in -rocwmma-improved toolbox builds. It may not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.

3.4 Vulkan Flash Attention Limitations on AMD

The Vulkan backend supports three FA paths:

Path	Extension	AMD Support
FA_SCALAR	None	Yes (CPU fallback)
FA_COOPMAT1	KHR_cooperative_matrix	Yes (gfx1151 reports support)
FA_COOPMAT2	NV_cooperative_matrix2	No (NVIDIA-only)

FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types. FA_COOPMAT2 additionally supports all quant types.

When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations. This provides a modest improvement over scalar FA but is significantly less efficient than ROCm + rocWMMA.

3.5 New Attention Models (GatedDeltaNet)

Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have severe performance problems on gfx1151:

Vulkan: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
ROCm/HIP: Kernel cross-compiles but suffers from register spilling (float s[S_v] allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95% of decode time on models >15GB)

Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s. Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.

4. Quantization Strategies for Speed

4.1 Quantization Speed on RDNA 3.5

Token generation speed is dominated by memory bandwidth, not compute. Smaller quantizations are faster because they reduce bytes-per-weight, allowing more tokens per second within the ~215 GB/s bandwidth envelope.

Approximate throughput formula for decode (bandwidth-bound):

tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9

For a 7B Q4_K_M model (~4.1 GB):

215 / 4.1 ≈ 52 t/s  (theoretical max; practical ~50 t/s on gfx1151)

4.2 Quantization Type Comparison

Quant	Bits/Weight	Quality	Speed (relative)	Notes
Q4_0	4.0	Low	Fastest	Legacy. Simple dequant.
Q4_K_M	4.83	Good	Very fast	K-quant with hierarchical blocks. Recommended default.
IQ4_XS	4.25	Good	Fast	Importance-weighted. Better quality/bit than Q4_K_M.
Q5_K_M	5.69	Very good	Fast	Sweet spot for quality-sensitive use.
Q6_K	6.56	Excellent	Moderate	Near-lossless quality.
Q8_0	8.0	Near-perfect	Slower	2x the bytes of Q4_K_M, ~2x slower tg.
F16	16.0	Perfect	Slowest	Reference baseline.

For RDNA 3.5 specifically:

Q4_K_M is the best general-purpose quantization. The K-quant family uses hierarchical super-blocks (256 values) with per-sub-block scales, providing better quality than Q4_0 at marginally higher dequant cost that is invisible at the GPU level.
Q4_0 has the simplest dequant kernels and is marginally faster than Q4_K_M on some GPU backends. However, the quality loss is significant. Use only for smoke tests or when every t/s matters more than quality.
IQ4_XS (importance-matrix quantized) offers better quality per bit than Q4_K_M. Speed is similar. Requires an importance matrix file during quantization. Recommended over Q4_K_M when you control the quantization process.
Q8_0 does NOT have special hardware-accelerated dequant on RDNA 3.5. RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the same FP16 compute paths, just with more memory bandwidth consumed.

4.3 Importance Matrix (imatrix) Quantization

imatrix quantization records how much each weight affects output quality, then allocates more precision bits to important weights. This is essential for sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant shows measurable degradation.

# Generate importance matrix (GPU-accelerated)
llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat

# Quantize with imatrix
llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS

Speed impact: None. imatrix affects quantization quality, not inference speed. The dequantization kernels are identical regardless of whether imatrix was used.

4.4 Unsloth Dynamic (UD) Quantizations

Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths, choosing the optimal quantization per layer based on sensitivity analysis.

Speed impact: Minimal to none. UD quants use the same dequant kernels as standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a standard Q4_K_M of the same total size.

Quality impact: Significantly better. UD consistently outperforms standard quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.

Recommendation: Prefer UD quants (e.g., UD-Q4_K_XL, UD-Q4_K_M) from Unsloth when available. They are a free quality upgrade with no speed penalty.

5. Memory Layout and Caching

5.1 KV Cache Quantization

KV cache quantization reduces the memory footprint of the attention cache, allowing larger context windows within the same memory budget.

Cache Type	Memory vs F16	Quality Impact	Recommendation
f16 (default)	1.0x	None	Baseline
q8_0	0.5x	Negligible (+0.002-0.05 ppl)	Recommended for production
q4_0	0.33x	Noticeable (+0.2-0.25 ppl)	Use when memory-constrained
q4_1	0.33x	Slightly better than q4_0	Alternative to q4_0
iq4_nl	0.33x	Better than q4_0	Best 4-bit KV option

Usage:

llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
# or for llama-bench:
# Not directly supported as flags; test via llama-server

Performance impact: Quantizing K cache slightly improves throughput (less memory to read). Quantizing V cache may have a slight negative impact. Overall performance impact is negligible for normal inference.

Caveat with speculative decoding: Using KV cache quantization with a draft model causes a consistent ~16% performance drop. q4_0 KV with speculative decoding causes massive acceptance rate drops. Avoid KV quant if using speculative decoding.

5.2 mmap vs Full Load on Unified Memory

On Strix Halo's unified memory architecture:

--no-mmap is strongly recommended for both ROCm and Vulkan.
With mmap enabled, ROCm model loading is "catastrophically slower."
Vulkan loading is "moderately slower" with mmap.
Since CPU and GPU share physical RAM, there is no data copy when loading to "GPU memory" -- it is just a page table update.

For llama-bench: Always use -mmp 0. For llama-server/llama-cli: Always use --no-mmap.

5.3 Prompt Caching

llama-server supports two levels of prompt caching:

1. Automatic KV cache reuse (cache_prompt: true): Reuses KV cache from previous requests when prompts share a common prefix. The server only reprocesses the suffix that differs.

2. Host-memory prompt caching (--cram N): Stores pre-computed prompt representations in system RAM.

Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
+6% token throughput (34 vs 32 t/s)
Memory formula: num_prefixes * avg_prefix_tokens * 8 bytes

Configuration:

llama-server -m model.gguf \
  --cram 256 \              # 256 MB host RAM for prompt cache
  --cache-type-k q8_0 \     # KV cache quantization
  --cache-type-v q8_0 \
  --no-mmap \
  -fa \
  -ngl 99

Best for:

System prompts > 5K tokens
Multi-user chatbots with shared context
Agentic use with repeated tool-call prefixes

5.4 UMA Detection Bug (Issue #18159)

llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark) incorrectly activates on AMD APUs when prop.integrated=1. It reads /proc/meminfo instead of hipMemGetInfo(), severely underreporting available GPU memory (e.g., reporting 27GB instead of 96GB).

Workarounds:

Build without GGML_CUDA_ENABLE_UNIFIED_MEMORY
Guard UMA detection with !defined(GGML_USE_HIP) (upstream fix pending)
Use toolbox containers where this has been patched

5.5 KV Cache Placement on ROCm (Issue #18011)

On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible) memory instead of GPU-mapped memory, causing performance degradation at high context sizes. This is a known issue contributing to ROCm falling behind Vulkan for tg at high contexts.

Mitigation: Use the rocWMMA-tuned branch which maintains better memory placement, or use Vulkan RADV for workloads where this matters.

6. llama-server Specific Optimizations

6.1 Recommended Server Configuration

llama-server -m model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --no-mmap \
  -fa \
  -c 32768 \               # Total context across all slots
  -np 4 \                  # 4 parallel slots (adjust for your use)
  -b 2048 \                # Logical batch size
  -ub 512 \                # Physical batch size
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cont-batching \         # Enabled by default
  --jinja                   # Enable Jinja2 chat template

6.2 Parallel Slot Configuration (`-np`)

Use Case	Slots	Context per Slot	Total `-c`
Single user chat	1	32768	32768
Agentic coding (Claude Code style)	2-4	8192-16384	32768-65536
Multi-user API	4-8	4096-8192	32768-65536
Eval harness	1	32768+	32768+

Memory formula: Each slot requires context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element. With q8_0 KV cache, this is roughly halved compared to f16.

6.3 Continuous Batching

Enabled by default (--cont-batching). Allows the server to process multiple requests simultaneously, interleaving prefill and decode operations.

For agentic workloads: One slot typically holds a large system prompt + conversation context, while additional slots handle parallel tool calls. Configure with:

-np 4 -c 131072  # 4 slots, up to 32K context each

6.4 Prompt Caching for Agentic Use

For agentic coding tools that send the same system prompt repeatedly:

Use cache_prompt: true in API requests (reuses KV cache prefix)
Use --system-prompt-file system.txt for static system prompts (note: may be removed in recent versions; verify with your build)
Use --cram 128 to enable host-memory caching for prefix deduplication

6.5 Speculative Decoding

For token generation speedup with a draft model:

llama-server -m main-model.gguf \
  --model-draft draft-model.gguf \
  -ngl 99 \
  --draft-max 8 \
  --draft-min 1 \
  --no-mmap \
  -fa

Caveat: Do NOT combine speculative decoding with KV cache quantization. The 16% performance drop and reduced acceptance rate negate the benefits.

7. Upcoming llama.cpp Features (2026)

7.1 Backend-Agnostic Tensor Parallelism (PR #19378)

Merged January 2026. Adds --split-mode tensor for splitting computation across multiple GPUs via a new "meta" backend.

Relevance to Strix Halo: Limited. Single integrated GPU. However, for RPC configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster), tensor parallelism could complement the existing layer-split approach.

Currently supports 1-2 GPUs with equal data split. --tensor-split has no effect yet.

7.2 TurboQuant KV Cache Compression (ICLR 2026)

Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with no training and negligible quality loss:

Format	MSE vs FP16	Compression
TQ3 (3-bit)	0.034	4.9x
TQ4 (4-bit)	0.009	3.8x

Timeline: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase integration plan exists covering GGML type registration, KV cache paths, FA integration, and CLI flags.

7.3 Vulkan Improvements

Active 2025-2026 developments:

Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
13% pp improvement from CU mode optimization for LDS utilization
BFloat16 Vulkan support (VK_KHR_shader_bfloat16) maturing in Mesa 25.x
Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)

7.4 Flash Attention for Head Dimension 512

Pull request from March 2026 adds FA support for HD=512 in CUDA kernels. This benefits models with larger head dimensions (some newer architectures). The HIP path should inherit this via hipify.

7.5 ik_llama.cpp Fork Innovations

The ik_llama.cpp fork by ikawrakow introduces:

Row-interleaved quant packing (better memory access patterns)
Smart Expert Reduction for faster MoE inference
Tensor overrides with regex patterns for hybrid GPU/CPU placement
FlashMLA for DeepSeek models

Caveat: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan are not maintained. Not recommended for AMD gfx1151.

8. Recommended Configurations

8.1 For llama-bench (Benchmarking)

ROCm backend:

ROCBLAS_USE_HIPBLASLT=1 \
toolbox run -c llama-rocm-7.2 -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5

Vulkan backend:

AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
toolbox run -c llama-vulkan -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5

8.2 For llama-server (Production Agentic Use)

ROCm (best for long context):

ROCBLAS_USE_HIPBLASLT=1 \
llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 65536 -np 4 \
  -b 2048 -ub 512 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --cram 256 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080

Vulkan RADV (best for single-user tg):

AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 32768 -np 2 \
  -b 4096 -ub 256 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080

8.3 Decision Matrix

Question	Answer
Which backend for benchmarking?	Both. ROCm and Vulkan have different strengths.
Which backend for daily chat?	Vulkan RADV for best tg speed.
Which backend for long-context agentic?	ROCm + rocWMMA-tuned for context resilience.
Which quantization?	Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality.
Enable flash attention?	Yes, always on ROCm. Yes on Vulkan for short contexts.
Use `--no-mmap`?	Always.
Set `ROCBLAS_USE_HIPBLASLT=1`?	Always for ROCm.
Set `RADV_PERFTEST=nogttspill`?	Always for Vulkan RADV.
KV cache quantization?	q8_0 for both K and V unless using speculative decoding.
Batch size for MoE?	`-b 256` (lower than default improves some MoE models).

9. Sources

GitHub Issues and Discussions

Wiki and Community Resources

Articles and Blog Posts

Official Documentation

Papers

"Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
"TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
Unsloth Dynamic 2.0 GGUFs Documentation

Open Questions / Limitations

rocWMMA-tuned patch upstream status: PR #16827 may not be fully merged. Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.
ROCm 7.2 stability on gfx1151: Multiple reports of crashes (MUT_MAL errors), performance regressions, and compilation issues. ROCm 7.x is maturing but not fully stable for gfx1151 as of March 2026.
Vulkan CoopMat FA for AMD: Will AMD ever get CoopMat2 support? The current CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent extension would close the gap with ROCm FA.
KV cache placement on ROCm: Issue #18011 (KV cache dumped to shared memory) reduces ROCm tg performance at high contexts. Root cause appears to be in HIP memory allocation behavior on APUs.
GGML_HIP_UMA vs kernel-param GTT expansion: The UMA flag uses slow fine-grained memory. GTT expansion via amdgpu.gttsize kernel params provides coarse-grained GPU-mapped memory that is much faster. The upstream approach may eventually improve, but kernel params remain the correct method for now.
GatedDeltaNet architecture support: Both Vulkan (missing shader) and ROCm (register pressure, memcpy bottleneck) perform poorly on GDN models. This blocks efficient use of Qwen3.5-27B and similar models.
TurboQuant integration timeline: Expected Q2-Q3 2026 for llama.cpp. Would provide 3-bit KV cache with no quality loss, roughly doubling available context within the same memory budget.
NPU utilization: The 50 TOPS NPU on Strix Halo is currently Linux-unusable for llama.cpp inference. AMD driver support for NPU on Linux remains pending.

Overlap Notes

Kernel parameters (amdgpu.gttsize, ttm.pages_limit, iommu=pt): Already documented in the project's scripts/optimize/kernel-params.sh. This research covers the llama.cpp side (why they matter for inference).
BIOS VRAM allocation: Reducing dedicated VRAM in BIOS frees more memory for GTT. This is documented in the project's audit scripts but is a prerequisite for the optimizations described here.
Toolbox container builds: The project uses pre-built toolboxes (llama-rocm-7.2, llama-vulkan). The compilation flags documented here describe what should be baked into those containers.

34 KiB Raw Permalink Blame History