KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
34 KiB
llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)
Comprehensive research into maximizing inference performance on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000). Researched March 2026.
Scope
This document covers every known compilation flag, runtime parameter, environment variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5) with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama internals, or non-llama.cpp inference engines except where their findings inform llama.cpp optimization.
Table of Contents
- Compilation Flags and Build Optimizations
- Runtime Flags and Environment Variables
- Flash Attention and Attention Backends
- Quantization Strategies for Speed
- Memory Layout and Caching
- llama-server Specific Optimizations
- Upcoming llama.cpp Features (2026)
- Recommended Configurations
- Sources
1. Compilation Flags and Build Optimizations
1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build
Both backends are worth building. Neither is universally faster on gfx1151:
| Workload | Winner | Rationale |
|---|---|---|
| Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths |
| Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory |
| Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes |
| Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx |
| Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent |
Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):
| Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s |
|---|---|---|---|---|
| Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 |
| Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 |
| ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 |
| ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 |
Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context prompt processing (3x faster than RADV at 130K).
1.2 ROCm Build Flags
Minimal build:
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151" \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Optimized build (recommended):
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151" \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Critical ROCm build flags:
| Flag | Effect | Recommendation |
|---|---|---|
-DGGML_HIP=ON |
Enable HIP/ROCm backend | Required |
-DAMDGPU_TARGETS="gfx1151" |
Target Strix Halo GPU | Required -- do not use gfx1100 |
-DGGML_HIP_ROCWMMA_FATTN=ON |
Enable rocWMMA flash attention | Strongly recommended for pp |
-DGGML_HIP_GRAPHS=ON |
HIP graph kernel scheduling | Test -- may help reduce launch overhead |
-DGGML_HIP_NO_VMM=OFF |
Re-enable Virtual Memory Management | Default is disabled; test if needed |
Flags to be aware of but NOT set by default:
| Flag | Notes |
|---|---|
-DGGML_HIP_UMA=ON |
Uses hipMallocManaged for UMA. Avoid on Strix Halo -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. |
-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F |
Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. Test before deploying. |
-DGGML_CUDA_FA_ALL_QUANTS=ON |
Compiles all KV cache quant type combinations for FA. Works for CUDA kernels which HIP reuses via hipify. Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. |
ROCm version considerations:
- ROCm 7.2: Known rocWMMA compilation issue (ambiguous template specializations in
mfma_impl.hpp). Fixed in later point releases or by disabling rocWMMA. - ROCm 7.0 RC / 7.1: Generally work well with gfx1151.
- ROCm 6.4.4: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
- Recommendation: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.
1.3 Vulkan Build Flags
cmake -B build -S . \
-DGGML_VULKAN=ON \
-DLLAMA_BUILD_SERVER=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime.
gfx1151 with RADV reports matrix cores: KHR_coopmat in llama-bench logs.
Vulkan-specific considerations:
| Topic | Detail |
|---|---|
| RADV vs AMDVLK | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. |
| AMDVLK buffer limit | AMDVLK caps single Vulkan allocations at ~2 GiB (VkPhysicalDeviceLimits::maxMemoryAllocationSize). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. |
| CoopMat1 vs CoopMat2 | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (VK_NV_cooperative_matrix2) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. |
| Shader compilation | Building from source with glslc available enables cooperative matrix shader variants. Pre-built binaries may omit them. |
1.4 LTO and PGO
llama.cpp does not have built-in LTO/PGO support in its CMake configuration. You can enable LTO manually:
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151" \
-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
-DCMAKE_BUILD_TYPE=Release
Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are compiled by the HIP/ROCm compiler and are not affected by host LTO.
PGO would require a two-pass build (instrument, profile, rebuild) and is not commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance and memory bandwidth, not host-side code paths.
1.5 Compiler Tuning for ROCm
A known LLVM regression affects loop unrolling on RDNA. The following flag has been reported to help:
-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
This increases the unrolling threshold for local memory operations, which can improve kernel performance for flash attention and matrix multiplication.
2. Runtime Flags and Environment Variables
2.1 ROCm Environment Variables
| Variable | Value | Effect |
|---|---|---|
ROCBLAS_USE_HIPBLASLT=1 |
Critical | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. This is a 2-7x improvement for prompt processing. |
HSA_OVERRIDE_GFX_VERSION=11.5.1 |
Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. |
HSA_ENABLE_SDMA=0 |
Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. |
HIP_VISIBLE_DEVICES=0 |
Optional | Select specific GPU device. Useful in multi-GPU or container setups. |
GPU_MAX_HEAP_SIZE=100 |
Optional | Allow 100% of GPU memory for heap. Default may be lower. |
GPU_MAX_ALLOC_PERCENT=100 |
Optional | Allow single allocation up to 100% of GPU memory. |
ROCR_VISIBLE_DEVICES=0 |
Optional | HSA-level device visibility control. |
AMD_LOG_LEVEL=0 |
Optional | Suppress AMD driver logging noise. |
The single most impactful environment variable is ROCBLAS_USE_HIPBLASLT=1.
Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).
2.2 Vulkan Environment Variables
| Variable | Value | Effect |
|---|---|---|
AMD_VULKAN_ICD=RADV |
Recommended | Force RADV driver (skip AMDVLK). |
RADV_PERFTEST=nogttspill |
Important | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). |
GGML_VK_VISIBLE_DEVICES=0 |
Optional | Select Vulkan device index. |
GGML_VULKAN_DISABLE_F16=1 |
Debugging | Force FP32 compute. Slower but useful for debugging precision issues. |
GGML_LOG_LEVEL=2 |
Debugging | Verbose logging to verify coopmat detection. |
2.3 Thread Count (-t flag)
For GPU-dominant inference (all layers offloaded), the thread count has minimal impact on throughput. The recommendation:
- Single-user inference:
-t 4to-t 8(enough for tokenization/sampling overhead) - Server with parallel slots:
-tequal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores) - Hybrid CPU+GPU (partial offload):
-tequal to number of physical cores
The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using
all 12 big cores (-t 12) is optimal for CPU-involved workloads.
2.4 Batch Size Tuning (-b and -ub)
| Flag | Default | Role |
|---|---|---|
-b / --batch-size |
2048 | Logical batch size (application level) |
-ub / --ubatch-size |
512 | Physical batch size (device level) |
Tuning guidance for gfx1151:
- MoE models:
-b 256significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B) - Dense models: Default
-b 2048is generally fine - Long context:
-ub 2048can improve performance, but test against OOM - Ultra-long context: Reduce
-ubif memory allocation fails
The Vulkan backend blog post for Strix Halo recommends: -c 32768 -b 4096 -ub 256
for a good balance of performance and memory.
2.5 Memory-Mapped Loading (-mmp / --no-mmap)
Critical finding for unified memory APUs:
When you load large models to the GPU, memory mapping can make loading moderately slower for Vulkan, and catastrophically slower for ROCm. You should always set
--mmap 0or--no-mmapto improve model loading times on Strix Halo.
For llama-bench, use -mmp 0. For llama-server/llama-cli, use --no-mmap.
On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x. The difference is which pages are mapped for GPU access. GPU-mapped pages have full bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for CPU-to-GPU copies).
Always use -ngl 99 (or higher) to ensure all layers are on GPU memory.
Even on a unified memory system, GPU memory paths provide 2x the bandwidth.
2.6 GPU Layer Offloading (-ngl)
For Strix Halo with 64GB unified memory:
- Models < 50GB:
-ngl 99offloads everything. No tuning needed. - Models 50-60GB:
-ngl 99should still work with GTT expanded via kernel params. - Models > 60GB: May need partial offload. Use
-ngl <N>where N is tuned to keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.
Never let GPU spill to system RAM paths -- performance will be worse than pure CPU.
3. Flash Attention and Attention Backends
3.1 When to Enable Flash Attention
Rule of thumb for gfx1151:
| Backend | Flash Attention | Recommendation |
|---|---|---|
| ROCm + rocWMMA | -fa 1 |
Always enable. 24% pp improvement, maintains tg speed, uses less memory. |
| ROCm without rocWMMA | -fa 1 |
Enable, but smaller improvement. |
| Vulkan RADV | -fa 1 |
Enable for short contexts. Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. |
| Vulkan AMDVLK | -fa 1 |
Similar to RADV. |
Key caveat: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior FA path.
3.2 rocWMMA Flash Attention Performance
Benchmark on gfx1151 (Llama2-7B Q4_K_M):
| Configuration | pp512 t/s | tg128 t/s |
|---|---|---|
| HIP standard | 592.28 | 40.40 |
| HIP + hipBLASLt | 548.72 | 40.43 |
| HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 |
| HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 |
rocWMMA provides ~70% improvement in prompt processing with flash attention. Token generation is slightly slower (~2%) due to WMMA overhead at small batch.
3.3 The rocWMMA Long-Context Regression and Fix
The standard rocWMMA implementation has a long-context decode regression: at 65K context, tg degrades by up to 57% compared to HIP-only baseline.
The fix (PR #16827, "rocm-wmma-tune" branch) implements:
__launch_bounds__(256, 2): Ensures minimum 2 blocks per SM, improving occupancy- Adaptive KQ stride: Uses stride 128 when head dimension <= 128, reducing LDS footprint
- Selective WMMA usage: WMMA only for prefill; decode reverts to VEC/TILE kernels
Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):
- pp512 at 65K context: 96% faster than untuned rocWMMA
- tg128 at 65K context: Matches HIP baseline (previously 57% degraded)
Status: This patch is available in -rocwmma-improved toolbox builds. It may
not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.
3.4 Vulkan Flash Attention Limitations on AMD
The Vulkan backend supports three FA paths:
| Path | Extension | AMD Support |
|---|---|---|
| FA_SCALAR | None | Yes (CPU fallback) |
| FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) |
| FA_COOPMAT2 | NV_cooperative_matrix2 | No (NVIDIA-only) |
FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types. FA_COOPMAT2 additionally supports all quant types.
When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations. This provides a modest improvement over scalar FA but is significantly less efficient than ROCm + rocWMMA.
3.5 New Attention Models (GatedDeltaNet)
Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have severe performance problems on gfx1151:
- Vulkan: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
- ROCm/HIP: Kernel cross-compiles but suffers from register spilling (float s[S_v] allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95% of decode time on models >15GB)
Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s. Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.
4. Quantization Strategies for Speed
4.1 Quantization Speed on RDNA 3.5
Token generation speed is dominated by memory bandwidth, not compute. Smaller quantizations are faster because they reduce bytes-per-weight, allowing more tokens per second within the ~215 GB/s bandwidth envelope.
Approximate throughput formula for decode (bandwidth-bound):
tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9
For a 7B Q4_K_M model (~4.1 GB):
215 / 4.1 ≈ 52 t/s (theoretical max; practical ~50 t/s on gfx1151)
4.2 Quantization Type Comparison
| Quant | Bits/Weight | Quality | Speed (relative) | Notes |
|---|---|---|---|---|
| Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. |
| Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. |
| IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. |
| Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. |
| Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. |
| Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. |
| F16 | 16.0 | Perfect | Slowest | Reference baseline. |
For RDNA 3.5 specifically:
-
Q4_K_M is the best general-purpose quantization. The K-quant family uses hierarchical super-blocks (256 values) with per-sub-block scales, providing better quality than Q4_0 at marginally higher dequant cost that is invisible at the GPU level.
-
Q4_0 has the simplest dequant kernels and is marginally faster than Q4_K_M on some GPU backends. However, the quality loss is significant. Use only for smoke tests or when every t/s matters more than quality.
-
IQ4_XS (importance-matrix quantized) offers better quality per bit than Q4_K_M. Speed is similar. Requires an importance matrix file during quantization. Recommended over Q4_K_M when you control the quantization process.
-
Q8_0 does NOT have special hardware-accelerated dequant on RDNA 3.5. RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the same FP16 compute paths, just with more memory bandwidth consumed.
4.3 Importance Matrix (imatrix) Quantization
imatrix quantization records how much each weight affects output quality, then allocates more precision bits to important weights. This is essential for sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant shows measurable degradation.
# Generate importance matrix (GPU-accelerated)
llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat
# Quantize with imatrix
llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS
Speed impact: None. imatrix affects quantization quality, not inference speed. The dequantization kernels are identical regardless of whether imatrix was used.
4.4 Unsloth Dynamic (UD) Quantizations
Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths, choosing the optimal quantization per layer based on sensitivity analysis.
Speed impact: Minimal to none. UD quants use the same dequant kernels as standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a standard Q4_K_M of the same total size.
Quality impact: Significantly better. UD consistently outperforms standard quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.
Recommendation: Prefer UD quants (e.g., UD-Q4_K_XL, UD-Q4_K_M) from
Unsloth when available. They are a free quality upgrade with no speed penalty.
5. Memory Layout and Caching
5.1 KV Cache Quantization
KV cache quantization reduces the memory footprint of the attention cache, allowing larger context windows within the same memory budget.
| Cache Type | Memory vs F16 | Quality Impact | Recommendation |
|---|---|---|---|
| f16 (default) | 1.0x | None | Baseline |
| q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | Recommended for production |
| q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained |
| q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 |
| iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option |
Usage:
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
# or for llama-bench:
# Not directly supported as flags; test via llama-server
Performance impact: Quantizing K cache slightly improves throughput (less memory to read). Quantizing V cache may have a slight negative impact. Overall performance impact is negligible for normal inference.
Caveat with speculative decoding: Using KV cache quantization with a draft model causes a consistent ~16% performance drop. q4_0 KV with speculative decoding causes massive acceptance rate drops. Avoid KV quant if using speculative decoding.
5.2 mmap vs Full Load on Unified Memory
On Strix Halo's unified memory architecture:
--no-mmapis strongly recommended for both ROCm and Vulkan.- With mmap enabled, ROCm model loading is "catastrophically slower."
- Vulkan loading is "moderately slower" with mmap.
- Since CPU and GPU share physical RAM, there is no data copy when loading to "GPU memory" -- it is just a page table update.
For llama-bench: Always use -mmp 0.
For llama-server/llama-cli: Always use --no-mmap.
5.3 Prompt Caching
llama-server supports two levels of prompt caching:
1. Automatic KV cache reuse (cache_prompt: true):
Reuses KV cache from previous requests when prompts share a common prefix.
The server only reprocesses the suffix that differs.
2. Host-memory prompt caching (--cram N):
Stores pre-computed prompt representations in system RAM.
- Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
- +6% token throughput (34 vs 32 t/s)
- Memory formula:
num_prefixes * avg_prefix_tokens * 8 bytes
Configuration:
llama-server -m model.gguf \
--cram 256 \ # 256 MB host RAM for prompt cache
--cache-type-k q8_0 \ # KV cache quantization
--cache-type-v q8_0 \
--no-mmap \
-fa \
-ngl 99
Best for:
- System prompts > 5K tokens
- Multi-user chatbots with shared context
- Agentic use with repeated tool-call prefixes
5.4 UMA Detection Bug (Issue #18159)
llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark)
incorrectly activates on AMD APUs when prop.integrated=1. It reads
/proc/meminfo instead of hipMemGetInfo(), severely underreporting available
GPU memory (e.g., reporting 27GB instead of 96GB).
Workarounds:
- Build without
GGML_CUDA_ENABLE_UNIFIED_MEMORY - Guard UMA detection with
!defined(GGML_USE_HIP)(upstream fix pending) - Use toolbox containers where this has been patched
5.5 KV Cache Placement on ROCm (Issue #18011)
On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible) memory instead of GPU-mapped memory, causing performance degradation at high context sizes. This is a known issue contributing to ROCm falling behind Vulkan for tg at high contexts.
Mitigation: Use the rocWMMA-tuned branch which maintains better memory placement, or use Vulkan RADV for workloads where this matters.
6. llama-server Specific Optimizations
6.1 Recommended Server Configuration
llama-server -m model.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 99 \
--no-mmap \
-fa \
-c 32768 \ # Total context across all slots
-np 4 \ # 4 parallel slots (adjust for your use)
-b 2048 \ # Logical batch size
-ub 512 \ # Physical batch size
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--cont-batching \ # Enabled by default
--jinja # Enable Jinja2 chat template
6.2 Parallel Slot Configuration (-np)
| Use Case | Slots | Context per Slot | Total -c |
|---|---|---|---|
| Single user chat | 1 | 32768 | 32768 |
| Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 |
| Multi-user API | 4-8 | 4096-8192 | 32768-65536 |
| Eval harness | 1 | 32768+ | 32768+ |
Memory formula: Each slot requires context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element.
With q8_0 KV cache, this is roughly halved compared to f16.
6.3 Continuous Batching
Enabled by default (--cont-batching). Allows the server to process multiple
requests simultaneously, interleaving prefill and decode operations.
For agentic workloads: One slot typically holds a large system prompt + conversation context, while additional slots handle parallel tool calls. Configure with:
-np 4 -c 131072 # 4 slots, up to 32K context each
6.4 Prompt Caching for Agentic Use
For agentic coding tools that send the same system prompt repeatedly:
- Use
cache_prompt: truein API requests (reuses KV cache prefix) - Use
--system-prompt-file system.txtfor static system prompts (note: may be removed in recent versions; verify with your build) - Use
--cram 128to enable host-memory caching for prefix deduplication
6.5 Speculative Decoding
For token generation speedup with a draft model:
llama-server -m main-model.gguf \
--model-draft draft-model.gguf \
-ngl 99 \
--draft-max 8 \
--draft-min 1 \
--no-mmap \
-fa
Caveat: Do NOT combine speculative decoding with KV cache quantization. The 16% performance drop and reduced acceptance rate negate the benefits.
7. Upcoming llama.cpp Features (2026)
7.1 Backend-Agnostic Tensor Parallelism (PR #19378)
Merged January 2026. Adds --split-mode tensor for splitting computation across
multiple GPUs via a new "meta" backend.
Relevance to Strix Halo: Limited. Single integrated GPU. However, for RPC configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster), tensor parallelism could complement the existing layer-split approach.
Currently supports 1-2 GPUs with equal data split. --tensor-split has no effect yet.
7.2 TurboQuant KV Cache Compression (ICLR 2026)
Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with no training and negligible quality loss:
| Format | MSE vs FP16 | Compression |
|---|---|---|
| TQ3 (3-bit) | 0.034 | 4.9x |
| TQ4 (4-bit) | 0.009 | 3.8x |
Timeline: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase integration plan exists covering GGML type registration, KV cache paths, FA integration, and CLI flags.
7.3 Vulkan Improvements
Active 2025-2026 developments:
- Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
- 13% pp improvement from CU mode optimization for LDS utilization
- BFloat16 Vulkan support (
VK_KHR_shader_bfloat16) maturing in Mesa 25.x - Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)
7.4 Flash Attention for Head Dimension 512
Pull request from March 2026 adds FA support for HD=512 in CUDA kernels. This benefits models with larger head dimensions (some newer architectures). The HIP path should inherit this via hipify.
7.5 ik_llama.cpp Fork Innovations
The ik_llama.cpp fork by ikawrakow introduces:
- Row-interleaved quant packing (better memory access patterns)
- Smart Expert Reduction for faster MoE inference
- Tensor overrides with regex patterns for hybrid GPU/CPU placement
- FlashMLA for DeepSeek models
Caveat: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan are not maintained. Not recommended for AMD gfx1151.
8. Recommended Configurations
8.1 For llama-bench (Benchmarking)
ROCm backend:
ROCBLAS_USE_HIPBLASLT=1 \
toolbox run -c llama-rocm-7.2 -- \
/path/to/llama-bench \
-m /path/to/model.gguf \
-ngl 99 -mmp 0 -fa 1 \
-p 512 -n 128 -r 5
Vulkan backend:
AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
toolbox run -c llama-vulkan -- \
/path/to/llama-bench \
-m /path/to/model.gguf \
-ngl 99 -mmp 0 -fa 1 \
-p 512 -n 128 -r 5
8.2 For llama-server (Production Agentic Use)
ROCm (best for long context):
ROCBLAS_USE_HIPBLASLT=1 \
llama-server -m model.gguf \
-ngl 99 --no-mmap -fa \
-c 65536 -np 4 \
-b 2048 -ub 512 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--cram 256 \
--jinja --cont-batching \
--host 0.0.0.0 --port 8080
Vulkan RADV (best for single-user tg):
AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
llama-server -m model.gguf \
-ngl 99 --no-mmap -fa \
-c 32768 -np 2 \
-b 4096 -ub 256 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--jinja --cont-batching \
--host 0.0.0.0 --port 8080
8.3 Decision Matrix
| Question | Answer |
|---|---|
| Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. |
| Which backend for daily chat? | Vulkan RADV for best tg speed. |
| Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. |
| Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. |
| Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. |
Use --no-mmap? |
Always. |
Set ROCBLAS_USE_HIPBLASLT=1? |
Always for ROCm. |
Set RADV_PERFTEST=nogttspill? |
Always for Vulkan RADV. |
| KV cache quantization? | q8_0 for both K and V unless using speculative decoding. |
| Batch size for MoE? | -b 256 (lower than default improves some MoE models). |
9. Sources
GitHub Issues and Discussions
- Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021
- Performance of llama.cpp with Vulkan - Discussion #10879
- HIP backend performs poorly on gfx1151 - Issue #13565
- UMA detection incorrectly limits memory on AMD APUs - Issue #18159
- ROCm model loading dumps KV cache to shared memory - Issue #18011
- GATED_DELTA_NET underperformance on gfx1151 - Issue #20354
- Under-Performance of ROCm 7.2 binaries - Issue #19984
- ROCm 7.2 + rocWMMA compilation - Issue #19269
- Building for gfx1151 - Issue #14734
- AMDVLK 2GB buffer allocation limit - Issue #15054
- Mastering Host-Memory Prompt Caching - Discussion #20574
- TurboQuant Extreme KV Cache Quantization - Discussion #20969
- Backend-agnostic tensor parallelism - PR #19378
- Massively Improved rocWMMA Performance - PR #16827
- rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7
- Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5
Wiki and Community Resources
- Strix Halo Wiki - llama.cpp Performance
- Strix Halo Wiki - llama.cpp with ROCm
- AMD Strix Halo Backend Benchmarks (Grid View)
- LLM Tracker - AMD Strix Halo GPU Performance
- Framework Community - Strix Halo GPU LLM Performance Tests
- Framework Community - Toolboxes for LLM inference on Strix Halo
Articles and Blog Posts
- Hardware Corner - Strix Halo LLM Optimization
- Hardware Corner - RADV Vulkan Driver 13% Improvement
- Phoronix - AMD ROCm 7.1 vs RADV Vulkan
- Phoronix - Valve Developer RADV Improvement
- Yifei's Blog - Strix Halo Matrix Cores with llama.cpp
- Strix Halo CUDA/HIP Testing Notes (lhl)
Official Documentation
- ROCm - llama.cpp compatibility
- ROCm - llama.cpp installation
- ROCm Blog - Llama.cpp Meets Instinct
- llama.cpp build documentation
- llama-server README
Papers
- "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
- "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
- Unsloth Dynamic 2.0 GGUFs Documentation
Open Questions / Limitations
-
rocWMMA-tuned patch upstream status: PR #16827 may not be fully merged. Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.
-
ROCm 7.2 stability on gfx1151: Multiple reports of crashes (MUT_MAL errors), performance regressions, and compilation issues. ROCm 7.x is maturing but not fully stable for gfx1151 as of March 2026.
-
Vulkan CoopMat FA for AMD: Will AMD ever get CoopMat2 support? The current CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent extension would close the gap with ROCm FA.
-
KV cache placement on ROCm: Issue #18011 (KV cache dumped to shared memory) reduces ROCm tg performance at high contexts. Root cause appears to be in HIP memory allocation behavior on APUs.
-
GGML_HIP_UMA vs kernel-param GTT expansion: The UMA flag uses slow fine-grained memory. GTT expansion via
amdgpu.gttsizekernel params provides coarse-grained GPU-mapped memory that is much faster. The upstream approach may eventually improve, but kernel params remain the correct method for now. -
GatedDeltaNet architecture support: Both Vulkan (missing shader) and ROCm (register pressure, memcpy bottleneck) perform poorly on GDN models. This blocks efficient use of Qwen3.5-27B and similar models.
-
TurboQuant integration timeline: Expected Q2-Q3 2026 for llama.cpp. Would provide 3-bit KV cache with no quality loss, roughly doubling available context within the same memory budget.
-
NPU utilization: The 50 TOPS NPU on Strix Halo is currently Linux-unusable for llama.cpp inference. AMD driver support for NPU on Linux remains pending.
Overlap Notes
-
Kernel parameters (
amdgpu.gttsize,ttm.pages_limit,iommu=pt): Already documented in the project'sscripts/optimize/kernel-params.sh. This research covers the llama.cpp side (why they matter for inference). -
BIOS VRAM allocation: Reducing dedicated VRAM in BIOS frees more memory for GTT. This is documented in the project's audit scripts but is a prerequisite for the optimizations described here.
-
Toolbox container builds: The project uses pre-built toolboxes (
llama-rocm-7.2,llama-vulkan). The compilation flags documented here describe what should be baked into those containers.