# llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151) Comprehensive research into maximizing inference performance on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000). Researched March 2026. --- ## Scope This document covers every known compilation flag, runtime parameter, environment variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5) with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama internals, or non-llama.cpp inference engines except where their findings inform llama.cpp optimization. --- ## Table of Contents 1. [Compilation Flags and Build Optimizations](#1-compilation-flags-and-build-optimizations) 2. [Runtime Flags and Environment Variables](#2-runtime-flags-and-environment-variables) 3. [Flash Attention and Attention Backends](#3-flash-attention-and-attention-backends) 4. [Quantization Strategies for Speed](#4-quantization-strategies-for-speed) 5. [Memory Layout and Caching](#5-memory-layout-and-caching) 6. [llama-server Specific Optimizations](#6-llama-server-specific-optimizations) 7. [Upcoming llama.cpp Features (2026)](#7-upcoming-llamacpp-features-2026) 8. [Recommended Configurations](#8-recommended-configurations) 9. [Sources](#9-sources) --- ## 1. Compilation Flags and Build Optimizations ### 1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build Both backends are worth building. Neither is universally faster on gfx1151: | Workload | Winner | Rationale | |----------|--------|-----------| | Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths | | Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory | | Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes | | Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx | | Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent | Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on): | Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s | |---------|-----------|-----------|----------------|----------------| | Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 | | Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 | | ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 | | ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 | Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context prompt processing (3x faster than RADV at 130K). ### 1.2 ROCm Build Flags **Minimal build:** ```bash cmake -B build -S . \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS="gfx1151" \ -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc) ``` **Optimized build (recommended):** ```bash cmake -B build -S . \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS="gfx1151" \ -DGGML_HIP_ROCWMMA_FATTN=ON \ -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc) ``` #### Critical ROCm build flags: | Flag | Effect | Recommendation | |------|--------|----------------| | `-DGGML_HIP=ON` | Enable HIP/ROCm backend | Required | | `-DAMDGPU_TARGETS="gfx1151"` | Target Strix Halo GPU | Required -- do not use gfx1100 | | `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enable rocWMMA flash attention | Strongly recommended for pp | | `-DGGML_HIP_GRAPHS=ON` | HIP graph kernel scheduling | Test -- may help reduce launch overhead | | `-DGGML_HIP_NO_VMM=OFF` | Re-enable Virtual Memory Management | Default is disabled; test if needed | #### Flags to be aware of but NOT set by default: | Flag | Notes | |------|-------| | `-DGGML_HIP_UMA=ON` | Uses hipMallocManaged for UMA. **Avoid on Strix Halo** -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. | | `-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` | Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. **Test before deploying.** | | `-DGGML_CUDA_FA_ALL_QUANTS=ON` | Compiles all KV cache quant type combinations for FA. **Works for CUDA kernels which HIP reuses via hipify.** Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. | #### ROCm version considerations: - **ROCm 7.2**: Known rocWMMA compilation issue (ambiguous template specializations in `mfma_impl.hpp`). Fixed in later point releases or by disabling rocWMMA. - **ROCm 7.0 RC / 7.1**: Generally work well with gfx1151. - **ROCm 6.4.4**: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential. - **Recommendation**: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151. ### 1.3 Vulkan Build Flags ```bash cmake -B build -S . \ -DGGML_VULKAN=ON \ -DLLAMA_BUILD_SERVER=ON \ -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc) ``` The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime. gfx1151 with RADV reports `matrix cores: KHR_coopmat` in llama-bench logs. #### Vulkan-specific considerations: | Topic | Detail | |-------|--------| | **RADV vs AMDVLK** | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. | | **AMDVLK buffer limit** | AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. | | **CoopMat1 vs CoopMat2** | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. | | **Shader compilation** | Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them. | ### 1.4 LTO and PGO llama.cpp does not have built-in LTO/PGO support in its CMake configuration. You can enable LTO manually: ```bash cmake -B build -S . \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS="gfx1151" \ -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \ -DCMAKE_BUILD_TYPE=Release ``` Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are compiled by the HIP/ROCm compiler and are not affected by host LTO. PGO would require a two-pass build (instrument, profile, rebuild) and is not commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance and memory bandwidth, not host-side code paths. ### 1.5 Compiler Tuning for ROCm A known LLVM regression affects loop unrolling on RDNA. The following flag has been reported to help: ```bash -DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" ``` This increases the unrolling threshold for local memory operations, which can improve kernel performance for flash attention and matrix multiplication. --- ## 2. Runtime Flags and Environment Variables ### 2.1 ROCm Environment Variables | Variable | Value | Effect | |----------|-------|--------| | `ROCBLAS_USE_HIPBLASLT=1` | **Critical** | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. **This is a 2-7x improvement for prompt processing.** | | `HSA_OVERRIDE_GFX_VERSION=11.5.1` | Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. | | `HSA_ENABLE_SDMA=0` | Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. | | `HIP_VISIBLE_DEVICES=0` | Optional | Select specific GPU device. Useful in multi-GPU or container setups. | | `GPU_MAX_HEAP_SIZE=100` | Optional | Allow 100% of GPU memory for heap. Default may be lower. | | `GPU_MAX_ALLOC_PERCENT=100` | Optional | Allow single allocation up to 100% of GPU memory. | | `ROCR_VISIBLE_DEVICES=0` | Optional | HSA-level device visibility control. | | `AMD_LOG_LEVEL=0` | Optional | Suppress AMD driver logging noise. | **The single most impactful environment variable is `ROCBLAS_USE_HIPBLASLT=1`.** Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower). ### 2.2 Vulkan Environment Variables | Variable | Value | Effect | |----------|-------|--------| | `AMD_VULKAN_ICD=RADV` | Recommended | Force RADV driver (skip AMDVLK). | | `RADV_PERFTEST=nogttspill` | **Important** | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). | | `GGML_VK_VISIBLE_DEVICES=0` | Optional | Select Vulkan device index. | | `GGML_VULKAN_DISABLE_F16=1` | Debugging | Force FP32 compute. Slower but useful for debugging precision issues. | | `GGML_LOG_LEVEL=2` | Debugging | Verbose logging to verify coopmat detection. | ### 2.3 Thread Count (`-t` flag) For GPU-dominant inference (all layers offloaded), the thread count has minimal impact on throughput. The recommendation: - **Single-user inference**: `-t 4` to `-t 8` (enough for tokenization/sampling overhead) - **Server with parallel slots**: `-t` equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores) - **Hybrid CPU+GPU (partial offload)**: `-t` equal to number of physical cores The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using all 12 big cores (`-t 12`) is optimal for CPU-involved workloads. ### 2.4 Batch Size Tuning (`-b` and `-ub`) | Flag | Default | Role | |------|---------|------| | `-b` / `--batch-size` | 2048 | Logical batch size (application level) | | `-ub` / `--ubatch-size` | 512 | Physical batch size (device level) | Tuning guidance for gfx1151: - **MoE models**: `-b 256` significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B) - **Dense models**: Default `-b 2048` is generally fine - **Long context**: `-ub 2048` can improve performance, but test against OOM - **Ultra-long context**: Reduce `-ub` if memory allocation fails The Vulkan backend blog post for Strix Halo recommends: `-c 32768 -b 4096 -ub 256` for a good balance of performance and memory. ### 2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`) **Critical finding for unified memory APUs:** > When you load large models to the GPU, memory mapping can make loading moderately > slower for Vulkan, and **catastrophically slower for ROCm**. You should always set > `--mmap 0` or `--no-mmap` to improve model loading times on Strix Halo. For `llama-bench`, use `-mmp 0`. For `llama-server`/`llama-cli`, use `--no-mmap`. On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x. The difference is which pages are mapped for GPU access. GPU-mapped pages have full bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for CPU-to-GPU copies). **Always use `-ngl 99` (or higher) to ensure all layers are on GPU memory.** Even on a unified memory system, GPU memory paths provide 2x the bandwidth. ### 2.6 GPU Layer Offloading (`-ngl`) For Strix Halo with 64GB unified memory: - **Models < 50GB**: `-ngl 99` offloads everything. No tuning needed. - **Models 50-60GB**: `-ngl 99` should still work with GTT expanded via kernel params. - **Models > 60GB**: May need partial offload. Use `-ngl ` where N is tuned to keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth. **Never let GPU spill to system RAM paths** -- performance will be worse than pure CPU. --- ## 3. Flash Attention and Attention Backends ### 3.1 When to Enable Flash Attention **Rule of thumb for gfx1151:** | Backend | Flash Attention | Recommendation | |---------|----------------|----------------| | ROCm + rocWMMA | `-fa 1` | **Always enable.** 24% pp improvement, maintains tg speed, uses less memory. | | ROCm without rocWMMA | `-fa 1` | Enable, but smaller improvement. | | Vulkan RADV | `-fa 1` | **Enable for short contexts.** Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. | | Vulkan AMDVLK | `-fa 1` | Similar to RADV. | **Key caveat**: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior FA path. ### 3.2 rocWMMA Flash Attention Performance Benchmark on gfx1151 (Llama2-7B Q4_K_M): | Configuration | pp512 t/s | tg128 t/s | |---------------|-----------|-----------| | HIP standard | 592.28 | 40.40 | | HIP + hipBLASLt | 548.72 | 40.43 | | HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 | | HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 | rocWMMA provides ~70% improvement in prompt processing with flash attention. Token generation is slightly slower (~2%) due to WMMA overhead at small batch. ### 3.3 The rocWMMA Long-Context Regression and Fix The standard rocWMMA implementation has a **long-context decode regression**: at 65K context, tg degrades by up to 57% compared to HIP-only baseline. **The fix** (PR #16827, "rocm-wmma-tune" branch) implements: 1. **`__launch_bounds__(256, 2)`**: Ensures minimum 2 blocks per SM, improving occupancy 2. **Adaptive KQ stride**: Uses stride 128 when head dimension <= 128, reducing LDS footprint 3. **Selective WMMA usage**: WMMA only for prefill; decode reverts to VEC/TILE kernels Results after fix (Llama 3.2 1B Q4_K_M on gfx1151): - pp512 at 65K context: **96% faster** than untuned rocWMMA - tg128 at 65K context: Matches HIP baseline (previously 57% degraded) **Status**: This patch is available in `-rocwmma-improved` toolbox builds. It may not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes. ### 3.4 Vulkan Flash Attention Limitations on AMD The Vulkan backend supports three FA paths: | Path | Extension | AMD Support | |------|-----------|-------------| | FA_SCALAR | None | Yes (CPU fallback) | | FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) | | FA_COOPMAT2 | NV_cooperative_matrix2 | **No** (NVIDIA-only) | FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types. FA_COOPMAT2 additionally supports all quant types. When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations. This provides a modest improvement over scalar FA but is significantly less efficient than ROCm + rocWMMA. ### 3.5 New Attention Models (GatedDeltaNet) Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have severe performance problems on gfx1151: - **Vulkan**: No GATED_DELTA_NET compute shader exists; ops fall back to CPU - **ROCm/HIP**: Kernel cross-compiles but suffers from register spilling (float s[S_v] allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95% of decode time on models >15GB) Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s. **Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.** --- ## 4. Quantization Strategies for Speed ### 4.1 Quantization Speed on RDNA 3.5 Token generation speed is dominated by memory bandwidth, not compute. Smaller quantizations are faster because they reduce bytes-per-weight, allowing more tokens per second within the ~215 GB/s bandwidth envelope. Approximate throughput formula for decode (bandwidth-bound): ``` tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9 ``` For a 7B Q4_K_M model (~4.1 GB): ``` 215 / 4.1 ≈ 52 t/s (theoretical max; practical ~50 t/s on gfx1151) ``` ### 4.2 Quantization Type Comparison | Quant | Bits/Weight | Quality | Speed (relative) | Notes | |-------|------------|---------|-------------------|-------| | Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. | | Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. | | IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. | | Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. | | Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. | | Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. | | F16 | 16.0 | Perfect | Slowest | Reference baseline. | **For RDNA 3.5 specifically**: - **Q4_K_M** is the best general-purpose quantization. The K-quant family uses hierarchical super-blocks (256 values) with per-sub-block scales, providing better quality than Q4_0 at marginally higher dequant cost that is invisible at the GPU level. - **Q4_0** has the simplest dequant kernels and is marginally faster than Q4_K_M on some GPU backends. However, the quality loss is significant. Use only for smoke tests or when every t/s matters more than quality. - **IQ4_XS** (importance-matrix quantized) offers better quality per bit than Q4_K_M. Speed is similar. Requires an importance matrix file during quantization. **Recommended over Q4_K_M when you control the quantization process.** - **Q8_0** does NOT have special hardware-accelerated dequant on RDNA 3.5. RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the same FP16 compute paths, just with more memory bandwidth consumed. ### 4.3 Importance Matrix (imatrix) Quantization imatrix quantization records how much each weight affects output quality, then allocates more precision bits to important weights. This is essential for sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant shows measurable degradation. ```bash # Generate importance matrix (GPU-accelerated) llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat # Quantize with imatrix llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS ``` **Speed impact**: None. imatrix affects quantization quality, not inference speed. The dequantization kernels are identical regardless of whether imatrix was used. ### 4.4 Unsloth Dynamic (UD) Quantizations Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths, choosing the optimal quantization per layer based on sensitivity analysis. **Speed impact**: Minimal to none. UD quants use the same dequant kernels as standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a standard Q4_K_M of the same total size. **Quality impact**: Significantly better. UD consistently outperforms standard quantizations in 5-shot MMLU and KL divergence metrics at the same total file size. **Recommendation**: Prefer UD quants (e.g., `UD-Q4_K_XL`, `UD-Q4_K_M`) from Unsloth when available. They are a free quality upgrade with no speed penalty. --- ## 5. Memory Layout and Caching ### 5.1 KV Cache Quantization KV cache quantization reduces the memory footprint of the attention cache, allowing larger context windows within the same memory budget. | Cache Type | Memory vs F16 | Quality Impact | Recommendation | |------------|--------------|----------------|----------------| | f16 (default) | 1.0x | None | Baseline | | q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | **Recommended for production** | | q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained | | q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 | | iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option | Usage: ```bash llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ... # or for llama-bench: # Not directly supported as flags; test via llama-server ``` **Performance impact**: Quantizing K cache slightly **improves** throughput (less memory to read). Quantizing V cache may have a slight negative impact. Overall performance impact is negligible for normal inference. **Caveat with speculative decoding**: Using KV cache quantization with a draft model causes a consistent ~16% performance drop. q4_0 KV with speculative decoding causes massive acceptance rate drops. **Avoid KV quant if using speculative decoding.** ### 5.2 mmap vs Full Load on Unified Memory On Strix Halo's unified memory architecture: - **`--no-mmap` is strongly recommended** for both ROCm and Vulkan. - With mmap enabled, ROCm model loading is "catastrophically slower." - Vulkan loading is "moderately slower" with mmap. - Since CPU and GPU share physical RAM, there is no data copy when loading to "GPU memory" -- it is just a page table update. For `llama-bench`: Always use `-mmp 0`. For `llama-server`/`llama-cli`: Always use `--no-mmap`. ### 5.3 Prompt Caching llama-server supports two levels of prompt caching: **1. Automatic KV cache reuse (`cache_prompt: true`)**: Reuses KV cache from previous requests when prompts share a common prefix. The server only reprocesses the suffix that differs. **2. Host-memory prompt caching (`--cram N`)**: Stores pre-computed prompt representations in system RAM. - Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction) - +6% token throughput (34 vs 32 t/s) - Memory formula: `num_prefixes * avg_prefix_tokens * 8 bytes` Configuration: ```bash llama-server -m model.gguf \ --cram 256 \ # 256 MB host RAM for prompt cache --cache-type-k q8_0 \ # KV cache quantization --cache-type-v q8_0 \ --no-mmap \ -fa \ -ngl 99 ``` Best for: - System prompts > 5K tokens - Multi-user chatbots with shared context - Agentic use with repeated tool-call prefixes ### 5.4 UMA Detection Bug (Issue #18159) llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark) incorrectly activates on AMD APUs when `prop.integrated=1`. It reads `/proc/meminfo` instead of `hipMemGetInfo()`, severely underreporting available GPU memory (e.g., reporting 27GB instead of 96GB). **Workarounds**: - Build without `GGML_CUDA_ENABLE_UNIFIED_MEMORY` - Guard UMA detection with `!defined(GGML_USE_HIP)` (upstream fix pending) - Use toolbox containers where this has been patched ### 5.5 KV Cache Placement on ROCm (Issue #18011) On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible) memory instead of GPU-mapped memory, causing performance degradation at high context sizes. This is a known issue contributing to ROCm falling behind Vulkan for tg at high contexts. **Mitigation**: Use the rocWMMA-tuned branch which maintains better memory placement, or use Vulkan RADV for workloads where this matters. --- ## 6. llama-server Specific Optimizations ### 6.1 Recommended Server Configuration ```bash llama-server -m model.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 99 \ --no-mmap \ -fa \ -c 32768 \ # Total context across all slots -np 4 \ # 4 parallel slots (adjust for your use) -b 2048 \ # Logical batch size -ub 512 \ # Physical batch size --cache-type-k q8_0 \ --cache-type-v q8_0 \ --cont-batching \ # Enabled by default --jinja # Enable Jinja2 chat template ``` ### 6.2 Parallel Slot Configuration (`-np`) | Use Case | Slots | Context per Slot | Total `-c` | |----------|-------|-----------------|-----------| | Single user chat | 1 | 32768 | 32768 | | Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 | | Multi-user API | 4-8 | 4096-8192 | 32768-65536 | | Eval harness | 1 | 32768+ | 32768+ | Memory formula: Each slot requires `context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element`. With q8_0 KV cache, this is roughly halved compared to f16. ### 6.3 Continuous Batching Enabled by default (`--cont-batching`). Allows the server to process multiple requests simultaneously, interleaving prefill and decode operations. For agentic workloads: One slot typically holds a large system prompt + conversation context, while additional slots handle parallel tool calls. Configure with: ```bash -np 4 -c 131072 # 4 slots, up to 32K context each ``` ### 6.4 Prompt Caching for Agentic Use For agentic coding tools that send the same system prompt repeatedly: 1. Use `cache_prompt: true` in API requests (reuses KV cache prefix) 2. Use `--system-prompt-file system.txt` for static system prompts (note: may be removed in recent versions; verify with your build) 3. Use `--cram 128` to enable host-memory caching for prefix deduplication ### 6.5 Speculative Decoding For token generation speedup with a draft model: ```bash llama-server -m main-model.gguf \ --model-draft draft-model.gguf \ -ngl 99 \ --draft-max 8 \ --draft-min 1 \ --no-mmap \ -fa ``` **Caveat**: Do NOT combine speculative decoding with KV cache quantization. The 16% performance drop and reduced acceptance rate negate the benefits. --- ## 7. Upcoming llama.cpp Features (2026) ### 7.1 Backend-Agnostic Tensor Parallelism (PR #19378) Merged January 2026. Adds `--split-mode tensor` for splitting computation across multiple GPUs via a new "meta" backend. **Relevance to Strix Halo**: Limited. Single integrated GPU. However, for RPC configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster), tensor parallelism could complement the existing layer-split approach. Currently supports 1-2 GPUs with equal data split. `--tensor-split` has no effect yet. ### 7.2 TurboQuant KV Cache Compression (ICLR 2026) Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with no training and negligible quality loss: | Format | MSE vs FP16 | Compression | |--------|-------------|-------------| | TQ3 (3-bit) | 0.034 | 4.9x | | TQ4 (4-bit) | 0.009 | 3.8x | **Timeline**: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase integration plan exists covering GGML type registration, KV cache paths, FA integration, and CLI flags. ### 7.3 Vulkan Improvements Active 2025-2026 developments: - Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches) - 13% pp improvement from CU mode optimization for LDS utilization - BFloat16 Vulkan support (`VK_KHR_shader_bfloat16`) maturing in Mesa 25.x - Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026) ### 7.4 Flash Attention for Head Dimension 512 Pull request from March 2026 adds FA support for HD=512 in CUDA kernels. This benefits models with larger head dimensions (some newer architectures). The HIP path should inherit this via hipify. ### 7.5 ik_llama.cpp Fork Innovations The `ik_llama.cpp` fork by ikawrakow introduces: - Row-interleaved quant packing (better memory access patterns) - Smart Expert Reduction for faster MoE inference - Tensor overrides with regex patterns for hybrid GPU/CPU placement - FlashMLA for DeepSeek models **Caveat**: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan are not maintained. Not recommended for AMD gfx1151. --- ## 8. Recommended Configurations ### 8.1 For llama-bench (Benchmarking) **ROCm backend:** ```bash ROCBLAS_USE_HIPBLASLT=1 \ toolbox run -c llama-rocm-7.2 -- \ /path/to/llama-bench \ -m /path/to/model.gguf \ -ngl 99 -mmp 0 -fa 1 \ -p 512 -n 128 -r 5 ``` **Vulkan backend:** ```bash AMD_VULKAN_ICD=RADV \ RADV_PERFTEST=nogttspill \ toolbox run -c llama-vulkan -- \ /path/to/llama-bench \ -m /path/to/model.gguf \ -ngl 99 -mmp 0 -fa 1 \ -p 512 -n 128 -r 5 ``` ### 8.2 For llama-server (Production Agentic Use) **ROCm (best for long context):** ```bash ROCBLAS_USE_HIPBLASLT=1 \ llama-server -m model.gguf \ -ngl 99 --no-mmap -fa \ -c 65536 -np 4 \ -b 2048 -ub 512 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --cram 256 \ --jinja --cont-batching \ --host 0.0.0.0 --port 8080 ``` **Vulkan RADV (best for single-user tg):** ```bash AMD_VULKAN_ICD=RADV \ RADV_PERFTEST=nogttspill \ llama-server -m model.gguf \ -ngl 99 --no-mmap -fa \ -c 32768 -np 2 \ -b 4096 -ub 256 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --jinja --cont-batching \ --host 0.0.0.0 --port 8080 ``` ### 8.3 Decision Matrix | Question | Answer | |----------|--------| | Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. | | Which backend for daily chat? | Vulkan RADV for best tg speed. | | Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. | | Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. | | Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. | | Use `--no-mmap`? | Always. | | Set `ROCBLAS_USE_HIPBLASLT=1`? | Always for ROCm. | | Set `RADV_PERFTEST=nogttspill`? | Always for Vulkan RADV. | | KV cache quantization? | q8_0 for both K and V unless using speculative decoding. | | Batch size for MoE? | `-b 256` (lower than default improves some MoE models). | --- ## 9. Sources ### GitHub Issues and Discussions - [Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021](https://github.com/ggml-org/llama.cpp/discussions/15021) - [Performance of llama.cpp with Vulkan - Discussion #10879](https://github.com/ggml-org/llama.cpp/discussions/10879) - [HIP backend performs poorly on gfx1151 - Issue #13565](https://github.com/ggml-org/llama.cpp/issues/13565) - [UMA detection incorrectly limits memory on AMD APUs - Issue #18159](https://github.com/ggml-org/llama.cpp/issues/18159) - [ROCm model loading dumps KV cache to shared memory - Issue #18011](https://github.com/ggml-org/llama.cpp/issues/18011) - [GATED_DELTA_NET underperformance on gfx1151 - Issue #20354](https://github.com/ggml-org/llama.cpp/issues/20354) - [Under-Performance of ROCm 7.2 binaries - Issue #19984](https://github.com/ggml-org/llama.cpp/issues/19984) - [ROCm 7.2 + rocWMMA compilation - Issue #19269](https://github.com/ggml-org/llama.cpp/issues/19269) - [Building for gfx1151 - Issue #14734](https://github.com/ggml-org/llama.cpp/issues/14734) - [AMDVLK 2GB buffer allocation limit - Issue #15054](https://github.com/ggml-org/llama.cpp/issues/15054) - [Mastering Host-Memory Prompt Caching - Discussion #20574](https://github.com/ggml-org/llama.cpp/discussions/20574) - [TurboQuant Extreme KV Cache Quantization - Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969) - [Backend-agnostic tensor parallelism - PR #19378](https://github.com/ggml-org/llama.cpp/pull/19378) - [Massively Improved rocWMMA Performance - PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827) - [rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7) - [Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5](https://github.com/geerlingguy/beowulf-ai-cluster/issues/5) ### Wiki and Community Resources - [Strix Halo Wiki - llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance) - [Strix Halo Wiki - llama.cpp with ROCm](https://strixhalo.wiki/AI/llamacpp-with-ROCm) - [AMD Strix Halo Backend Benchmarks (Grid View)](https://kyuz0.github.io/amd-strix-halo-toolboxes/) - [LLM Tracker - AMD Strix Halo GPU Performance](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance) - [Framework Community - Strix Halo GPU LLM Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) - [Framework Community - Toolboxes for LLM inference on Strix Halo](https://community.frame.work/t/llama-cpp-vllm-toolboxes-for-llm-inference-on-strix-halo/74916) ### Articles and Blog Posts - [Hardware Corner - Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/) - [Hardware Corner - RADV Vulkan Driver 13% Improvement](https://www.hardware-corner.net/llama-cpp-amd-radv-vulkan-driver-update/) - [Phoronix - AMD ROCm 7.1 vs RADV Vulkan](https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan) - [Phoronix - Valve Developer RADV Improvement](https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp) - [Yifei's Blog - Strix Halo Matrix Cores with llama.cpp](https://blog.yifei.sg/jekyll/update/2025/08/27/building-llamacpp-strix-halo.html) - [Strix Halo CUDA/HIP Testing Notes (lhl)](https://github.com/lhl/strix-halo-testing/blob/main/llama-cpp-cuda-hip.md) ### Official Documentation - [ROCm - llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html) - [ROCm - llama.cpp installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html) - [ROCm Blog - Llama.cpp Meets Instinct](https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html) - [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) - [llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) ### Papers - "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277) - "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026) - [Unsloth Dynamic 2.0 GGUFs Documentation](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs) --- ## Open Questions / Limitations 1. **rocWMMA-tuned patch upstream status**: PR #16827 may not be fully merged. Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes. 2. **ROCm 7.2 stability on gfx1151**: Multiple reports of crashes (MUT_MAL errors), performance regressions, and compilation issues. ROCm 7.x is maturing but not fully stable for gfx1151 as of March 2026. 3. **Vulkan CoopMat FA for AMD**: Will AMD ever get CoopMat2 support? The current CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent extension would close the gap with ROCm FA. 4. **KV cache placement on ROCm**: Issue #18011 (KV cache dumped to shared memory) reduces ROCm tg performance at high contexts. Root cause appears to be in HIP memory allocation behavior on APUs. 5. **GGML_HIP_UMA vs kernel-param GTT expansion**: The UMA flag uses slow fine-grained memory. GTT expansion via `amdgpu.gttsize` kernel params provides coarse-grained GPU-mapped memory that is much faster. The upstream approach may eventually improve, but kernel params remain the correct method for now. 6. **GatedDeltaNet architecture support**: Both Vulkan (missing shader) and ROCm (register pressure, memcpy bottleneck) perform poorly on GDN models. This blocks efficient use of Qwen3.5-27B and similar models. 7. **TurboQuant integration timeline**: Expected Q2-Q3 2026 for llama.cpp. Would provide 3-bit KV cache with no quality loss, roughly doubling available context within the same memory budget. 8. **NPU utilization**: The 50 TOPS NPU on Strix Halo is currently Linux-unusable for llama.cpp inference. AMD driver support for NPU on Linux remains pending. --- ## Overlap Notes - **Kernel parameters** (`amdgpu.gttsize`, `ttm.pages_limit`, `iommu=pt`): Already documented in the project's `scripts/optimize/kernel-params.sh`. This research covers the llama.cpp side (why they matter for inference). - **BIOS VRAM allocation**: Reducing dedicated VRAM in BIOS frees more memory for GTT. This is documented in the project's audit scripts but is a prerequisite for the optimizations described here. - **Toolbox container builds**: The project uses pre-built toolboxes (`llama-rocm-7.2`, `llama-vulkan`). The compilation flags documented here describe what should be baked into those containers.