KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
807 lines
34 KiB
Markdown
807 lines
34 KiB
Markdown
# llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)
|
|
|
|
Comprehensive research into maximizing inference performance on AMD Strix Halo
|
|
(Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000).
|
|
Researched March 2026.
|
|
|
|
---
|
|
|
|
## Scope
|
|
|
|
This document covers every known compilation flag, runtime parameter, environment
|
|
variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5)
|
|
with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama
|
|
internals, or non-llama.cpp inference engines except where their findings inform
|
|
llama.cpp optimization.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Compilation Flags and Build Optimizations](#1-compilation-flags-and-build-optimizations)
|
|
2. [Runtime Flags and Environment Variables](#2-runtime-flags-and-environment-variables)
|
|
3. [Flash Attention and Attention Backends](#3-flash-attention-and-attention-backends)
|
|
4. [Quantization Strategies for Speed](#4-quantization-strategies-for-speed)
|
|
5. [Memory Layout and Caching](#5-memory-layout-and-caching)
|
|
6. [llama-server Specific Optimizations](#6-llama-server-specific-optimizations)
|
|
7. [Upcoming llama.cpp Features (2026)](#7-upcoming-llamacpp-features-2026)
|
|
8. [Recommended Configurations](#8-recommended-configurations)
|
|
9. [Sources](#9-sources)
|
|
|
|
---
|
|
|
|
## 1. Compilation Flags and Build Optimizations
|
|
|
|
### 1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build
|
|
|
|
Both backends are worth building. Neither is universally faster on gfx1151:
|
|
|
|
| Workload | Winner | Rationale |
|
|
|----------|--------|-----------|
|
|
| Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths |
|
|
| Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory |
|
|
| Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes |
|
|
| Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx |
|
|
| Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent |
|
|
|
|
Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):
|
|
|
|
| Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s |
|
|
|---------|-----------|-----------|----------------|----------------|
|
|
| Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 |
|
|
| Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 |
|
|
| ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 |
|
|
| ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 |
|
|
|
|
Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x
|
|
faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context
|
|
prompt processing (3x faster than RADV at 130K).
|
|
|
|
### 1.2 ROCm Build Flags
|
|
|
|
**Minimal build:**
|
|
```bash
|
|
cmake -B build -S . \
|
|
-DGGML_HIP=ON \
|
|
-DAMDGPU_TARGETS="gfx1151" \
|
|
-DCMAKE_BUILD_TYPE=Release
|
|
cmake --build build --config Release -j$(nproc)
|
|
```
|
|
|
|
**Optimized build (recommended):**
|
|
```bash
|
|
cmake -B build -S . \
|
|
-DGGML_HIP=ON \
|
|
-DAMDGPU_TARGETS="gfx1151" \
|
|
-DGGML_HIP_ROCWMMA_FATTN=ON \
|
|
-DCMAKE_BUILD_TYPE=Release
|
|
cmake --build build --config Release -j$(nproc)
|
|
```
|
|
|
|
#### Critical ROCm build flags:
|
|
|
|
| Flag | Effect | Recommendation |
|
|
|------|--------|----------------|
|
|
| `-DGGML_HIP=ON` | Enable HIP/ROCm backend | Required |
|
|
| `-DAMDGPU_TARGETS="gfx1151"` | Target Strix Halo GPU | Required -- do not use gfx1100 |
|
|
| `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enable rocWMMA flash attention | Strongly recommended for pp |
|
|
| `-DGGML_HIP_GRAPHS=ON` | HIP graph kernel scheduling | Test -- may help reduce launch overhead |
|
|
| `-DGGML_HIP_NO_VMM=OFF` | Re-enable Virtual Memory Management | Default is disabled; test if needed |
|
|
|
|
#### Flags to be aware of but NOT set by default:
|
|
|
|
| Flag | Notes |
|
|
|------|-------|
|
|
| `-DGGML_HIP_UMA=ON` | Uses hipMallocManaged for UMA. **Avoid on Strix Halo** -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. |
|
|
| `-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` | Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. **Test before deploying.** |
|
|
| `-DGGML_CUDA_FA_ALL_QUANTS=ON` | Compiles all KV cache quant type combinations for FA. **Works for CUDA kernels which HIP reuses via hipify.** Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. |
|
|
|
|
#### ROCm version considerations:
|
|
|
|
- **ROCm 7.2**: Known rocWMMA compilation issue (ambiguous template specializations in `mfma_impl.hpp`). Fixed in later point releases or by disabling rocWMMA.
|
|
- **ROCm 7.0 RC / 7.1**: Generally work well with gfx1151.
|
|
- **ROCm 6.4.4**: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
|
|
- **Recommendation**: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.
|
|
|
|
### 1.3 Vulkan Build Flags
|
|
|
|
```bash
|
|
cmake -B build -S . \
|
|
-DGGML_VULKAN=ON \
|
|
-DLLAMA_BUILD_SERVER=ON \
|
|
-DCMAKE_BUILD_TYPE=Release
|
|
cmake --build build --config Release -j$(nproc)
|
|
```
|
|
|
|
The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime.
|
|
gfx1151 with RADV reports `matrix cores: KHR_coopmat` in llama-bench logs.
|
|
|
|
#### Vulkan-specific considerations:
|
|
|
|
| Topic | Detail |
|
|
|-------|--------|
|
|
| **RADV vs AMDVLK** | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. |
|
|
| **AMDVLK buffer limit** | AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. |
|
|
| **CoopMat1 vs CoopMat2** | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. |
|
|
| **Shader compilation** | Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them. |
|
|
|
|
### 1.4 LTO and PGO
|
|
|
|
llama.cpp does not have built-in LTO/PGO support in its CMake configuration.
|
|
You can enable LTO manually:
|
|
|
|
```bash
|
|
cmake -B build -S . \
|
|
-DGGML_HIP=ON \
|
|
-DAMDGPU_TARGETS="gfx1151" \
|
|
-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
|
|
-DCMAKE_BUILD_TYPE=Release
|
|
```
|
|
|
|
Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are
|
|
compiled by the HIP/ROCm compiler and are not affected by host LTO.
|
|
|
|
PGO would require a two-pass build (instrument, profile, rebuild) and is not
|
|
commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance
|
|
and memory bandwidth, not host-side code paths.
|
|
|
|
### 1.5 Compiler Tuning for ROCm
|
|
|
|
A known LLVM regression affects loop unrolling on RDNA. The following flag has
|
|
been reported to help:
|
|
|
|
```bash
|
|
-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
|
|
```
|
|
|
|
This increases the unrolling threshold for local memory operations, which can
|
|
improve kernel performance for flash attention and matrix multiplication.
|
|
|
|
---
|
|
|
|
## 2. Runtime Flags and Environment Variables
|
|
|
|
### 2.1 ROCm Environment Variables
|
|
|
|
| Variable | Value | Effect |
|
|
|----------|-------|--------|
|
|
| `ROCBLAS_USE_HIPBLASLT=1` | **Critical** | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. **This is a 2-7x improvement for prompt processing.** |
|
|
| `HSA_OVERRIDE_GFX_VERSION=11.5.1` | Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. |
|
|
| `HSA_ENABLE_SDMA=0` | Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. |
|
|
| `HIP_VISIBLE_DEVICES=0` | Optional | Select specific GPU device. Useful in multi-GPU or container setups. |
|
|
| `GPU_MAX_HEAP_SIZE=100` | Optional | Allow 100% of GPU memory for heap. Default may be lower. |
|
|
| `GPU_MAX_ALLOC_PERCENT=100` | Optional | Allow single allocation up to 100% of GPU memory. |
|
|
| `ROCR_VISIBLE_DEVICES=0` | Optional | HSA-level device visibility control. |
|
|
| `AMD_LOG_LEVEL=0` | Optional | Suppress AMD driver logging noise. |
|
|
|
|
**The single most impactful environment variable is `ROCBLAS_USE_HIPBLASLT=1`.**
|
|
Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).
|
|
|
|
### 2.2 Vulkan Environment Variables
|
|
|
|
| Variable | Value | Effect |
|
|
|----------|-------|--------|
|
|
| `AMD_VULKAN_ICD=RADV` | Recommended | Force RADV driver (skip AMDVLK). |
|
|
| `RADV_PERFTEST=nogttspill` | **Important** | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). |
|
|
| `GGML_VK_VISIBLE_DEVICES=0` | Optional | Select Vulkan device index. |
|
|
| `GGML_VULKAN_DISABLE_F16=1` | Debugging | Force FP32 compute. Slower but useful for debugging precision issues. |
|
|
| `GGML_LOG_LEVEL=2` | Debugging | Verbose logging to verify coopmat detection. |
|
|
|
|
### 2.3 Thread Count (`-t` flag)
|
|
|
|
For GPU-dominant inference (all layers offloaded), the thread count has minimal
|
|
impact on throughput. The recommendation:
|
|
|
|
- **Single-user inference**: `-t 4` to `-t 8` (enough for tokenization/sampling overhead)
|
|
- **Server with parallel slots**: `-t` equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
|
|
- **Hybrid CPU+GPU (partial offload)**: `-t` equal to number of physical cores
|
|
|
|
The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using
|
|
all 12 big cores (`-t 12`) is optimal for CPU-involved workloads.
|
|
|
|
### 2.4 Batch Size Tuning (`-b` and `-ub`)
|
|
|
|
| Flag | Default | Role |
|
|
|------|---------|------|
|
|
| `-b` / `--batch-size` | 2048 | Logical batch size (application level) |
|
|
| `-ub` / `--ubatch-size` | 512 | Physical batch size (device level) |
|
|
|
|
Tuning guidance for gfx1151:
|
|
|
|
- **MoE models**: `-b 256` significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
|
|
- **Dense models**: Default `-b 2048` is generally fine
|
|
- **Long context**: `-ub 2048` can improve performance, but test against OOM
|
|
- **Ultra-long context**: Reduce `-ub` if memory allocation fails
|
|
|
|
The Vulkan backend blog post for Strix Halo recommends: `-c 32768 -b 4096 -ub 256`
|
|
for a good balance of performance and memory.
|
|
|
|
### 2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`)
|
|
|
|
**Critical finding for unified memory APUs:**
|
|
|
|
> When you load large models to the GPU, memory mapping can make loading moderately
|
|
> slower for Vulkan, and **catastrophically slower for ROCm**. You should always set
|
|
> `--mmap 0` or `--no-mmap` to improve model loading times on Strix Halo.
|
|
|
|
For `llama-bench`, use `-mmp 0`. For `llama-server`/`llama-cli`, use `--no-mmap`.
|
|
|
|
On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x.
|
|
The difference is which pages are mapped for GPU access. GPU-mapped pages have full
|
|
bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for
|
|
CPU-to-GPU copies).
|
|
|
|
**Always use `-ngl 99` (or higher) to ensure all layers are on GPU memory.**
|
|
Even on a unified memory system, GPU memory paths provide 2x the bandwidth.
|
|
|
|
### 2.6 GPU Layer Offloading (`-ngl`)
|
|
|
|
For Strix Halo with 64GB unified memory:
|
|
|
|
- **Models < 50GB**: `-ngl 99` offloads everything. No tuning needed.
|
|
- **Models 50-60GB**: `-ngl 99` should still work with GTT expanded via kernel params.
|
|
- **Models > 60GB**: May need partial offload. Use `-ngl <N>` where N is tuned to
|
|
keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.
|
|
|
|
**Never let GPU spill to system RAM paths** -- performance will be worse than pure CPU.
|
|
|
|
---
|
|
|
|
## 3. Flash Attention and Attention Backends
|
|
|
|
### 3.1 When to Enable Flash Attention
|
|
|
|
**Rule of thumb for gfx1151:**
|
|
|
|
| Backend | Flash Attention | Recommendation |
|
|
|---------|----------------|----------------|
|
|
| ROCm + rocWMMA | `-fa 1` | **Always enable.** 24% pp improvement, maintains tg speed, uses less memory. |
|
|
| ROCm without rocWMMA | `-fa 1` | Enable, but smaller improvement. |
|
|
| Vulkan RADV | `-fa 1` | **Enable for short contexts.** Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. |
|
|
| Vulkan AMDVLK | `-fa 1` | Similar to RADV. |
|
|
|
|
**Key caveat**: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not
|
|
the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior
|
|
FA path.
|
|
|
|
### 3.2 rocWMMA Flash Attention Performance
|
|
|
|
Benchmark on gfx1151 (Llama2-7B Q4_K_M):
|
|
|
|
| Configuration | pp512 t/s | tg128 t/s |
|
|
|---------------|-----------|-----------|
|
|
| HIP standard | 592.28 | 40.40 |
|
|
| HIP + hipBLASLt | 548.72 | 40.43 |
|
|
| HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 |
|
|
| HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 |
|
|
|
|
rocWMMA provides ~70% improvement in prompt processing with flash attention.
|
|
Token generation is slightly slower (~2%) due to WMMA overhead at small batch.
|
|
|
|
### 3.3 The rocWMMA Long-Context Regression and Fix
|
|
|
|
The standard rocWMMA implementation has a **long-context decode regression**:
|
|
at 65K context, tg degrades by up to 57% compared to HIP-only baseline.
|
|
|
|
**The fix** (PR #16827, "rocm-wmma-tune" branch) implements:
|
|
|
|
1. **`__launch_bounds__(256, 2)`**: Ensures minimum 2 blocks per SM, improving occupancy
|
|
2. **Adaptive KQ stride**: Uses stride 128 when head dimension <= 128, reducing LDS footprint
|
|
3. **Selective WMMA usage**: WMMA only for prefill; decode reverts to VEC/TILE kernels
|
|
|
|
Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):
|
|
- pp512 at 65K context: **96% faster** than untuned rocWMMA
|
|
- tg128 at 65K context: Matches HIP baseline (previously 57% degraded)
|
|
|
|
**Status**: This patch is available in `-rocwmma-improved` toolbox builds. It may
|
|
not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.
|
|
|
|
### 3.4 Vulkan Flash Attention Limitations on AMD
|
|
|
|
The Vulkan backend supports three FA paths:
|
|
|
|
| Path | Extension | AMD Support |
|
|
|------|-----------|-------------|
|
|
| FA_SCALAR | None | Yes (CPU fallback) |
|
|
| FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) |
|
|
| FA_COOPMAT2 | NV_cooperative_matrix2 | **No** (NVIDIA-only) |
|
|
|
|
FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types.
|
|
FA_COOPMAT2 additionally supports all quant types.
|
|
|
|
When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations.
|
|
This provides a modest improvement over scalar FA but is significantly less
|
|
efficient than ROCm + rocWMMA.
|
|
|
|
### 3.5 New Attention Models (GatedDeltaNet)
|
|
|
|
Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have
|
|
severe performance problems on gfx1151:
|
|
|
|
- **Vulkan**: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
|
|
- **ROCm/HIP**: Kernel cross-compiles but suffers from register spilling (float s[S_v]
|
|
allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95%
|
|
of decode time on models >15GB)
|
|
|
|
Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s.
|
|
**Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.**
|
|
|
|
---
|
|
|
|
## 4. Quantization Strategies for Speed
|
|
|
|
### 4.1 Quantization Speed on RDNA 3.5
|
|
|
|
Token generation speed is dominated by memory bandwidth, not compute. Smaller
|
|
quantizations are faster because they reduce bytes-per-weight, allowing more
|
|
tokens per second within the ~215 GB/s bandwidth envelope.
|
|
|
|
Approximate throughput formula for decode (bandwidth-bound):
|
|
```
|
|
tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9
|
|
```
|
|
|
|
For a 7B Q4_K_M model (~4.1 GB):
|
|
```
|
|
215 / 4.1 ≈ 52 t/s (theoretical max; practical ~50 t/s on gfx1151)
|
|
```
|
|
|
|
### 4.2 Quantization Type Comparison
|
|
|
|
| Quant | Bits/Weight | Quality | Speed (relative) | Notes |
|
|
|-------|------------|---------|-------------------|-------|
|
|
| Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. |
|
|
| Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. |
|
|
| IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. |
|
|
| Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. |
|
|
| Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. |
|
|
| Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. |
|
|
| F16 | 16.0 | Perfect | Slowest | Reference baseline. |
|
|
|
|
**For RDNA 3.5 specifically**:
|
|
|
|
- **Q4_K_M** is the best general-purpose quantization. The K-quant family uses
|
|
hierarchical super-blocks (256 values) with per-sub-block scales, providing
|
|
better quality than Q4_0 at marginally higher dequant cost that is invisible
|
|
at the GPU level.
|
|
|
|
- **Q4_0** has the simplest dequant kernels and is marginally faster than Q4_K_M
|
|
on some GPU backends. However, the quality loss is significant. Use only for
|
|
smoke tests or when every t/s matters more than quality.
|
|
|
|
- **IQ4_XS** (importance-matrix quantized) offers better quality per bit than
|
|
Q4_K_M. Speed is similar. Requires an importance matrix file during quantization.
|
|
**Recommended over Q4_K_M when you control the quantization process.**
|
|
|
|
- **Q8_0** does NOT have special hardware-accelerated dequant on RDNA 3.5.
|
|
RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the
|
|
same FP16 compute paths, just with more memory bandwidth consumed.
|
|
|
|
### 4.3 Importance Matrix (imatrix) Quantization
|
|
|
|
imatrix quantization records how much each weight affects output quality, then
|
|
allocates more precision bits to important weights. This is essential for
|
|
sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant
|
|
shows measurable degradation.
|
|
|
|
```bash
|
|
# Generate importance matrix (GPU-accelerated)
|
|
llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat
|
|
|
|
# Quantize with imatrix
|
|
llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS
|
|
```
|
|
|
|
**Speed impact**: None. imatrix affects quantization quality, not inference speed.
|
|
The dequantization kernels are identical regardless of whether imatrix was used.
|
|
|
|
### 4.4 Unsloth Dynamic (UD) Quantizations
|
|
|
|
Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths,
|
|
choosing the optimal quantization per layer based on sensitivity analysis.
|
|
|
|
**Speed impact**: Minimal to none. UD quants use the same dequant kernels as
|
|
standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a
|
|
standard Q4_K_M of the same total size.
|
|
|
|
**Quality impact**: Significantly better. UD consistently outperforms standard
|
|
quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.
|
|
|
|
**Recommendation**: Prefer UD quants (e.g., `UD-Q4_K_XL`, `UD-Q4_K_M`) from
|
|
Unsloth when available. They are a free quality upgrade with no speed penalty.
|
|
|
|
---
|
|
|
|
## 5. Memory Layout and Caching
|
|
|
|
### 5.1 KV Cache Quantization
|
|
|
|
KV cache quantization reduces the memory footprint of the attention cache,
|
|
allowing larger context windows within the same memory budget.
|
|
|
|
| Cache Type | Memory vs F16 | Quality Impact | Recommendation |
|
|
|------------|--------------|----------------|----------------|
|
|
| f16 (default) | 1.0x | None | Baseline |
|
|
| q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | **Recommended for production** |
|
|
| q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained |
|
|
| q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 |
|
|
| iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option |
|
|
|
|
Usage:
|
|
```bash
|
|
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
|
|
# or for llama-bench:
|
|
# Not directly supported as flags; test via llama-server
|
|
```
|
|
|
|
**Performance impact**: Quantizing K cache slightly **improves** throughput
|
|
(less memory to read). Quantizing V cache may have a slight negative impact.
|
|
Overall performance impact is negligible for normal inference.
|
|
|
|
**Caveat with speculative decoding**: Using KV cache quantization with a draft
|
|
model causes a consistent ~16% performance drop. q4_0 KV with speculative
|
|
decoding causes massive acceptance rate drops. **Avoid KV quant if using
|
|
speculative decoding.**
|
|
|
|
### 5.2 mmap vs Full Load on Unified Memory
|
|
|
|
On Strix Halo's unified memory architecture:
|
|
|
|
- **`--no-mmap` is strongly recommended** for both ROCm and Vulkan.
|
|
- With mmap enabled, ROCm model loading is "catastrophically slower."
|
|
- Vulkan loading is "moderately slower" with mmap.
|
|
- Since CPU and GPU share physical RAM, there is no data copy when loading
|
|
to "GPU memory" -- it is just a page table update.
|
|
|
|
For `llama-bench`: Always use `-mmp 0`.
|
|
For `llama-server`/`llama-cli`: Always use `--no-mmap`.
|
|
|
|
### 5.3 Prompt Caching
|
|
|
|
llama-server supports two levels of prompt caching:
|
|
|
|
**1. Automatic KV cache reuse (`cache_prompt: true`)**:
|
|
Reuses KV cache from previous requests when prompts share a common prefix.
|
|
The server only reprocesses the suffix that differs.
|
|
|
|
**2. Host-memory prompt caching (`--cram N`)**:
|
|
Stores pre-computed prompt representations in system RAM.
|
|
- Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
|
|
- +6% token throughput (34 vs 32 t/s)
|
|
- Memory formula: `num_prefixes * avg_prefix_tokens * 8 bytes`
|
|
|
|
Configuration:
|
|
```bash
|
|
llama-server -m model.gguf \
|
|
--cram 256 \ # 256 MB host RAM for prompt cache
|
|
--cache-type-k q8_0 \ # KV cache quantization
|
|
--cache-type-v q8_0 \
|
|
--no-mmap \
|
|
-fa \
|
|
-ngl 99
|
|
```
|
|
|
|
Best for:
|
|
- System prompts > 5K tokens
|
|
- Multi-user chatbots with shared context
|
|
- Agentic use with repeated tool-call prefixes
|
|
|
|
### 5.4 UMA Detection Bug (Issue #18159)
|
|
|
|
llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark)
|
|
incorrectly activates on AMD APUs when `prop.integrated=1`. It reads
|
|
`/proc/meminfo` instead of `hipMemGetInfo()`, severely underreporting available
|
|
GPU memory (e.g., reporting 27GB instead of 96GB).
|
|
|
|
**Workarounds**:
|
|
- Build without `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
|
|
- Guard UMA detection with `!defined(GGML_USE_HIP)` (upstream fix pending)
|
|
- Use toolbox containers where this has been patched
|
|
|
|
### 5.5 KV Cache Placement on ROCm (Issue #18011)
|
|
|
|
On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible)
|
|
memory instead of GPU-mapped memory, causing performance degradation at high
|
|
context sizes. This is a known issue contributing to ROCm falling behind Vulkan
|
|
for tg at high contexts.
|
|
|
|
**Mitigation**: Use the rocWMMA-tuned branch which maintains better memory
|
|
placement, or use Vulkan RADV for workloads where this matters.
|
|
|
|
---
|
|
|
|
## 6. llama-server Specific Optimizations
|
|
|
|
### 6.1 Recommended Server Configuration
|
|
|
|
```bash
|
|
llama-server -m model.gguf \
|
|
--host 0.0.0.0 --port 8080 \
|
|
-ngl 99 \
|
|
--no-mmap \
|
|
-fa \
|
|
-c 32768 \ # Total context across all slots
|
|
-np 4 \ # 4 parallel slots (adjust for your use)
|
|
-b 2048 \ # Logical batch size
|
|
-ub 512 \ # Physical batch size
|
|
--cache-type-k q8_0 \
|
|
--cache-type-v q8_0 \
|
|
--cont-batching \ # Enabled by default
|
|
--jinja # Enable Jinja2 chat template
|
|
```
|
|
|
|
### 6.2 Parallel Slot Configuration (`-np`)
|
|
|
|
| Use Case | Slots | Context per Slot | Total `-c` |
|
|
|----------|-------|-----------------|-----------|
|
|
| Single user chat | 1 | 32768 | 32768 |
|
|
| Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 |
|
|
| Multi-user API | 4-8 | 4096-8192 | 32768-65536 |
|
|
| Eval harness | 1 | 32768+ | 32768+ |
|
|
|
|
Memory formula: Each slot requires `context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element`.
|
|
With q8_0 KV cache, this is roughly halved compared to f16.
|
|
|
|
### 6.3 Continuous Batching
|
|
|
|
Enabled by default (`--cont-batching`). Allows the server to process multiple
|
|
requests simultaneously, interleaving prefill and decode operations.
|
|
|
|
For agentic workloads: One slot typically holds a large system prompt + conversation
|
|
context, while additional slots handle parallel tool calls. Configure with:
|
|
|
|
```bash
|
|
-np 4 -c 131072 # 4 slots, up to 32K context each
|
|
```
|
|
|
|
### 6.4 Prompt Caching for Agentic Use
|
|
|
|
For agentic coding tools that send the same system prompt repeatedly:
|
|
|
|
1. Use `cache_prompt: true` in API requests (reuses KV cache prefix)
|
|
2. Use `--system-prompt-file system.txt` for static system prompts (note: may be
|
|
removed in recent versions; verify with your build)
|
|
3. Use `--cram 128` to enable host-memory caching for prefix deduplication
|
|
|
|
### 6.5 Speculative Decoding
|
|
|
|
For token generation speedup with a draft model:
|
|
|
|
```bash
|
|
llama-server -m main-model.gguf \
|
|
--model-draft draft-model.gguf \
|
|
-ngl 99 \
|
|
--draft-max 8 \
|
|
--draft-min 1 \
|
|
--no-mmap \
|
|
-fa
|
|
```
|
|
|
|
**Caveat**: Do NOT combine speculative decoding with KV cache quantization.
|
|
The 16% performance drop and reduced acceptance rate negate the benefits.
|
|
|
|
---
|
|
|
|
## 7. Upcoming llama.cpp Features (2026)
|
|
|
|
### 7.1 Backend-Agnostic Tensor Parallelism (PR #19378)
|
|
|
|
Merged January 2026. Adds `--split-mode tensor` for splitting computation across
|
|
multiple GPUs via a new "meta" backend.
|
|
|
|
**Relevance to Strix Halo**: Limited. Single integrated GPU. However, for RPC
|
|
configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster),
|
|
tensor parallelism could complement the existing layer-split approach.
|
|
|
|
Currently supports 1-2 GPUs with equal data split. `--tensor-split` has no effect yet.
|
|
|
|
### 7.2 TurboQuant KV Cache Compression (ICLR 2026)
|
|
|
|
Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with
|
|
no training and negligible quality loss:
|
|
|
|
| Format | MSE vs FP16 | Compression |
|
|
|--------|-------------|-------------|
|
|
| TQ3 (3-bit) | 0.034 | 4.9x |
|
|
| TQ4 (4-bit) | 0.009 | 3.8x |
|
|
|
|
**Timeline**: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase
|
|
integration plan exists covering GGML type registration, KV cache paths, FA
|
|
integration, and CLI flags.
|
|
|
|
### 7.3 Vulkan Improvements
|
|
|
|
Active 2025-2026 developments:
|
|
- Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
|
|
- 13% pp improvement from CU mode optimization for LDS utilization
|
|
- BFloat16 Vulkan support (`VK_KHR_shader_bfloat16`) maturing in Mesa 25.x
|
|
- Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)
|
|
|
|
### 7.4 Flash Attention for Head Dimension 512
|
|
|
|
Pull request from March 2026 adds FA support for HD=512 in CUDA kernels.
|
|
This benefits models with larger head dimensions (some newer architectures).
|
|
The HIP path should inherit this via hipify.
|
|
|
|
### 7.5 ik_llama.cpp Fork Innovations
|
|
|
|
The `ik_llama.cpp` fork by ikawrakow introduces:
|
|
- Row-interleaved quant packing (better memory access patterns)
|
|
- Smart Expert Reduction for faster MoE inference
|
|
- Tensor overrides with regex patterns for hybrid GPU/CPU placement
|
|
- FlashMLA for DeepSeek models
|
|
|
|
**Caveat**: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan
|
|
are not maintained. Not recommended for AMD gfx1151.
|
|
|
|
---
|
|
|
|
## 8. Recommended Configurations
|
|
|
|
### 8.1 For llama-bench (Benchmarking)
|
|
|
|
**ROCm backend:**
|
|
```bash
|
|
ROCBLAS_USE_HIPBLASLT=1 \
|
|
toolbox run -c llama-rocm-7.2 -- \
|
|
/path/to/llama-bench \
|
|
-m /path/to/model.gguf \
|
|
-ngl 99 -mmp 0 -fa 1 \
|
|
-p 512 -n 128 -r 5
|
|
```
|
|
|
|
**Vulkan backend:**
|
|
```bash
|
|
AMD_VULKAN_ICD=RADV \
|
|
RADV_PERFTEST=nogttspill \
|
|
toolbox run -c llama-vulkan -- \
|
|
/path/to/llama-bench \
|
|
-m /path/to/model.gguf \
|
|
-ngl 99 -mmp 0 -fa 1 \
|
|
-p 512 -n 128 -r 5
|
|
```
|
|
|
|
### 8.2 For llama-server (Production Agentic Use)
|
|
|
|
**ROCm (best for long context):**
|
|
```bash
|
|
ROCBLAS_USE_HIPBLASLT=1 \
|
|
llama-server -m model.gguf \
|
|
-ngl 99 --no-mmap -fa \
|
|
-c 65536 -np 4 \
|
|
-b 2048 -ub 512 \
|
|
--cache-type-k q8_0 --cache-type-v q8_0 \
|
|
--cram 256 \
|
|
--jinja --cont-batching \
|
|
--host 0.0.0.0 --port 8080
|
|
```
|
|
|
|
**Vulkan RADV (best for single-user tg):**
|
|
```bash
|
|
AMD_VULKAN_ICD=RADV \
|
|
RADV_PERFTEST=nogttspill \
|
|
llama-server -m model.gguf \
|
|
-ngl 99 --no-mmap -fa \
|
|
-c 32768 -np 2 \
|
|
-b 4096 -ub 256 \
|
|
--cache-type-k q8_0 --cache-type-v q8_0 \
|
|
--jinja --cont-batching \
|
|
--host 0.0.0.0 --port 8080
|
|
```
|
|
|
|
### 8.3 Decision Matrix
|
|
|
|
| Question | Answer |
|
|
|----------|--------|
|
|
| Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. |
|
|
| Which backend for daily chat? | Vulkan RADV for best tg speed. |
|
|
| Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. |
|
|
| Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. |
|
|
| Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. |
|
|
| Use `--no-mmap`? | Always. |
|
|
| Set `ROCBLAS_USE_HIPBLASLT=1`? | Always for ROCm. |
|
|
| Set `RADV_PERFTEST=nogttspill`? | Always for Vulkan RADV. |
|
|
| KV cache quantization? | q8_0 for both K and V unless using speculative decoding. |
|
|
| Batch size for MoE? | `-b 256` (lower than default improves some MoE models). |
|
|
|
|
---
|
|
|
|
## 9. Sources
|
|
|
|
### GitHub Issues and Discussions
|
|
- [Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021](https://github.com/ggml-org/llama.cpp/discussions/15021)
|
|
- [Performance of llama.cpp with Vulkan - Discussion #10879](https://github.com/ggml-org/llama.cpp/discussions/10879)
|
|
- [HIP backend performs poorly on gfx1151 - Issue #13565](https://github.com/ggml-org/llama.cpp/issues/13565)
|
|
- [UMA detection incorrectly limits memory on AMD APUs - Issue #18159](https://github.com/ggml-org/llama.cpp/issues/18159)
|
|
- [ROCm model loading dumps KV cache to shared memory - Issue #18011](https://github.com/ggml-org/llama.cpp/issues/18011)
|
|
- [GATED_DELTA_NET underperformance on gfx1151 - Issue #20354](https://github.com/ggml-org/llama.cpp/issues/20354)
|
|
- [Under-Performance of ROCm 7.2 binaries - Issue #19984](https://github.com/ggml-org/llama.cpp/issues/19984)
|
|
- [ROCm 7.2 + rocWMMA compilation - Issue #19269](https://github.com/ggml-org/llama.cpp/issues/19269)
|
|
- [Building for gfx1151 - Issue #14734](https://github.com/ggml-org/llama.cpp/issues/14734)
|
|
- [AMDVLK 2GB buffer allocation limit - Issue #15054](https://github.com/ggml-org/llama.cpp/issues/15054)
|
|
- [Mastering Host-Memory Prompt Caching - Discussion #20574](https://github.com/ggml-org/llama.cpp/discussions/20574)
|
|
- [TurboQuant Extreme KV Cache Quantization - Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969)
|
|
- [Backend-agnostic tensor parallelism - PR #19378](https://github.com/ggml-org/llama.cpp/pull/19378)
|
|
- [Massively Improved rocWMMA Performance - PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827)
|
|
- [rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7)
|
|
- [Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5](https://github.com/geerlingguy/beowulf-ai-cluster/issues/5)
|
|
|
|
### Wiki and Community Resources
|
|
- [Strix Halo Wiki - llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance)
|
|
- [Strix Halo Wiki - llama.cpp with ROCm](https://strixhalo.wiki/AI/llamacpp-with-ROCm)
|
|
- [AMD Strix Halo Backend Benchmarks (Grid View)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
|
|
- [LLM Tracker - AMD Strix Halo GPU Performance](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
|
|
- [Framework Community - Strix Halo GPU LLM Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
|
|
- [Framework Community - Toolboxes for LLM inference on Strix Halo](https://community.frame.work/t/llama-cpp-vllm-toolboxes-for-llm-inference-on-strix-halo/74916)
|
|
|
|
### Articles and Blog Posts
|
|
- [Hardware Corner - Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/)
|
|
- [Hardware Corner - RADV Vulkan Driver 13% Improvement](https://www.hardware-corner.net/llama-cpp-amd-radv-vulkan-driver-update/)
|
|
- [Phoronix - AMD ROCm 7.1 vs RADV Vulkan](https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan)
|
|
- [Phoronix - Valve Developer RADV Improvement](https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp)
|
|
- [Yifei's Blog - Strix Halo Matrix Cores with llama.cpp](https://blog.yifei.sg/jekyll/update/2025/08/27/building-llamacpp-strix-halo.html)
|
|
- [Strix Halo CUDA/HIP Testing Notes (lhl)](https://github.com/lhl/strix-halo-testing/blob/main/llama-cpp-cuda-hip.md)
|
|
|
|
### Official Documentation
|
|
- [ROCm - llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
|
|
- [ROCm - llama.cpp installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
|
|
- [ROCm Blog - Llama.cpp Meets Instinct](https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html)
|
|
- [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
|
|
- [llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
|
|
|
|
### Papers
|
|
- "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
|
|
- "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
|
|
- [Unsloth Dynamic 2.0 GGUFs Documentation](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)
|
|
|
|
---
|
|
|
|
## Open Questions / Limitations
|
|
|
|
1. **rocWMMA-tuned patch upstream status**: PR #16827 may not be fully merged.
|
|
Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.
|
|
|
|
2. **ROCm 7.2 stability on gfx1151**: Multiple reports of crashes (MUT_MAL errors),
|
|
performance regressions, and compilation issues. ROCm 7.x is maturing but
|
|
not fully stable for gfx1151 as of March 2026.
|
|
|
|
3. **Vulkan CoopMat FA for AMD**: Will AMD ever get CoopMat2 support? The current
|
|
CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent
|
|
extension would close the gap with ROCm FA.
|
|
|
|
4. **KV cache placement on ROCm**: Issue #18011 (KV cache dumped to shared memory)
|
|
reduces ROCm tg performance at high contexts. Root cause appears to be in
|
|
HIP memory allocation behavior on APUs.
|
|
|
|
5. **GGML_HIP_UMA vs kernel-param GTT expansion**: The UMA flag uses slow
|
|
fine-grained memory. GTT expansion via `amdgpu.gttsize` kernel params provides
|
|
coarse-grained GPU-mapped memory that is much faster. The upstream approach
|
|
may eventually improve, but kernel params remain the correct method for now.
|
|
|
|
6. **GatedDeltaNet architecture support**: Both Vulkan (missing shader) and ROCm
|
|
(register pressure, memcpy bottleneck) perform poorly on GDN models. This
|
|
blocks efficient use of Qwen3.5-27B and similar models.
|
|
|
|
7. **TurboQuant integration timeline**: Expected Q2-Q3 2026 for llama.cpp.
|
|
Would provide 3-bit KV cache with no quality loss, roughly doubling available
|
|
context within the same memory budget.
|
|
|
|
8. **NPU utilization**: The 50 TOPS NPU on Strix Halo is currently Linux-unusable
|
|
for llama.cpp inference. AMD driver support for NPU on Linux remains pending.
|
|
|
|
---
|
|
|
|
## Overlap Notes
|
|
|
|
- **Kernel parameters** (`amdgpu.gttsize`, `ttm.pages_limit`, `iommu=pt`):
|
|
Already documented in the project's `scripts/optimize/kernel-params.sh`.
|
|
This research covers the llama.cpp side (why they matter for inference).
|
|
|
|
- **BIOS VRAM allocation**: Reducing dedicated VRAM in BIOS frees more memory
|
|
for GTT. This is documented in the project's audit scripts but is a prerequisite
|
|
for the optimizations described here.
|
|
|
|
- **Toolbox container builds**: The project uses pre-built toolboxes
|
|
(`llama-rocm-7.2`, `llama-vulkan`). The compilation flags documented here
|
|
describe what should be baked into those containers.
|