strix-halo-optimizations/docs/llama-cpp-optimization-research.md

# llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)

Comprehensive research into maximizing inference performance on AMD Strix Halo
(Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000).
Researched March 2026.

---

## Scope

This document covers every known compilation flag, runtime parameter, environment
variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5)
with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama
internals, or non-llama.cpp inference engines except where their findings inform
llama.cpp optimization.

---

## Table of Contents

1. [Compilation Flags and Build Optimizations](#1-compilation-flags-and-build-optimizations)
2. [Runtime Flags and Environment Variables](#2-runtime-flags-and-environment-variables)
3. [Flash Attention and Attention Backends](#3-flash-attention-and-attention-backends)
4. [Quantization Strategies for Speed](#4-quantization-strategies-for-speed)
5. [Memory Layout and Caching](#5-memory-layout-and-caching)
6. [llama-server Specific Optimizations](#6-llama-server-specific-optimizations)
7. [Upcoming llama.cpp Features (2026)](#7-upcoming-llamacpp-features-2026)
8. [Recommended Configurations](#8-recommended-configurations)
9. [Sources](#9-sources)

---

## 1. Compilation Flags and Build Optimizations

### 1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build

Both backends are worth building. Neither is universally faster on gfx1151:

| Workload | Winner | Rationale |
|----------|--------|-----------|
| Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths |
| Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory |
| Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes |
| Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx |
| Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent |

Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):

| Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s |
|---------|-----------|-----------|----------------|----------------|
| Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 |
| Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 |
| ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 |
| ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 |

Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x
faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context
prompt processing (3x faster than RADV at 130K).

### 1.2 ROCm Build Flags

**Minimal build:**
```bash
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
```

**Optimized build (recommended):**
```bash
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
```

#### Critical ROCm build flags:

| Flag | Effect | Recommendation |
|------|--------|----------------|
| `-DGGML_HIP=ON` | Enable HIP/ROCm backend | Required |
| `-DAMDGPU_TARGETS="gfx1151"` | Target Strix Halo GPU | Required -- do not use gfx1100 |
| `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enable rocWMMA flash attention | Strongly recommended for pp |
| `-DGGML_HIP_GRAPHS=ON` | HIP graph kernel scheduling | Test -- may help reduce launch overhead |
| `-DGGML_HIP_NO_VMM=OFF` | Re-enable Virtual Memory Management | Default is disabled; test if needed |

#### Flags to be aware of but NOT set by default:

| Flag | Notes |
|------|-------|
| `-DGGML_HIP_UMA=ON` | Uses hipMallocManaged for UMA. **Avoid on Strix Halo** -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. |
| `-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` | Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. **Test before deploying.** |
| `-DGGML_CUDA_FA_ALL_QUANTS=ON` | Compiles all KV cache quant type combinations for FA. **Works for CUDA kernels which HIP reuses via hipify.** Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. |

#### ROCm version considerations:

- **ROCm 7.2**: Known rocWMMA compilation issue (ambiguous template specializations in `mfma_impl.hpp`). Fixed in later point releases or by disabling rocWMMA.
- **ROCm 7.0 RC / 7.1**: Generally work well with gfx1151.
- **ROCm 6.4.4**: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
- **Recommendation**: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.

### 1.3 Vulkan Build Flags

```bash
cmake -B build -S . \
  -DGGML_VULKAN=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
```

The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime.
gfx1151 with RADV reports `matrix cores: KHR_coopmat` in llama-bench logs.

#### Vulkan-specific considerations:

| Topic | Detail |
|-------|--------|
| **RADV vs AMDVLK** | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. |
| **AMDVLK buffer limit** | AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. |
| **CoopMat1 vs CoopMat2** | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. |
| **Shader compilation** | Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them. |

### 1.4 LTO and PGO

llama.cpp does not have built-in LTO/PGO support in its CMake configuration.
You can enable LTO manually:

```bash
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
  -DCMAKE_BUILD_TYPE=Release
```

Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are
compiled by the HIP/ROCm compiler and are not affected by host LTO.

PGO would require a two-pass build (instrument, profile, rebuild) and is not
commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance
and memory bandwidth, not host-side code paths.

### 1.5 Compiler Tuning for ROCm

A known LLVM regression affects loop unrolling on RDNA. The following flag has
been reported to help:

```bash
-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
```

This increases the unrolling threshold for local memory operations, which can
improve kernel performance for flash attention and matrix multiplication.

---

## 2. Runtime Flags and Environment Variables

### 2.1 ROCm Environment Variables

| Variable | Value | Effect |
|----------|-------|--------|
| `ROCBLAS_USE_HIPBLASLT=1` | **Critical** | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. **This is a 2-7x improvement for prompt processing.** |
| `HSA_OVERRIDE_GFX_VERSION=11.5.1` | Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. |
| `HSA_ENABLE_SDMA=0` | Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. |
| `HIP_VISIBLE_DEVICES=0` | Optional | Select specific GPU device. Useful in multi-GPU or container setups. |
| `GPU_MAX_HEAP_SIZE=100` | Optional | Allow 100% of GPU memory for heap. Default may be lower. |
| `GPU_MAX_ALLOC_PERCENT=100` | Optional | Allow single allocation up to 100% of GPU memory. |
| `ROCR_VISIBLE_DEVICES=0` | Optional | HSA-level device visibility control. |
| `AMD_LOG_LEVEL=0` | Optional | Suppress AMD driver logging noise. |

**The single most impactful environment variable is `ROCBLAS_USE_HIPBLASLT=1`.**
Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).

### 2.2 Vulkan Environment Variables

| Variable | Value | Effect |
|----------|-------|--------|
| `AMD_VULKAN_ICD=RADV` | Recommended | Force RADV driver (skip AMDVLK). |
| `RADV_PERFTEST=nogttspill` | **Important** | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). |
| `GGML_VK_VISIBLE_DEVICES=0` | Optional | Select Vulkan device index. |
| `GGML_VULKAN_DISABLE_F16=1` | Debugging | Force FP32 compute. Slower but useful for debugging precision issues. |
| `GGML_LOG_LEVEL=2` | Debugging | Verbose logging to verify coopmat detection. |

### 2.3 Thread Count (`-t` flag)

For GPU-dominant inference (all layers offloaded), the thread count has minimal
impact on throughput. The recommendation:

- **Single-user inference**: `-t 4` to `-t 8` (enough for tokenization/sampling overhead)
- **Server with parallel slots**: `-t` equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
- **Hybrid CPU+GPU (partial offload)**: `-t` equal to number of physical cores

The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using
all 12 big cores (`-t 12`) is optimal for CPU-involved workloads.

### 2.4 Batch Size Tuning (`-b` and `-ub`)

| Flag | Default | Role |
|------|---------|------|
| `-b` / `--batch-size` | 2048 | Logical batch size (application level) |
| `-ub` / `--ubatch-size` | 512 | Physical batch size (device level) |

Tuning guidance for gfx1151:

- **MoE models**: `-b 256` significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
- **Dense models**: Default `-b 2048` is generally fine
- **Long context**: `-ub 2048` can improve performance, but test against OOM
- **Ultra-long context**: Reduce `-ub` if memory allocation fails

The Vulkan backend blog post for Strix Halo recommends: `-c 32768 -b 4096 -ub 256`
for a good balance of performance and memory.

### 2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`)

**Critical finding for unified memory APUs:**

> When you load large models to the GPU, memory mapping can make loading moderately
> slower for Vulkan, and **catastrophically slower for ROCm**. You should always set
> `--mmap 0` or `--no-mmap` to improve model loading times on Strix Halo.

For `llama-bench`, use `-mmp 0`. For `llama-server`/`llama-cli`, use `--no-mmap`.

On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x.
The difference is which pages are mapped for GPU access. GPU-mapped pages have full
bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for
CPU-to-GPU copies).

**Always use `-ngl 99` (or higher) to ensure all layers are on GPU memory.**
Even on a unified memory system, GPU memory paths provide 2x the bandwidth.

### 2.6 GPU Layer Offloading (`-ngl`)

For Strix Halo with 64GB unified memory:

- **Models < 50GB**: `-ngl 99` offloads everything. No tuning needed.
- **Models 50-60GB**: `-ngl 99` should still work with GTT expanded via kernel params.
- **Models > 60GB**: May need partial offload. Use `-ngl <N>` where N is tuned to
  keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.

**Never let GPU spill to system RAM paths** -- performance will be worse than pure CPU.

---

## 3. Flash Attention and Attention Backends

### 3.1 When to Enable Flash Attention

**Rule of thumb for gfx1151:**

| Backend | Flash Attention | Recommendation |
|---------|----------------|----------------|
| ROCm + rocWMMA | `-fa 1` | **Always enable.** 24% pp improvement, maintains tg speed, uses less memory. |
| ROCm without rocWMMA | `-fa 1` | Enable, but smaller improvement. |
| Vulkan RADV | `-fa 1` | **Enable for short contexts.** Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. |
| Vulkan AMDVLK | `-fa 1` | Similar to RADV. |

**Key caveat**: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not
the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior
FA path.

### 3.2 rocWMMA Flash Attention Performance

Benchmark on gfx1151 (Llama2-7B Q4_K_M):

| Configuration | pp512 t/s | tg128 t/s |
|---------------|-----------|-----------|
| HIP standard | 592.28 | 40.40 |
| HIP + hipBLASLt | 548.72 | 40.43 |
| HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 |
| HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 |

rocWMMA provides ~70% improvement in prompt processing with flash attention.
Token generation is slightly slower (~2%) due to WMMA overhead at small batch.

### 3.3 The rocWMMA Long-Context Regression and Fix

The standard rocWMMA implementation has a **long-context decode regression**:
at 65K context, tg degrades by up to 57% compared to HIP-only baseline.

**The fix** (PR #16827, "rocm-wmma-tune" branch) implements:

1. **`__launch_bounds__(256, 2)`**: Ensures minimum 2 blocks per SM, improving occupancy
2. **Adaptive KQ stride**: Uses stride 128 when head dimension <= 128, reducing LDS footprint
3. **Selective WMMA usage**: WMMA only for prefill; decode reverts to VEC/TILE kernels

Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):
- pp512 at 65K context: **96% faster** than untuned rocWMMA
- tg128 at 65K context: Matches HIP baseline (previously 57% degraded)

**Status**: This patch is available in `-rocwmma-improved` toolbox builds. It may
not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.

### 3.4 Vulkan Flash Attention Limitations on AMD

The Vulkan backend supports three FA paths:

| Path | Extension | AMD Support |
|------|-----------|-------------|
| FA_SCALAR | None | Yes (CPU fallback) |
| FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) |
| FA_COOPMAT2 | NV_cooperative_matrix2 | **No** (NVIDIA-only) |

FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types.
FA_COOPMAT2 additionally supports all quant types.

When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations.
This provides a modest improvement over scalar FA but is significantly less
efficient than ROCm + rocWMMA.

### 3.5 New Attention Models (GatedDeltaNet)

Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have
severe performance problems on gfx1151:

- **Vulkan**: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
- **ROCm/HIP**: Kernel cross-compiles but suffers from register spilling (float s[S_v]
  allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95%
  of decode time on models >15GB)

Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s.
**Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.**

---

## 4. Quantization Strategies for Speed

### 4.1 Quantization Speed on RDNA 3.5

Token generation speed is dominated by memory bandwidth, not compute. Smaller
quantizations are faster because they reduce bytes-per-weight, allowing more
tokens per second within the ~215 GB/s bandwidth envelope.

Approximate throughput formula for decode (bandwidth-bound):
```
tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9
```

For a 7B Q4_K_M model (~4.1 GB):
```
215 / 4.1 ≈ 52 t/s  (theoretical max; practical ~50 t/s on gfx1151)
```

### 4.2 Quantization Type Comparison

| Quant | Bits/Weight | Quality | Speed (relative) | Notes |
|-------|------------|---------|-------------------|-------|
| Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. |
| Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. |
| IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. |
| Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. |
| Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. |
| Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. |
| F16 | 16.0 | Perfect | Slowest | Reference baseline. |

**For RDNA 3.5 specifically**:

- **Q4_K_M** is the best general-purpose quantization. The K-quant family uses
  hierarchical super-blocks (256 values) with per-sub-block scales, providing
  better quality than Q4_0 at marginally higher dequant cost that is invisible
  at the GPU level.

- **Q4_0** has the simplest dequant kernels and is marginally faster than Q4_K_M
  on some GPU backends. However, the quality loss is significant. Use only for
  smoke tests or when every t/s matters more than quality.

- **IQ4_XS** (importance-matrix quantized) offers better quality per bit than
  Q4_K_M. Speed is similar. Requires an importance matrix file during quantization.
  **Recommended over Q4_K_M when you control the quantization process.**

- **Q8_0** does NOT have special hardware-accelerated dequant on RDNA 3.5.
  RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the
  same FP16 compute paths, just with more memory bandwidth consumed.

### 4.3 Importance Matrix (imatrix) Quantization

imatrix quantization records how much each weight affects output quality, then
allocates more precision bits to important weights. This is essential for
sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant
shows measurable degradation.

```bash
# Generate importance matrix (GPU-accelerated)
llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat

# Quantize with imatrix
llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS
```

**Speed impact**: None. imatrix affects quantization quality, not inference speed.
The dequantization kernels are identical regardless of whether imatrix was used.

### 4.4 Unsloth Dynamic (UD) Quantizations

Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths,
choosing the optimal quantization per layer based on sensitivity analysis.

**Speed impact**: Minimal to none. UD quants use the same dequant kernels as
standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a
standard Q4_K_M of the same total size.

**Quality impact**: Significantly better. UD consistently outperforms standard
quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.

**Recommendation**: Prefer UD quants (e.g., `UD-Q4_K_XL`, `UD-Q4_K_M`) from
Unsloth when available. They are a free quality upgrade with no speed penalty.

---

## 5. Memory Layout and Caching

### 5.1 KV Cache Quantization

KV cache quantization reduces the memory footprint of the attention cache,
allowing larger context windows within the same memory budget.

| Cache Type | Memory vs F16 | Quality Impact | Recommendation |
|------------|--------------|----------------|----------------|
| f16 (default) | 1.0x | None | Baseline |
| q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | **Recommended for production** |
| q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained |
| q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 |
| iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option |

Usage:
```bash
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
# or for llama-bench:
# Not directly supported as flags; test via llama-server
```

**Performance impact**: Quantizing K cache slightly **improves** throughput
(less memory to read). Quantizing V cache may have a slight negative impact.
Overall performance impact is negligible for normal inference.

**Caveat with speculative decoding**: Using KV cache quantization with a draft
model causes a consistent ~16% performance drop. q4_0 KV with speculative
decoding causes massive acceptance rate drops. **Avoid KV quant if using
speculative decoding.**

### 5.2 mmap vs Full Load on Unified Memory

On Strix Halo's unified memory architecture:

- **`--no-mmap` is strongly recommended** for both ROCm and Vulkan.
- With mmap enabled, ROCm model loading is "catastrophically slower."
- Vulkan loading is "moderately slower" with mmap.
- Since CPU and GPU share physical RAM, there is no data copy when loading
  to "GPU memory" -- it is just a page table update.

For `llama-bench`: Always use `-mmp 0`.
For `llama-server`/`llama-cli`: Always use `--no-mmap`.

### 5.3 Prompt Caching

llama-server supports two levels of prompt caching:

**1. Automatic KV cache reuse (`cache_prompt: true`)**:
Reuses KV cache from previous requests when prompts share a common prefix.
The server only reprocesses the suffix that differs.

**2. Host-memory prompt caching (`--cram N`)**:
Stores pre-computed prompt representations in system RAM.
- Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
- +6% token throughput (34 vs 32 t/s)
- Memory formula: `num_prefixes * avg_prefix_tokens * 8 bytes`

Configuration:
```bash
llama-server -m model.gguf \
  --cram 256 \              # 256 MB host RAM for prompt cache
  --cache-type-k q8_0 \     # KV cache quantization
  --cache-type-v q8_0 \
  --no-mmap \
  -fa \
  -ngl 99
```

Best for:
- System prompts > 5K tokens
- Multi-user chatbots with shared context
- Agentic use with repeated tool-call prefixes

### 5.4 UMA Detection Bug (Issue #18159)

llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark)
incorrectly activates on AMD APUs when `prop.integrated=1`. It reads
`/proc/meminfo` instead of `hipMemGetInfo()`, severely underreporting available
GPU memory (e.g., reporting 27GB instead of 96GB).

**Workarounds**:
- Build without `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
- Guard UMA detection with `!defined(GGML_USE_HIP)` (upstream fix pending)
- Use toolbox containers where this has been patched

### 5.5 KV Cache Placement on ROCm (Issue #18011)

On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible)
memory instead of GPU-mapped memory, causing performance degradation at high
context sizes. This is a known issue contributing to ROCm falling behind Vulkan
for tg at high contexts.

**Mitigation**: Use the rocWMMA-tuned branch which maintains better memory
placement, or use Vulkan RADV for workloads where this matters.

---

## 6. llama-server Specific Optimizations

### 6.1 Recommended Server Configuration

```bash
llama-server -m model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --no-mmap \
  -fa \
  -c 32768 \               # Total context across all slots
  -np 4 \                  # 4 parallel slots (adjust for your use)
  -b 2048 \                # Logical batch size
  -ub 512 \                # Physical batch size
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cont-batching \         # Enabled by default
  --jinja                   # Enable Jinja2 chat template
```

### 6.2 Parallel Slot Configuration (`-np`)

| Use Case | Slots | Context per Slot | Total `-c` |
|----------|-------|-----------------|-----------|
| Single user chat | 1 | 32768 | 32768 |
| Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 |
| Multi-user API | 4-8 | 4096-8192 | 32768-65536 |
| Eval harness | 1 | 32768+ | 32768+ |

Memory formula: Each slot requires `context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element`.
With q8_0 KV cache, this is roughly halved compared to f16.

### 6.3 Continuous Batching

Enabled by default (`--cont-batching`). Allows the server to process multiple
requests simultaneously, interleaving prefill and decode operations.

For agentic workloads: One slot typically holds a large system prompt + conversation
context, while additional slots handle parallel tool calls. Configure with:

```bash
-np 4 -c 131072  # 4 slots, up to 32K context each
```

### 6.4 Prompt Caching for Agentic Use

For agentic coding tools that send the same system prompt repeatedly:

1. Use `cache_prompt: true` in API requests (reuses KV cache prefix)
2. Use `--system-prompt-file system.txt` for static system prompts (note: may be
   removed in recent versions; verify with your build)
3. Use `--cram 128` to enable host-memory caching for prefix deduplication

### 6.5 Speculative Decoding

For token generation speedup with a draft model:

```bash
llama-server -m main-model.gguf \
  --model-draft draft-model.gguf \
  -ngl 99 \
  --draft-max 8 \
  --draft-min 1 \
  --no-mmap \
  -fa
```

**Caveat**: Do NOT combine speculative decoding with KV cache quantization.
The 16% performance drop and reduced acceptance rate negate the benefits.

---

## 7. Upcoming llama.cpp Features (2026)

### 7.1 Backend-Agnostic Tensor Parallelism (PR #19378)

Merged January 2026. Adds `--split-mode tensor` for splitting computation across
multiple GPUs via a new "meta" backend.

**Relevance to Strix Halo**: Limited. Single integrated GPU. However, for RPC
configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster),
tensor parallelism could complement the existing layer-split approach.

Currently supports 1-2 GPUs with equal data split. `--tensor-split` has no effect yet.

### 7.2 TurboQuant KV Cache Compression (ICLR 2026)

Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with
no training and negligible quality loss:

| Format | MSE vs FP16 | Compression |
|--------|-------------|-------------|
| TQ3 (3-bit) | 0.034 | 4.9x |
| TQ4 (4-bit) | 0.009 | 3.8x |

**Timeline**: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase
integration plan exists covering GGML type registration, KV cache paths, FA
integration, and CLI flags.

### 7.3 Vulkan Improvements

Active 2025-2026 developments:
- Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
- 13% pp improvement from CU mode optimization for LDS utilization
- BFloat16 Vulkan support (`VK_KHR_shader_bfloat16`) maturing in Mesa 25.x
- Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)

### 7.4 Flash Attention for Head Dimension 512

Pull request from March 2026 adds FA support for HD=512 in CUDA kernels.
This benefits models with larger head dimensions (some newer architectures).
The HIP path should inherit this via hipify.

### 7.5 ik_llama.cpp Fork Innovations

The `ik_llama.cpp` fork by ikawrakow introduces:
- Row-interleaved quant packing (better memory access patterns)
- Smart Expert Reduction for faster MoE inference
- Tensor overrides with regex patterns for hybrid GPU/CPU placement
- FlashMLA for DeepSeek models

**Caveat**: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan
are not maintained. Not recommended for AMD gfx1151.

---

## 8. Recommended Configurations

### 8.1 For llama-bench (Benchmarking)

**ROCm backend:**
```bash
ROCBLAS_USE_HIPBLASLT=1 \
toolbox run -c llama-rocm-7.2 -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5
```

**Vulkan backend:**
```bash
AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
toolbox run -c llama-vulkan -- \
  /path/to/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -mmp 0 -fa 1 \
  -p 512 -n 128 -r 5
```

### 8.2 For llama-server (Production Agentic Use)

**ROCm (best for long context):**
```bash
ROCBLAS_USE_HIPBLASLT=1 \
llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 65536 -np 4 \
  -b 2048 -ub 512 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --cram 256 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080
```

**Vulkan RADV (best for single-user tg):**
```bash
AMD_VULKAN_ICD=RADV \
RADV_PERFTEST=nogttspill \
llama-server -m model.gguf \
  -ngl 99 --no-mmap -fa \
  -c 32768 -np 2 \
  -b 4096 -ub 256 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --cont-batching \
  --host 0.0.0.0 --port 8080
```

### 8.3 Decision Matrix

| Question | Answer |
|----------|--------|
| Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. |
| Which backend for daily chat? | Vulkan RADV for best tg speed. |
| Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. |
| Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. |
| Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. |
| Use `--no-mmap`? | Always. |
| Set `ROCBLAS_USE_HIPBLASLT=1`? | Always for ROCm. |
| Set `RADV_PERFTEST=nogttspill`? | Always for Vulkan RADV. |
| KV cache quantization? | q8_0 for both K and V unless using speculative decoding. |
| Batch size for MoE? | `-b 256` (lower than default improves some MoE models). |

---

## 9. Sources

### GitHub Issues and Discussions
- [Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021](https://github.com/ggml-org/llama.cpp/discussions/15021)
- [Performance of llama.cpp with Vulkan - Discussion #10879](https://github.com/ggml-org/llama.cpp/discussions/10879)
- [HIP backend performs poorly on gfx1151 - Issue #13565](https://github.com/ggml-org/llama.cpp/issues/13565)
- [UMA detection incorrectly limits memory on AMD APUs - Issue #18159](https://github.com/ggml-org/llama.cpp/issues/18159)
- [ROCm model loading dumps KV cache to shared memory - Issue #18011](https://github.com/ggml-org/llama.cpp/issues/18011)
- [GATED_DELTA_NET underperformance on gfx1151 - Issue #20354](https://github.com/ggml-org/llama.cpp/issues/20354)
- [Under-Performance of ROCm 7.2 binaries - Issue #19984](https://github.com/ggml-org/llama.cpp/issues/19984)
- [ROCm 7.2 + rocWMMA compilation - Issue #19269](https://github.com/ggml-org/llama.cpp/issues/19269)
- [Building for gfx1151 - Issue #14734](https://github.com/ggml-org/llama.cpp/issues/14734)
- [AMDVLK 2GB buffer allocation limit - Issue #15054](https://github.com/ggml-org/llama.cpp/issues/15054)
- [Mastering Host-Memory Prompt Caching - Discussion #20574](https://github.com/ggml-org/llama.cpp/discussions/20574)
- [TurboQuant Extreme KV Cache Quantization - Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969)
- [Backend-agnostic tensor parallelism - PR #19378](https://github.com/ggml-org/llama.cpp/pull/19378)
- [Massively Improved rocWMMA Performance - PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827)
- [rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7)
- [Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5](https://github.com/geerlingguy/beowulf-ai-cluster/issues/5)

### Wiki and Community Resources
- [Strix Halo Wiki - llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance)
- [Strix Halo Wiki - llama.cpp with ROCm](https://strixhalo.wiki/AI/llamacpp-with-ROCm)
- [AMD Strix Halo Backend Benchmarks (Grid View)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [LLM Tracker - AMD Strix Halo GPU Performance](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
- [Framework Community - Strix Halo GPU LLM Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
- [Framework Community - Toolboxes for LLM inference on Strix Halo](https://community.frame.work/t/llama-cpp-vllm-toolboxes-for-llm-inference-on-strix-halo/74916)

### Articles and Blog Posts
- [Hardware Corner - Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/)
- [Hardware Corner - RADV Vulkan Driver 13% Improvement](https://www.hardware-corner.net/llama-cpp-amd-radv-vulkan-driver-update/)
- [Phoronix - AMD ROCm 7.1 vs RADV Vulkan](https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan)
- [Phoronix - Valve Developer RADV Improvement](https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp)
- [Yifei's Blog - Strix Halo Matrix Cores with llama.cpp](https://blog.yifei.sg/jekyll/update/2025/08/27/building-llamacpp-strix-halo.html)
- [Strix Halo CUDA/HIP Testing Notes (lhl)](https://github.com/lhl/strix-halo-testing/blob/main/llama-cpp-cuda-hip.md)

### Official Documentation
- [ROCm - llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
- [ROCm - llama.cpp installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
- [ROCm Blog - Llama.cpp Meets Instinct](https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html)
- [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
- [llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)

### Papers
- "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
- "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
- [Unsloth Dynamic 2.0 GGUFs Documentation](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)

---

## Open Questions / Limitations

1. **rocWMMA-tuned patch upstream status**: PR #16827 may not be fully merged.
   Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.

2. **ROCm 7.2 stability on gfx1151**: Multiple reports of crashes (MUT_MAL errors),
   performance regressions, and compilation issues. ROCm 7.x is maturing but
   not fully stable for gfx1151 as of March 2026.

3. **Vulkan CoopMat FA for AMD**: Will AMD ever get CoopMat2 support? The current
   CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent
   extension would close the gap with ROCm FA.

4. **KV cache placement on ROCm**: Issue #18011 (KV cache dumped to shared memory)
   reduces ROCm tg performance at high contexts. Root cause appears to be in
   HIP memory allocation behavior on APUs.

5. **GGML_HIP_UMA vs kernel-param GTT expansion**: The UMA flag uses slow
   fine-grained memory. GTT expansion via `amdgpu.gttsize` kernel params provides
   coarse-grained GPU-mapped memory that is much faster. The upstream approach
   may eventually improve, but kernel params remain the correct method for now.

6. **GatedDeltaNet architecture support**: Both Vulkan (missing shader) and ROCm
   (register pressure, memcpy bottleneck) perform poorly on GDN models. This
   blocks efficient use of Qwen3.5-27B and similar models.

7. **TurboQuant integration timeline**: Expected Q2-Q3 2026 for llama.cpp.
   Would provide 3-bit KV cache with no quality loss, roughly doubling available
   context within the same memory budget.

8. **NPU utilization**: The 50 TOPS NPU on Strix Halo is currently Linux-unusable
   for llama.cpp inference. AMD driver support for NPU on Linux remains pending.

---

## Overlap Notes

- **Kernel parameters** (`amdgpu.gttsize`, `ttm.pages_limit`, `iommu=pt`):
  Already documented in the project's `scripts/optimize/kernel-params.sh`.
  This research covers the llama.cpp side (why they matter for inference).

- **BIOS VRAM allocation**: Reducing dedicated VRAM in BIOS frees more memory
  for GTT. This is documented in the project's audit scripts but is a prerequisite
  for the optimizations described here.

- **Toolbox container builds**: The project uses pre-built toolboxes
  (`llama-rocm-7.2`, `llama-vulkan`). The compilation flags documented here
  describe what should be baked into those containers.