fix(benchmark): parse llama-bench output with variable column count
KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
This commit is contained in:
806
docs/llama-cpp-optimization-research.md
Normal file
806
docs/llama-cpp-optimization-research.md
Normal file
@@ -0,0 +1,806 @@
|
||||
# llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)
|
||||
|
||||
Comprehensive research into maximizing inference performance on AMD Strix Halo
|
||||
(Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000).
|
||||
Researched March 2026.
|
||||
|
||||
---
|
||||
|
||||
## Scope
|
||||
|
||||
This document covers every known compilation flag, runtime parameter, environment
|
||||
variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5)
|
||||
with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama
|
||||
internals, or non-llama.cpp inference engines except where their findings inform
|
||||
llama.cpp optimization.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Compilation Flags and Build Optimizations](#1-compilation-flags-and-build-optimizations)
|
||||
2. [Runtime Flags and Environment Variables](#2-runtime-flags-and-environment-variables)
|
||||
3. [Flash Attention and Attention Backends](#3-flash-attention-and-attention-backends)
|
||||
4. [Quantization Strategies for Speed](#4-quantization-strategies-for-speed)
|
||||
5. [Memory Layout and Caching](#5-memory-layout-and-caching)
|
||||
6. [llama-server Specific Optimizations](#6-llama-server-specific-optimizations)
|
||||
7. [Upcoming llama.cpp Features (2026)](#7-upcoming-llamacpp-features-2026)
|
||||
8. [Recommended Configurations](#8-recommended-configurations)
|
||||
9. [Sources](#9-sources)
|
||||
|
||||
---
|
||||
|
||||
## 1. Compilation Flags and Build Optimizations
|
||||
|
||||
### 1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build
|
||||
|
||||
Both backends are worth building. Neither is universally faster on gfx1151:
|
||||
|
||||
| Workload | Winner | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths |
|
||||
| Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory |
|
||||
| Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes |
|
||||
| Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx |
|
||||
| Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent |
|
||||
|
||||
Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):
|
||||
|
||||
| Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s |
|
||||
|---------|-----------|-----------|----------------|----------------|
|
||||
| Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 |
|
||||
| Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 |
|
||||
| ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 |
|
||||
| ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 |
|
||||
|
||||
Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x
|
||||
faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context
|
||||
prompt processing (3x faster than RADV at 130K).
|
||||
|
||||
### 1.2 ROCm Build Flags
|
||||
|
||||
**Minimal build:**
|
||||
```bash
|
||||
cmake -B build -S . \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS="gfx1151" \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
```
|
||||
|
||||
**Optimized build (recommended):**
|
||||
```bash
|
||||
cmake -B build -S . \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS="gfx1151" \
|
||||
-DGGML_HIP_ROCWMMA_FATTN=ON \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
```
|
||||
|
||||
#### Critical ROCm build flags:
|
||||
|
||||
| Flag | Effect | Recommendation |
|
||||
|------|--------|----------------|
|
||||
| `-DGGML_HIP=ON` | Enable HIP/ROCm backend | Required |
|
||||
| `-DAMDGPU_TARGETS="gfx1151"` | Target Strix Halo GPU | Required -- do not use gfx1100 |
|
||||
| `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enable rocWMMA flash attention | Strongly recommended for pp |
|
||||
| `-DGGML_HIP_GRAPHS=ON` | HIP graph kernel scheduling | Test -- may help reduce launch overhead |
|
||||
| `-DGGML_HIP_NO_VMM=OFF` | Re-enable Virtual Memory Management | Default is disabled; test if needed |
|
||||
|
||||
#### Flags to be aware of but NOT set by default:
|
||||
|
||||
| Flag | Notes |
|
||||
|------|-------|
|
||||
| `-DGGML_HIP_UMA=ON` | Uses hipMallocManaged for UMA. **Avoid on Strix Halo** -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. |
|
||||
| `-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` | Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. **Test before deploying.** |
|
||||
| `-DGGML_CUDA_FA_ALL_QUANTS=ON` | Compiles all KV cache quant type combinations for FA. **Works for CUDA kernels which HIP reuses via hipify.** Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. |
|
||||
|
||||
#### ROCm version considerations:
|
||||
|
||||
- **ROCm 7.2**: Known rocWMMA compilation issue (ambiguous template specializations in `mfma_impl.hpp`). Fixed in later point releases or by disabling rocWMMA.
|
||||
- **ROCm 7.0 RC / 7.1**: Generally work well with gfx1151.
|
||||
- **ROCm 6.4.4**: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
|
||||
- **Recommendation**: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.
|
||||
|
||||
### 1.3 Vulkan Build Flags
|
||||
|
||||
```bash
|
||||
cmake -B build -S . \
|
||||
-DGGML_VULKAN=ON \
|
||||
-DLLAMA_BUILD_SERVER=ON \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
```
|
||||
|
||||
The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime.
|
||||
gfx1151 with RADV reports `matrix cores: KHR_coopmat` in llama-bench logs.
|
||||
|
||||
#### Vulkan-specific considerations:
|
||||
|
||||
| Topic | Detail |
|
||||
|-------|--------|
|
||||
| **RADV vs AMDVLK** | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. |
|
||||
| **AMDVLK buffer limit** | AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. |
|
||||
| **CoopMat1 vs CoopMat2** | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. |
|
||||
| **Shader compilation** | Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them. |
|
||||
|
||||
### 1.4 LTO and PGO
|
||||
|
||||
llama.cpp does not have built-in LTO/PGO support in its CMake configuration.
|
||||
You can enable LTO manually:
|
||||
|
||||
```bash
|
||||
cmake -B build -S . \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS="gfx1151" \
|
||||
-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
```
|
||||
|
||||
Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are
|
||||
compiled by the HIP/ROCm compiler and are not affected by host LTO.
|
||||
|
||||
PGO would require a two-pass build (instrument, profile, rebuild) and is not
|
||||
commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance
|
||||
and memory bandwidth, not host-side code paths.
|
||||
|
||||
### 1.5 Compiler Tuning for ROCm
|
||||
|
||||
A known LLVM regression affects loop unrolling on RDNA. The following flag has
|
||||
been reported to help:
|
||||
|
||||
```bash
|
||||
-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
|
||||
```
|
||||
|
||||
This increases the unrolling threshold for local memory operations, which can
|
||||
improve kernel performance for flash attention and matrix multiplication.
|
||||
|
||||
---
|
||||
|
||||
## 2. Runtime Flags and Environment Variables
|
||||
|
||||
### 2.1 ROCm Environment Variables
|
||||
|
||||
| Variable | Value | Effect |
|
||||
|----------|-------|--------|
|
||||
| `ROCBLAS_USE_HIPBLASLT=1` | **Critical** | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. **This is a 2-7x improvement for prompt processing.** |
|
||||
| `HSA_OVERRIDE_GFX_VERSION=11.5.1` | Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. |
|
||||
| `HSA_ENABLE_SDMA=0` | Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. |
|
||||
| `HIP_VISIBLE_DEVICES=0` | Optional | Select specific GPU device. Useful in multi-GPU or container setups. |
|
||||
| `GPU_MAX_HEAP_SIZE=100` | Optional | Allow 100% of GPU memory for heap. Default may be lower. |
|
||||
| `GPU_MAX_ALLOC_PERCENT=100` | Optional | Allow single allocation up to 100% of GPU memory. |
|
||||
| `ROCR_VISIBLE_DEVICES=0` | Optional | HSA-level device visibility control. |
|
||||
| `AMD_LOG_LEVEL=0` | Optional | Suppress AMD driver logging noise. |
|
||||
|
||||
**The single most impactful environment variable is `ROCBLAS_USE_HIPBLASLT=1`.**
|
||||
Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).
|
||||
|
||||
### 2.2 Vulkan Environment Variables
|
||||
|
||||
| Variable | Value | Effect |
|
||||
|----------|-------|--------|
|
||||
| `AMD_VULKAN_ICD=RADV` | Recommended | Force RADV driver (skip AMDVLK). |
|
||||
| `RADV_PERFTEST=nogttspill` | **Important** | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). |
|
||||
| `GGML_VK_VISIBLE_DEVICES=0` | Optional | Select Vulkan device index. |
|
||||
| `GGML_VULKAN_DISABLE_F16=1` | Debugging | Force FP32 compute. Slower but useful for debugging precision issues. |
|
||||
| `GGML_LOG_LEVEL=2` | Debugging | Verbose logging to verify coopmat detection. |
|
||||
|
||||
### 2.3 Thread Count (`-t` flag)
|
||||
|
||||
For GPU-dominant inference (all layers offloaded), the thread count has minimal
|
||||
impact on throughput. The recommendation:
|
||||
|
||||
- **Single-user inference**: `-t 4` to `-t 8` (enough for tokenization/sampling overhead)
|
||||
- **Server with parallel slots**: `-t` equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
|
||||
- **Hybrid CPU+GPU (partial offload)**: `-t` equal to number of physical cores
|
||||
|
||||
The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using
|
||||
all 12 big cores (`-t 12`) is optimal for CPU-involved workloads.
|
||||
|
||||
### 2.4 Batch Size Tuning (`-b` and `-ub`)
|
||||
|
||||
| Flag | Default | Role |
|
||||
|------|---------|------|
|
||||
| `-b` / `--batch-size` | 2048 | Logical batch size (application level) |
|
||||
| `-ub` / `--ubatch-size` | 512 | Physical batch size (device level) |
|
||||
|
||||
Tuning guidance for gfx1151:
|
||||
|
||||
- **MoE models**: `-b 256` significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
|
||||
- **Dense models**: Default `-b 2048` is generally fine
|
||||
- **Long context**: `-ub 2048` can improve performance, but test against OOM
|
||||
- **Ultra-long context**: Reduce `-ub` if memory allocation fails
|
||||
|
||||
The Vulkan backend blog post for Strix Halo recommends: `-c 32768 -b 4096 -ub 256`
|
||||
for a good balance of performance and memory.
|
||||
|
||||
### 2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`)
|
||||
|
||||
**Critical finding for unified memory APUs:**
|
||||
|
||||
> When you load large models to the GPU, memory mapping can make loading moderately
|
||||
> slower for Vulkan, and **catastrophically slower for ROCm**. You should always set
|
||||
> `--mmap 0` or `--no-mmap` to improve model loading times on Strix Halo.
|
||||
|
||||
For `llama-bench`, use `-mmp 0`. For `llama-server`/`llama-cli`, use `--no-mmap`.
|
||||
|
||||
On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x.
|
||||
The difference is which pages are mapped for GPU access. GPU-mapped pages have full
|
||||
bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for
|
||||
CPU-to-GPU copies).
|
||||
|
||||
**Always use `-ngl 99` (or higher) to ensure all layers are on GPU memory.**
|
||||
Even on a unified memory system, GPU memory paths provide 2x the bandwidth.
|
||||
|
||||
### 2.6 GPU Layer Offloading (`-ngl`)
|
||||
|
||||
For Strix Halo with 64GB unified memory:
|
||||
|
||||
- **Models < 50GB**: `-ngl 99` offloads everything. No tuning needed.
|
||||
- **Models 50-60GB**: `-ngl 99` should still work with GTT expanded via kernel params.
|
||||
- **Models > 60GB**: May need partial offload. Use `-ngl <N>` where N is tuned to
|
||||
keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.
|
||||
|
||||
**Never let GPU spill to system RAM paths** -- performance will be worse than pure CPU.
|
||||
|
||||
---
|
||||
|
||||
## 3. Flash Attention and Attention Backends
|
||||
|
||||
### 3.1 When to Enable Flash Attention
|
||||
|
||||
**Rule of thumb for gfx1151:**
|
||||
|
||||
| Backend | Flash Attention | Recommendation |
|
||||
|---------|----------------|----------------|
|
||||
| ROCm + rocWMMA | `-fa 1` | **Always enable.** 24% pp improvement, maintains tg speed, uses less memory. |
|
||||
| ROCm without rocWMMA | `-fa 1` | Enable, but smaller improvement. |
|
||||
| Vulkan RADV | `-fa 1` | **Enable for short contexts.** Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. |
|
||||
| Vulkan AMDVLK | `-fa 1` | Similar to RADV. |
|
||||
|
||||
**Key caveat**: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not
|
||||
the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior
|
||||
FA path.
|
||||
|
||||
### 3.2 rocWMMA Flash Attention Performance
|
||||
|
||||
Benchmark on gfx1151 (Llama2-7B Q4_K_M):
|
||||
|
||||
| Configuration | pp512 t/s | tg128 t/s |
|
||||
|---------------|-----------|-----------|
|
||||
| HIP standard | 592.28 | 40.40 |
|
||||
| HIP + hipBLASLt | 548.72 | 40.43 |
|
||||
| HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 |
|
||||
| HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 |
|
||||
|
||||
rocWMMA provides ~70% improvement in prompt processing with flash attention.
|
||||
Token generation is slightly slower (~2%) due to WMMA overhead at small batch.
|
||||
|
||||
### 3.3 The rocWMMA Long-Context Regression and Fix
|
||||
|
||||
The standard rocWMMA implementation has a **long-context decode regression**:
|
||||
at 65K context, tg degrades by up to 57% compared to HIP-only baseline.
|
||||
|
||||
**The fix** (PR #16827, "rocm-wmma-tune" branch) implements:
|
||||
|
||||
1. **`__launch_bounds__(256, 2)`**: Ensures minimum 2 blocks per SM, improving occupancy
|
||||
2. **Adaptive KQ stride**: Uses stride 128 when head dimension <= 128, reducing LDS footprint
|
||||
3. **Selective WMMA usage**: WMMA only for prefill; decode reverts to VEC/TILE kernels
|
||||
|
||||
Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):
|
||||
- pp512 at 65K context: **96% faster** than untuned rocWMMA
|
||||
- tg128 at 65K context: Matches HIP baseline (previously 57% degraded)
|
||||
|
||||
**Status**: This patch is available in `-rocwmma-improved` toolbox builds. It may
|
||||
not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.
|
||||
|
||||
### 3.4 Vulkan Flash Attention Limitations on AMD
|
||||
|
||||
The Vulkan backend supports three FA paths:
|
||||
|
||||
| Path | Extension | AMD Support |
|
||||
|------|-----------|-------------|
|
||||
| FA_SCALAR | None | Yes (CPU fallback) |
|
||||
| FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) |
|
||||
| FA_COOPMAT2 | NV_cooperative_matrix2 | **No** (NVIDIA-only) |
|
||||
|
||||
FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types.
|
||||
FA_COOPMAT2 additionally supports all quant types.
|
||||
|
||||
When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations.
|
||||
This provides a modest improvement over scalar FA but is significantly less
|
||||
efficient than ROCm + rocWMMA.
|
||||
|
||||
### 3.5 New Attention Models (GatedDeltaNet)
|
||||
|
||||
Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have
|
||||
severe performance problems on gfx1151:
|
||||
|
||||
- **Vulkan**: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
|
||||
- **ROCm/HIP**: Kernel cross-compiles but suffers from register spilling (float s[S_v]
|
||||
allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95%
|
||||
of decode time on models >15GB)
|
||||
|
||||
Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s.
|
||||
**Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.**
|
||||
|
||||
---
|
||||
|
||||
## 4. Quantization Strategies for Speed
|
||||
|
||||
### 4.1 Quantization Speed on RDNA 3.5
|
||||
|
||||
Token generation speed is dominated by memory bandwidth, not compute. Smaller
|
||||
quantizations are faster because they reduce bytes-per-weight, allowing more
|
||||
tokens per second within the ~215 GB/s bandwidth envelope.
|
||||
|
||||
Approximate throughput formula for decode (bandwidth-bound):
|
||||
```
|
||||
tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9
|
||||
```
|
||||
|
||||
For a 7B Q4_K_M model (~4.1 GB):
|
||||
```
|
||||
215 / 4.1 ≈ 52 t/s (theoretical max; practical ~50 t/s on gfx1151)
|
||||
```
|
||||
|
||||
### 4.2 Quantization Type Comparison
|
||||
|
||||
| Quant | Bits/Weight | Quality | Speed (relative) | Notes |
|
||||
|-------|------------|---------|-------------------|-------|
|
||||
| Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. |
|
||||
| Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. |
|
||||
| IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. |
|
||||
| Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. |
|
||||
| Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. |
|
||||
| Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. |
|
||||
| F16 | 16.0 | Perfect | Slowest | Reference baseline. |
|
||||
|
||||
**For RDNA 3.5 specifically**:
|
||||
|
||||
- **Q4_K_M** is the best general-purpose quantization. The K-quant family uses
|
||||
hierarchical super-blocks (256 values) with per-sub-block scales, providing
|
||||
better quality than Q4_0 at marginally higher dequant cost that is invisible
|
||||
at the GPU level.
|
||||
|
||||
- **Q4_0** has the simplest dequant kernels and is marginally faster than Q4_K_M
|
||||
on some GPU backends. However, the quality loss is significant. Use only for
|
||||
smoke tests or when every t/s matters more than quality.
|
||||
|
||||
- **IQ4_XS** (importance-matrix quantized) offers better quality per bit than
|
||||
Q4_K_M. Speed is similar. Requires an importance matrix file during quantization.
|
||||
**Recommended over Q4_K_M when you control the quantization process.**
|
||||
|
||||
- **Q8_0** does NOT have special hardware-accelerated dequant on RDNA 3.5.
|
||||
RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the
|
||||
same FP16 compute paths, just with more memory bandwidth consumed.
|
||||
|
||||
### 4.3 Importance Matrix (imatrix) Quantization
|
||||
|
||||
imatrix quantization records how much each weight affects output quality, then
|
||||
allocates more precision bits to important weights. This is essential for
|
||||
sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant
|
||||
shows measurable degradation.
|
||||
|
||||
```bash
|
||||
# Generate importance matrix (GPU-accelerated)
|
||||
llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat
|
||||
|
||||
# Quantize with imatrix
|
||||
llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS
|
||||
```
|
||||
|
||||
**Speed impact**: None. imatrix affects quantization quality, not inference speed.
|
||||
The dequantization kernels are identical regardless of whether imatrix was used.
|
||||
|
||||
### 4.4 Unsloth Dynamic (UD) Quantizations
|
||||
|
||||
Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths,
|
||||
choosing the optimal quantization per layer based on sensitivity analysis.
|
||||
|
||||
**Speed impact**: Minimal to none. UD quants use the same dequant kernels as
|
||||
standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a
|
||||
standard Q4_K_M of the same total size.
|
||||
|
||||
**Quality impact**: Significantly better. UD consistently outperforms standard
|
||||
quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.
|
||||
|
||||
**Recommendation**: Prefer UD quants (e.g., `UD-Q4_K_XL`, `UD-Q4_K_M`) from
|
||||
Unsloth when available. They are a free quality upgrade with no speed penalty.
|
||||
|
||||
---
|
||||
|
||||
## 5. Memory Layout and Caching
|
||||
|
||||
### 5.1 KV Cache Quantization
|
||||
|
||||
KV cache quantization reduces the memory footprint of the attention cache,
|
||||
allowing larger context windows within the same memory budget.
|
||||
|
||||
| Cache Type | Memory vs F16 | Quality Impact | Recommendation |
|
||||
|------------|--------------|----------------|----------------|
|
||||
| f16 (default) | 1.0x | None | Baseline |
|
||||
| q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | **Recommended for production** |
|
||||
| q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained |
|
||||
| q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 |
|
||||
| iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option |
|
||||
|
||||
Usage:
|
||||
```bash
|
||||
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
|
||||
# or for llama-bench:
|
||||
# Not directly supported as flags; test via llama-server
|
||||
```
|
||||
|
||||
**Performance impact**: Quantizing K cache slightly **improves** throughput
|
||||
(less memory to read). Quantizing V cache may have a slight negative impact.
|
||||
Overall performance impact is negligible for normal inference.
|
||||
|
||||
**Caveat with speculative decoding**: Using KV cache quantization with a draft
|
||||
model causes a consistent ~16% performance drop. q4_0 KV with speculative
|
||||
decoding causes massive acceptance rate drops. **Avoid KV quant if using
|
||||
speculative decoding.**
|
||||
|
||||
### 5.2 mmap vs Full Load on Unified Memory
|
||||
|
||||
On Strix Halo's unified memory architecture:
|
||||
|
||||
- **`--no-mmap` is strongly recommended** for both ROCm and Vulkan.
|
||||
- With mmap enabled, ROCm model loading is "catastrophically slower."
|
||||
- Vulkan loading is "moderately slower" with mmap.
|
||||
- Since CPU and GPU share physical RAM, there is no data copy when loading
|
||||
to "GPU memory" -- it is just a page table update.
|
||||
|
||||
For `llama-bench`: Always use `-mmp 0`.
|
||||
For `llama-server`/`llama-cli`: Always use `--no-mmap`.
|
||||
|
||||
### 5.3 Prompt Caching
|
||||
|
||||
llama-server supports two levels of prompt caching:
|
||||
|
||||
**1. Automatic KV cache reuse (`cache_prompt: true`)**:
|
||||
Reuses KV cache from previous requests when prompts share a common prefix.
|
||||
The server only reprocesses the suffix that differs.
|
||||
|
||||
**2. Host-memory prompt caching (`--cram N`)**:
|
||||
Stores pre-computed prompt representations in system RAM.
|
||||
- Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
|
||||
- +6% token throughput (34 vs 32 t/s)
|
||||
- Memory formula: `num_prefixes * avg_prefix_tokens * 8 bytes`
|
||||
|
||||
Configuration:
|
||||
```bash
|
||||
llama-server -m model.gguf \
|
||||
--cram 256 \ # 256 MB host RAM for prompt cache
|
||||
--cache-type-k q8_0 \ # KV cache quantization
|
||||
--cache-type-v q8_0 \
|
||||
--no-mmap \
|
||||
-fa \
|
||||
-ngl 99
|
||||
```
|
||||
|
||||
Best for:
|
||||
- System prompts > 5K tokens
|
||||
- Multi-user chatbots with shared context
|
||||
- Agentic use with repeated tool-call prefixes
|
||||
|
||||
### 5.4 UMA Detection Bug (Issue #18159)
|
||||
|
||||
llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark)
|
||||
incorrectly activates on AMD APUs when `prop.integrated=1`. It reads
|
||||
`/proc/meminfo` instead of `hipMemGetInfo()`, severely underreporting available
|
||||
GPU memory (e.g., reporting 27GB instead of 96GB).
|
||||
|
||||
**Workarounds**:
|
||||
- Build without `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
|
||||
- Guard UMA detection with `!defined(GGML_USE_HIP)` (upstream fix pending)
|
||||
- Use toolbox containers where this has been patched
|
||||
|
||||
### 5.5 KV Cache Placement on ROCm (Issue #18011)
|
||||
|
||||
On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible)
|
||||
memory instead of GPU-mapped memory, causing performance degradation at high
|
||||
context sizes. This is a known issue contributing to ROCm falling behind Vulkan
|
||||
for tg at high contexts.
|
||||
|
||||
**Mitigation**: Use the rocWMMA-tuned branch which maintains better memory
|
||||
placement, or use Vulkan RADV for workloads where this matters.
|
||||
|
||||
---
|
||||
|
||||
## 6. llama-server Specific Optimizations
|
||||
|
||||
### 6.1 Recommended Server Configuration
|
||||
|
||||
```bash
|
||||
llama-server -m model.gguf \
|
||||
--host 0.0.0.0 --port 8080 \
|
||||
-ngl 99 \
|
||||
--no-mmap \
|
||||
-fa \
|
||||
-c 32768 \ # Total context across all slots
|
||||
-np 4 \ # 4 parallel slots (adjust for your use)
|
||||
-b 2048 \ # Logical batch size
|
||||
-ub 512 \ # Physical batch size
|
||||
--cache-type-k q8_0 \
|
||||
--cache-type-v q8_0 \
|
||||
--cont-batching \ # Enabled by default
|
||||
--jinja # Enable Jinja2 chat template
|
||||
```
|
||||
|
||||
### 6.2 Parallel Slot Configuration (`-np`)
|
||||
|
||||
| Use Case | Slots | Context per Slot | Total `-c` |
|
||||
|----------|-------|-----------------|-----------|
|
||||
| Single user chat | 1 | 32768 | 32768 |
|
||||
| Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 |
|
||||
| Multi-user API | 4-8 | 4096-8192 | 32768-65536 |
|
||||
| Eval harness | 1 | 32768+ | 32768+ |
|
||||
|
||||
Memory formula: Each slot requires `context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element`.
|
||||
With q8_0 KV cache, this is roughly halved compared to f16.
|
||||
|
||||
### 6.3 Continuous Batching
|
||||
|
||||
Enabled by default (`--cont-batching`). Allows the server to process multiple
|
||||
requests simultaneously, interleaving prefill and decode operations.
|
||||
|
||||
For agentic workloads: One slot typically holds a large system prompt + conversation
|
||||
context, while additional slots handle parallel tool calls. Configure with:
|
||||
|
||||
```bash
|
||||
-np 4 -c 131072 # 4 slots, up to 32K context each
|
||||
```
|
||||
|
||||
### 6.4 Prompt Caching for Agentic Use
|
||||
|
||||
For agentic coding tools that send the same system prompt repeatedly:
|
||||
|
||||
1. Use `cache_prompt: true` in API requests (reuses KV cache prefix)
|
||||
2. Use `--system-prompt-file system.txt` for static system prompts (note: may be
|
||||
removed in recent versions; verify with your build)
|
||||
3. Use `--cram 128` to enable host-memory caching for prefix deduplication
|
||||
|
||||
### 6.5 Speculative Decoding
|
||||
|
||||
For token generation speedup with a draft model:
|
||||
|
||||
```bash
|
||||
llama-server -m main-model.gguf \
|
||||
--model-draft draft-model.gguf \
|
||||
-ngl 99 \
|
||||
--draft-max 8 \
|
||||
--draft-min 1 \
|
||||
--no-mmap \
|
||||
-fa
|
||||
```
|
||||
|
||||
**Caveat**: Do NOT combine speculative decoding with KV cache quantization.
|
||||
The 16% performance drop and reduced acceptance rate negate the benefits.
|
||||
|
||||
---
|
||||
|
||||
## 7. Upcoming llama.cpp Features (2026)
|
||||
|
||||
### 7.1 Backend-Agnostic Tensor Parallelism (PR #19378)
|
||||
|
||||
Merged January 2026. Adds `--split-mode tensor` for splitting computation across
|
||||
multiple GPUs via a new "meta" backend.
|
||||
|
||||
**Relevance to Strix Halo**: Limited. Single integrated GPU. However, for RPC
|
||||
configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster),
|
||||
tensor parallelism could complement the existing layer-split approach.
|
||||
|
||||
Currently supports 1-2 GPUs with equal data split. `--tensor-split` has no effect yet.
|
||||
|
||||
### 7.2 TurboQuant KV Cache Compression (ICLR 2026)
|
||||
|
||||
Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with
|
||||
no training and negligible quality loss:
|
||||
|
||||
| Format | MSE vs FP16 | Compression |
|
||||
|--------|-------------|-------------|
|
||||
| TQ3 (3-bit) | 0.034 | 4.9x |
|
||||
| TQ4 (4-bit) | 0.009 | 3.8x |
|
||||
|
||||
**Timeline**: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase
|
||||
integration plan exists covering GGML type registration, KV cache paths, FA
|
||||
integration, and CLI flags.
|
||||
|
||||
### 7.3 Vulkan Improvements
|
||||
|
||||
Active 2025-2026 developments:
|
||||
- Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
|
||||
- 13% pp improvement from CU mode optimization for LDS utilization
|
||||
- BFloat16 Vulkan support (`VK_KHR_shader_bfloat16`) maturing in Mesa 25.x
|
||||
- Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)
|
||||
|
||||
### 7.4 Flash Attention for Head Dimension 512
|
||||
|
||||
Pull request from March 2026 adds FA support for HD=512 in CUDA kernels.
|
||||
This benefits models with larger head dimensions (some newer architectures).
|
||||
The HIP path should inherit this via hipify.
|
||||
|
||||
### 7.5 ik_llama.cpp Fork Innovations
|
||||
|
||||
The `ik_llama.cpp` fork by ikawrakow introduces:
|
||||
- Row-interleaved quant packing (better memory access patterns)
|
||||
- Smart Expert Reduction for faster MoE inference
|
||||
- Tensor overrides with regex patterns for hybrid GPU/CPU placement
|
||||
- FlashMLA for DeepSeek models
|
||||
|
||||
**Caveat**: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan
|
||||
are not maintained. Not recommended for AMD gfx1151.
|
||||
|
||||
---
|
||||
|
||||
## 8. Recommended Configurations
|
||||
|
||||
### 8.1 For llama-bench (Benchmarking)
|
||||
|
||||
**ROCm backend:**
|
||||
```bash
|
||||
ROCBLAS_USE_HIPBLASLT=1 \
|
||||
toolbox run -c llama-rocm-7.2 -- \
|
||||
/path/to/llama-bench \
|
||||
-m /path/to/model.gguf \
|
||||
-ngl 99 -mmp 0 -fa 1 \
|
||||
-p 512 -n 128 -r 5
|
||||
```
|
||||
|
||||
**Vulkan backend:**
|
||||
```bash
|
||||
AMD_VULKAN_ICD=RADV \
|
||||
RADV_PERFTEST=nogttspill \
|
||||
toolbox run -c llama-vulkan -- \
|
||||
/path/to/llama-bench \
|
||||
-m /path/to/model.gguf \
|
||||
-ngl 99 -mmp 0 -fa 1 \
|
||||
-p 512 -n 128 -r 5
|
||||
```
|
||||
|
||||
### 8.2 For llama-server (Production Agentic Use)
|
||||
|
||||
**ROCm (best for long context):**
|
||||
```bash
|
||||
ROCBLAS_USE_HIPBLASLT=1 \
|
||||
llama-server -m model.gguf \
|
||||
-ngl 99 --no-mmap -fa \
|
||||
-c 65536 -np 4 \
|
||||
-b 2048 -ub 512 \
|
||||
--cache-type-k q8_0 --cache-type-v q8_0 \
|
||||
--cram 256 \
|
||||
--jinja --cont-batching \
|
||||
--host 0.0.0.0 --port 8080
|
||||
```
|
||||
|
||||
**Vulkan RADV (best for single-user tg):**
|
||||
```bash
|
||||
AMD_VULKAN_ICD=RADV \
|
||||
RADV_PERFTEST=nogttspill \
|
||||
llama-server -m model.gguf \
|
||||
-ngl 99 --no-mmap -fa \
|
||||
-c 32768 -np 2 \
|
||||
-b 4096 -ub 256 \
|
||||
--cache-type-k q8_0 --cache-type-v q8_0 \
|
||||
--jinja --cont-batching \
|
||||
--host 0.0.0.0 --port 8080
|
||||
```
|
||||
|
||||
### 8.3 Decision Matrix
|
||||
|
||||
| Question | Answer |
|
||||
|----------|--------|
|
||||
| Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. |
|
||||
| Which backend for daily chat? | Vulkan RADV for best tg speed. |
|
||||
| Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. |
|
||||
| Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. |
|
||||
| Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. |
|
||||
| Use `--no-mmap`? | Always. |
|
||||
| Set `ROCBLAS_USE_HIPBLASLT=1`? | Always for ROCm. |
|
||||
| Set `RADV_PERFTEST=nogttspill`? | Always for Vulkan RADV. |
|
||||
| KV cache quantization? | q8_0 for both K and V unless using speculative decoding. |
|
||||
| Batch size for MoE? | `-b 256` (lower than default improves some MoE models). |
|
||||
|
||||
---
|
||||
|
||||
## 9. Sources
|
||||
|
||||
### GitHub Issues and Discussions
|
||||
- [Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021](https://github.com/ggml-org/llama.cpp/discussions/15021)
|
||||
- [Performance of llama.cpp with Vulkan - Discussion #10879](https://github.com/ggml-org/llama.cpp/discussions/10879)
|
||||
- [HIP backend performs poorly on gfx1151 - Issue #13565](https://github.com/ggml-org/llama.cpp/issues/13565)
|
||||
- [UMA detection incorrectly limits memory on AMD APUs - Issue #18159](https://github.com/ggml-org/llama.cpp/issues/18159)
|
||||
- [ROCm model loading dumps KV cache to shared memory - Issue #18011](https://github.com/ggml-org/llama.cpp/issues/18011)
|
||||
- [GATED_DELTA_NET underperformance on gfx1151 - Issue #20354](https://github.com/ggml-org/llama.cpp/issues/20354)
|
||||
- [Under-Performance of ROCm 7.2 binaries - Issue #19984](https://github.com/ggml-org/llama.cpp/issues/19984)
|
||||
- [ROCm 7.2 + rocWMMA compilation - Issue #19269](https://github.com/ggml-org/llama.cpp/issues/19269)
|
||||
- [Building for gfx1151 - Issue #14734](https://github.com/ggml-org/llama.cpp/issues/14734)
|
||||
- [AMDVLK 2GB buffer allocation limit - Issue #15054](https://github.com/ggml-org/llama.cpp/issues/15054)
|
||||
- [Mastering Host-Memory Prompt Caching - Discussion #20574](https://github.com/ggml-org/llama.cpp/discussions/20574)
|
||||
- [TurboQuant Extreme KV Cache Quantization - Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969)
|
||||
- [Backend-agnostic tensor parallelism - PR #19378](https://github.com/ggml-org/llama.cpp/pull/19378)
|
||||
- [Massively Improved rocWMMA Performance - PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827)
|
||||
- [rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7)
|
||||
- [Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5](https://github.com/geerlingguy/beowulf-ai-cluster/issues/5)
|
||||
|
||||
### Wiki and Community Resources
|
||||
- [Strix Halo Wiki - llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance)
|
||||
- [Strix Halo Wiki - llama.cpp with ROCm](https://strixhalo.wiki/AI/llamacpp-with-ROCm)
|
||||
- [AMD Strix Halo Backend Benchmarks (Grid View)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
|
||||
- [LLM Tracker - AMD Strix Halo GPU Performance](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
|
||||
- [Framework Community - Strix Halo GPU LLM Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
|
||||
- [Framework Community - Toolboxes for LLM inference on Strix Halo](https://community.frame.work/t/llama-cpp-vllm-toolboxes-for-llm-inference-on-strix-halo/74916)
|
||||
|
||||
### Articles and Blog Posts
|
||||
- [Hardware Corner - Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/)
|
||||
- [Hardware Corner - RADV Vulkan Driver 13% Improvement](https://www.hardware-corner.net/llama-cpp-amd-radv-vulkan-driver-update/)
|
||||
- [Phoronix - AMD ROCm 7.1 vs RADV Vulkan](https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan)
|
||||
- [Phoronix - Valve Developer RADV Improvement](https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp)
|
||||
- [Yifei's Blog - Strix Halo Matrix Cores with llama.cpp](https://blog.yifei.sg/jekyll/update/2025/08/27/building-llamacpp-strix-halo.html)
|
||||
- [Strix Halo CUDA/HIP Testing Notes (lhl)](https://github.com/lhl/strix-halo-testing/blob/main/llama-cpp-cuda-hip.md)
|
||||
|
||||
### Official Documentation
|
||||
- [ROCm - llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
|
||||
- [ROCm - llama.cpp installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
|
||||
- [ROCm Blog - Llama.cpp Meets Instinct](https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html)
|
||||
- [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
|
||||
- [llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
|
||||
|
||||
### Papers
|
||||
- "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
|
||||
- "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
|
||||
- [Unsloth Dynamic 2.0 GGUFs Documentation](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)
|
||||
|
||||
---
|
||||
|
||||
## Open Questions / Limitations
|
||||
|
||||
1. **rocWMMA-tuned patch upstream status**: PR #16827 may not be fully merged.
|
||||
Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.
|
||||
|
||||
2. **ROCm 7.2 stability on gfx1151**: Multiple reports of crashes (MUT_MAL errors),
|
||||
performance regressions, and compilation issues. ROCm 7.x is maturing but
|
||||
not fully stable for gfx1151 as of March 2026.
|
||||
|
||||
3. **Vulkan CoopMat FA for AMD**: Will AMD ever get CoopMat2 support? The current
|
||||
CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent
|
||||
extension would close the gap with ROCm FA.
|
||||
|
||||
4. **KV cache placement on ROCm**: Issue #18011 (KV cache dumped to shared memory)
|
||||
reduces ROCm tg performance at high contexts. Root cause appears to be in
|
||||
HIP memory allocation behavior on APUs.
|
||||
|
||||
5. **GGML_HIP_UMA vs kernel-param GTT expansion**: The UMA flag uses slow
|
||||
fine-grained memory. GTT expansion via `amdgpu.gttsize` kernel params provides
|
||||
coarse-grained GPU-mapped memory that is much faster. The upstream approach
|
||||
may eventually improve, but kernel params remain the correct method for now.
|
||||
|
||||
6. **GatedDeltaNet architecture support**: Both Vulkan (missing shader) and ROCm
|
||||
(register pressure, memcpy bottleneck) perform poorly on GDN models. This
|
||||
blocks efficient use of Qwen3.5-27B and similar models.
|
||||
|
||||
7. **TurboQuant integration timeline**: Expected Q2-Q3 2026 for llama.cpp.
|
||||
Would provide 3-bit KV cache with no quality loss, roughly doubling available
|
||||
context within the same memory budget.
|
||||
|
||||
8. **NPU utilization**: The 50 TOPS NPU on Strix Halo is currently Linux-unusable
|
||||
for llama.cpp inference. AMD driver support for NPU on Linux remains pending.
|
||||
|
||||
---
|
||||
|
||||
## Overlap Notes
|
||||
|
||||
- **Kernel parameters** (`amdgpu.gttsize`, `ttm.pages_limit`, `iommu=pt`):
|
||||
Already documented in the project's `scripts/optimize/kernel-params.sh`.
|
||||
This research covers the llama.cpp side (why they matter for inference).
|
||||
|
||||
- **BIOS VRAM allocation**: Reducing dedicated VRAM in BIOS frees more memory
|
||||
for GTT. This is documented in the project's audit scripts but is a prerequisite
|
||||
for the optimizations described here.
|
||||
|
||||
- **Toolbox container builds**: The project uses pre-built toolboxes
|
||||
(`llama-rocm-7.2`, `llama-vulkan`). The compilation flags documented here
|
||||
describe what should be baked into those containers.
|
||||
Reference in New Issue
Block a user