fix(benchmark): parse llama-bench output with variable column count
KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
This commit is contained in:
825
docs/inference-optimization-landscape.md
Normal file
825
docs/inference-optimization-landscape.md
Normal file
@@ -0,0 +1,825 @@
|
||||
# LLM Inference Optimization Landscape (March 2026)
|
||||
|
||||
## Scope
|
||||
|
||||
Comprehensive survey of cutting-edge LLM inference optimization techniques applicable
|
||||
to a high-end AMD APU workstation: Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S
|
||||
(gfx1151, RDNA 3.5), 64 GB unified LPDDR5X memory, 256 GB/s bandwidth. Covers
|
||||
inference engines, quantization, attention, MoE optimization, memory bandwidth, OS-level
|
||||
tuning, hardware features, and model-level techniques. Research current as of March 2026.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Inference Engines and Backends](#1-inference-engines-and-backends)
|
||||
2. [Advanced Quantization Techniques](#2-advanced-quantization-techniques)
|
||||
3. [Attention Optimization](#3-attention-optimization)
|
||||
4. [MoE-Specific Optimizations](#4-moe-specific-optimizations)
|
||||
5. [Memory Bandwidth Optimization](#5-memory-bandwidth-optimization)
|
||||
6. [OS and Runtime Techniques](#6-os-and-runtime-techniques)
|
||||
7. [Emerging Hardware Features](#7-emerging-hardware-features)
|
||||
8. [Model-Level Optimizations](#8-model-level-optimizations)
|
||||
9. [Prioritized Recommendations for Strix Halo](#9-prioritized-recommendations-for-strix-halo)
|
||||
10. [Sources](#10-sources)
|
||||
|
||||
---
|
||||
|
||||
## 1. Inference Engines and Backends
|
||||
|
||||
### 1.1 llama.cpp -- Still the Foundation
|
||||
|
||||
llama.cpp remains the dominant local inference engine. All major interfaces (Ollama,
|
||||
LM Studio, GPT4All, KoboldCpp) use it under the hood. For Strix Halo specifically:
|
||||
|
||||
- **ROCm/HIP backend**: Build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151
|
||||
-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`. The `ROCBLAS_USE_HIPBLASLT=1`
|
||||
environment variable forces hipBLASLt kernels, which deliver the best throughput on
|
||||
gfx1151.
|
||||
- **Vulkan backend**: The RADV Mesa driver has seen active RDNA 3.5/4 optimization
|
||||
in Mesa 25.x. In some benchmarks Vulkan outperforms ROCm for single-shot inference
|
||||
and shorter contexts. HIP+WMMA+FlashAttention is fastest for long contexts (tg8192+).
|
||||
- **UMA detection bug (issue #18159)**: llama.cpp's UMA detection can incorrectly
|
||||
limit available memory on AMD APUs with large TTM allocations. The `--mmp 0`
|
||||
(disable mmap) flag is critical for ROCm on Strix Halo to avoid catastrophically
|
||||
slow model loading.
|
||||
- **Performance**: Llama-2-7B Q4_0 achieves ~1464 t/s prompt processing (pp512) and
|
||||
~50 t/s token generation (tg128) on Strix Halo with ROCm.
|
||||
- **Known regression**: A commit enabling WMMA-MMQ INT kernels for RDNA 3 introduced
|
||||
significant prompt processing regression on gfx1151 with ROCm 7.x (issue #17917).
|
||||
|
||||
**Status**: Production-ready. Best single-engine choice for Strix Halo.
|
||||
|
||||
### 1.2 KTransformers -- CPU/GPU Hybrid MoE Specialist
|
||||
|
||||
KTransformers (SOSP 2025) is the most significant new engine for hybrid inference.
|
||||
It was purpose-built for running large MoE models (DeepSeek-R1/V3) on systems with
|
||||
limited GPU memory but abundant CPU memory.
|
||||
|
||||
- **AMX-optimized kernels**: Uses Intel AMX instructions for CPU-side expert
|
||||
computation. For AMD Zen 5, it falls back to AVX-512, which is still substantially
|
||||
faster than naive CPU inference.
|
||||
- **Async CPU-GPU scheduling**: Overlaps CPU expert computation with GPU attention
|
||||
computation, hiding CPU latency.
|
||||
- **Performance**: 4.62-19.74x prefill speedup, 1.25-4.09x decode speedup vs
|
||||
existing hybrid systems. SGLang + KTransformers achieves 220+ tok/s total
|
||||
throughput on trillion-parameter MoE models.
|
||||
- **Relevance to Strix Halo**: Moderate. KTransformers shines when GPU VRAM is
|
||||
scarce (24 GB discrete) and CPU RAM is abundant (382 GB). On Strix Halo, all 64 GB
|
||||
is accessible to the GPU, so the CPU offloading advantage is diminished. However,
|
||||
for models exceeding 64 GB, KTransformers-style hybrid inference becomes relevant.
|
||||
|
||||
**Status**: Production. Most useful for models that exceed available VRAM.
|
||||
|
||||
### 1.3 PowerInfer / PowerInfer-2
|
||||
|
||||
PowerInfer-2 targets smartphones, achieving 11.68 t/s on Mixtral 47B (22x faster
|
||||
than alternatives). It exploits MoE sparsity by predicting which experts will
|
||||
activate and only loading those. The core technique -- hot/cold neuron partitioning
|
||||
and GPU-resident hot neurons -- is architecturally interesting but the implementation
|
||||
targets mobile SoCs with discrete memory hierarchies, not unified-memory APUs where
|
||||
all memory is equally accessible to the GPU.
|
||||
|
||||
**Status**: Research. Techniques are partially subsumed by llama.cpp's own MoE
|
||||
offloading improvements.
|
||||
|
||||
### 1.4 MLC-LLM
|
||||
|
||||
MLC-LLM compiles models via TVM to target multiple backends including ROCm, Vulkan,
|
||||
Metal, and OpenCL. It was one of the first engines to make AMD GPUs competitive for
|
||||
LLM inference (2023 blog post). The Vulkan backend provides a universal fallback
|
||||
that works on any GPU.
|
||||
|
||||
**Status**: Active but niche. For Strix Halo, llama.cpp's native ROCm/Vulkan
|
||||
backends are more mature and better optimized.
|
||||
|
||||
### 1.5 mistral.rs / candle / burn
|
||||
|
||||
Rust-based inference engines:
|
||||
|
||||
- **mistral.rs**: Built on Hugging Face's candle library. Supports GGUF, GPTQ,
|
||||
ISQ (in-situ quantization). Has CUDA support but no ROCm backend.
|
||||
- **candle**: Hugging Face's Rust ML framework. GPU support via CUDA; no ROCm.
|
||||
- **burn**: Rust ML framework with multiple backends (WGPU, Vulkan, CUDA). The
|
||||
WGPU/Vulkan path could theoretically work on AMD, but LLM inference support
|
||||
is limited.
|
||||
|
||||
**Status**: Not viable for Strix Halo in 2026. No ROCm support, and the Vulkan
|
||||
paths are less optimized than llama.cpp's.
|
||||
|
||||
### 1.6 BitNet.cpp
|
||||
|
||||
Microsoft's official 1-bit LLM inference framework. Achieves 6x faster inference
|
||||
and 82% lower energy consumption. GPU kernel support was added May 2025 for NVIDIA
|
||||
and Apple Silicon. No AMD GPU kernels yet. CPU-only mode works on any x86 system
|
||||
and could be relevant for future 1-bit models, but the model ecosystem (BitNet b1.58
|
||||
variants) remains small.
|
||||
|
||||
**Status**: Watch. No AMD GPU support. CPU path works but model selection is limited.
|
||||
|
||||
### 1.7 vLLM and SGLang
|
||||
|
||||
Both are production LLM serving frameworks with AMD ROCm support:
|
||||
|
||||
- **vLLM v0.16.0** (Feb 2026): ROCm is now a first-class platform. 93% of AMD
|
||||
test groups passing. Native AITER FP8 kernels, fused LayerNorm/SiLU, optimized
|
||||
Paged Attention. Extended bitsandbytes quantization to warp-size-32 GPUs (RDNA).
|
||||
- **SGLang**: Supports ROCm. KTransformers integration for hybrid MoE inference.
|
||||
|
||||
Both are overkill for single-user local inference but become relevant for serving
|
||||
multiple users or running agentic workloads with concurrent requests.
|
||||
|
||||
**Status**: Production for server workloads. Consider if running multi-user or
|
||||
agentic eval pipelines.
|
||||
|
||||
### 1.8 ExLlamaV3 / EXL3
|
||||
|
||||
ExLlamaV3 introduces the EXL3 format (based on QTIP from Cornell RelaxML), achieving
|
||||
excellent perplexity at extreme compression (Llama 3.3 70B at 1.75 bpw, 19 GB). The
|
||||
Marlin-inspired GEMM kernels are highly optimized for NVIDIA GPUs. AMD ROCm support
|
||||
was absent at launch (early 2025) and current status is uncertain.
|
||||
|
||||
**Status**: Watch. Potentially best-in-class quantization quality, but AMD support
|
||||
is unclear.
|
||||
|
||||
---
|
||||
|
||||
## 2. Advanced Quantization Techniques
|
||||
|
||||
### 2.1 GGUF Quantization Landscape
|
||||
|
||||
GGUF remains the dominant format for local inference via llama.cpp. The key variants:
|
||||
|
||||
| Format | Bits | Method | Best For |
|
||||
|-----------|------|-----------------|-----------------------------|
|
||||
| Q8_0 | 8 | Round-to-nearest| Maximum quality, 2x compress|
|
||||
| Q6_K | 6.5 | K-quant | High quality, 2.5x compress |
|
||||
| Q5_K_M | 5.5 | K-quant+imatrix | Balanced quality/size |
|
||||
| Q4_K_M | 4.5 | K-quant+imatrix | Default recommendation |
|
||||
| Q3_K_M | 3.9 | K-quant+imatrix | Aggressive, still usable |
|
||||
| IQ3_XXS | 3.06 | I-quant+imatrix | Extreme compression |
|
||||
| IQ2_XXS | 2.06 | I-quant+imatrix | Near-minimum viable |
|
||||
| IQ1_S | 1.56 | I-quant+imatrix | Experimental |
|
||||
|
||||
**imatrix (Importance Matrix)**: The single most impactful quality improvement for
|
||||
sub-4-bit quantization. The importance matrix identifies which weights produce large
|
||||
activations during inference and allocates more precision to them. For aggressive
|
||||
quantization (<4 bits), imatrix is no longer optional -- it is essential.
|
||||
|
||||
**Recommendation**: Q4_K_M + imatrix for most use cases. Q3_K_M + imatrix when
|
||||
fitting a larger model matters more than marginal quality.
|
||||
|
||||
### 2.2 Unsloth Dynamic 2.0
|
||||
|
||||
Unsloth Dynamic 2.0 (Feb 2026) represents the state-of-the-art in intelligent GGUF
|
||||
quantization:
|
||||
|
||||
- **Per-layer adaptive quantization**: Each layer gets a custom quantization type
|
||||
based on sensitivity analysis. The quantization scheme for Gemma 3 differs
|
||||
significantly from Llama 4.
|
||||
- **Universal MoE + dense support**: Dynamic 2.0 works on all architectures
|
||||
(previously MoE-only).
|
||||
- **Calibration dataset**: 1.5M+ token hand-curated dataset for improved
|
||||
conversational quality.
|
||||
- **Quality results**: Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6% on 5-shot
|
||||
MMLU, surpassing many full-precision models.
|
||||
- **KL Divergence tracking**: Every GGUF is benchmarked against the original model
|
||||
on both perplexity and KL divergence.
|
||||
|
||||
**Relevance**: Directly applicable. Use Unsloth Dynamic 2.0 GGUFs when available
|
||||
for any model. They consistently outperform standard k-quant GGUFs at the same
|
||||
bit-width.
|
||||
|
||||
### 2.3 AQLM and QuIP#
|
||||
|
||||
Both target extreme compression (2-3 bits):
|
||||
|
||||
- **QuIP#** (ICML 2024): Uses randomized Hadamard transforms + E8 lattice codebooks.
|
||||
First PTQ method where 3-bit outperforms theoretical lossless 4-bit. The E8
|
||||
codebook fits in L1 cache, enabling inference speedups over FP16.
|
||||
- **AQLM** v1.1.7 (April 2025): Additive quantization achieving Pareto optimality
|
||||
below 3 bpw. Outperforms QuIP# on MoE models at 2-bit. Added arbitrary
|
||||
8-dimensional codebooks on GPU.
|
||||
|
||||
Both require PyTorch/CUDA for dequantization kernels. Neither has native llama.cpp
|
||||
integration or AMD support. They represent the theoretical frontier of what is
|
||||
achievable at extreme compression but are not practical for Strix Halo today.
|
||||
|
||||
**Status**: Research. Watch for llama.cpp integration of QTIP (via ExLlamaV3/EXL3).
|
||||
|
||||
### 2.4 AWQ vs GPTQ vs GGUF on AMD
|
||||
|
||||
For AMD GPUs in the llama.cpp ecosystem:
|
||||
|
||||
- **GGUF**: The only practical choice. Native llama.cpp support with ROCm/Vulkan
|
||||
acceleration. K-quants and I-quants are well-optimized.
|
||||
- **AWQ/GPTQ**: Require Marlin kernels for competitive speed (741 tok/s with
|
||||
Marlin-AWQ vs 67 tok/s without on NVIDIA). Marlin kernels are CUDA-only. On AMD,
|
||||
these formats are accessible via vLLM or Hugging Face Transformers with ROCm, but
|
||||
not through llama.cpp.
|
||||
- **Performance hierarchy on AMD (via vLLM)**: GPTQ and AWQ with Marlin kernels are
|
||||
fastest on NVIDIA; on AMD ROCm, the performance advantage over GGUF is minimal
|
||||
and setup complexity is higher.
|
||||
|
||||
**Recommendation**: GGUF for llama.cpp on Strix Halo. AWQ/GPTQ only if using vLLM.
|
||||
|
||||
### 2.5 Mixed-Precision and Layer-Wise Quantization
|
||||
|
||||
Active research area with direct practical implications:
|
||||
|
||||
- **Attention vs FFN sensitivity**: Attention layers (QKV projections, output
|
||||
projection) have varying sensitivity. FFN layers are often the largest component
|
||||
and frequent targets for aggressive quantization (INT4).
|
||||
- **Channel-Wise Mixed-Precision (CMPQ)**: Allocates quantization precision per
|
||||
weight channel based on activation distributions. Adapts to any bit-width.
|
||||
- **HOBBIT for MoE**: Maintains FP16 and INT4 versions of experts simultaneously.
|
||||
Hot experts stay at FP16; cold experts use INT4 or even INT2. This concept is
|
||||
partially implemented in Unsloth Dynamic 2.0's per-layer approach.
|
||||
- **Fine-Grained Mixed Precision (FGMP)**: Goes below row-level granularity to
|
||||
handle unstructured sensitivity patterns in both weights and activations.
|
||||
|
||||
**Relevance**: Unsloth Dynamic 2.0 already implements the practical version of
|
||||
layer-wise mixed precision for GGUF. The research frontier is moving toward
|
||||
sub-layer and channel-level mixed precision.
|
||||
|
||||
### 2.6 KV Cache Quantization
|
||||
|
||||
- **TurboQuant** (ICLR 2026): Being integrated into llama.cpp. TQ3 (3-bit) achieves
|
||||
4.9x compression vs FP16 KV cache; TQ4 (4-bit) achieves 3.8x. This directly
|
||||
reduces memory pressure for long-context inference.
|
||||
- **llama.cpp native**: Already supports Q8_0 and Q4_0 KV cache quantization via
|
||||
`--cache-type-k` and `--cache-type-v` flags.
|
||||
|
||||
**Relevance**: High. On a 64 GB system, KV cache can consume significant memory for
|
||||
long contexts. Q4_0 KV cache is recommended; TurboQuant will push this further.
|
||||
|
||||
---
|
||||
|
||||
## 3. Attention Optimization
|
||||
|
||||
### 3.1 Flash Attention on AMD
|
||||
|
||||
Current status for RDNA 3.5 / gfx1151:
|
||||
|
||||
- **Triton backend**: Supports CDNA and RDNA GPUs with fp16, bf16, fp32. This is
|
||||
the primary Flash Attention path for non-Instinct AMD GPUs.
|
||||
- **PyTorch integration**: Since PyTorch 2.5.0+, `F.scaled_dot_product_attention`
|
||||
automatically uses Flash Attention on RDNA cards via the Triton backend.
|
||||
- **llama.cpp WMMA Flash Attention**: Enabled via `-DGGML_HIP_ROCWMMA_FATTN=ON`.
|
||||
Uses RDNA 3.5's WMMA instructions for matrix multiply within the attention kernel.
|
||||
This is the fastest path for long-context inference on Strix Halo.
|
||||
- **CK (Composable Kernel) backend**: Supports MI200x, MI250x, MI300x, MI355x.
|
||||
Not available for RDNA consumer GPUs.
|
||||
|
||||
**Gap**: Flash Attention 3 (with asynchronous pipelines and FP8 attention) is
|
||||
NVIDIA Hopper-specific. No AMD equivalent exists.
|
||||
|
||||
### 3.2 SageAttention
|
||||
|
||||
SageAttention (ICLR 2025, ICML 2025, NeurIPS 2025 Spotlight) achieves 2-5x speedup
|
||||
over FlashAttention through quantized attention (8-bit Q/K matrices, FP16 values).
|
||||
SageAttention3 further uses FP4 Tensor Cores on Blackwell GPUs.
|
||||
|
||||
**AMD status**: SageAttention's Triton implementation could theoretically work on
|
||||
AMD GPUs, but no AMD-optimized kernels exist. The quantized attention concept is
|
||||
sound and could be adapted.
|
||||
|
||||
**Status**: Watch. Would be high-impact if ported to AMD.
|
||||
|
||||
### 3.3 Paged Attention
|
||||
|
||||
Paged Attention (vLLM) manages KV cache as non-contiguous memory pages, eliminating
|
||||
60-80% of memory waste from fragmentation. llama.cpp's server mode implements a
|
||||
simplified version of this for concurrent request handling, but the full PagedAttention
|
||||
system is more mature in vLLM.
|
||||
|
||||
**Relevance**: Moderate for single-user. High for multi-user serving.
|
||||
|
||||
### 3.4 GQA/MQA Architecture Implications
|
||||
|
||||
Modern models (Llama 2/3, Mistral, Qwen) use Grouped Query Attention:
|
||||
|
||||
- GQA reduces KV cache by up to 90% vs MHA (Multi-Head Attention)
|
||||
- 30-40% faster inference than MHA with near-equivalent accuracy
|
||||
- Enables larger batch sizes due to smaller memory footprint
|
||||
|
||||
**Practical impact**: When choosing models for Strix Halo, prefer GQA models. All
|
||||
modern model families (Llama 3, Qwen 3, Gemma 3, Mistral) use GQA. Avoid older MHA
|
||||
models when alternatives exist.
|
||||
|
||||
### 3.5 Ring Attention and Linear Attention
|
||||
|
||||
- **Ring Attention**: Distributes long sequences across multiple devices. Achieves
|
||||
1M context prefill in 77s with 93% parallelization efficiency. Not applicable to
|
||||
single-device Strix Halo.
|
||||
- **Linear Attention**: Reduces KV cache from O(n) to O(1) and computation from
|
||||
O(n^2) to O(n). The Ring-Linear models (hybrid softmax + linear attention) reduce
|
||||
inference cost to 1/10 of dense models. This is a model architecture choice, not
|
||||
a runtime optimization.
|
||||
|
||||
**Relevance**: Linear attention models would be transformative for long-context on
|
||||
Strix Halo. Watch for Qwen, DeepSeek, or Llama variants with hybrid attention.
|
||||
|
||||
---
|
||||
|
||||
## 4. MoE-Specific Optimizations
|
||||
|
||||
### 4.1 Expert Offloading on Unified Memory
|
||||
|
||||
On discrete GPU systems, MoE inference involves expensive PCIe transfers of expert
|
||||
weights between CPU RAM and GPU VRAM. On Strix Halo's unified memory, this bottleneck
|
||||
is fundamentally different:
|
||||
|
||||
- All expert weights reside in the same physical memory accessible to both CPU and
|
||||
GPU. There is no PCIe transfer cost.
|
||||
- The bottleneck shifts to **memory bandwidth**: at 256 GB/s, loading a 2 GB expert
|
||||
takes ~7.8 ms. With GGUF Q4 quantization, experts are 4x smaller, reducing this
|
||||
to ~2 ms.
|
||||
- **Implication**: Unified memory eliminates the offloading problem but does not
|
||||
eliminate the bandwidth problem. The optimization focus should be on reducing the
|
||||
number of expert weights that must be read per token.
|
||||
|
||||
### 4.2 Expert Caching and Prediction
|
||||
|
||||
The research frontier in 2025-2026 focuses on predicting which experts will be needed:
|
||||
|
||||
- **OD-MoE**: 99.94% expert activation prediction accuracy, delivering ~75% of
|
||||
fully GPU-cached speed using 1/3 GPU memory.
|
||||
- **MoE-SpeQ**: Uses a small draft model to predict expert sequences, enabling
|
||||
prefetching. Combines speculative decoding with expert prediction.
|
||||
- **SP-MoE**: First speculative-decoding-aware expert offloading framework. Achieves
|
||||
1.07-3.5x TPOT speedup by exploiting structural correspondence between draft
|
||||
and target models.
|
||||
- **SliceMoE**: Dynamic Bit-Sliced Caching -- caches experts at sub-expert
|
||||
granularity, assigning precision on demand.
|
||||
- **FlashMoE**: ML-based cache replacement for SSD-based expert offloading on edge.
|
||||
|
||||
**Relevance for Strix Halo**: Expert caching is less critical when all experts fit
|
||||
in memory, but expert prediction can still help by enabling **prefetching into L2/
|
||||
Infinity Cache** before the expert is needed, reducing effective memory latency.
|
||||
|
||||
### 4.3 Expert Pruning
|
||||
|
||||
- Static pruning: Remove least-used experts entirely (MC-SMoE, EEP). Can reduce
|
||||
active parameters by up to 96.875% (TSEP). Requires fine-tuning.
|
||||
- Dynamic pruning: Skip experts below an activation threshold at inference time.
|
||||
38.2% FLOPs reduction with 1.32x speedup (Li et al.).
|
||||
- **DynMoE**: 9% FLOPs reduction, 1.37x speedup through dynamic gating.
|
||||
|
||||
**Relevance**: Moderate. Dynamic expert skipping could reduce memory bandwidth
|
||||
requirements on Strix Halo, but requires model-specific configuration.
|
||||
|
||||
### 4.4 MoE Quantization -- Inactive Expert Compression
|
||||
|
||||
HOBBIT maintains multiple precision versions of experts: FP16 hot experts, INT4 cold
|
||||
experts, INT2 for rarely-used experts. On unified memory, a variant of this approach
|
||||
could keep the working set of experts at higher precision while storing rarely-activated
|
||||
experts at aggressive quantization, reducing total memory footprint.
|
||||
|
||||
MoE-CSP achieves 26x speedup through 4-bit/8-bit quantization with custom CUDA
|
||||
kernels. QMoE achieves 20x memory reduction but lacks efficient 1-bit kernel support.
|
||||
|
||||
**Practical approach for Strix Halo**: Use Unsloth Dynamic 2.0 GGUFs, which already
|
||||
implement per-layer (including per-expert) precision allocation.
|
||||
|
||||
---
|
||||
|
||||
## 5. Memory Bandwidth Optimization
|
||||
|
||||
### 5.1 The Fundamental Bottleneck
|
||||
|
||||
LLM inference (especially token generation / decode) is almost always memory-bandwidth
|
||||
bound. On Strix Halo:
|
||||
|
||||
- **Available bandwidth**: 256 GB/s (LPDDR5X-8000, 256-bit bus)
|
||||
- **Theoretical decode throughput** for a 7B Q4_0 model (~3.5 GB):
|
||||
256 GB/s / 3.5 GB = ~73 tok/s (assuming 100% utilization)
|
||||
- **Measured**: ~50 t/s (tg128), implying ~68% bandwidth utilization
|
||||
- **Infinity Cache effect**: The 32 MB Infinity Cache acts as a bandwidth amplifier.
|
||||
When working set fits in cache, effective bandwidth can exceed 256 GB/s. For LLM
|
||||
inference, per-layer weights typically exceed 32 MB, so cache benefit is limited
|
||||
to KV cache and activations.
|
||||
|
||||
### 5.2 Techniques to Reduce Bandwidth Requirements
|
||||
|
||||
| Technique | Bandwidth Reduction | Status on Strix Halo |
|
||||
|----------------------------|--------------------|-----------------------|
|
||||
| Lower quantization (Q4->Q3)| ~25% | Available now |
|
||||
| KV cache quantization (Q4) | ~75% for KV reads | Available now |
|
||||
| Speculative decoding | 2-3x effective | Available now |
|
||||
| Expert prediction/caching | Variable (MoE) | Research |
|
||||
| Weight compression (EXL3) | Up to 8x | No AMD support |
|
||||
| Activation checkpointing | Reduces peak memory | Available |
|
||||
|
||||
### 5.3 Speculative Decoding
|
||||
|
||||
The most impactful bandwidth optimization technique available today:
|
||||
|
||||
- **Principle**: A small, fast "draft" model generates N candidate tokens. The large
|
||||
"target" model verifies all N tokens in a single forward pass (batch). Accepted
|
||||
tokens are "free" -- they required no additional bandwidth from the target model.
|
||||
- **Speedup**: 2-3x without accuracy loss. NVIDIA demonstrates 3.6x on H200.
|
||||
- **EAGLE-3**: Lightweight autoregressive head attached to target model internals.
|
||||
No separate draft model needed.
|
||||
- **TurboSpec**: Closed-loop control system that dynamically adjusts speculative
|
||||
parameters based on online feedback.
|
||||
- **MoE-SpeQ**: Combines speculative decoding with expert prefetching.
|
||||
|
||||
**Relevance**: High. Speculative decoding is the single highest-impact optimization
|
||||
for decode throughput on bandwidth-limited systems like Strix Halo. llama.cpp
|
||||
supports speculative decoding via `--model-draft`.
|
||||
|
||||
### 5.4 Prefetching Strategies
|
||||
|
||||
- **L2 cache prefetching**: Proactively load KV cache and next-layer weights into
|
||||
GPU L2 during computation. Achieves 2.15x attention kernel speedup on NVIDIA H20.
|
||||
- **PRESERVE**: Prefetch model weights from HBM to on-chip cache during communication
|
||||
operations. Up to 1.6x end-to-end speedup.
|
||||
- **Strix Halo consideration**: The 32 MB Infinity Cache + 2 MB L2 provides limited
|
||||
on-chip storage. Prefetching activations and KV cache (which are smaller than
|
||||
weights) into Infinity Cache during weight reads could help.
|
||||
|
||||
### 5.5 Batched Inference
|
||||
|
||||
Batching amortizes weight-read cost across multiple requests:
|
||||
|
||||
- Single request: ~68% bandwidth utilization on Strix Halo
|
||||
- Batch of 4: Approaches compute-bound regime for prefill; still bandwidth-bound
|
||||
for decode on most models
|
||||
- **Continuous batching** (vLLM, llama.cpp server): 10-20x throughput improvement
|
||||
over naive batching
|
||||
|
||||
**Trade-off**: Batching increases throughput but also increases per-request latency
|
||||
and memory consumption (KV cache scales linearly with batch size).
|
||||
|
||||
---
|
||||
|
||||
## 6. OS and Runtime Techniques
|
||||
|
||||
### 6.1 Memory Management
|
||||
|
||||
**Huge Pages**: Transparent Huge Pages (THP) reduce TLB misses for large model
|
||||
weights. On Fedora 43, THP is enabled by default. For explicit control:
|
||||
|
||||
```bash
|
||||
# Check current THP setting
|
||||
cat /sys/kernel/mm/transparent_hugepage/enabled
|
||||
|
||||
# For llama.cpp, ensure THP is at least "madvise"
|
||||
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
|
||||
```
|
||||
|
||||
For models loaded with mmap, THP automatically promotes 4 KB pages to 2 MB pages,
|
||||
reducing page faults during inference.
|
||||
|
||||
**Memory Locking**: `mlock` prevents model weights from being swapped. llama.cpp's
|
||||
`--mlock` flag enables this. Critical for systems running other workloads alongside
|
||||
inference.
|
||||
|
||||
**mmap vs direct load**: On Strix Halo with ROCm, `--mmp 0` (disable mmap) is
|
||||
recommended. mmap causes catastrophically slow model loading when GPU offloading is
|
||||
active because of the double-copy path through page cache.
|
||||
|
||||
### 6.2 Process Pinning and NUMA
|
||||
|
||||
Strix Halo is a single-die APU, so NUMA topology is simple (typically 1 NUMA node).
|
||||
However, CPU core affinity still matters:
|
||||
|
||||
```bash
|
||||
# Pin inference to specific cores, keeping others free for OS
|
||||
numactl --physcpubind=0-15 llama-server [args]
|
||||
|
||||
# Or via taskset
|
||||
taskset -c 0-15 llama-server [args]
|
||||
```
|
||||
|
||||
**Core isolation**: For minimum-jitter inference:
|
||||
```bash
|
||||
# Add to kernel cmdline
|
||||
isolcpus=0-15 nohz_full=0-15 rcu_nocbs=0-15
|
||||
```
|
||||
This prevents the OS from scheduling unrelated tasks on inference cores.
|
||||
|
||||
### 6.3 CPU Frequency and Power
|
||||
|
||||
```bash
|
||||
# Set performance governor for consistent throughput
|
||||
sudo cpupower frequency-set -g performance
|
||||
|
||||
# Verify
|
||||
cpupower frequency-info | grep "current CPU frequency"
|
||||
```
|
||||
|
||||
### 6.4 cgroups v2 for Resource Isolation
|
||||
|
||||
Reserve memory and CPU for inference workloads:
|
||||
|
||||
```bash
|
||||
# Create inference cgroup
|
||||
sudo mkdir /sys/fs/cgroup/inference
|
||||
echo "+memory +cpu" | sudo tee /sys/fs/cgroup/inference/cgroup.subtree_control
|
||||
|
||||
# Reserve 56 GB for inference (leave 8 GB for system)
|
||||
echo $((56 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/inference/memory.min
|
||||
|
||||
# Pin CPU cores
|
||||
echo "0-15" | sudo tee /sys/fs/cgroup/inference/cpuset.cpus
|
||||
|
||||
# Run inference in cgroup
|
||||
sudo cgexec -g memory,cpu:inference llama-server [args]
|
||||
```
|
||||
|
||||
### 6.5 io_uring for Model Loading
|
||||
|
||||
io_uring provides zero-copy, kernel-bypassing I/O that can accelerate initial model
|
||||
loading. While llama.cpp does not natively use io_uring, the underlying mmap/read
|
||||
path can benefit from io_uring-based file I/O when loading from NVMe:
|
||||
|
||||
- Eliminates context switch overhead during model load
|
||||
- Enables true async I/O with completion ring buffers
|
||||
- Most benefit when loading very large models (>32 GB) from storage
|
||||
|
||||
**Practical impact**: Minor for Strix Halo since model loading is a one-time cost,
|
||||
and LPDDR5X bandwidth far exceeds NVMe read speeds.
|
||||
|
||||
### 6.6 eBPF-Based Performance Monitoring
|
||||
|
||||
eBPF enables zero-instrumentation monitoring of inference workloads:
|
||||
|
||||
```bash
|
||||
# Monitor GPU DRM scheduler jobs (works with amdgpu driver)
|
||||
sudo bpftrace -e 'tracepoint:drm:drm_sched_job { printf("GPU job: %s\n", args->name); }'
|
||||
|
||||
# Track page faults during model loading
|
||||
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'
|
||||
|
||||
# Monitor context switches on inference cores
|
||||
sudo bpftrace -e 'tracepoint:sched:sched_switch /cpu == 0/ { @[args->next_comm] = count(); }'
|
||||
```
|
||||
|
||||
The eunomia project provides ready-made eBPF programs for AI workload monitoring.
|
||||
|
||||
---
|
||||
|
||||
## 7. Emerging Hardware Features
|
||||
|
||||
### 7.1 AMD XDNA NPU
|
||||
|
||||
The Ryzen AI MAX+ 395 includes an XDNA 2 NPU rated at 50 TOPS. Current status for
|
||||
LLM inference:
|
||||
|
||||
- **Software stack**: AMD Ryzen AI Software supports ONNX model execution on the NPU.
|
||||
AMD Quark provides quantization for NPU deployment (SmoothQuant, GPTQ, Quarot).
|
||||
- **LLM capability**: The NPU can accelerate small models and specific operations
|
||||
(attention heads, small expert networks) but cannot run full large LLMs.
|
||||
- **Linux support**: Kernel 7.1 (expected 2026) brings significant XDNA upstreaming.
|
||||
Current Linux support is limited compared to Windows.
|
||||
- **Practical use**: The NPU could potentially handle a speculative decoding draft
|
||||
model while the GPU runs the main model. This is not yet implemented in any
|
||||
inference engine.
|
||||
|
||||
**Status**: Not viable for LLM inference in March 2026. Watch for Linux kernel 7.1
|
||||
and llama.cpp NPU backend development.
|
||||
|
||||
### 7.2 RDNA 3.5 Matrix Cores (WMMA)
|
||||
|
||||
The Radeon 8060S (gfx1151) has the same WMMA instruction set as RDNA 3 (gfx11xx),
|
||||
which is a generation behind RDNA 4 (gfx12xx):
|
||||
|
||||
**RDNA 3 / 3.5 (gfx1151) WMMA capabilities**:
|
||||
- FP16/BF16: 512 FLOPS/clock/CU
|
||||
- INT8: 1024 OPS/clock/CU
|
||||
- 16x16 matrix dimensions
|
||||
- Requires inter-lane data shuffling for chained operations
|
||||
|
||||
**RDNA 4 (gfx12xx) improvements over RDNA 3.5**:
|
||||
- FP16/BF16: 1024 FLOPS/clock/CU (2x)
|
||||
- INT8: 2048 OPS/clock/CU (2x)
|
||||
- New FP8/BF8 formats at 4x the FP16 rate
|
||||
- 4:2 structured sparsity support (effectively 2x more)
|
||||
- No inter-lane shuffling needed for chained WMMA (major efficiency gain)
|
||||
- New efficient matrix load instruction
|
||||
|
||||
**Current usage in llama.cpp**: WMMA is used for Flash Attention
|
||||
(`GGML_HIP_ROCWMMA_FATTN`) and matrix-multiply quantized (`MMQ`) kernels. The
|
||||
ROCm 7.x regression for gfx1151 (issue #17917) specifically affects MMQ kernels.
|
||||
|
||||
### 7.3 Vulkan Cooperative Matrices
|
||||
|
||||
The `VK_KHR_cooperative_matrix` Vulkan extension was merged into the RADV driver
|
||||
for RDNA 3+ hardware. This provides a portable API for matrix operations that maps
|
||||
to WMMA hardware:
|
||||
|
||||
- Enables inference engines to use matrix cores through Vulkan instead of
|
||||
vendor-specific ROCm/HIP APIs
|
||||
- llama.cpp's Vulkan backend could leverage this for WMMA-accelerated matrix
|
||||
operations
|
||||
- Currently less optimized than native HIP/ROCm paths
|
||||
|
||||
**Status**: Available in Mesa 25.x. Watch for llama.cpp Vulkan backend improvements
|
||||
using cooperative matrices.
|
||||
|
||||
### 7.4 Infinity Cache for Inference
|
||||
|
||||
Strix Halo has a 32 MB Infinity Cache (MALL -- Memory Attached Last Level):
|
||||
|
||||
- **Architecture**: L1 (256 KB/shader array) -> L2 (2 MB) -> Infinity Cache (32 MB)
|
||||
-> LPDDR5X
|
||||
- **Latency**: Slightly higher than discrete GPU Infinity Cache implementations
|
||||
- **Hit rate**: Varies by workload. Graphics benchmarks show ~73% hit rate at peak.
|
||||
- **LLM inference implications**: For a 7B Q4 model (~3.5 GB), per-layer weights
|
||||
are ~70-140 MB, far exceeding the 32 MB cache. Benefit is limited to:
|
||||
- KV cache for current context (fits well for shorter contexts)
|
||||
- Activations and intermediate results
|
||||
- Embedding layer (often accessed repeatedly)
|
||||
- Small models/layers that fit entirely in cache
|
||||
|
||||
The Infinity Cache is most impactful as a bandwidth amplifier -- when inference
|
||||
accesses exhibit temporal locality (same data accessed multiple times within a
|
||||
short window), effective bandwidth exceeds the 256 GB/s DRAM limit.
|
||||
|
||||
---
|
||||
|
||||
## 8. Model-Level Optimizations
|
||||
|
||||
### 8.1 Prompt Compression
|
||||
|
||||
- **LLMLingua / LLMLingua-2** (Microsoft): Compresses input prompts by removing
|
||||
low-information tokens. 20x compression with 1.5 point performance drop.
|
||||
1.7-5.7x end-to-end inference speedup. LLMLingua-2 is 3-6x faster than v1.
|
||||
Integrated into LangChain and LlamaIndex.
|
||||
- **500xCompressor**: Compresses contexts into a single special token. 6x-480x
|
||||
compression. Adds only 0.25% parameters. More aggressive but less mature.
|
||||
|
||||
**Relevance**: High for RAG and agentic workloads where prompts are long. Reduces
|
||||
both prefill time and KV cache memory.
|
||||
|
||||
### 8.2 Speculative Decoding (Model-Level)
|
||||
|
||||
Beyond the engine-level implementation described in Section 5.3:
|
||||
|
||||
- **Self-speculative decoding**: Model drafts its own tokens using early exit from
|
||||
lower layers. No separate draft model needed.
|
||||
- **EAGLE-3**: Autoregressive head on target model internals. Higher acceptance
|
||||
rates than separate draft models.
|
||||
- **Draft model latency > accuracy**: Research shows that draft model speed matters
|
||||
more than its language modeling accuracy for overall throughput.
|
||||
|
||||
### 8.3 Mixture of Depths / Mixture of Recursions
|
||||
|
||||
- **Mixture of Depths (MoD)**: Dynamically allocates compute to tokens that need it.
|
||||
2-3x inference speedup with minimal quality degradation. Implemented at training
|
||||
time -- requires model architecture support.
|
||||
- **Mixture of Recursions (MoR)** (NeurIPS 2025): Combines parameter sharing with
|
||||
adaptive token-level compute. Lightweight routers assign different recursion depths
|
||||
to individual tokens. 2x inference throughput with reduced KV cache sizes.
|
||||
|
||||
**Relevance**: These are model architecture choices, not runtime optimizations.
|
||||
Watch for models trained with MoD/MoR architectures.
|
||||
|
||||
### 8.4 Structured Pruning
|
||||
|
||||
Post-training methods to permanently remove model components:
|
||||
|
||||
- **Width pruning**: Remove neurons, attention heads, or embedding channels. Better
|
||||
accuracy retention than depth pruning.
|
||||
- **Depth pruning**: Remove entire layers. More latency reduction per parameter
|
||||
removed.
|
||||
- **LLM-Pruner, SliceGPT, FLAP**: State-of-the-art structured pruning methods.
|
||||
- **AMP**: Jointly prunes attention heads and MLP neurons.
|
||||
- **NIRVANA** (2025): Structured pruning reimagined for LLM compression.
|
||||
|
||||
**Practical approach**: Structured pruning requires per-model effort and is generally
|
||||
less practical than quantization for local inference. Exception: if a specific model
|
||||
is too slow at a given quantization level, pruning the model first and then
|
||||
quantizing can yield a better speed/quality trade-off.
|
||||
|
||||
### 8.5 Token Merging and Pruning
|
||||
|
||||
- **TokenSelect** (EMNLP 2025): Dynamic token-level KV cache selection for
|
||||
efficient long-context inference and length extrapolation.
|
||||
- **LightThinker**: Step-by-step compression of chain-of-thought reasoning.
|
||||
- **Attention sparsity**: Twilight (NeurIPS 2025) uses hierarchical top-p pruning
|
||||
for adaptive attention sparsity.
|
||||
|
||||
These techniques reduce the effective sequence length during inference, directly
|
||||
reducing both compute and memory bandwidth requirements.
|
||||
|
||||
---
|
||||
|
||||
## 9. Prioritized Recommendations for Strix Halo
|
||||
|
||||
### Tier 1: Implement Now (High Impact, Available Today)
|
||||
|
||||
1. **Use Unsloth Dynamic 2.0 GGUFs** for all models. They provide the best
|
||||
quality-per-bit through intelligent layer-wise quantization.
|
||||
|
||||
2. **Build llama.cpp with WMMA Flash Attention**: `-DGGML_HIP_ROCWMMA_FATTN=ON
|
||||
-DGGML_HIP_UMA=ON`. Monitor issue #17917 for MMQ regression fix.
|
||||
|
||||
3. **Disable mmap for ROCm**: Always use `--mmp 0` / `--no-mmap` to avoid the
|
||||
double-copy performance penalty.
|
||||
|
||||
4. **Enable KV cache quantization**: Use `--cache-type-k q4_0 --cache-type-v q4_0`
|
||||
for long-context workloads. Watch for TurboQuant integration.
|
||||
|
||||
5. **Set ROCBLAS_USE_HIPBLASLT=1**: Forces the optimized hipBLASLt kernels.
|
||||
|
||||
6. **Speculative decoding for decode-heavy workloads**: Use `--model-draft` with a
|
||||
small model from the same family.
|
||||
|
||||
7. **GPU performance governor and frequency pinning**: Ensures consistent throughput.
|
||||
|
||||
### Tier 2: Evaluate (Moderate Impact, Some Setup Required)
|
||||
|
||||
8. **LLMLingua-2 for agentic/RAG workloads**: Compress long prompts before inference.
|
||||
3-6x prompt processing speedup.
|
||||
|
||||
9. **vLLM for multi-user serving**: If running concurrent inference requests
|
||||
(e.g., agentic eval pipelines), vLLM's continuous batching and PagedAttention
|
||||
provide 10-20x throughput improvement.
|
||||
|
||||
10. **cgroups v2 memory reservation**: Prevent the OS from reclaiming GPU-mapped
|
||||
memory under memory pressure.
|
||||
|
||||
11. **Vulkan backend for short-context workloads**: Test whether the Vulkan/RADV
|
||||
path is faster than ROCm for your specific model and context length.
|
||||
|
||||
12. **Process pinning** with `numactl` or `taskset` for reduced scheduling jitter.
|
||||
|
||||
### Tier 3: Watch and Prepare (High Potential, Not Ready)
|
||||
|
||||
13. **KTransformers for >64 GB models**: When running DeepSeek V3 or similar models
|
||||
that exceed available memory.
|
||||
|
||||
14. **ExLlamaV3/EXL3 AMD support**: If AMD kernels arrive, EXL3's QTIP-based
|
||||
quantization could significantly improve quality at extreme compression.
|
||||
|
||||
15. **XDNA NPU for draft model acceleration**: If/when llama.cpp adds NPU backend
|
||||
support, the NPU could run the draft model for speculative decoding.
|
||||
|
||||
16. **SageAttention AMD port**: 2-5x attention speedup through quantized attention.
|
||||
|
||||
17. **Linear attention models**: Watch for hybrid softmax/linear attention models
|
||||
from major labs that would dramatically improve long-context inference.
|
||||
|
||||
18. **Cooperative matrices in Vulkan**: As llama.cpp's Vulkan backend matures, this
|
||||
provides a portable path to WMMA acceleration without ROCm dependency.
|
||||
|
||||
---
|
||||
|
||||
## 10. Sources
|
||||
|
||||
### Papers and Conference Proceedings
|
||||
|
||||
- Raposo et al., "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models," 2024. https://arxiv.org/abs/2404.02258
|
||||
- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," ICML 2023. https://arxiv.org/abs/2305.13245
|
||||
- Tseng et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks," ICML 2024. https://arxiv.org/abs/2402.04396
|
||||
- Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization," ICLR 2025. https://arxiv.org/abs/2401.06118
|
||||
- Chen et al., "KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models," SOSP 2025. https://dl.acm.org/doi/10.1145/3731569.3764843
|
||||
- Min et al., "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation," NeurIPS 2025. https://arxiv.org/abs/2507.10524
|
||||
- Varadarajan et al., "Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures," 2025. https://arxiv.org/abs/2504.11750
|
||||
- Zandieh et al., "TurboQuant: Extreme KV Cache Quantization," ICLR 2026. https://github.com/ggml-org/llama.cpp/discussions/20969
|
||||
- Agrawal et al., "SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration," ICLR 2025. https://arxiv.org/abs/2410.02367
|
||||
- Ye et al., "FlashInfer: Efficient and Customizable Attention Engine for LLM Serving," 2025. https://arxiv.org/abs/2501.01005
|
||||
- Jiang et al., "LLMLingua: Compressing Prompts for Accelerated Inference," EMNLP 2023. https://arxiv.org/abs/2310.05736
|
||||
- Li et al., "A Survey on Inference Optimization Techniques for Mixture of Experts Models," 2024. https://arxiv.org/abs/2412.14219
|
||||
- Liu et al., "MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching," 2025. https://arxiv.org/abs/2511.14102
|
||||
- Zhou et al., "SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE Inference," 2025. https://arxiv.org/abs/2510.10302
|
||||
- He et al., "SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints," 2025. https://arxiv.org/abs/2512.12990
|
||||
- Jin et al., "OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference," 2025. https://arxiv.org/abs/2512.03927
|
||||
|
||||
### Documentation and Technical References
|
||||
|
||||
- AMD ROCm Strix Halo System Optimization: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html
|
||||
- AMD GPUOpen -- Using Matrix Cores of RDNA 4: https://gpuopen.com/learn/using_matrix_core_amd_rdna4/
|
||||
- AMD GPUOpen -- Accelerating Generative AI on Radeon GPUs: https://gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/
|
||||
- vLLM ROCm Blog: https://blog.vllm.ai/2026/02/27/rocm-attention-backend.html
|
||||
- AMD ROCm vLLM Blog: https://rocm.blogs.amd.com/software-tools-optimization/vllm-omni/README.html
|
||||
- AMD AI Inference on Ryzen AI NPU with Quark: https://www.amd.com/en/developer/resources/technical-articles/2025/ai-inference-acceleration-on-ryzen-ai-with-quark.html
|
||||
- Chips and Cheese -- Evaluating Infinity Cache in Strix Halo: https://chipsandcheese.com/p/evaluating-the-infinity-cache-in
|
||||
- Chips and Cheese -- RDNA 4 Architecture at Hot Chips 2025: https://chipsandcheese.com/p/amds-rdna4-gpu-architecture-at-hot
|
||||
- Linux Kernel XDNA NPU Documentation: https://docs.kernel.org/accel/amdxdna/amdnpu.html
|
||||
|
||||
### Community Resources and Guides
|
||||
|
||||
- llama.cpp ROCm Performance Discussion: https://github.com/ggml-org/llama.cpp/discussions/15021
|
||||
- llama.cpp Strix Halo UMA Detection Bug: https://github.com/ggml-org/llama.cpp/issues/18159
|
||||
- llama.cpp Strix Halo Performance Regression: https://github.com/ggml-org/llama.cpp/issues/17917
|
||||
- Strix Halo Wiki -- llama.cpp with ROCm: https://strixhalo.wiki/AI/llamacpp-with-ROCm
|
||||
- Strix Halo Wiki -- Performance: https://strixhalo.wiki/AI/llamacpp-performance
|
||||
- AMD Strix Halo Toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes
|
||||
- LLM Tracker -- AMD GPUs: https://llm-tracker.info/howto/AMD-GPUs
|
||||
- LLM Tracker -- Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
|
||||
- Unsloth Dynamic 2.0 Documentation: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
|
||||
- Unsloth Dynamic v2.0 Blog: https://unsloth.ai/blog/dynamic-v2
|
||||
- KTransformers GitHub: https://github.com/kvcache-ai/ktransformers
|
||||
- ExLlamaV3 GitHub: https://github.com/turboderp-org/exllamav3
|
||||
- BitNet GitHub: https://github.com/microsoft/BitNet
|
||||
- LLMLingua GitHub: https://github.com/microsoft/LLMLingua
|
||||
- MoE Inference Awesome List: https://github.com/MoE-Inf/awesome-moe-inference
|
||||
- Awesome LLM Inference: https://github.com/xlite-dev/Awesome-LLM-Inference
|
||||
- Phoronix -- ROCm 7.1 vs Vulkan on AI PRO R9700: https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan
|
||||
- eunomia -- OS-Level LLM Inference Optimizations: https://eunomia.dev/blog/2025/02/18/os-level-challenges-in-llm-inference-and-optimizations/
|
||||
- RADV Cooperative Matrix for RDNA4: https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1524861-vulkan-cooperative-matrix-merged-for-rdna4-gpus-with-radv-dcc-support-inches-closer
|
||||
- Kaitchup -- GGUF Quant Selection: https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i
|
||||
Reference in New Issue
Block a user