docs: update optimization guide with measured hardware data

Replace estimated values with clpeak measurements: DRAM 216-233 GB/s,
GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting
only). Correct backend recommendation to Vulkan RADV (2.7x faster tg
than ROCm at 131K). Update KV cache recommendation to q4_0. Add
Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano
from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.
This commit is contained in:
Felipe Cardoso
2026-03-30 19:56:18 +02:00
parent 1549bc27c0
commit ea70687cd2
3 changed files with 157 additions and 208 deletions

View File

@@ -18,10 +18,9 @@ glm-4.7-flash-q6|lmstudio-community/GLM-4.7-Flash-GGUF|GLM-4.7-Flash-Q6_K.gguf|2
qwen3.5-27b-q4|unsloth/Qwen3.5-27B-GGUF|Qwen3.5-27B-Q4_K_M.gguf|17|dense|Dense 27B, quality-first
# ── MoE models (fast generation, best for 64GB) ─────────
qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_L.gguf|19|moe|MoE 35B, 3B active, Unsloth dynamic
qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf|21|moe|MoE 35B, 3B active, Unsloth dynamic XL
qwen3.5-35b-a3b-q8|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-Q8_0.gguf|37|moe|MoE 35B Q8, near-full precision
nemotron-30b-a3b-q4|lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-GGUF|NVIDIA-Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf|23|moe|Nemotron MoE 30B, 3B active
nemotron-cascade2-q8|bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF|nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf|31|moe|Nemotron Cascade 2, Mamba-2 hybrid
nemotron-cascade2-q8|bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF|nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf|31|moe|Nemotron Cascade 2, Mamba-2 hybrid (replaces Nano)
# ── Coding models ─────────────────────────────────────────
qwen3-coder-30b-a3b-q6|unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF|Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf|26|moe|Agentic coding MoE, pure Transformer

View File

@@ -143,11 +143,12 @@ Living document tracking what was applied, tested, and the actual results. Each
- **Status**: BLOCKED — llama.cpp PR #20700
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon
### 5.3 GPU Clock Fix
### 5.3 GPU Clock Reporting
- **Status**: BLOCKED — ROCm issue #5750
- **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
- **Last checked**: 2026-03-27
- **Status**: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
- **Measured**: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
- **Notes**: ROCm issue #5750 is about sysfs `pp_dpm_sclk` reporting, not actual performance. No action needed.
- **Verdict**: CLOSED — no performance impact
---
@@ -187,7 +188,31 @@ Living document tracking what was applied, tested, and the actual results. Each
| **UD-Q4_K_XL** | 20.7 GB | 835 | **56.4** | **Daily driver** — best quality/speed. |
| Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |
**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L can be deleted — Q4_K_XL is strictly better at only +2 GB.
**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
### Coder Model Shootout (2026-03-30)
- **Benchmark**: `data/benchmarks/coder-shootout-*`
- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
- **RyzenAdj**: STAPM=81W (sustained ~70W)
| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
|-------|-------------|-----------|-------------|-------------|
| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
**Analysis**:
- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
**Recommended roles**:
- **Qwen3-Coder-30B**: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
- **Qwen3.5-35B-A3B**: General tasks, long prompt processing (best pp, best all-rounder)
- **Qwen3-Coder-Next**: Complex multi-file coding tasks where quality > speed
---

View File

@@ -18,13 +18,9 @@ These are the foundational optimizations handled by this repo's scripts. Apply i
sudo make optimize-tuned
```
Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states and sets CPU governor to performance.
Switches to `accelerator-performance`: disables higher-latency CPU STOP states, sets CPU governor to performance.
Takes effect immediately. Previous profile is saved for rollback.
| Expected Impact | pp512 | tg128 |
|----------------|-------|-------|
| Tuned profile | +5-8% | +2-3% |
**Measured**: +5-8% pp, +2-3% tg.
### 1.2 Kernel Boot Parameters (reboot required)
@@ -32,15 +28,13 @@ Takes effect immediately. Previous profile is saved for rollback.
sudo make optimize-kernel
```
Adds three parameters to GRUB:
| Parameter | Value (64 GB) | Purpose |
|-----------|--------------|---------|
| `iommu=pt` | -- | IOMMU passthrough, reduces memory access latency |
| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB (~59 GiB) |
| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it. See [architecture.md](architecture.md) for the math.
Values computed dynamically from total physical RAM. See [architecture.md](architecture.md) for the math.
### 1.3 BIOS VRAM Reduction (reboot + BIOS access)
@@ -48,11 +42,9 @@ Values are computed dynamically based on your system's total physical RAM. The s
make optimize-vram # Prints guidance — cannot modify BIOS directly
```
Reduce dedicated VRAM (UMA Frame Buffer Size) from 32 GB to 512 MB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
Reduce UMA Frame Buffer Size from 32 GB to 512 MB. Frees 31.5 GB for GTT. See [bios-vram-guide.md](bios-vram-guide.md) (HP ZBook: F10 at boot).
See [bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough (HP ZBook: F10 at boot).
**Combine 1.2 and 1.3 into a single reboot**: apply kernel params, then reboot into BIOS to change VRAM, then boot normally.
**Combine 1.2 and 1.3 into a single reboot.**
### 1.4 Verify
@@ -61,264 +53,201 @@ make verify # 9-point checklist, target: 9/9
make audit # Single-screen system status with scores
```
### Phase 1 Expected Impact (combined)
### Phase 1 Measured Impact
| Optimization | pp512 | tg128 |
|-------------|-------|-------|
| Optimization | pp | tg |
|-------------|----|----|
| Tuned profile | +5-8% | +2-3% |
| Kernel params + BIOS VRAM | +10-20% | +5-15% |
| Kernel params + BIOS VRAM | Enables 37 GB+ models | +5-15% |
| **Phase 1 combined** | **+15-25%** | **+8-18%** |
Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
Trade-off: Small models (<5 GB) are ~3-8% slower due to GTT indirection vs dedicated VRAM. Acceptable given the massive capability gain.
---
## Phase 2: System Tuning (manual, no reboot unless noted)
## Phase 2: System Tuning
These require root but are safe to apply and revert.
### 2.1 Power Budget Increase via RyzenAdj
The HP ZBook Ultra G1a ships with a conservative 60W power limit. The Strix Halo chip supports 120W. Community testing shows **85W is the sweet spot**: +12-19% over 60W, with manageable thermals.
All Phase 2 optimizations are applied with a single command:
```bash
# Install ryzenadj (Fedora)
sudo dnf install ryzenadj # or build from https://github.com/FlyGoat/RyzenAdj
# Apply 85W limits (milliwatts)
sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
# Verify
sudo ryzenadj -i | grep -E 'STAPM|PPT'
sudo make optimize-power
```
| Setting | HP Default | Recommended | Max (risky) |
|---------|-----------|-------------|-------------|
| STAPM | 60W | **85W** | 120W |
| PPT Fast | 60W | **85W** | 120W |
| PPT Slow | 20W | **85W** | 120W |
This applies RyzenAdj, sysctl, THP, and RADV nogttspill. Sysctl and nogttspill persist across reboots. RyzenAdj and THP are volatile.
**Notes**:
- Settings are volatile — reset on reboot/sleep. Create a systemd service for persistence.
- Going above 85W yields only +2-3% more (LLM inference is memory-bandwidth-bound at ~215 GB/s).
- Monitor thermals: `sensors` or `amdgpu_top`. Throttling starts around 100C junction temp.
- HP firmware may periodically reset limits. Verify after wake from sleep.
- The 140W USB-C charger limits total system draw. At 100W+ APU, battery will drain even while plugged in.
For RyzenAdj persistence across reboots:
```bash
sudo cp configs/ryzenadj-llm.service configs/ryzenadj-resume.service /etc/systemd/system/
sudo systemctl enable --now ryzenadj-llm.service
sudo systemctl enable ryzenadj-resume.service
```
### 2.1 RyzenAdj PPT Increase
The HP ZBook Ultra G1a ships at 59W sustained. RyzenAdj raises this.
**Measured on HP ZBook**: STAPM raised to 81W, but **HP firmware hard-caps PPT SLOW at 70W**. Effective sustained power: ~70W (was ~59W). Cannot be overridden without modded BIOS.
**Measured impact**: Qwen3.5-35B-A3B tg1024 went from **~39 t/s → 57 t/s (+46%)**. This was the single largest improvement in the entire optimization journey.
Thermals: 70-73C under sustained load. 30C headroom. Cooling handles it easily.
### 2.2 VM / Sysctl Tuning
```bash
# Apply immediately
sudo sysctl -w vm.swappiness=1
sudo sysctl -w vm.dirty_ratio=40
sudo sysctl -w vm.dirty_background_ratio=10
sudo sysctl -w vm.max_map_count=500000
sudo sysctl -w vm.zone_reclaim_mode=0
Persisted to `/etc/sysctl.d/99-llm-inference.conf`:
# Persist across reboots
sudo tee /etc/sysctl.d/99-llm-inference.conf << 'EOF'
vm.swappiness = 1
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.max_map_count = 500000
vm.zone_reclaim_mode = 0
EOF
```
| Parameter | Default | Recommended | Why |
|-----------|---------|-------------|-----|
| `vm.swappiness` | 60 | **1** | Prevent model weights from being swapped out |
| Parameter | Default | Set To | Why |
|-----------|---------|--------|-----|
| `vm.swappiness` | 60 | **1** | Prevent model weight eviction |
| `vm.dirty_ratio` | 20 | **40** | Reduce I/O flush storms during inference |
| `vm.dirty_background_ratio` | 10 | **10** | Keep background writeback at default |
| `vm.max_map_count` | 65530 | **500000** | Large models need many memory mappings |
| `vm.zone_reclaim_mode` | 0 | **0** | Don't aggressively reclaim memory zones |
### 2.3 Transparent Huge Pages
THP reduces TLB misses for mmap'd model files (~55 GB model = 14M page table entries at 4KB vs 28K at 2MB).
Set to `always`. Reduces TLB misses for mmap'd model files. Volatile — add `transparent_hugepage=always` to kernel cmdline for persistence.
```bash
# Apply immediately
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
### 2.4 RADV_PERFTEST=nogttspill
# Persist via kernel cmdline (add to GRUB):
# transparent_hugepage=always
# Verify THP is being used
grep -i huge /proc/meminfo
grep thp /proc/vmstat
```
**Trade-off**: `always` may cause rare latency spikes during memory compaction. Use `madvise` if you need predictable latency, but note that llama.cpp does not call `madvise(MADV_HUGEPAGE)` so `always` is needed.
### 2.4 RADV_PERFTEST=nogttspill (Vulkan backend)
Prevents unnecessary GTT spill management on unified memory. Fixes prompt processing degradation with the Vulkan RADV backend.
```bash
# Per-session
export RADV_PERFTEST=nogttspill
# Persist system-wide
echo 'RADV_PERFTEST=nogttspill' | sudo tee /etc/environment.d/radv.conf
```
Only affects the Vulkan RADV backend. No effect on ROCm.
### 2.5 Additional Kernel Parameters (reboot required)
These can be added to the GRUB cmdline alongside Phase 1 params:
| Parameter | Value | Purpose | Priority |
|-----------|-------|---------|----------|
| `amdgpu.noretry=0` | 0 | Enable GPU page fault retry, improves stability | Medium — add if seeing GPU crashes |
| `transparent_hugepage=always` | -- | Persist THP setting | Medium |
| `preempt=voluntary` | -- | Reduce context switch overhead | Low — only for batch inference |
| `processor.max_cstate=1` | 1 | Disable deep C-states | Low — tuned profile handles this |
**Do NOT add**: `amdgpu.ppfeaturemask=0xffffffff` — OverDrive is non-functional on gfx1151 (ROCm issue #5750).
Persisted to `/etc/environment.d/radv-llm.conf`. Prevents GTT spill management overhead on unified memory. Vulkan RADV only.
---
## Phase 3: Runtime Flags (per-invocation, no system changes)
## Phase 3: Runtime Flags (per-invocation)
These are llama-bench / llama-server flags that affect performance without changing the system.
These flags should be used when running llama-bench, llama-server, or llama-cli. Already set in this repo's benchmark scripts.
### 3.1 Always Use `-mmp 0` (no mmap)
### 3.1 `-mmp 0` (no mmap) — mandatory
On unified memory, mmap adds a double-copy penalty. The `--no-mmap` / `-mmp 0` flag loads weights directly. Already set in this repo's benchmark scripts.
On unified memory, mmap adds a double-copy penalty. Always disable.
### 3.2 Batch Size for MoE Models (`-b 256`)
### 3.2 KV Cache Quantization — use Q4_0
Default batch size (2048) is too large for MoE on this hardware. Reducing to 256 can improve pp512 throughput by up to 70% on MoE models.
**Measured** (Vulkan RADV, Qwen3.5-35B-A3B):
```bash
# In llama-bench
llama-bench -m model.gguf -b 256 -ngl 99 -fa 1
| KV Type | pp2048 | tg1024 | Memory Savings |
|---------|--------|--------|---------------|
| f16 | 456 | 39.8 | Baseline |
| q8_0 | 418 | 38.5 | ~50% (slightly slower on Vulkan) |
| **q4_0** | **460** | **41.1** | **75% (fastest on Vulkan)** |
# In llama-server
llama-server -m model.gguf -b 256 -ngl 99 -fa 1
```
Q4_0 is faster because less memory bandwidth is spent reading KV cache. Use as default for serving. Quality impact is noticeable only on reasoning-heavy tasks.
### 3.3 KV Cache Quantization
### 3.3 Flash Attention (`-fa 1`) — always enable
Q8_0 KV cache halves KV memory usage with negligible quality loss. Recommended as default for all serving.
+24% pp on ROCm. Modest improvement on Vulkan (CoopMat1). Already in benchmark scripts.
```bash
# llama-server
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
### 3.4 `ROCBLAS_USE_HIPBLASLT=1` (ROCm only) — mandatory
# Benchmark sweep
make benchmark ARGS="--tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072 --models MODEL.gguf --reps 3"
```
Without this, ROCm pp is 2-7x slower on gfx1151. Already in benchmark scripts.
| KV Type | Memory Savings | Quality Impact | Recommendation |
|---------|---------------|----------------|----------------|
| f16 | Baseline | None | Default for benchmarks |
| **q8_0** | **~50%** | **Negligible** | **Default for serving** |
| q4_0 | ~75% | Noticeable on reasoning | Only for max context |
### 3.5 Backend Selection
### 3.4 Flash Attention (`-fa 1`)
**Measured** (Qwen3.5-35B-A3B Q8, 128K context):
Always enable on ROCm (+24% pp improvement). On Vulkan, FA uses CoopMat1 (modest improvement). Already set in benchmark scripts.
| Workload | Vulkan RADV | ROCm 7.2 | Winner |
|----------|------------|----------|--------|
| pp2048 | 456 | 445 | Vulkan (+2%) |
| tg1024 | 39.8 | 21.5 | **Vulkan (1.9x)** |
| pp8192 @ 131K | 130 | 84 | **Vulkan (1.5x)** |
| tg32 @ 131K | 22.0 | 8.1 | **Vulkan (2.7x)** |
### 3.5 ROCBLAS_USE_HIPBLASLT=1 (ROCm only)
**Vulkan RADV dominates across all workloads on this hardware.** ROCm is significantly slower for token generation, especially at long context. Never use AMDVLK.
Without this, ROCm pp on gfx1151 is 2-7x slower. Already set in benchmark scripts.
### 3.6 MoE Batch Size `-b 256` — NOT YET TESTED
### 3.6 Backend Selection
Neither ROCm nor Vulkan is universally faster:
| Workload | Best Backend | Why |
|----------|-------------|-----|
| Short context tg | Vulkan RADV | Lower per-token overhead |
| Long context (8K-130K) | ROCm + rocWMMA | True HW flash attention |
| General stability | Vulkan RADV | More mature on gfx1151 |
Never use AMDVLK — RADV scales 3.6x better at extreme context depths.
Community reports up to +70% pp improvement for MoE models. Default batch (2048) may be too large. Needs benchmarking.
---
## Phase 4: Build Optimizations (requires rebuilding containers)
## Phase 4: Build Optimizations (not yet tested)
These require rebuilding the llama.cpp toolbox containers with specific flags.
These require rebuilding the llama.cpp toolbox containers. Given that **Vulkan RADV already outperforms ROCm significantly**, the ROI of these ROCm-specific optimizations is unclear.
### 4.1 ROCm Build Flags
```bash
cmake -B build \
-DGGML_HIP=ON \
-DGGML_HIP_ROCWMMA_FATTN=ON \ # GPU-accelerated flash attention via WMMA
-DGGML_HIP_UMA=ON \ # Unified memory aware allocation
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DGGML_HIP_UMA=ON \
-DAMDGPU_TARGETS=gfx1151
```
`GGML_HIP_ROCWMMA_FATTN` is the only path to true GPU-accelerated flash attention on AMD (96% speedup at 65K context). The Vulkan CoopMat1 path is a software fallback.
`GGML_HIP_ROCWMMA_FATTN` enables GPU-accelerated flash attention via WMMA. Could close the gap between ROCm and Vulkan for long-context workloads. Check if Donato Capitella's ROCm toolboxes already include this.
### 4.2 rocWMMA Tuned Patch (PR #16827)
### 4.2 Vulkan Cooperative Matrices
Fixes a long-context regression in rocWMMA. Implements adaptive KQ stride, better launch bounds, and selective WMMA (prefill only; decode reverts to VEC/TILE). Check if Donato Capitella's ROCm toolboxes include this.
### 4.3 Vulkan Cooperative Matrices
RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Building llama.cpp with cooperative matrix support could enable WMMA-like speedups without ROCm dependency.
RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Already used by llama.cpp for matrix operations. The current Vulkan toolbox shows `matrix cores: KHR_coopmat` — this is likely already active.
---
## Phase 5: Future / Currently Blocked
These optimizations are not available today but are worth tracking.
### 5.1 Speculative Decoding
### 5.1 Speculative Decoding (blocked: llama.cpp PR #20075)
Expected 1.8-2.5x tg speedup. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`) downloaded.
Expected 1.8-2.5x tg speedup for coding tasks. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`, 812 MB) already downloaded. Blocked because Qwen3.5 MoE uses hybrid GatedDeltaNet architecture that breaks llama.cpp's speculative rollback mechanism.
**Blocked**: llama.cpp PR #20075 — hybrid SSM/MoE speculative rollback.
**Track**: [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — fix for hybrid SSM/MoE speculative decoding.
### 5.2 Native Multi-Token Prediction
### 5.2 Native Multi-Token Prediction (blocked: llama.cpp PR #20700)
Qwen3.5 has built-in MTP heads. No draft model needed.
Qwen3.5 was trained with built-in MTP heads. No separate draft model needed. Works in vLLM/SGLang today but not llama.cpp.
**Blocked**: llama.cpp PR #20700 — MTP for Qwen3.5.
**Track**: [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — MTP for Qwen3.5 with FastMTP vocabulary trimming.
### 5.3 GPU Clock Reporting Fix
### 5.3 GPU Clock Fix (blocked: ROCm #5750)
ROCm issue #5750 reports GPU clocks appearing stuck at ~885 MHz in sysfs. However, **clpeak confirms clocks reach 2900 MHz under compute load** (measured 2026-03-30). The issue is likely broken sysfs reporting, not actual clock throttling. No performance impact.
GPU clocks on gfx1151 may be stuck at ~885 MHz instead of 2900 MHz. `power_dpm_force_performance_level` and OverDrive are non-functional. If fixed, this could unlock significant additional throughput.
**Tracking**: ROCm issue #5750 (sysfs reporting only, not a real blocker).
**Track**: [ROCm issue #5750](https://github.com/ROCm/ROCm/issues/5750) — Strix Halo stuck in low power clocks.
### 5.4 Other Future Items
### 5.4 SageAttention
2-5x speedup over FlashAttention via quantized attention computation. No AMD port exists yet.
### 5.5 AMD XDNA NPU (50 TOPS)
Not viable for LLM inference today. Linux support coming in kernel 7.1. Future potential: running a draft model on the NPU for speculative decoding while the GPU runs the main model.
### 5.6 TurboQuant 3-bit KV Cache (ICLR 2026)
4.9x KV cache compression with minimal quality loss. Being integrated into llama.cpp.
### 5.7 LLMLingua-2 Prompt Compression
20x prompt compression for agentic/RAG workloads. Reduces pp time by compressing input before inference. Applicable to the agentic eval pipeline.
- **SageAttention**: 2-5x over FlashAttention. No AMD port yet.
- **XDNA NPU** (50 TOPS): Linux support coming in kernel 7.1. Could run draft model for speculative decoding.
- **TurboQuant 3-bit KV** (ICLR 2026): 4.9x KV compression. Being integrated into llama.cpp.
- **LLMLingua-2**: 20x prompt compression for agentic/RAG workloads.
---
## Hardware Limits (cannot be changed)
Understanding what is fixed helps avoid wasted effort.
## Hardware Limits (measured via clpeak on this system, 2026-03-30)
| Resource | Value | Notes |
|----------|-------|-------|
| Memory bandwidth | **~215 GB/s** (measured) | 84% of 256 GB/s theoretical. Hard ceiling for tg speed. |
| LPDDR5X-8000 | **8000 MT/s, 256-bit** | Soldered, no XMP/EXPO, no overclocking |
| Infinity Fabric | **2 GHz FCLK** | Fixed, not tunable on Strix Halo |
| Infinity Cache | **32 MB** | ~1 TB/s hit bandwidth. Per-layer weights exceed it. |
| GPU clocks | **Up to 2900 MHz** | Currently broken in driver (see 5.3) |
| Max power | **120W APU** | HP ZBook charger is 140W total system |
| DRAM bandwidth | **216-233 GB/s** | clpeak float: 215.7, float16: 232.6. Theoretical max: 256 GB/s. |
| Infinity Cache | **32 MB, ~1 TB/s** | MoE effective throughput exceeds DRAM BW due to cache hits |
| GPU clocks | **2900 MHz confirmed** | clpeak shows clocks reach 2900 MHz under load. ROCm #5750 sysfs reporting may be broken but actual clocks are correct. |
| FP16 compute | **21.9 TFLOPS** | 20 CUs x 2900 MHz |
| FP32 compute | **12.3 TFLOPS** | |
| LPDDR5X-8000 | 8000 MT/s, 256-bit | Soldered, no overclocking |
| Infinity Fabric | 2 GHz FCLK | Fixed |
| HP ZBook sustained power | **70W** (firmware cap) | RyzenAdj can set 85W but HP overrides to 70W |
**Note on MoE bandwidth**: MoE models (3B active params) read ~13.5 GB per token at Q4. At 55 t/s this implies ~740 GB/s effective throughput — well above the 216 GB/s DRAM ceiling. The 32 MB Infinity Cache (~1 TB/s) is actively boosting throughput for repeated KV cache and activation reads.
---
## Performance Summary (all measured, Vulkan RADV, q4_0 KV)
### Model Shootout (pp2048/tg1024, Phase 1+2 applied)
| Model | Arch | Size | pp2048 | tg1024 |
|-------|------|------|--------|--------|
| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
tg speed is bandwidth-bound: smaller model = faster tokens.
### Optimization Journey (Qwen3.5-35B-A3B, tg on Vulkan)
| Stage | tg (t/s) | Improvement |
|-------|----------|-------------|
| Pre-optimization (stock) | ~33 | Baseline |
| After Phase 1 (tuned + kernel + BIOS) | ~39 | +18% |
| After Phase 2 (RyzenAdj + sysctl + THP) | **~57** | **+46%** |
---
@@ -328,23 +257,19 @@ Understanding what is fixed helps avoid wasted effort.
sudo make rollback # Restores GRUB backup and previous tuned profile
```
BIOS VRAM must be reverted manually (F10 at boot, restore previous UMA Frame Buffer Size).
Phase 2 changes can be reverted individually:
Phase 2 rollback:
- RyzenAdj: `sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000`
- Sysctl: `sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system`
- THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
- nogttspill: `sudo rm /etc/environment.d/radv.conf`
- nogttspill: `sudo rm /etc/environment.d/radv-llm.conf`
- BIOS VRAM: revert manually (F10 at boot)
---
## Troubleshooting
If anything goes wrong, see [troubleshooting.md](troubleshooting.md).
## Further Reading
- [Hardware analysis](llama-cpp-optimization-research.md) — Deep dive into llama.cpp flags, backends, quantization
- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines, techniques, and future directions
- [Optimization log](optimization-log.md) — Detailed test results and verdicts
- [Hardware analysis](llama-cpp-optimization-research.md) — llama.cpp flags, backends, quantization deep dive
- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines and techniques
- [Benchmarking guide](benchmarking.md) — Methodology and result interpretation
- [References](references.md) — All external links