strix-halo-optimizations/docs/optimization.md

# Optimization Guide

Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.

**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.

Track results in [optimization-log.md](optimization-log.md) as you apply each change.

---

## Phase 1: Core System (automated scripts)

These are the foundational optimizations handled by this repo's scripts. Apply in order.

### 1.1 Tuned Profile (no reboot)

```bash
sudo make optimize-tuned
```

Switches to `accelerator-performance`: disables higher-latency CPU STOP states, sets CPU governor to performance.

**Measured**: +5-8% pp, +2-3% tg.

### 1.2 Kernel Boot Parameters (reboot required)

```bash
sudo make optimize-kernel
```

| Parameter | Value (64 GB) | Purpose |
|-----------|--------------|---------|
| `iommu=pt` | -- | IOMMU passthrough, reduces memory access latency |
| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB (~59 GiB) |
| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |

Values computed dynamically from total physical RAM. See [architecture.md](architecture.md) for the math.

### 1.3 BIOS VRAM Reduction (reboot + BIOS access)

```bash
make optimize-vram   # Prints guidance — cannot modify BIOS directly
```

Reduce UMA Frame Buffer Size from 32 GB to 512 MB. Frees 31.5 GB for GTT. See [bios-vram-guide.md](bios-vram-guide.md) (HP ZBook: F10 at boot).

**Combine 1.2 and 1.3 into a single reboot.**

### 1.4 Verify

```bash
make verify          # 9-point checklist, target: 9/9
make audit           # Single-screen system status with scores
```

### Phase 1 Measured Impact

| Optimization | pp | tg |
|-------------|----|----|
| Tuned profile | +5-8% | +2-3% |
| Kernel params + BIOS VRAM | Enables 37 GB+ models | +5-15% |
| **Phase 1 combined** | **+15-25%** | **+8-18%** |

Trade-off: Small models (<5 GB) are ~3-8% slower due to GTT indirection vs dedicated VRAM. Acceptable given the massive capability gain.

---

## Phase 2: System Tuning

All Phase 2 optimizations are applied with a single command:

```bash
sudo make optimize-power
```

This applies RyzenAdj, sysctl, THP, and RADV nogttspill. Sysctl and nogttspill persist across reboots. RyzenAdj and THP are volatile.

For RyzenAdj persistence across reboots:
```bash
sudo cp configs/ryzenadj-llm.service configs/ryzenadj-resume.service /etc/systemd/system/
sudo systemctl enable --now ryzenadj-llm.service
sudo systemctl enable ryzenadj-resume.service
```

### 2.1 RyzenAdj PPT Increase

The HP ZBook Ultra G1a ships at 59W sustained. RyzenAdj raises this.

**Measured on HP ZBook**: STAPM raised to 81W, but **HP firmware hard-caps PPT SLOW at 70W**. Effective sustained power: ~70W (was ~59W). Cannot be overridden without modded BIOS.

**Measured impact**: Qwen3.5-35B-A3B tg1024 went from **~39 t/s → 57 t/s (+46%)**. This was the single largest improvement in the entire optimization journey.

Thermals: 70-73C under sustained load. 30C headroom. Cooling handles it easily.

### 2.2 VM / Sysctl Tuning

Persisted to `/etc/sysctl.d/99-llm-inference.conf`:

| Parameter | Default | Set To | Why |
|-----------|---------|--------|-----|
| `vm.swappiness` | 60 | **1** | Prevent model weight eviction |
| `vm.dirty_ratio` | 20 | **40** | Reduce I/O flush storms during inference |
| `vm.max_map_count` | 65530 | **500000** | Large models need many memory mappings |
| `vm.zone_reclaim_mode` | 0 | **0** | Don't aggressively reclaim memory zones |

### 2.3 Transparent Huge Pages

Set to `always`. Reduces TLB misses for mmap'd model files. Volatile — add `transparent_hugepage=always` to kernel cmdline for persistence.

### 2.4 RADV_PERFTEST=nogttspill

Persisted to `/etc/environment.d/radv-llm.conf`. Prevents GTT spill management overhead on unified memory. Vulkan RADV only.

---

## Phase 3: Runtime Flags (per-invocation)

These flags should be used when running llama-bench, llama-server, or llama-cli. Already set in this repo's benchmark scripts.

### 3.1 `-mmp 0` (no mmap) — mandatory

On unified memory, mmap adds a double-copy penalty. Always disable.

### 3.2 KV Cache Quantization — use Q4_0

**Measured** (Vulkan RADV, Qwen3.5-35B-A3B):

| KV Type | pp2048 | tg1024 | Memory Savings |
|---------|--------|--------|---------------|
| f16 | 456 | 39.8 | Baseline |
| q8_0 | 418 | 38.5 | ~50% (slightly slower on Vulkan) |
| **q4_0** | **460** | **41.1** | **75% (fastest on Vulkan)** |

Q4_0 is faster because less memory bandwidth is spent reading KV cache. Use as default for serving. Quality impact is noticeable only on reasoning-heavy tasks.

### 3.3 Flash Attention (`-fa 1`) — always enable

+24% pp on ROCm. Modest improvement on Vulkan (CoopMat1). Already in benchmark scripts.

### 3.4 `ROCBLAS_USE_HIPBLASLT=1` (ROCm only) — mandatory

Without this, ROCm pp is 2-7x slower on gfx1151. Already in benchmark scripts.

### 3.5 Backend Selection

**Measured** (Qwen3.5-35B-A3B Q8, 128K context):

| Workload | Vulkan RADV | ROCm 7.2 | Winner |
|----------|------------|----------|--------|
| pp2048 | 456 | 445 | Vulkan (+2%) |
| tg1024 | 39.8 | 21.5 | **Vulkan (1.9x)** |
| pp8192 @ 131K | 130 | 84 | **Vulkan (1.5x)** |
| tg32 @ 131K | 22.0 | 8.1 | **Vulkan (2.7x)** |

**Vulkan RADV dominates across all workloads on this hardware.** ROCm is significantly slower for token generation, especially at long context. Never use AMDVLK.

### 3.6 MoE Batch Size `-b 256` — NOT YET TESTED

Community reports up to +70% pp improvement for MoE models. Default batch (2048) may be too large. Needs benchmarking.

---

## Phase 4: Build Optimizations (not yet tested)

These require rebuilding the llama.cpp toolbox containers. Given that **Vulkan RADV already outperforms ROCm significantly**, the ROI of these ROCm-specific optimizations is unclear.

### 4.1 ROCm Build Flags

```bash
cmake -B build \
    -DGGML_HIP=ON \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DGGML_HIP_UMA=ON \
    -DAMDGPU_TARGETS=gfx1151
```

`GGML_HIP_ROCWMMA_FATTN` enables GPU-accelerated flash attention via WMMA. Could close the gap between ROCm and Vulkan for long-context workloads. Check if Donato Capitella's ROCm toolboxes already include this.

### 4.2 Vulkan Cooperative Matrices

RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Already used by llama.cpp for matrix operations. The current Vulkan toolbox shows `matrix cores: KHR_coopmat` — this is likely already active.

---

## Phase 5: Future / Currently Blocked

### 5.1 Speculative Decoding

Expected 1.8-2.5x tg speedup. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`) downloaded.

**Blocked**: llama.cpp PR #20075 — hybrid SSM/MoE speculative rollback.

### 5.2 Native Multi-Token Prediction

Qwen3.5 has built-in MTP heads. No draft model needed.

**Blocked**: llama.cpp PR #20700 — MTP for Qwen3.5.

### 5.3 GPU Clock Reporting Fix

ROCm issue #5750 reports GPU clocks appearing stuck at ~885 MHz in sysfs. However, **clpeak confirms clocks reach 2900 MHz under compute load** (measured 2026-03-30). The issue is likely broken sysfs reporting, not actual clock throttling. No performance impact.

**Tracking**: ROCm issue #5750 (sysfs reporting only, not a real blocker).

### 5.4 Other Future Items

- **SageAttention**: 2-5x over FlashAttention. No AMD port yet.
- **XDNA NPU** (50 TOPS): Linux support coming in kernel 7.1. Could run draft model for speculative decoding.
- **TurboQuant 3-bit KV** (ICLR 2026): 4.9x KV compression. Being integrated into llama.cpp.
- **LLMLingua-2**: 20x prompt compression for agentic/RAG workloads.

---

## Hardware Limits (measured via clpeak on this system, 2026-03-30)

| Resource | Value | Notes |
|----------|-------|-------|
| DRAM bandwidth | **216-233 GB/s** | clpeak float: 215.7, float16: 232.6. Theoretical max: 256 GB/s. |
| Infinity Cache | **32 MB, ~1 TB/s** | MoE effective throughput exceeds DRAM BW due to cache hits |
| GPU clocks | **2900 MHz confirmed** | clpeak shows clocks reach 2900 MHz under load. ROCm #5750 sysfs reporting may be broken but actual clocks are correct. |
| FP16 compute | **21.9 TFLOPS** | 20 CUs x 2900 MHz |
| FP32 compute | **12.3 TFLOPS** | |
| LPDDR5X-8000 | 8000 MT/s, 256-bit | Soldered, no overclocking |
| Infinity Fabric | 2 GHz FCLK | Fixed |
| HP ZBook sustained power | **70W** (firmware cap) | RyzenAdj can set 85W but HP overrides to 70W |

**Note on MoE bandwidth**: MoE models (3B active params) read ~13.5 GB per token at Q4. At 55 t/s this implies ~740 GB/s effective throughput — well above the 216 GB/s DRAM ceiling. The 32 MB Infinity Cache (~1 TB/s) is actively boosting throughput for repeated KV cache and activation reads.

---

## Performance Summary (all measured, Vulkan RADV, q4_0 KV)

### Model Shootout (pp2048/tg1024, Phase 1+2 applied)

| Model | Arch | Size | pp2048 | tg1024 |
|-------|------|------|--------|--------|
| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |

tg speed is bandwidth-bound: smaller model = faster tokens.

### Optimization Journey (Qwen3.5-35B-A3B, tg on Vulkan)

| Stage | tg (t/s) | Improvement |
|-------|----------|-------------|
| Pre-optimization (stock) | ~33 | Baseline |
| After Phase 1 (tuned + kernel + BIOS) | ~39 | +18% |
| After Phase 2 (RyzenAdj + sysctl + THP) | **~57** | **+46%** |

---

## Rollback

```bash
sudo make rollback   # Restores GRUB backup and previous tuned profile
```

Phase 2 rollback:
- RyzenAdj: `sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000`
- Sysctl: `sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system`
- THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
- nogttspill: `sudo rm /etc/environment.d/radv-llm.conf`
- BIOS VRAM: revert manually (F10 at boot)

---

## Further Reading

- [Optimization log](optimization-log.md) — Detailed test results and verdicts
- [Hardware analysis](llama-cpp-optimization-research.md) — llama.cpp flags, backends, quantization deep dive
- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines and techniques
- [Benchmarking guide](benchmarking.md) — Methodology and result interpretation
- [References](references.md) — All external links