strix-halo-optimizations/docs/optimization-log.md

# Optimization Log

Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.

**Verdicts**: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).

---

## Phase 1: Core System

### 1.1 Tuned Profile: accelerator-performance

- **Date**: 2026-03-26
- **Change**: `sudo tuned-adm profile accelerator-performance`
- **Benchmark**: `data/benchmarks/after-tuned-*`
- **Result**: +5-8% pp improvement, +2-3% tg improvement
- **Verdict**: KEEP

### 1.2 Kernel Boot Parameters

- **Date**: 2026-03-26
- **Change**: `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
- **Result**: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
- **Verdict**: KEEP

### 1.3 BIOS VRAM Reduction (512 MB)

- **Date**: 2026-03-26
- **Change**: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
- **Result**: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
- **Trade-off**: Small model regression is acceptable given the massive capability gain.
- **Verdict**: KEEP

---

## Phase 2: System Tuning

### 2.1 RyzenAdj 85W PPT

- **Date**: PENDING
- **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000`
- **Expected**: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
- **Benchmark**: Not yet run
- **Notes**: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
- **Verdict**: PENDING

### 2.2 VM Sysctl Tuning

- **Date**: PENDING
- **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000`
- **Expected**: Prevent model weight eviction, reduce I/O disruption
- **Benchmark**: Not yet run
- **Verdict**: PENDING

### 2.3 Transparent Huge Pages

- **Date**: PENDING
- **Change**: `transparent_hugepage=always`
- **Expected**: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
- **Benchmark**: Not yet run
- **Verdict**: PENDING

### 2.4 RADV_PERFTEST=nogttspill

- **Date**: PENDING
- **Change**: `export RADV_PERFTEST=nogttspill`
- **Expected**: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
- **Benchmark**: Not yet run — needs Vulkan-specific benchmark comparison
- **Verdict**: PENDING

### 2.5 amdgpu.noretry=0

- **Date**: PENDING
- **Change**: Kernel cmdline `amdgpu.noretry=0`
- **Expected**: Improved stability under memory pressure
- **Notes**: Only apply if experiencing GPU page faults or crashes during large model loading
- **Verdict**: PENDING

---

## Phase 3: Runtime Flags

### 3.1 KV Cache Quantization

- **Date**: PENDING (sweep running)
- **Change**: `-ctk q8_0 -ctv q8_0` / `-ctk q4_0 -ctv q4_0`
- **Benchmark**: `data/benchmarks/kv-sweep-128k-*` (in progress)
- **Expected**: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
- **Verdict**: PENDING

### 3.2 MoE Batch Size `-b 256`

- **Date**: PENDING
- **Change**: Add `-b 256` to MoE benchmark runs
- **Expected**: Up to +70% pp improvement for MoE models (community benchmarks)
- **Benchmark**: Not yet run
- **Verdict**: PENDING

---

## Phase 4: Build Optimizations

### 4.1 rocWMMA Flash Attention

- **Date**: PENDING
- **Change**: Rebuild ROCm toolbox with `-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`
- **Expected**: +96% long-context performance (65K+)
- **Notes**: Need to check if Donato's toolboxes already include this
- **Verdict**: PENDING

### 4.2 rocWMMA Tuned Patch (PR #16827)

- **Date**: PENDING
- **Notes**: Fixes long-context regression. Check Donato's latest toolbox builds.
- **Verdict**: PENDING

---

## Phase 5: Future / Blocked

### 5.1 Speculative Decoding

- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
- **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
- **Last checked**: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues

### 5.2 Native MTP (Multi-Token Prediction)

- **Status**: BLOCKED — llama.cpp PR #20700
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon

### 5.3 GPU Clock Fix

- **Status**: BLOCKED — ROCm issue #5750
- **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
- **Last checked**: 2026-03-27

---

## Context Window Benchmarks

### 64K Context (pp4096/tg1024, MoE models)

- **Date**: 2026-03-26
- **Benchmark**: `data/benchmarks/ctx64k-*`
- **Results**: (check logs)

### 128K Context (pp8192/tg1024, MoE models)

- **Date**: 2026-03-26
- **Benchmark**: `data/benchmarks/ctx128k-realistic-*`
- **Results**: (check logs)

### 256K Context (pp16384/tg1024, MoE models)

- **Date**: 2026-03-27
- **Benchmark**: `data/benchmarks/ctx256k-*`
- **Results**: (check logs)

---

## How to Add Entries

When testing a new optimization:

1. Record the date and exact change
2. Run a benchmark: `make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."`
3. Compare: `make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new`
4. Update this log with results and verdict
5. If KEEP: document in [optimization.md](optimization.md) with the measured numbers