Files
strix-halo-optimizations/docs/optimization-log.md
Felipe Cardoso f92b710492 fix(benchmark): parse llama-bench output with variable column count
KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.
2026-03-27 14:54:19 +01:00

173 lines
5.2 KiB
Markdown

# Optimization Log
Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
**Verdicts**: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
---
## Phase 1: Core System
### 1.1 Tuned Profile: accelerator-performance
- **Date**: 2026-03-26
- **Change**: `sudo tuned-adm profile accelerator-performance`
- **Benchmark**: `data/benchmarks/after-tuned-*`
- **Result**: +5-8% pp improvement, +2-3% tg improvement
- **Verdict**: KEEP
### 1.2 Kernel Boot Parameters
- **Date**: 2026-03-26
- **Change**: `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
- **Result**: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
- **Verdict**: KEEP
### 1.3 BIOS VRAM Reduction (512 MB)
- **Date**: 2026-03-26
- **Change**: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
- **Result**: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
- **Trade-off**: Small model regression is acceptable given the massive capability gain.
- **Verdict**: KEEP
---
## Phase 2: System Tuning
### 2.1 RyzenAdj 85W PPT
- **Date**: PENDING
- **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000`
- **Expected**: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
- **Benchmark**: Not yet run
- **Notes**: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
- **Verdict**: PENDING
### 2.2 VM Sysctl Tuning
- **Date**: PENDING
- **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000`
- **Expected**: Prevent model weight eviction, reduce I/O disruption
- **Benchmark**: Not yet run
- **Verdict**: PENDING
### 2.3 Transparent Huge Pages
- **Date**: PENDING
- **Change**: `transparent_hugepage=always`
- **Expected**: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
- **Benchmark**: Not yet run
- **Verdict**: PENDING
### 2.4 RADV_PERFTEST=nogttspill
- **Date**: PENDING
- **Change**: `export RADV_PERFTEST=nogttspill`
- **Expected**: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
- **Benchmark**: Not yet run — needs Vulkan-specific benchmark comparison
- **Verdict**: PENDING
### 2.5 amdgpu.noretry=0
- **Date**: PENDING
- **Change**: Kernel cmdline `amdgpu.noretry=0`
- **Expected**: Improved stability under memory pressure
- **Notes**: Only apply if experiencing GPU page faults or crashes during large model loading
- **Verdict**: PENDING
---
## Phase 3: Runtime Flags
### 3.1 KV Cache Quantization
- **Date**: PENDING (sweep running)
- **Change**: `-ctk q8_0 -ctv q8_0` / `-ctk q4_0 -ctv q4_0`
- **Benchmark**: `data/benchmarks/kv-sweep-128k-*` (in progress)
- **Expected**: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
- **Verdict**: PENDING
### 3.2 MoE Batch Size `-b 256`
- **Date**: PENDING
- **Change**: Add `-b 256` to MoE benchmark runs
- **Expected**: Up to +70% pp improvement for MoE models (community benchmarks)
- **Benchmark**: Not yet run
- **Verdict**: PENDING
---
## Phase 4: Build Optimizations
### 4.1 rocWMMA Flash Attention
- **Date**: PENDING
- **Change**: Rebuild ROCm toolbox with `-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`
- **Expected**: +96% long-context performance (65K+)
- **Notes**: Need to check if Donato's toolboxes already include this
- **Verdict**: PENDING
### 4.2 rocWMMA Tuned Patch (PR #16827)
- **Date**: PENDING
- **Notes**: Fixes long-context regression. Check Donato's latest toolbox builds.
- **Verdict**: PENDING
---
## Phase 5: Future / Blocked
### 5.1 Speculative Decoding
- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
- **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
- **Last checked**: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
### 5.2 Native MTP (Multi-Token Prediction)
- **Status**: BLOCKED — llama.cpp PR #20700
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon
### 5.3 GPU Clock Fix
- **Status**: BLOCKED — ROCm issue #5750
- **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
- **Last checked**: 2026-03-27
---
## Context Window Benchmarks
### 64K Context (pp4096/tg1024, MoE models)
- **Date**: 2026-03-26
- **Benchmark**: `data/benchmarks/ctx64k-*`
- **Results**: (check logs)
### 128K Context (pp8192/tg1024, MoE models)
- **Date**: 2026-03-26
- **Benchmark**: `data/benchmarks/ctx128k-realistic-*`
- **Results**: (check logs)
### 256K Context (pp16384/tg1024, MoE models)
- **Date**: 2026-03-27
- **Benchmark**: `data/benchmarks/ctx256k-*`
- **Results**: (check logs)
---
## How to Add Entries
When testing a new optimization:
1. Record the date and exact change
2. Run a benchmark: `make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."`
3. Compare: `make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new`
4. Update this log with results and verdict
5. If KEEP: document in [optimization.md](optimization.md) with the measured numbers