KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
173 lines
5.2 KiB
Markdown
173 lines
5.2 KiB
Markdown
# Optimization Log
|
|
|
|
Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
|
|
|
|
**Verdicts**: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
|
|
|
|
---
|
|
|
|
## Phase 1: Core System
|
|
|
|
### 1.1 Tuned Profile: accelerator-performance
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Change**: `sudo tuned-adm profile accelerator-performance`
|
|
- **Benchmark**: `data/benchmarks/after-tuned-*`
|
|
- **Result**: +5-8% pp improvement, +2-3% tg improvement
|
|
- **Verdict**: KEEP
|
|
|
|
### 1.2 Kernel Boot Parameters
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Change**: `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`
|
|
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
|
|
- **Result**: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
|
|
- **Verdict**: KEEP
|
|
|
|
### 1.3 BIOS VRAM Reduction (512 MB)
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Change**: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
|
|
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
|
|
- **Result**: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
|
|
- **Trade-off**: Small model regression is acceptable given the massive capability gain.
|
|
- **Verdict**: KEEP
|
|
|
|
---
|
|
|
|
## Phase 2: System Tuning
|
|
|
|
### 2.1 RyzenAdj 85W PPT
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000`
|
|
- **Expected**: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
|
|
- **Benchmark**: Not yet run
|
|
- **Notes**: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
|
|
- **Verdict**: PENDING
|
|
|
|
### 2.2 VM Sysctl Tuning
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000`
|
|
- **Expected**: Prevent model weight eviction, reduce I/O disruption
|
|
- **Benchmark**: Not yet run
|
|
- **Verdict**: PENDING
|
|
|
|
### 2.3 Transparent Huge Pages
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: `transparent_hugepage=always`
|
|
- **Expected**: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
|
|
- **Benchmark**: Not yet run
|
|
- **Verdict**: PENDING
|
|
|
|
### 2.4 RADV_PERFTEST=nogttspill
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: `export RADV_PERFTEST=nogttspill`
|
|
- **Expected**: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
|
|
- **Benchmark**: Not yet run — needs Vulkan-specific benchmark comparison
|
|
- **Verdict**: PENDING
|
|
|
|
### 2.5 amdgpu.noretry=0
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: Kernel cmdline `amdgpu.noretry=0`
|
|
- **Expected**: Improved stability under memory pressure
|
|
- **Notes**: Only apply if experiencing GPU page faults or crashes during large model loading
|
|
- **Verdict**: PENDING
|
|
|
|
---
|
|
|
|
## Phase 3: Runtime Flags
|
|
|
|
### 3.1 KV Cache Quantization
|
|
|
|
- **Date**: PENDING (sweep running)
|
|
- **Change**: `-ctk q8_0 -ctv q8_0` / `-ctk q4_0 -ctv q4_0`
|
|
- **Benchmark**: `data/benchmarks/kv-sweep-128k-*` (in progress)
|
|
- **Expected**: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
|
|
- **Verdict**: PENDING
|
|
|
|
### 3.2 MoE Batch Size `-b 256`
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: Add `-b 256` to MoE benchmark runs
|
|
- **Expected**: Up to +70% pp improvement for MoE models (community benchmarks)
|
|
- **Benchmark**: Not yet run
|
|
- **Verdict**: PENDING
|
|
|
|
---
|
|
|
|
## Phase 4: Build Optimizations
|
|
|
|
### 4.1 rocWMMA Flash Attention
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: Rebuild ROCm toolbox with `-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`
|
|
- **Expected**: +96% long-context performance (65K+)
|
|
- **Notes**: Need to check if Donato's toolboxes already include this
|
|
- **Verdict**: PENDING
|
|
|
|
### 4.2 rocWMMA Tuned Patch (PR #16827)
|
|
|
|
- **Date**: PENDING
|
|
- **Notes**: Fixes long-context regression. Check Donato's latest toolbox builds.
|
|
- **Verdict**: PENDING
|
|
|
|
---
|
|
|
|
## Phase 5: Future / Blocked
|
|
|
|
### 5.1 Speculative Decoding
|
|
|
|
- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
|
|
- **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
|
|
- **Last checked**: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
|
|
|
|
### 5.2 Native MTP (Multi-Token Prediction)
|
|
|
|
- **Status**: BLOCKED — llama.cpp PR #20700
|
|
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon
|
|
|
|
### 5.3 GPU Clock Fix
|
|
|
|
- **Status**: BLOCKED — ROCm issue #5750
|
|
- **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
|
|
- **Last checked**: 2026-03-27
|
|
|
|
---
|
|
|
|
## Context Window Benchmarks
|
|
|
|
### 64K Context (pp4096/tg1024, MoE models)
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Benchmark**: `data/benchmarks/ctx64k-*`
|
|
- **Results**: (check logs)
|
|
|
|
### 128K Context (pp8192/tg1024, MoE models)
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Benchmark**: `data/benchmarks/ctx128k-realistic-*`
|
|
- **Results**: (check logs)
|
|
|
|
### 256K Context (pp16384/tg1024, MoE models)
|
|
|
|
- **Date**: 2026-03-27
|
|
- **Benchmark**: `data/benchmarks/ctx256k-*`
|
|
- **Results**: (check logs)
|
|
|
|
---
|
|
|
|
## How to Add Entries
|
|
|
|
When testing a new optimization:
|
|
|
|
1. Record the date and exact change
|
|
2. Run a benchmark: `make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."`
|
|
3. Compare: `make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new`
|
|
4. Update this log with results and verdict
|
|
5. If KEEP: document in [optimization.md](optimization.md) with the measured numbers
|