Add batch size override to benchmark scripts. Testing -b 256 vs default 2048 on Vulkan RADV shows no meaningful difference for MoE pp2048 (826 vs 843 t/s, within noise). Community-reported +70% improvement does not reproduce on this backend.
231 lines
9.1 KiB
Markdown
231 lines
9.1 KiB
Markdown
# Optimization Log
|
|
|
|
Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
|
|
|
|
**Verdicts**: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
|
|
|
|
---
|
|
|
|
## Phase 1: Core System
|
|
|
|
### 1.1 Tuned Profile: accelerator-performance
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Change**: `sudo tuned-adm profile accelerator-performance`
|
|
- **Benchmark**: `data/benchmarks/after-tuned-*`
|
|
- **Result**: +5-8% pp improvement, +2-3% tg improvement
|
|
- **Verdict**: KEEP
|
|
|
|
### 1.2 Kernel Boot Parameters
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Change**: `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`
|
|
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
|
|
- **Result**: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
|
|
- **Verdict**: KEEP
|
|
|
|
### 1.3 BIOS VRAM Reduction (512 MB)
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Change**: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
|
|
- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
|
|
- **Result**: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
|
|
- **Trade-off**: Small model regression is acceptable given the massive capability gain.
|
|
- **Verdict**: KEEP
|
|
|
|
---
|
|
|
|
## Phase 2: System Tuning
|
|
|
|
### 2.1 RyzenAdj PPT Increase
|
|
|
|
- **Date**: 2026-03-30
|
|
- **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000`
|
|
- **Result**: STAPM raised from 59W→81W. PPT Fast raised to 81W. **However, PPT SLOW and APU SLOW stuck at 70W** — HP ZBook BIOS EC overrides these limits. Effective sustained power: ~70W (was ~59W).
|
|
- **Benchmark**: `data/benchmarks/qwen35-shootout-v2-*` (Vulkan, q4_0 KV, pp2048/tg1024)
|
|
- UD-Q4_K_L: **57.0 t/s** (was ~39 t/s before RyzenAdj = **+46%**)
|
|
- UD-Q4_K_XL: **56.4 t/s**
|
|
- Q8_0: **51.4 t/s** (was ~39-41 t/s before = **+25%**)
|
|
- **Thermals**: 70-73C under load, 30C headroom. Cooling handles it easily.
|
|
- **Notes**: Settings are volatile (reset on reboot/sleep). Use `sudo make optimize-power` or install systemd service for persistence. HP firmware hard-caps slow PPT at 70W regardless.
|
|
- **Verdict**: KEEP — significant real-world improvement despite HP firmware limit
|
|
|
|
### 2.2 VM Sysctl Tuning
|
|
|
|
- **Date**: 2026-03-30
|
|
- **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.dirty_background_ratio=10, vm.max_map_count=500000, vm.zone_reclaim_mode=0`
|
|
- **Applied via**: `sudo make optimize-power` (persists to `/etc/sysctl.d/99-llm-inference.conf`)
|
|
- **Notes**: Hard to isolate impact — applied together with other Phase 2 changes. Prevents model weight eviction and I/O disruption.
|
|
- **Verdict**: KEEP — low risk, persists across reboots
|
|
|
|
### 2.3 Transparent Huge Pages
|
|
|
|
- **Date**: 2026-03-30
|
|
- **Change**: `echo always > /sys/kernel/mm/transparent_hugepage/enabled`
|
|
- **Applied via**: `sudo make optimize-power` (volatile — add `transparent_hugepage=always` to kernel cmdline for persistence)
|
|
- **Notes**: Reduces TLB misses for mmap'd model files. Hard to isolate impact.
|
|
- **Verdict**: KEEP — low risk
|
|
|
|
### 2.4 RADV_PERFTEST=nogttspill
|
|
|
|
- **Date**: 2026-03-30
|
|
- **Change**: `RADV_PERFTEST=nogttspill` persisted to `/etc/environment.d/radv-llm.conf`
|
|
- **Applied via**: `sudo make optimize-power`
|
|
- **Notes**: Prevents GTT spill management overhead on unified memory Vulkan. Takes effect on next login. For current session: `export RADV_PERFTEST=nogttspill`
|
|
- **Verdict**: KEEP — persists across reboots
|
|
|
|
### 2.5 amdgpu.noretry=0
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: Kernel cmdline `amdgpu.noretry=0`
|
|
- **Expected**: Improved stability under memory pressure
|
|
- **Notes**: Only apply if experiencing GPU page faults or crashes during large model loading
|
|
- **Verdict**: PENDING
|
|
|
|
---
|
|
|
|
## Phase 3: Runtime Flags
|
|
|
|
### 3.1 KV Cache Quantization
|
|
|
|
- **Date**: 2026-03-27
|
|
- **Change**: `--kv-types f16,q8_0,q4_0` sweep
|
|
- **Benchmark**: `data/benchmarks/kv-sweep-256k-*`
|
|
- **Result** (Vulkan RADV, Qwen3.5-35B-A3B Q8, pp2048/tg1024):
|
|
- f16: 456 pp, 39.8 tg
|
|
- q8_0: 418 pp, 38.5 tg (slight Vulkan regression — unexpected)
|
|
- **q4_0: 460 pp, 41.1 tg** (fastest overall, +3% tg over f16)
|
|
- **Result** (ROCm, same model):
|
|
- f16: 445 pp, 21.5 tg
|
|
- q8_0: 495 pp, 21.7 tg (+11% pp, same tg)
|
|
- q4_0: 494 pp, 21.8 tg (+11% pp, same tg)
|
|
- **Conclusion**: q4_0 is the sweet spot on Vulkan (fastest tg + 75% less KV memory). On ROCm, KV quant helps pp but not tg.
|
|
- **Verdict**: KEEP — use q4_0 KV as default for serving
|
|
|
|
### 3.2 MoE Batch Size `-b 256`
|
|
|
|
- **Date**: 2026-03-30
|
|
- **Change**: `-b 256` vs default (2048)
|
|
- **Benchmark**: `data/benchmarks/batch-default-*` vs `data/benchmarks/batch-256-*`
|
|
- **Result** (Vulkan RADV, Qwen3.5-35B-A3B UD-Q4_K_XL, q4_0 KV):
|
|
- Default: 826 pp, 55.9 tg
|
|
- b=256: 843 pp, 55.5 tg (within noise)
|
|
- **Notes**: Community-reported +70% improvement does not reproduce on Vulkan RADV. May only apply to ROCm or CPU backends, or to longer prompts (pp8192+).
|
|
- **Verdict**: NO IMPACT on Vulkan — not recommended
|
|
|
|
---
|
|
|
|
## Phase 4: Build Optimizations
|
|
|
|
### 4.1 rocWMMA Flash Attention
|
|
|
|
- **Date**: PENDING
|
|
- **Change**: Rebuild ROCm toolbox with `-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`
|
|
- **Expected**: +96% long-context performance (65K+)
|
|
- **Notes**: Need to check if Donato's toolboxes already include this
|
|
- **Verdict**: PENDING
|
|
|
|
### 4.2 rocWMMA Tuned Patch (PR #16827)
|
|
|
|
- **Date**: PENDING
|
|
- **Notes**: Fixes long-context regression. Check Donato's latest toolbox builds.
|
|
- **Verdict**: PENDING
|
|
|
|
---
|
|
|
|
## Phase 5: Future / Blocked
|
|
|
|
### 5.1 Speculative Decoding
|
|
|
|
- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
|
|
- **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
|
|
- **Last checked**: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
|
|
|
|
### 5.2 Native MTP (Multi-Token Prediction)
|
|
|
|
- **Status**: BLOCKED — llama.cpp PR #20700
|
|
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon
|
|
|
|
### 5.3 GPU Clock Reporting
|
|
|
|
- **Status**: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
|
|
- **Measured**: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
|
|
- **Notes**: ROCm issue #5750 is about sysfs `pp_dpm_sclk` reporting, not actual performance. No action needed.
|
|
- **Verdict**: CLOSED — no performance impact
|
|
|
|
---
|
|
|
|
## Context Window Benchmarks
|
|
|
|
### 64K Context (pp4096/tg1024, MoE models)
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Benchmark**: `data/benchmarks/ctx64k-*`
|
|
- **Results**: (check logs)
|
|
|
|
### 128K Context (pp8192/tg1024, MoE models)
|
|
|
|
- **Date**: 2026-03-26
|
|
- **Benchmark**: `data/benchmarks/ctx128k-realistic-*`
|
|
- **Results**: (check logs)
|
|
|
|
### 256K Context (pp16384/tg1024, MoE models)
|
|
|
|
- **Date**: 2026-03-27
|
|
- **Benchmark**: `data/benchmarks/ctx256k-*`
|
|
- **Results**: (check logs)
|
|
|
|
---
|
|
|
|
## Model Quant Shootout
|
|
|
|
### Qwen3.5-35B-A3B — Q4_K_L vs Q4_K_XL vs Q8 (2026-03-30)
|
|
|
|
- **Benchmark**: `data/benchmarks/qwen35-shootout-v2-*`
|
|
- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
|
|
- **RyzenAdj**: STAPM=81W (sustained ~70W due to HP firmware cap)
|
|
|
|
| Quant | File Size | pp2048 (t/s) | tg1024 (t/s) | Recommendation |
|
|
|-------|-----------|-------------|-------------|----------------|
|
|
| UD-Q4_K_L | 18.8 GB | 825 | **57.0** | Fastest. Good quality. |
|
|
| **UD-Q4_K_XL** | 20.7 GB | 835 | **56.4** | **Daily driver** — best quality/speed. |
|
|
| Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |
|
|
|
|
**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
|
|
|
|
### Coder Model Shootout (2026-03-30)
|
|
|
|
- **Benchmark**: `data/benchmarks/coder-shootout-*`
|
|
- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
|
|
- **RyzenAdj**: STAPM=81W (sustained ~70W)
|
|
|
|
| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
|
|
|-------|-------------|-----------|-------------|-------------|
|
|
| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
|
|
| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
|
|
| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
|
|
| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
|
|
|
|
**Analysis**:
|
|
- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
|
|
- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
|
|
- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
|
|
- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
|
|
|
|
**Recommended roles**:
|
|
- **Qwen3-Coder-30B**: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
|
|
- **Qwen3.5-35B-A3B**: General tasks, long prompt processing (best pp, best all-rounder)
|
|
- **Qwen3-Coder-Next**: Complex multi-file coding tasks where quality > speed
|
|
|
|
---
|
|
|
|
## How to Add Entries
|
|
|
|
When testing a new optimization:
|
|
|
|
1. Record the date and exact change
|
|
2. Run a benchmark: `make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."`
|
|
3. Compare: `make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new`
|
|
4. Update this log with results and verdict
|
|
5. If KEEP: document in [optimization.md](optimization.md) with the measured numbers
|