feat(serve): add optimized llama-server launcher with n-gram speculation

Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.
2026-03-30 21:12:30 +02:00
parent ba24091791
commit dd403a907c
4 changed files with 169 additions and 5 deletions
--- a/docs/optimization-log.md
+++ b/docs/optimization-log.md
@@ -135,16 +135,32 @@ Living document tracking what was applied, tested, and the actual results. Each

 ## Phase 5: Future / Blocked

-### 5.1 Speculative Decoding
+### 5.1 Speculative Decoding (draft model)

- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
+- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE checkpoint/restore)
 - **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
- **Last checked**: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
+- **Last checked**: 2026-03-30 — PR stalled since Mar 5. ROCm buffer crashes in `copy_cell()`. Works on Metal/CUDA but not AMD. Months away from landing.

 ### 5.2 Native MTP (Multi-Token Prediction)

- **Status**: BLOCKED — llama.cpp PR #20700
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon
+- **Status**: BLOCKED — multiple dependencies unmerged
+- **Last checked**: 2026-03-30
+- **Details**: 4 separate PRs in flight, none merged:
+  - PR #18886: MTP API framework (DRAFT since Feb 6) — foundation for all MTP work
+  - PR #20700: MTP for Qwen3.5 **dense only** (WIP, author says "not expected to merge soon")
+  - PR #15225: GLM-style MTP (open since Aug 2025, "slower than baseline")
+  - PR #18039: EAGLE3 speculative (open since Dec 2025)
+- **Key gap**: No MTP implementation exists for MoE models. PR #20700 only covers dense Qwen3.5 (0.8B-27B), not the 35B-A3B MoE.
+- **Timeline estimate**: MTP API (#18886) must merge first, then model-specific implementations adapted. Months, not weeks.
+
+### 5.2a N-gram Speculative Decoding (AVAILABLE NOW)
+
+- **Status**: WORKS TODAY — no upstream PRs needed
+- **How**: `llama-server --spec-type ngram-simple --draft-max 64 --draft-min 4`
+- **Expected**: 1.1-1.4x tg speedup on repetitive content (code, structured output)
+- **Added to**: `make serve-ngram ARGS="-m MODEL.gguf"` and `bin/serve --ngram`
+- **Notes**: Pattern-matches from token history, no draft model needed. Best for code generation where patterns repeat. No quality impact.
+- **Verdict**: AVAILABLE — use `--ngram` flag when serving

 ### 5.3 GPU Clock Reporting