feat(serve): add optimized llama-server launcher with n-gram speculation

Add `make serve` and `make serve-ngram` for launching llama-server with
baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention,
no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x
tg speedup on repetitive content without upstream PR dependencies.
Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE
support), draft-model speculation stalled on ROCm buffer crashes.
This commit is contained in:
Felipe Cardoso
2026-03-30 21:12:30 +02:00
parent ba24091791
commit dd403a907c
4 changed files with 169 additions and 5 deletions

View File

@@ -135,16 +135,32 @@ Living document tracking what was applied, tested, and the actual results. Each
## Phase 5: Future / Blocked
### 5.1 Speculative Decoding
### 5.1 Speculative Decoding (draft model)
- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE checkpoint/restore)
- **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
- **Last checked**: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
- **Last checked**: 2026-03-30 — PR stalled since Mar 5. ROCm buffer crashes in `copy_cell()`. Works on Metal/CUDA but not AMD. Months away from landing.
### 5.2 Native MTP (Multi-Token Prediction)
- **Status**: BLOCKED — llama.cpp PR #20700
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon
- **Status**: BLOCKED — multiple dependencies unmerged
- **Last checked**: 2026-03-30
- **Details**: 4 separate PRs in flight, none merged:
- PR #18886: MTP API framework (DRAFT since Feb 6) — foundation for all MTP work
- PR #20700: MTP for Qwen3.5 **dense only** (WIP, author says "not expected to merge soon")
- PR #15225: GLM-style MTP (open since Aug 2025, "slower than baseline")
- PR #18039: EAGLE3 speculative (open since Dec 2025)
- **Key gap**: No MTP implementation exists for MoE models. PR #20700 only covers dense Qwen3.5 (0.8B-27B), not the 35B-A3B MoE.
- **Timeline estimate**: MTP API (#18886) must merge first, then model-specific implementations adapted. Months, not weeks.
### 5.2a N-gram Speculative Decoding (AVAILABLE NOW)
- **Status**: WORKS TODAY — no upstream PRs needed
- **How**: `llama-server --spec-type ngram-simple --draft-max 64 --draft-min 4`
- **Expected**: 1.1-1.4x tg speedup on repetitive content (code, structured output)
- **Added to**: `make serve-ngram ARGS="-m MODEL.gguf"` and `bin/serve --ngram`
- **Notes**: Pattern-matches from token history, no draft model needed. Best for code generation where patterns repeat. No quality impact.
- **Verdict**: AVAILABLE — use `--ngram` flag when serving
### 5.3 GPU Clock Reporting