feat(serve): add optimized llama-server launcher with n-gram speculation

Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.
2026-03-30 21:12:30 +02:00
parent ba24091791
commit dd403a907c
4 changed files with 169 additions and 5 deletions
--- a/bin/serve
+++ b/bin/serve
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+# Server dispatcher
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+exec bash "$SCRIPT_DIR/scripts/serve/launch.sh" "$@"