Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.
6 lines
173 B
Bash
Executable File
6 lines
173 B
Bash
Executable File
#!/usr/bin/env bash
|
|
# Server dispatcher
|
|
set -euo pipefail
|
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
|
exec bash "$SCRIPT_DIR/scripts/serve/launch.sh" "$@"
|