From 474d94a07e2023027686925c0f4959bc5fa70d20 Mon Sep 17 00:00:00 2001 From: Felipe Cardoso Date: Fri, 3 Apr 2026 20:03:53 +0200 Subject: [PATCH] chore: update model catalog with gemma 4, opus distill, and hw-bandwidth target --- Makefile | 4 ++++ configs/models.conf | 15 +++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/Makefile b/Makefile index bc8bfad..3fffa70 100644 --- a/Makefile +++ b/Makefile @@ -45,6 +45,10 @@ serve: ## Launch llama-server with optimized settings (ARGS="-m MODEL.gguf") serve-ngram: ## Launch with n-gram speculative decoding (ARGS="-m MODEL.gguf") @bash bin/serve --ngram $(ARGS) +# --- Hardware Info --- +hw-bandwidth: ## Measure GPU memory bandwidth and compute (clpeak) + @clpeak 2>&1 + # --- Optimize --- optimize: ## Interactive optimization walkthrough @bash bin/optimize --all diff --git a/configs/models.conf b/configs/models.conf index 4326607..141f4d2 100644 --- a/configs/models.conf +++ b/configs/models.conf @@ -5,17 +5,23 @@ # Download with: huggingface-cli download REPO FILE --local-dir /data/models/llms/REPO # ── Smoke tests (quick, small) ─────────────────────────── -qwen3.5-0.8b-q8|unsloth/Qwen3.5-0.8B-GGUF|Qwen3.5-0.8B-Q8_0.gguf|0.8|smoke|Tiny, Q8 full precision +qwen2.5-0.5b-q8|lmstudio-community/Qwen2.5-0.5B-Instruct-GGUF|Qwen2.5-0.5B-Instruct-Q8_0.gguf|0.4|smoke|Tiny Qwen2.5, Q8 +qwen3.5-0.8b-q8|unsloth/Qwen3.5-0.8B-GGUF|Qwen3.5-0.8B-Q8_0.gguf|0.8|smoke|Tiny Qwen3.5, Q8 qwen3.5-2b-q4|unsloth/Qwen3.5-2B-GGUF|Qwen3.5-2B-Q4_K_S.gguf|1.2|smoke|Small dense 2B qwen3.5-4b-q4|unsloth/Qwen3.5-4B-GGUF|Qwen3.5-4B-Q4_K_S.gguf|2.5|smoke|Small dense 4B # ── Standard dense models ──────────────────────────────── qwen3.5-9b-q4|unsloth/Qwen3.5-9B-GGUF|Qwen3.5-9B-Q4_K_S.gguf|5.1|dense|Dense 9B -gpt-oss-20b-mxfp4|lmstudio-community/gpt-oss-20b-GGUF|gpt-oss-20b-MXFP4.gguf|12|dense|GPT-OSS 20B MXFP4 -glm-4.7-flash-q6|lmstudio-community/GLM-4.7-Flash-GGUF|GLM-4.7-Flash-Q6_K.gguf|23|dense|GLM 4.7 Flash Q6 +glm-4.7-flash-q6|unsloth/GLM-4.7-Flash-GGUF|GLM-4.7-Flash-UD-Q6_K_XL.gguf|24|moe|GLM 4.7 Flash, UD Q6 (MoE 30B, 3B active) -# ── Qwen3.5-27B dense (download needed) ───────────────── +# ── Gemma 4 ──────────────────────────────────────────── +gemma-4-26b-a4b-q6xl|unsloth/gemma-4-26B-A4B-it-GGUF|gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf|22|moe|Gemma 4 MoE 26B, 4B active, UD Q6 XL +gemma-4-26b-a4b-q4s|unsloth/gemma-4-26B-A4B-it-GGUF|gemma-4-26B-A4B-it-UD-Q4_K_S.gguf|15|moe|Gemma 4 MoE 26B, 4B active, UD Q4 +gemma-4-31b-q3xl|unsloth/gemma-4-31B-it-GGUF|gemma-4-31B-it-UD-Q3_K_XL.gguf|14|dense|Gemma 4 dense 31B, UD Q3 XL + +# ── Qwen3.5-27B dense ────────────────────────────────── qwen3.5-27b-q4|unsloth/Qwen3.5-27B-GGUF|Qwen3.5-27B-Q4_K_M.gguf|17|dense|Dense 27B, quality-first +qwen3.5-27b-opus-distill|Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF|Qwen3.5-27B.Q4_K_M.gguf|15|dense|Dense 27B, Claude Opus reasoning distilled v2 # ── MoE models (fast generation, best for 64GB) ───────── qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf|21|moe|MoE 35B, 3B active, Unsloth dynamic XL @@ -24,6 +30,7 @@ nemotron-cascade2-q8|bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF|nvidia_Nem # ── Coding models ───────────────────────────────────────── qwen3-coder-30b-a3b-q6|unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF|Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf|26|moe|Agentic coding MoE, pure Transformer +qwen3-coder-next-q3|unsloth/Qwen3-Coder-Next-GGUF|Qwen3-Coder-Next-UD-Q3_K_XL.gguf|34|moe|80B MoE coder, >70% SWE-bench, hybrid DeltaNet # ── Draft models (speculative decoding) ─────────────────── qwen3.5-0.8b-q8-draft|unsloth/Qwen3.5-0.8B-GGUF|Qwen3.5-0.8B-Q8_0.gguf|0.8|draft|Draft for Qwen3.5 speculative decoding