feat: add Qwen3.5 model catalog and agentic evaluation framework

Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Felipe Cardoso
2026-03-26 00:20:23 +01:00
parent 71053997be
commit 58124cd657
11 changed files with 1354 additions and 16 deletions

22
configs/models.conf Normal file
View File

@@ -0,0 +1,22 @@
# Model catalog for benchmarking
# Format: NAME|HF_REPO|FILE|SIZE_GB|CATEGORY|DESCRIPTION
#
# Categories: smoke, standard, moe, dense, coding, agentic
# Download with: huggingface-cli download REPO FILE --local-dir data/models
# ── Smoke tests (quick, small) ───────────────────────────
qwen3-4b|unsloth/Qwen3-4B-GGUF|Qwen3-4B-Q4_K_M.gguf|3|smoke|Quick validation
# ── Standard benchmarks ──────────────────────────────────
qwen3-14b|unsloth/Qwen3-14B-GGUF|Qwen3-14B-Q4_K_M.gguf|9|standard|Standard test model
# ── Qwen3.5 MoE models (fast generation, best for 64GB) ─
qwen3.5-35b-a3b-q8|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-Q8_0.gguf|37|moe|Top pick: near-full precision, 3B active
qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf|22|moe|Best quality/size ratio, 3B active
# ── Qwen3.5 dense models ────────────────────────────────
qwen3.5-27b-q4|unsloth/Qwen3.5-27B-GGUF|Qwen3.5-27B-Q4_K_M.gguf|17|dense|Dense 27B, quality-first
qwen3.5-27b-q8|unsloth/Qwen3.5-27B-GGUF|Qwen3.5-27B-Q8_0.gguf|29|dense|Dense 27B, max quality
# ── Coding / agentic models ─────────────────────────────
qwen3-coder-30b-a3b|unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF|Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf|18|agentic|Best for tool use + coding, 3B active