feat: add Qwen3.5 model catalog and agentic evaluation framework

Models: - configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick), Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding) - Updated benchmark setup to show catalog with download status - docs/model-recommendations.md: memory planning, quantization guide Agentic evaluation: - scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench in a Python venv - scripts/agentic/run-eval.sh: runs evaluations against local LLM server (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code (EvalPlus+BigCodeBench), tooluse (BFCL), full (all) - bin/agentic: dispatcher with help - docs/agentic-benchmarks.md: methodology, framework comparison, model recommendations for agentic use Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:20:23 +01:00
parent 71053997be
commit 58124cd657
11 changed files with 1354 additions and 16 deletions
--- a/bin/agentic
+++ b/bin/agentic
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+# Agentic evaluation dispatcher
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+
+case "${1:-help}" in
+    setup)   exec bash "$SCRIPT_DIR/scripts/agentic/setup.sh" ;;
+    run)     exec bash "$SCRIPT_DIR/scripts/agentic/run-eval.sh" "${@:2}" ;;
+    quick)   exec bash "$SCRIPT_DIR/scripts/agentic/run-eval.sh" --suite quick "${@:2}" ;;
+    code)    exec bash "$SCRIPT_DIR/scripts/agentic/run-eval.sh" --suite code "${@:2}" ;;
+    tooluse) exec bash "$SCRIPT_DIR/scripts/agentic/run-eval.sh" --suite tooluse "${@:2}" ;;
+    full)    exec bash "$SCRIPT_DIR/scripts/agentic/run-eval.sh" --suite full "${@:2}" ;;
+    *)
+        echo "Usage: agentic <command> [options]"
+        echo ""
+        echo "Commands:"
+        echo "  setup     Install evaluation frameworks (inspect-ai, evalplus, bigcodebench)"
+        echo "  quick     EvalPlus HumanEval+ + IFEval (~1 hour)"
+        echo "  code      EvalPlus + BigCodeBench (~2-3 hours)"
+        echo "  tooluse   BFCL function calling evaluation (~1-2 hours)"
+        echo "  full      All evaluations (~5-6 hours)"
+        echo "  run       Custom run (--suite SUITE --model NAME --endpoint URL)"
+        echo ""
+        echo "All commands require --model NAME. Examples:"
+        echo "  agentic quick --model qwen3.5:35b-a3b-q8_0"
+        echo "  agentic full --model qwen3-coder:30b-a3b --endpoint http://localhost:8080/v1"
+        exit 1
+        ;;
+esac