Model Evaluation

Online Resources for Model Evaluation

  1. Artificial Analysis - The best starting point. An independent benchmarking platform tracking LLM quality, speed, and cost across providers — useful for comparing hosted API providers on the same model or tracking how performance and pricing evolve over time.
  2. LM Arena - ELO-based leaderboard ranked via human preference votes from blind side-by-side model comparisons. Widely cited as a real-world quality signal distinct from academic benchmarks.
  3. LLM Stats - Similar to LM Arena but more consolidated. Aggregates benchmark scores, context window sizes, licensing, and other statistics across a wide range of models.

How to Evaluate a Model

Start with memory fit. Before assessing quality, determine whether the model can run at all on your hardware. A model that doesn’t fit in VRAM can be quantized to a lower bit depth to reduce its footprint, or layers can be offloaded to system RAM at the cost of slower inference. This is the practical baseline — a model that runs slowly is still usable; one that doesn’t fit isn’t.

Quantization degrades quality below Q4. Q4_K_M is the widely accepted quality floor for general use — the balance of size reduction and output quality is good enough for most tasks. Going below Q4 (Q3, Q2, Q1) causes increasingly noticeable degradation. The exception is IQ (importance-matrix) quantizations, which use a calibration dataset to apply precision selectively and can outperform K-quants at the same bit depth, particularly at lower levels.

Quantization quality is measured by perplexity — a metric that captures how well the model predicts correct next-token selections compared to the full-precision original. Lower perplexity means the quantized model’s output distribution stays closer to the unquantized version. Many model releases publish perplexity scores for each quantization level, but this is not guaranteed.

Where to find quantized models. Unsloth and bartowski on Hugging Face are the two most prominent sources for high-quality GGUF quantizations, and are typically among the first to publish quantizations for new model releases. When perplexity data is available, it’s usually included alongside the model download on Hugging Face.

Evaluating reasoning and thinking capability. Some models expose a dedicated thinking mode — extended chain-of-thought reasoning before the final answer — while others reason implicitly. When evaluating reasoning, look for scores on AIME (competition math), GPQA Diamond (graduate-level science), and MATH-500. These benchmarks are hard to game and correlate well with real-world complex problem solving. A model’s thinking budget (the token limit on internal reasoning) also matters: a model trained with a 40k thinking budget behaves differently from one with 80k.

Evaluating tool use. TAU-bench (airline and retail variants) is the most meaningful public benchmark for agentic tool use — it tests multi-turn, real-world task completion rather than single tool calls in isolation. Tool Calling (V*) measures structured calling accuracy across schemas. A model that scores well on both is reliable for function-calling pipelines and autonomous agents.

Evaluating general capability. IFEval measures instruction-following fidelity (does the model do exactly what you ask, in the format you specified?). SWE-bench Verified measures real-world software engineering — resolving actual GitHub issues. Together these cover two distinct failure modes: models that hallucinate or drift from instructions, and models that can’t execute multi-step technical tasks.

Model Examples

Qwen3.5-27B — A fully dense (no MoE) 27B model from Alibaba Cloud (Feb 2026). All 27B parameters are active on every forward pass, which makes inference predictable and efficient without expert routing overhead. Its strength is breadth: IFEval 95.0 (instruction following), Tool Calling V* 93.7 (structured tool use), SWE-bench Verified 72.4 (coding/engineering), and GPQA Diamond 85.5 (graduate-level reasoning). It has a native 262k context window and supports a thinking mode for harder problems. Practically, it fits in ~22GB of VRAM, making it runnable on consumer or prosumer hardware. A good default choice when you need a capable all-rounder that runs locally.

MiniMax-M1 — A hybrid MoE model with 456B total parameters but only ~46B active per token. It uses a linear attention mechanism (Lightning Attention) that scales sub-linearly with context length, enabling a native 1M token context window at practical compute cost. Its standout qualities are long-context retrieval and agentic tool use: it outperforms all open-weight models on TAU-bench, which is the most realistic public benchmark for multi-turn tool use. Reasoning benchmarks are strong (AIME 2024: 86.0, MATH-500: 96.8). The tradeoff is hardware demand — 456B total weights require a multi-GPU setup even with MoE sparsity. A good choice when the task involves long documents, complex agentic pipelines, or extended reasoning chains.