Evals#

gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?

To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.

The suite covers fundamental tool use, web browsing, project initialization, and a growing set of practical programming tasks that reflect real-world agentic work: building APIs, refactoring code, parsing data formats, writing tests, and more.

Model Leaderboard#

The table below shows pass rates across our eval suites for each model (best tool format per model). Models are ranked by overall pass rate, with breakdowns by suite type.

$ python -m gptme.eval.leaderboard --results-dir eval_results --format rst --min-tests 4
============================  ==========  ===============  ==========  ==========
Model                         Format      Overall          Basic       Practical 
============================  ==========  ===============  ==========  ==========
Claude 3.5 Sonnet (Jun 2024)  default     7/7 (100%)       4/4         -         
Gemini 1.5 Pro (OR)           default     7/7 (100%)       4/4         -         
GPT-4 Turbo                   default     7/7 (100%)       4/4         -         
Hermes 3 405B                 default     7/7 (100%)       4/4         -         
Claude 3.5 Haiku              default     6/6 (100%)       4/4         -         
Claude Sonnet 4.6             tool        54/60 (90%)      14/18       40/41     
GPT-4o                        tool        5/5 (100%)       4/4         -         
GPT-4o Mini                   tool        5/5 (100%)       4/4         -         
Llama 3.1 405B                markdown    5/5 (100%)       4/4         -         
Claude 3.5 Sonnet (Oct 2024)  default     5/5 (100%)       4/4         -         
Claude Sonnet 4               markdown    5/5 (100%)       4/4         -         
Kimi K2                       xml         4/4 (100%)       4/4         -         
Gemini 2.5 Flash              markdown    4/4 (100%)       4/4         -         
Kimi K2 (OR)                  markdown    4/4 (100%)       4/4         -         
Magistral Medium              markdown    4/4 (100%)       4/4         -         
Qwen3 Max                     markdown    4/4 (100%)       4/4         -         
Grok 4 Fast                   markdown    4/4 (100%)       4/4         -         
Claude Sonnet 4.5             tool        16/18 (89%)      16/18       -         
GPT-4o Mini (OR)              tool        15/18 (83%)      15/18       -         
Claude 3 Haiku                default     6/7 (86%)        3/4         -         
o1-preview                    default     6/7 (86%)        3/4         -         
GPT-5                         markdown    4/5 (80%)        4/4         -         
GPT-5 Mini                    xml         4/5 (80%)        3/4         -         
Claude Haiku 4.5              tool        13/19 (68%)      13/18       -         
Llama 3.1 70B                 default     5/7 (71%)        2/4         -         
Gemini 1.5 Flash (OR)         default     5/7 (71%)        3/4         -         
Gemma 2 27B                   default     5/7 (71%)        4/4         -         
Llama 3.2 90B                 default     3/4 (75%)        3/4         -         
Llama 3.2 11B                 default     3/4 (75%)        3/4         -         
Grok Code Fast                markdown    3/4 (75%)        3/4         -         
Hermes 4 70B                  xml         3/4 (75%)        3/4         -         
o1-mini                       default     4/7 (57%)        3/4         -         
Llama 3.1 8B                  default     3/5 (60%)        2/4         -         
Qwen3 32B                     markdown    3/5 (60%)        2/4         -         
Gemma 2 9B                    default     3/7 (43%)        2/4         -         
Hermes 2 Pro 8B               default     1/4 (25%)        1/4         -         
Claude Opus 4.1               markdown    1/5 (20%)        1/4         -         
Gemini 1.5 Flash              default     0/5 (0%)         0/4         -         
Hermes 3 70B                  default     0/4 (0%)         0/4         -         
============================  ==========  ===============  ==========  ==========

Notes:

  • Format shows the best-performing --tool-format for each model.

  • Basic tests cover fundamental tool use (file I/O, shell, git, Python).

  • Practical tests cover real-world programming tasks (APIs, data processing, refactoring).

  • Models with fewer than 4 tests are excluded.

  • Results use a 300-second timeout per test. Some models may perform better with longer timeouts.

To generate this table locally:

gptme-eval --leaderboard --leaderboard-format rst
gptme-eval --leaderboard --leaderboard-format csv       # for data analysis
gptme-eval --leaderboard --leaderboard-format markdown   # for GitHub/blog
gptme-eval --leaderboard --leaderboard-format html       # self-contained HTML page

Usage#

You can run the simple hello eval like this:

gptme-eval hello --model anthropic/claude-sonnet-4-6

However, we recommend running it in Docker to improve isolation and reproducibility:

make build-docker
docker run \
    -e "ANTHROPIC_API_KEY=<your api key>" \
    -v $(pwd)/eval_results:/app/eval_results \
    gptme-eval hello --model anthropic/claude-sonnet-4-6

Available Eval Suites#

The evaluation suite is organized into named suites that can be run individually or together:

basic

Fundamental tool use: reading and writing files, patching code, running Python in IPython, executing shell commands, using git, counting words, transforming JSON, multi-file refactoring, writing tests, generating CLI programs, and fixing bugs. (~18 tests)

browser

Web browsing and data extraction using the browser tool.

init_projects

Project initialization: init-git, init-react, init-rust. Tests the ability to scaffold new projects from scratch.

practicalpractical2 — … — practical33

A growing series of real-world programming tasks that go beyond basic file I/O. The practical suites now cover 99 tasks across data processing, refactoring, algorithms, async/concurrency, SQL, validation, graph search, dynamic programming, tree data structures, and classic interview problems.

Early suites give a good feel for the format:

Suite

Description

Tests

practical

Web APIs, log parsing, error handling

build-api, parse-log, add-error-handling

practical2

Data filtering, templating, CSV validation

sort-and-filter, template-fill, validate-csv

practical3

Unit test writing, SQLite persistence

write-tests-calculator, sqlite-store

practical4

Data aggregation, schedule overlap detection, topological sort

group-by, schedule-overlaps, topo-sort

practical5

Code refactoring, data pipelines, regex scrubbing

rename-function, data-pipeline, regex-scrub

practical6

CSV analysis, word frequency counting, config merging

csv-analysis, word-frequency, merge-configs

practical7

INI-to-JSON conversion, JSON diff, changelog generation

ini-to-json, json-diff, changelog-gen

Later suites extend coverage with semver sorting, Roman numerals, matrix and bracket tasks, async pipelines and worker queues, SQL analytics, tries, LRU caches, interval merging, min-stack, knight moves, histogram area, edit distance, BST operations, coin change, Dijkstra, spiral matrix, number of islands, Kadane’s algorithm, 0/1 knapsack, flood fill, trapping rain water, word break, permutations, longest common subsequence, stock trading with cooldown, image rotation, N-Queens, longest increasing subsequence, cycle detection, sliding window maximum, decode ways, meeting rooms, longest palindromic substring, jump game, task scheduler, house robber, max product subarray, finding all anagrams, minimum path sum, gas station, next permutation, word break II, unique paths, rotate array, decode string, top-k frequent elements, partition equal subset sum, 3sum, majority element (Boyer-Moore voting), counting bits, combination sum, generate parentheses, single number (XOR), product except self, find duplicate (Floyd’s cycle detection), and missing number.

For the current authoritative suite list, run gptme-eval --list.

Run specific tests or suites by name:

gptme-eval build-api --model anthropic/claude-sonnet-4-6
gptme-eval sort-and-filter rename-function --model anthropic/claude-sonnet-4-6

Run all practical suites at once (useful for benchmarking):

gptme-eval all-practical --model anthropic/claude-sonnet-4-6

# Or run every suite (basic + browser + init_projects + practical):
gptme-eval all --model anthropic/claude-sonnet-4-6

Raw Results#

Full per-test results from all eval runs are stored as CSV files in eval_results/ subdirectories. Results are published to the eval-results branch of the repository.

To view raw results locally:

# View latest results
cat eval_results/*/eval_results.csv | head -50

# Export leaderboard as CSV for analysis
gptme-eval --leaderboard --leaderboard-format csv

# Export as JSON for programmatic use
gptme-eval --leaderboard --leaderboard-format json

Other evals#

SWE-Bench support is now available via gptme-eval-swebench. It can:

  • inspect datasets and instances with --info

  • generate predictions.jsonl in the official SWE-Bench format

  • resume interrupted runs with --resume

  • optionally invoke the official harness with --run-harness

Example single-instance smoke test:

gptme-eval-swebench \
    -m anthropic/claude-sonnet-4-6 \
    -i django__django-11099

Example full SWE-Bench Lite run:

gptme-eval-swebench \
    -m anthropic/claude-sonnet-4-6 \
    --resume \
    --run-harness \
    --dataset princeton-nlp/SWE-bench_Lite \
    --run-id gptme_baseline_2026

Notes:

  • The built-in summary printed by gptme-eval-swebench is a lightweight file-coverage heuristic. For authoritative pass/fail results and leaderboard submission, use the official SWE-Bench harness.

  • --run-harness requires Docker plus swebench[evaluation] dependencies.

  • Use gptme-eval-swebench --info to inspect dataset size and specific instance IDs before launching an expensive run.

See also:

  • PR #1994 — SWE-Bench harness integration

  • PR #2045 — resume support for interrupted runs