Evals#

gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?

To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.

The suite covers fundamental tool use, web browsing, project initialization, and a growing set of practical programming tasks that reflect real-world agentic work: building APIs, refactoring code, parsing data formats, writing tests, and more.

Model Leaderboard#

The table below shows pass rates across our eval suites for each model (best tool format per model). Models are ranked by overall pass rate, with breakdowns by suite type.

$ python -m gptme.eval.leaderboard --results-dir eval_results --format rst --min-tests 4
============================  ==========  ===============  ==========  ==========
Model                         Format      Overall          Basic       Practical 
============================  ==========  ===============  ==========  ==========
Claude Sonnet 4.6             tool        59/60 (98%)      18/18       40/41     
Claude 3.5 Sonnet (Jun 2024)  default     7/7 (100%)       4/4         -         
Gemini 1.5 Pro (OR)           default     7/7 (100%)       4/4         -         
GPT-4 Turbo                   default     7/7 (100%)       4/4         -         
Hermes 3 405B                 default     7/7 (100%)       4/4         -         
Claude 3.5 Haiku              default     6/6 (100%)       4/4         -         
GPT-4o                        tool        5/5 (100%)       4/4         -         
GPT-4o Mini                   tool        5/5 (100%)       4/4         -         
Llama 3.1 405B                markdown    5/5 (100%)       4/4         -         
Claude 3.5 Sonnet (Oct 2024)  default     5/5 (100%)       4/4         -         
Claude Sonnet 4               markdown    5/5 (100%)       4/4         -         
Claude Haiku 4.5              tool        17/19 (89%)      17/18       -         
Kimi K2                       xml         4/4 (100%)       4/4         -         
Gemini 2.5 Flash              markdown    4/4 (100%)       4/4         -         
Kimi K2 (OR)                  markdown    4/4 (100%)       4/4         -         
Magistral Medium              markdown    4/4 (100%)       4/4         -         
Qwen3 Max                     markdown    4/4 (100%)       4/4         -         
Grok 4 Fast                   markdown    4/4 (100%)       4/4         -         
Claude Sonnet 4.5             tool        16/18 (89%)      16/18       -         
GPT-4o Mini (OR)              tool        15/18 (83%)      15/18       -         
Claude 3 Haiku                default     6/7 (86%)        3/4         -         
o1-preview                    default     6/7 (86%)        3/4         -         
GPT-5                         markdown    4/5 (80%)        4/4         -         
GPT-5 Mini                    xml         4/5 (80%)        3/4         -         
Llama 3.1 70B                 default     5/7 (71%)        2/4         -         
Gemini 1.5 Flash (OR)         default     5/7 (71%)        3/4         -         
Gemma 2 27B                   default     5/7 (71%)        4/4         -         
Llama 3.2 90B                 default     3/4 (75%)        3/4         -         
Llama 3.2 11B                 default     3/4 (75%)        3/4         -         
Grok Code Fast                markdown    3/4 (75%)        3/4         -         
Hermes 4 70B                  xml         3/4 (75%)        3/4         -         
o1-mini                       default     4/7 (57%)        3/4         -         
Llama 3.1 8B                  default     3/5 (60%)        2/4         -         
Qwen3 32B                     markdown    3/5 (60%)        2/4         -         
Gemma 2 9B                    default     3/7 (43%)        2/4         -         
Hermes 2 Pro 8B               default     1/4 (25%)        1/4         -         
Claude Opus 4.1               markdown    1/5 (20%)        1/4         -         
Gemini 1.5 Flash              default     0/5 (0%)         0/4         -         
Hermes 3 70B                  default     0/4 (0%)         0/4         -         
============================  ==========  ===============  ==========  ==========

Notes:

  • Format shows the best-performing --tool-format for each model.

  • Basic tests cover fundamental tool use (file I/O, shell, git, Python).

  • Practical tests cover real-world programming tasks (APIs, data processing, refactoring).

  • Models with fewer than 4 tests are excluded.

  • Results use a 300-second timeout per test. Some models may perform better with longer timeouts.

To generate this table locally:

gptme eval --leaderboard --leaderboard-format rst
gptme eval --leaderboard --leaderboard-format csv       # for data analysis
gptme eval --leaderboard --leaderboard-format markdown   # for GitHub/blog
gptme eval --leaderboard --leaderboard-format html       # self-contained HTML page

Usage#

You can run the simple hello eval like this:

gptme-eval hello --model anthropic/claude-sonnet-4-6

However, we recommend running it in Docker to improve isolation and reproducibility:

make build-docker
docker run \
    -e "ANTHROPIC_API_KEY=<your api key>" \
    -v $(pwd)/eval_results:/app/eval_results \
    gptme-eval hello --model anthropic/claude-sonnet-4-6

Available Eval Suites#

The evaluation suite is organized into named suites that can be run individually or together:

basic

Fundamental tool use: reading and writing files, patching code, running Python in IPython, executing shell commands, using git, counting words, transforming JSON, multi-file refactoring, writing tests, generating CLI programs, and fixing bugs. (~18 tests)

browser

Web browsing and data extraction using the browser tool.

init_projects

Project initialization: init-git, init-react, init-rust. Tests the ability to scaffold new projects from scratch.

practicalpractical2 — … — practical21

A growing series of real-world programming tasks that go beyond basic file I/O. The practical suites now cover 62 tasks across data processing, refactoring, algorithms, async/concurrency, SQL, validation, graph search, dynamic programming, and tree data structures.

Early suites give a good feel for the format:

Suite

Description

Tests

practical

Web APIs, log parsing, error handling

build-api, parse-log, add-error-handling

practical2

Data filtering, templating, CSV validation

sort-and-filter, template-fill, validate-csv

practical3

Unit test writing, SQLite persistence

write-tests-calculator, sqlite-store

practical4

Data aggregation, schedule overlap detection, topological sort

group-by, schedule-overlaps, topo-sort

practical5

Code refactoring, data pipelines, regex scrubbing

rename-function, data-pipeline, regex-scrub

practical6

CSV analysis, word frequency counting, config merging

csv-analysis, word-frequency, merge-configs

practical7

INI-to-JSON conversion, JSON diff, changelog generation

ini-to-json, json-diff, changelog-gen

Later suites extend coverage with semver sorting, Roman numerals, matrix and bracket tasks, async pipelines and worker queues, SQL analytics, tries, LRU caches, interval merging, min-stack, knight moves, histogram area, edit distance, BST operations, coin change, Dijkstra, spiral matrix, number of islands, Kadane’s algorithm, 0/1 knapsack, and flood fill.

For the current authoritative suite list, run gptme-eval --list.

Run specific tests or suites by name:

gptme-eval build-api --model anthropic/claude-sonnet-4-6
gptme-eval sort-and-filter rename-function --model anthropic/claude-sonnet-4-6

Run all practical suites at once (useful for benchmarking):

gptme-eval practical practical2 practical3 practical4 practical5 practical6 practical7 \
    practical8 practical9 practical10 practical11 practical12 practical13 \
    practical14 practical15 practical16 practical17 practical18 practical19 \
    practical20 practical21 \
    --model anthropic/claude-sonnet-4-6

Raw Results#

Full per-test results from all eval runs are stored as CSV files in eval_results/ subdirectories. Results are published to the eval-results branch of the repository.

To view raw results locally:

# View latest results
cat eval_results/*/eval_results.csv | head -50

# Export leaderboard as CSV for analysis
gptme eval --leaderboard --leaderboard-format csv

# Export as JSON for programmatic use
gptme eval --leaderboard --leaderboard-format json

Other evals#

We have considered running gptme on other evals such as SWE-Bench, but have not finished it (see PR #142).

If you are interested in running gptme on other evals, drop a comment in the issues!