Evals#
gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?
To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
The suite covers fundamental tool use, web browsing, project initialization, and a growing set of practical programming tasks that reflect real-world agentic work: building APIs, refactoring code, parsing data formats, writing tests, and more.
Recommended Model#
The recommended model is Claude Sonnet 4.6 (anthropic/claude-sonnet-4-6 and openrouter/anthropic/claude-sonnet-4-6) for its:
Strong agentic capabilities
Strong coder capabilities
Strong performance across all tool types and formats
Reasoning capabilities
Vision & computer use capabilities
Decent alternatives include:
Gemini 3 Pro (
openrouter/google/gemini-3-pro-preview,gemini/gemini-3-pro-preview)GPT-5, GPT-4o (
openai/gpt-5,openai/gpt-4o)Grok 4 (
xai/grok-4,openrouter/x-ai/grok-4)Qwen3 Coder 480B A35B (
openrouter/qwen/qwen3-coder)Kimi K2 (
openrouter/moondreamai/kimi-k2-thinking,openrouter/moondreamai/kimi-k2)MiniMax M2 (
openrouter/minimax/minimax-m2)Llama 3.1 405B (
openrouter/meta-llama/llama-3.1-405b-instruct)DeepSeek V3 (
deepseek/deepseek-chat)DeepSeek R1 (
deepseek/deepseek-reasoner)
Note that some models may perform better or worse with different --tool-format options (markdown, xml, or tool for native tool-calling).
Note that many providers on OpenRouter have poor performance and reliability, so be sure to test your chosen model/provider combination before committing to it. This is especially true for open weight models which any provider can host at any quality. You can choose a specific provider by appending with :provider, e.g. openrouter/qwen/qwen3-coder:alibaba/opensource.
Note that pricing for models varies widely when accounting for caching, making some providers much cheaper than others. Anthropic is known and tested to cache well, significantly reducing costs for conversations with many turns.
You can get an overview of actual model usage in the wild from the OpenRouter app analytics for gptme.
Model Leaderboard#
The table below shows pass rates across our eval suites for each model (best tool format per model). Models are ranked by overall pass rate, with breakdowns by suite type.
$ python -m gptme.eval.leaderboard --results-dir eval_results --format rst --min-tests 4
============================ ========== =============== ========== ==========
Model Format Overall Basic Practical
============================ ========== =============== ========== ==========
Claude Sonnet 4.6 tool 59/60 (98%) 18/18 40/41
Claude 3.5 Sonnet (Jun 2024) default 7/7 (100%) 4/4 -
Gemini 1.5 Pro (OR) default 7/7 (100%) 4/4 -
GPT-4 Turbo default 7/7 (100%) 4/4 -
Hermes 3 405B default 7/7 (100%) 4/4 -
Claude 3.5 Haiku default 6/6 (100%) 4/4 -
GPT-4o tool 5/5 (100%) 4/4 -
GPT-4o Mini tool 5/5 (100%) 4/4 -
Llama 3.1 405B markdown 5/5 (100%) 4/4 -
Claude 3.5 Sonnet (Oct 2024) default 5/5 (100%) 4/4 -
Claude Sonnet 4 markdown 5/5 (100%) 4/4 -
Claude Haiku 4.5 tool 17/19 (89%) 17/18 -
Kimi K2 xml 4/4 (100%) 4/4 -
Gemini 2.5 Flash markdown 4/4 (100%) 4/4 -
Kimi K2 (OR) markdown 4/4 (100%) 4/4 -
Magistral Medium markdown 4/4 (100%) 4/4 -
Qwen3 Max markdown 4/4 (100%) 4/4 -
Grok 4 Fast markdown 4/4 (100%) 4/4 -
Claude Sonnet 4.5 tool 16/18 (89%) 16/18 -
GPT-4o Mini (OR) tool 15/18 (83%) 15/18 -
Claude 3 Haiku default 6/7 (86%) 3/4 -
o1-preview default 6/7 (86%) 3/4 -
GPT-5 markdown 4/5 (80%) 4/4 -
GPT-5 Mini xml 4/5 (80%) 3/4 -
Llama 3.1 70B default 5/7 (71%) 2/4 -
Gemini 1.5 Flash (OR) default 5/7 (71%) 3/4 -
Gemma 2 27B default 5/7 (71%) 4/4 -
Llama 3.2 90B default 3/4 (75%) 3/4 -
Llama 3.2 11B default 3/4 (75%) 3/4 -
Grok Code Fast markdown 3/4 (75%) 3/4 -
Hermes 4 70B xml 3/4 (75%) 3/4 -
o1-mini default 4/7 (57%) 3/4 -
Llama 3.1 8B default 3/5 (60%) 2/4 -
Qwen3 32B markdown 3/5 (60%) 2/4 -
Gemma 2 9B default 3/7 (43%) 2/4 -
Hermes 2 Pro 8B default 1/4 (25%) 1/4 -
Claude Opus 4.1 markdown 1/5 (20%) 1/4 -
Gemini 1.5 Flash default 0/5 (0%) 0/4 -
Hermes 3 70B default 0/4 (0%) 0/4 -
============================ ========== =============== ========== ==========
Notes:
Format shows the best-performing
--tool-formatfor each model.Basic tests cover fundamental tool use (file I/O, shell, git, Python).
Practical tests cover real-world programming tasks (APIs, data processing, refactoring).
Models with fewer than 4 tests are excluded.
Results use a 300-second timeout per test. Some models may perform better with longer timeouts.
To generate this table locally:
gptme eval --leaderboard --leaderboard-format rst
gptme eval --leaderboard --leaderboard-format csv # for data analysis
gptme eval --leaderboard --leaderboard-format markdown # for GitHub/blog
gptme eval --leaderboard --leaderboard-format html # self-contained HTML page
Usage#
You can run the simple hello eval like this:
gptme-eval hello --model anthropic/claude-sonnet-4-6
However, we recommend running it in Docker to improve isolation and reproducibility:
make build-docker
docker run \
-e "ANTHROPIC_API_KEY=<your api key>" \
-v $(pwd)/eval_results:/app/eval_results \
gptme-eval hello --model anthropic/claude-sonnet-4-6
Available Eval Suites#
The evaluation suite is organized into named suites that can be run individually or together:
- basic
Fundamental tool use: reading and writing files, patching code, running Python in IPython, executing shell commands, using git, counting words, transforming JSON, multi-file refactoring, writing tests, generating CLI programs, and fixing bugs. (~18 tests)
- browser
Web browsing and data extraction using the browser tool.
- init_projects
Project initialization:
init-git,init-react,init-rust. Tests the ability to scaffold new projects from scratch.- practical — practical2 — … — practical21
A growing series of real-world programming tasks that go beyond basic file I/O. The practical suites now cover 62 tasks across data processing, refactoring, algorithms, async/concurrency, SQL, validation, graph search, dynamic programming, and tree data structures.
Early suites give a good feel for the format:
Suite
Description
Tests
practical
Web APIs, log parsing, error handling
build-api, parse-log, add-error-handling
practical2
Data filtering, templating, CSV validation
sort-and-filter, template-fill, validate-csv
practical3
Unit test writing, SQLite persistence
write-tests-calculator, sqlite-store
practical4
Data aggregation, schedule overlap detection, topological sort
group-by, schedule-overlaps, topo-sort
practical5
Code refactoring, data pipelines, regex scrubbing
rename-function, data-pipeline, regex-scrub
practical6
CSV analysis, word frequency counting, config merging
csv-analysis, word-frequency, merge-configs
practical7
INI-to-JSON conversion, JSON diff, changelog generation
ini-to-json, json-diff, changelog-gen
Later suites extend coverage with semver sorting, Roman numerals, matrix and bracket tasks, async pipelines and worker queues, SQL analytics, tries, LRU caches, interval merging, min-stack, knight moves, histogram area, edit distance, BST operations, coin change, Dijkstra, spiral matrix, number of islands, Kadane’s algorithm, 0/1 knapsack, and flood fill.
For the current authoritative suite list, run
gptme-eval --list.
Run specific tests or suites by name:
gptme-eval build-api --model anthropic/claude-sonnet-4-6
gptme-eval sort-and-filter rename-function --model anthropic/claude-sonnet-4-6
Run all practical suites at once (useful for benchmarking):
gptme-eval practical practical2 practical3 practical4 practical5 practical6 practical7 \
practical8 practical9 practical10 practical11 practical12 practical13 \
practical14 practical15 practical16 practical17 practical18 practical19 \
practical20 practical21 \
--model anthropic/claude-sonnet-4-6
Raw Results#
Full per-test results from all eval runs are stored as CSV files in eval_results/ subdirectories.
Results are published to the eval-results branch of the repository.
To view raw results locally:
# View latest results
cat eval_results/*/eval_results.csv | head -50
# Export leaderboard as CSV for analysis
gptme eval --leaderboard --leaderboard-format csv
# Export as JSON for programmatic use
gptme eval --leaderboard --leaderboard-format json
Other evals#
We have considered running gptme on other evals such as SWE-Bench, but have not finished it (see PR #142).
If you are interested in running gptme on other evals, drop a comment in the issues!