Evals#
gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?
To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
Note
The evaluation suite is still tiny and under development, but the eval harness is fully functional.
Recommended Model#
The recommended model is Claude Sonnet 4.5 (anthropic/claude-sonnet-4-5 and openrouter/anthropic/claude-sonnet-4-5) for its:
Strong agentic capabilities
Strong coder capabilities
Strong performance across all tool types and formats
Reasoning capabilities
Vision & computer use capabilities
Decent alternatives include:
Gemini 3 Pro (
openrouter/google/gemini-3-pro-preview,gemini/gemini-3-pro-preview)GPT-5, GPT-4o (
openai/gpt-5,openai/gpt-4o)Grok 4 (
xai/grok-4,openrouter/x-ai/grok-4)Qwen3 Coder 480B A35B (
openrouter/qwen/qwen3-coder)Kimi K2 (
openrouter/moondreamai/kimi-k2-thinking,openrouter/moondreamai/kimi-k2)MiniMax M2 (
openrouter/minimax/minimax-m2)Llama 3.1 405B (
openrouter/meta-llama/llama-3.1-405b-instruct)DeepSeek V3 (
deepseek/deepseek-chat)DeepSeek R1 (
deepseek/deepseek-reasoner)
Note that some models may perform better or worse with different --tool-format options (markdown, xml, or tool for native tool-calling).
Note that many providers on OpenRouter have poor performance and reliability, so be sure to test your chosen model/provider combination before committing to it. This is especially true for open weight models which any provider can host at any quality. You can choose a specific provider by appending with :provider, e.g. openrouter/qwen/qwen3-coder:alibaba/opensource.
Note that pricing for models varies widely when accounting for caching, making some providers much cheaper than others. Anthropic is known and tested to cache well, significantly reducing costs for conversations with many turns.
You can get an overview of actual model usage in the wild from the OpenRouter app analytics for gptme.
Usage#
You can run the simple hello eval like this:
gptme-eval hello --model anthropic/claude-sonnet-4-5
However, we recommend running it in Docker to improve isolation and reproducibility:
make build-docker
docker run \
-e "ANTHROPIC_API_KEY=<your api key>" \
-v $(pwd)/eval_results:/app/eval_results \
gptme-eval hello --model anthropic/claude-sonnet-4-5
Available Evals#
The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction.
Results#
Here are the results of the evals we have run so far:
$ gptme-eval eval_results/*/eval_results.csv
2026-02-22 12:04:33,637 - INFO - gptme.config - Using user configuration from ~/.config/gptme/config.toml
Model Format hello hello-patch hello-ask prime100 init-git init-rust whois-superuserlabs-ceo
---------------------------------------- -------- ---------------- ---------------- ---------------- ---------------- ---------------- ------------- -------------------------
anthropic/claude-3-5-haiku-20241022 markdown ✅ 66/69 456tk ✅ 68/69 377tk ✅ 68/69 458tk ✅ 67/69 885tk ✅ 65/68 905tk ❓ N/A ✅ 1/1 2619tk
anthropic/claude-3-5-haiku-20241022 tool 🔶 67/126 287tk 🔶 68/126 330tk 🔶 68/126 433tk ❌ 15/126 595tk 🔶 73/126 643tk ❓ N/A ❓ N/A
anthropic/claude-3-5-haiku-20241022 xml 🔶 67/114 298tk 🔶 68/114 329tk 🔶 46/114 435tk ❌ 11/114 715tk 🔶 62/114 638tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20240620 markdown ✅ 34/34 630tk ✅ 34/34 542tk ✅ 34/34 710tk ✅ 34/34 924tk ✅ 34/34 1098tk ✅ 4/4 1504tk 🔶 3/4 1306tk
anthropic/claude-3-5-sonnet-20241022 markdown 🔶 146/294 286tk 🔶 186/294 302tk 🔶 190/294 346tk 🔶 191/294 567tk 🔶 191/294 536tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022 tool 🔶 80/238 230tk 🔶 59/238 288tk 🔶 80/238 324tk 🔶 77/238 420tk 🔶 70/238 446tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022 xml ❌ 10/226 251tk 🔶 80/226 271tk 🔶 76/226 308tk ❌ 12/226 447tk ❌ 41/226 466tk ❓ N/A ❓ N/A
anthropic/claude-3-haiku-20240307 markdown ✅ 34/34 388tk ✅ 34/34 375tk ✅ 34/34 432tk ❌ 6/34 781tk 🔶 24/34 903tk 🔶 3/4 670tk ✅ 4/4 1535tk
anthropic/claude-haiku-4-5 tool 🔶 91/116 350tk 🔶 91/116 456tk 🔶 91/116 524tk 🔶 39/116 1075tk 🔶 72/116 904tk ❓ N/A ❓ N/A
anthropic/claude-haiku-4-5 xml ✅ 111/116 386tk ✅ 110/116 470tk ✅ 112/116 557tk ✅ 112/116 803tk 🔶 73/116 944tk ❓ N/A ❓ N/A
anthropic/claude-opus-4-1-20250805 markdown 🔶 1/2 393tk ❌ 0/1 197tk ❌ 0/1 204tk ❌ 0/1 190tk ❌ 0/1 186tk ❓ N/A ❓ N/A
anthropic/claude-opus-4-1-20250805 tool ❌ 0/1 181tk ❌ 0/1 196tk ❌ 0/1 202tk ❌ 0/1 188tk ❌ 0/1 184tk ❓ N/A ❓ N/A
anthropic/claude-opus-4-1-20250805 xml 🔶 1/2 372tk ❌ 0/1 197tk ❌ 0/1 205tk ❌ 0/1 191tk ❌ 0/1 187tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-20250514 markdown ✅ 1/1 439tk ✅ 1/1 480tk ✅ 1/1 532tk ✅ 1/1 1180tk ✅ 1/1 2006tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-20250514 tool ✅ 1/1 283tk ✅ 1/1 320tk ✅ 1/1 471tk ✅ 1/1 1133tk ✅ 1/1 1070tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-20250514 xml ✅ 1/1 404tk ✅ 1/1 495tk ✅ 1/1 584tk ✅ 1/1 1223tk ❌ 0/1 1422tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-6 markdown ✅ 4/4 294tk ✅ 4/4 373tk ✅ 4/4 465tk ✅ 4/4 430tk ✅ 4/4 650tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-6 tool ❌ 0/4 256tk ❌ 0/4 365tk ❌ 0/4 374tk ❌ 0/4 338tk ❌ 0/4 301tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-6 xml ✅ 4/4 306tk ✅ 4/4 390tk ✅ 4/4 417tk ✅ 4/4 449tk 🔶 3/4 831tk ❓ N/A ❓ N/A
deepseek/deepseek-chat markdown ✅ 1/1 429tk ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A
deepseek/deepseek-reasoner markdown ✅ 1/1 742tk ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A
deepseek/deepseek-reasoner xml ✅ 1/1 680tk ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A
gemini/gemini-1.5-flash-latest markdown ❌ 0/54 28tk ❌ 0/54 28tk ❌ 0/54 28tk ❌ 0/54 28tk ❌ 0/54 28tk ❓ N/A ❓ N/A
gemini/gemini-1.5-flash-latest tool ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❓ N/A ❓ N/A
gemini/gemini-2.5-flash markdown ✅ 1/1 419tk ✅ 1/1 450tk ✅ 1/1 516tk ✅ 1/1 735tk ❓ N/A ❓ N/A ❓ N/A
gemini/gemini-2.5-flash xml ✅ 1/1 432tk ✅ 1/1 468tk ✅ 1/1 534tk ✅ 1/1 836tk ❓ N/A ❓ N/A ❓ N/A
groq/moonshotai/kimi-k2-instruct markdown ✅ 1/1 407tk ✅ 1/1 497tk ✅ 1/1 575tk ✅ 1/1 1079tk ❓ N/A ❓ N/A ❓ N/A
groq/moonshotai/kimi-k2-instruct xml ✅ 1/1 424tk ✅ 1/1 498tk ✅ 1/1 587tk ✅ 1/1 1150tk ❓ N/A ❓ N/A ❓ N/A
groq/qwen/qwen3-32b markdown ✅ 1/1 709tk ✅ 1/1 711tk ❌ 0/1 0tk ❌ 0/1 0tk ✅ 1/1 2609tk ❓ N/A ❓ N/A
groq/qwen/qwen3-32b tool ❌ 0/1 172tk ❌ 0/1 187tk ❌ 0/1 194tk ❌ 0/1 179tk ❌ 0/1 175tk ❓ N/A ❓ N/A
groq/qwen/qwen3-32b xml ✅ 1/1 1918tk ✅ 1/1 725tk ✅ 1/1 1115tk ❌ 0/1 5171tk ❌ 0/1 4367tk ❓ N/A ❓ N/A
openai/gpt-4-turbo markdown ✅ 3/3 255tk ✅ 3/3 312tk ✅ 3/3 376tk ✅ 3/3 527tk ✅ 4/4 590tk ✅ 4/4 784tk ✅ 6/7 819tk
openai/gpt-4o-mini markdown 🔶 61/104 263tk ✅ 103/104 319tk ✅ 103/104 375tk 🔶 65/104 502tk ✅ 92/104 766tk ✅ 6/6 813tk ✅ 6/6 951tk
openai/gpt-4o-mini tool ✅ 233/242 289tk ✅ 196/242 292tk ✅ 242/242 420tk 🔶 171/242 621tk ✅ 242/242 680tk ❓ N/A ❓ N/A
openai/gpt-4o markdown 🔶 167/333 299tk 🔶 180/333 356tk ✅ 287/333 382tk 🔶 69/333 512tk 🔶 177/333 654tk ✅ 5/5 663tk ✅ 5/5 1253tk
openai/gpt-4o tool ❌ 43/242 292tk 🔶 158/242 454tk 🔶 138/242 461tk 🔶 78/242 691tk ✅ 237/242 834tk ❓ N/A ❓ N/A
openai/gpt-4o xml ❌ 12/230 260tk 🔶 52/230 307tk ❌ 24/230 283tk ❌ 1/230 360tk ❌ 2/230 303tk ❓ N/A ❓ N/A
openai/gpt-5-mini markdown ✅ 1/1 379tk ✅ 1/1 358tk ✅ 1/1 884tk ❌ 0/1 0tk ❌ 0/1 715tk ❓ N/A ❓ N/A
openai/gpt-5-mini tool ✅ 1/1 286tk ✅ 1/1 294tk ✅ 1/1 1150tk ❌ 0/1 0tk ❌ 0/1 0tk ❓ N/A ❓ N/A
openai/gpt-5-mini xml ❌ 0/1 331tk ✅ 1/1 386tk ✅ 1/1 922tk ✅ 1/1 1369tk ✅ 1/1 1614tk ❓ N/A ❓ N/A
openai/gpt-5 markdown ✅ 1/1 228tk ✅ 1/1 273tk ✅ 1/1 367tk ✅ 1/1 695tk ❌ 0/1 0tk ❓ N/A ❓ N/A
openai/gpt-5 tool ✅ 1/1 306tk ✅ 1/1 299tk ✅ 1/1 348tk ❌ 0/1 0tk ❌ 0/1 0tk ❓ N/A ❓ N/A
openai/gpt-5 xml ✅ 1/1 237tk ❌ 0/1 218tk ❌ 0/1 217tk ❌ 0/1 678tk ❌ 0/1 255tk ❓ N/A ❓ N/A
openai/o1-mini markdown ✅ 3/3 354tk ✅ 3/3 431tk ✅ 3/3 460tk 🔶 2/3 567tk 🔶 3/5 2222tk 🔶 1/5 1412tk 🔶 2/3 813tk
openai/o1-preview markdown ✅ 2/2 308tk ✅ 2/2 570tk 🔶 1/2 549tk ✅ 2/2 490tk ✅ 3/3 823tk ✅ 1/1 656tk ✅ 1/1 1998tk
google/gemini-flash-1.5 markdown ✅ 2/2 225tk ✅ 2/2 401tk ✅ 2/2 430tk ❌ 0/2 296tk ✅ 1/1 686tk ❌ 0/1 661tk ✅ 1/1 1014tk
google/gemini-pro-1.5 markdown ✅ 1/1 341tk ✅ 1/1 419tk ✅ 1/1 456tk ✅ 1/1 676tk 🔶 2/3 431tk 🔶 1/2 1016tk ✅ 2/2 1308tk
google/gemma-2-27b-it markdown ✅ 1/1 288tk ✅ 1/1 384tk ✅ 1/1 446tk ✅ 1/1 714tk ✅ 1/1 570tk ❌ 0/1 535tk ❌ 0/1 235tk
google/gemma-2-9b-it markdown ❌ 0/2 186tk ✅ 2/2 370tk ✅ 2/2 368tk ❌ 0/2 545tk ✅ 1/1 492tk ❌ 0/1 1730tk ❌ 0/1 352tk
meta-llama/llama-3.1-405b-instruct markdown 🔶 80/103 253tk ✅ 84/103 400tk ✅ 83/103 364tk 🔶 79/103 449tk 🔶 71/103 501tk 🔶 2/5 255tk ❌ 0/5 85tk
meta-llama/llama-3.1-405b-instruct tool ❌ 0/12 184tk ❌ 0/12 198tk ❌ 0/12 214tk ❌ 0/12 190tk ❌ 0/12 189tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-70b-instruct markdown ✅ 5/6 367tk ✅ 5/6 424tk ✅ 6/6 452tk 🔶 2/6 546tk ✅ 5/6 813tk 🔶 3/4 682tk 🔶 2/3 1461tk
meta-llama/llama-3.1-70b-instruct xml 🔶 182/230 273tk ❌ 9/230 355tk 🔶 56/230 412tk ❌ 21/230 373tk 🔶 136/230 474tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-8b-instruct markdown ✅ 1/1 277tk ✅ 1/1 441tk ❌ 0/1 400tk ❌ 0/1 5095tk ✅ 1/1 2266tk ❓ N/A ❓ N/A
meta-llama/llama-3.2-11b-vision-instruct markdown ✅ 2/2 352tk ✅ 2/2 493tk ❌ 0/2 479tk ✅ 2/2 2643tk ❓ N/A ❓ N/A ❓ N/A
meta-llama/llama-3.2-90b-vision-instruct markdown 🔶 2/4 237tk 🔶 2/4 288tk 🔶 3/4 336tk 🔶 1/4 233tk ❓ N/A ❓ N/A ❓ N/A
mistralai/magistral-medium-2506 markdown ✅ 1/1 531tk ✅ 1/1 569tk ✅ 1/1 666tk ✅ 1/1 1106tk ❓ N/A ❓ N/A ❓ N/A
mistralai/magistral-medium-2506 tool ❌ 0/1 465tk ❌ 0/1 604tk ❌ 0/1 0tk ❌ 0/1 0tk ❓ N/A ❓ N/A ❓ N/A
mistralai/magistral-medium-2506 xml ✅ 1/1 516tk ✅ 1/1 568tk ❌ 0/1 552tk ✅ 1/1 1075tk ❓ N/A ❓ N/A ❓ N/A
moonshotai/kimi-k2-0905 markdown ✅ 2/2 464tk ✅ 2/2 590tk ✅ 2/2 613tk 🔶 1/2 650tk ❓ N/A ❓ N/A ❓ N/A
moonshotai/kimi-k2-0905 tool ❌ 0/1 397tk ❌ 0/1 483tk ❌ 0/1 592tk ✅ 1/1 990tk ❓ N/A ❓ N/A ❓ N/A
moonshotai/kimi-k2-0905 xml ✅ 1/1 441tk ✅ 1/1 563tk ✅ 1/1 848tk ❌ 0/1 598tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-2-pro-llama-3-8b markdown ✅ 1/1 341tk ❌ 0/1 4274tk ❌ 0/1 3760tk ❌ 0/1 659tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-3-llama-3.1-405b markdown ✅ 2/2 317tk ✅ 2/2 420tk ✅ 2/2 325tk ✅ 2/2 410tk ✅ 1/1 821tk ✅ 1/1 758tk ✅ 1/1 1039tk
nousresearch/hermes-3-llama-3.1-70b markdown ❌ 0/2 173tk ❌ 0/2 187tk ❌ 0/2 202tk ❌ 0/2 177tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-4-70b markdown ❌ 0/1 439tk ✅ 1/1 612tk ✅ 1/1 1235tk ❌ 0/1 0tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-4-70b tool ✅ 1/1 476tk ✅ 1/1 536tk ❌ 0/1 1310tk ❌ 0/1 0tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-4-70b xml ✅ 1/1 466tk ✅ 1/1 516tk ✅ 1/1 631tk ❌ 0/1 1255tk ❓ N/A ❓ N/A ❓ N/A
qwen/qwen3-max markdown ✅ 1/1 422tk ✅ 1/1 492tk 🔶 1/2 615tk ✅ 1/1 949tk ❓ N/A ❓ N/A ❓ N/A
qwen/qwen3-max tool ✅ 1/1 436tk ✅ 1/1 519tk ✅ 1/1 546tk ✅ 1/1 663tk ❓ N/A ❓ N/A ❓ N/A
qwen/qwen3-max xml ✅ 1/1 437tk ✅ 1/1 530tk ✅ 1/1 580tk ✅ 1/1 901tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-4-fast:free markdown ✅ 1/1 561tk 🔶 1/2 760tk ✅ 1/1 1326tk ✅ 1/1 1016tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-code-fast-1 markdown ✅ 1/1 661tk ❌ 0/1 1385tk ✅ 1/1 829tk ✅ 1/1 955tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-code-fast-1 tool ✅ 1/1 663tk ❌ 0/1 2590tk ❌ 0/1 1415tk ✅ 1/1 1807tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-code-fast-1 xml ❌ 0/1 485tk ❌ 0/1 1652tk ✅ 1/1 759tk ✅ 1/1 1112tk ❓ N/A ❓ N/A ❓ N/A
We are working on making the evals more robust, informative, and challenging.
Other evals#
We have considered running gptme on other evals such as SWE-Bench, but have not finished it (see PR #142).
If you are interested in running gptme on other evals, drop a comment in the issues!