Evals#
gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?
To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
Note
The evaluation suite is still tiny and under development, but the eval harness is fully functional.
Recommended Model#
The recommended model is Claude Sonnet 4.5 (anthropic/claude-sonnet-4-5 and openrouter/anthropic/claude-sonnet-4-5) for its:
Strong agentic capabilities
Strong coder capabilities
Strong performance across all tool types and formats
Reasoning capabilities
Vision & computer use capabilities
Decent alternatives include:
Gemini 3 Pro (
openrouter/google/gemini-3-pro-preview,gemini/gemini-3-pro-preview)GPT-5, GPT-4o (
openai/gpt-5,openai/gpt-4o)Grok 4 (
xai/grok-4,openrouter/x-ai/grok-4)Qwen3 Coder 480B A35B (
openrouter/qwen/qwen3-coder)Kimi K2 (
openrouter/moondreamai/kimi-k2-thinking,openrouter/moondreamai/kimi-k2)MiniMax M2 (
openrouter/minimax/minimax-m2)Llama 3.1 405B (
openrouter/meta-llama/llama-3.1-405b-instruct)DeepSeek V3 (
deepseek/deepseek-chat)DeepSeek R1 (
deepseek/deepseek-reasoner)
Note that some models may perform better or worse with different --tool-format options (markdown, xml, or tool for native tool-calling).
Note that many providers on OpenRouter have poor performance and reliability, so be sure to test your chosen model/provider combination before committing to it. This is especially true for open weight models which any provider can host at any quality. You can choose a specific provider by appending with :provider, e.g. openrouter/qwen/qwen3-coder:alibaba/opensource.
Note that pricing for models varies widely when accounting for caching, making some providers much cheaper than others. Anthropic is known and tested to cache well, significantly reducing costs for conversations with many turns.
You can get an overview of actual model usage in the wild from the OpenRouter app analytics for gptme.
Usage#
You can run the simple hello eval like this:
gptme-eval hello --model anthropic/claude-sonnet-4-5
However, we recommend running it in Docker to improve isolation and reproducibility:
make build-docker
docker run \
-e "ANTHROPIC_API_KEY=<your api key>" \
-v $(pwd)/eval_results:/app/eval_results \
gptme-eval hello --model anthropic/claude-sonnet-4-5
Available Evals#
The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction.
Results#
Here are the results of the evals we have run so far:
$ gptme-eval eval_results/*/eval_results.csv
Model hello hello-patch hello-ask prime100 init-git init-rust whois-superuserlabs-ceo
--------------------------------------------- ---------------- ---------------- ---------------- ---------------- ---------------- ------------- -------------------------
anthropic/claude-3-5-haiku-20241022 ✅ 55/57 473tk ✅ 56/57 386tk ✅ 56/57 468tk ✅ 56/57 912tk ✅ 54/56 922tk ❓ N/A ✅ 1/1 2619tk
anthropic/claude-3-5-haiku-20241022@markdown ✅ 11/12 373tk ✅ 12/12 332tk ✅ 12/12 407tk ✅ 11/12 758tk ✅ 11/12 830tk ❓ N/A ❓ N/A
anthropic/claude-3-5-haiku-20241022@tool 🔶 67/126 287tk 🔶 68/126 330tk 🔶 68/126 433tk ❌ 15/126 595tk 🔶 73/126 643tk ❓ N/A ❓ N/A
anthropic/claude-3-5-haiku-20241022@xml 🔶 67/114 298tk 🔶 68/114 329tk 🔶 46/114 435tk ❌ 11/114 715tk 🔶 62/114 638tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20240620 ✅ 34/34 630tk ✅ 34/34 542tk ✅ 34/34 710tk ✅ 34/34 924tk ✅ 34/34 1098tk ✅ 4/4 1504tk 🔶 3/4 1306tk
anthropic/claude-3-5-sonnet-20241022 ✅ 54/56 378tk ✅ 54/56 371tk ✅ 54/56 416tk ✅ 55/56 869tk ✅ 55/56 754tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022@markdown 🔶 92/224 269tk 🔶 132/224 291tk 🔶 136/224 337tk 🔶 136/224 514tk 🔶 136/224 503tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022@tool 🔶 80/224 233tk 🔶 59/224 294tk 🔶 80/224 331tk 🔶 77/224 434tk 🔶 70/224 462tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022@xml ❌ 10/212 256tk 🔶 80/212 275tk 🔶 76/212 315tk ❌ 12/212 464tk ❌ 41/212 485tk ❓ N/A ❓ N/A
anthropic/claude-3-haiku-20240307 ✅ 34/34 388tk ✅ 34/34 375tk ✅ 34/34 432tk ❌ 6/34 781tk 🔶 24/34 903tk 🔶 3/4 670tk ✅ 4/4 1535tk
anthropic/claude-haiku-4-5@tool ✅ 91/98 358tk ✅ 91/98 463tk ✅ 91/98 538tk 🔶 39/98 1169tk 🔶 72/98 995tk ❓ N/A ❓ N/A
anthropic/claude-haiku-4-5@xml ✅ 93/98 383tk ✅ 93/98 469tk ✅ 94/98 555tk ✅ 94/98 821tk 🔶 67/98 964tk ❓ N/A ❓ N/A
anthropic/claude-opus-4-1-20250805@markdown 🔶 1/2 393tk ❌ 0/1 197tk ❌ 0/1 204tk ❌ 0/1 190tk ❌ 0/1 186tk ❓ N/A ❓ N/A
anthropic/claude-opus-4-1-20250805@tool ❌ 0/1 181tk ❌ 0/1 196tk ❌ 0/1 202tk ❌ 0/1 188tk ❌ 0/1 184tk ❓ N/A ❓ N/A
anthropic/claude-opus-4-1-20250805@xml 🔶 1/2 372tk ❌ 0/1 197tk ❌ 0/1 205tk ❌ 0/1 191tk ❌ 0/1 187tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-20250514@markdown ✅ 1/1 439tk ✅ 1/1 480tk ✅ 1/1 532tk ✅ 1/1 1180tk ✅ 1/1 2006tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-20250514@tool ✅ 1/1 283tk ✅ 1/1 320tk ✅ 1/1 471tk ✅ 1/1 1133tk ✅ 1/1 1070tk ❓ N/A ❓ N/A
anthropic/claude-sonnet-4-20250514@xml ✅ 1/1 404tk ✅ 1/1 495tk ✅ 1/1 584tk ✅ 1/1 1223tk ❌ 0/1 1422tk ❓ N/A ❓ N/A
deepseek/deepseek-chat@markdown ✅ 1/1 429tk ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A
deepseek/deepseek-reasoner@markdown ✅ 1/1 742tk ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A
deepseek/deepseek-reasoner@xml ✅ 1/1 680tk ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A ❓ N/A
gemini/gemini-1.5-flash-latest ❌ 0/42 28tk ❌ 0/42 28tk ❌ 0/42 28tk ❌ 0/42 28tk ❌ 0/42 28tk ❓ N/A ❓ N/A
gemini/gemini-1.5-flash-latest@markdown ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❓ N/A ❓ N/A
gemini/gemini-1.5-flash-latest@tool ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❓ N/A ❓ N/A
gemini/gemini-2.5-flash@markdown ✅ 1/1 419tk ✅ 1/1 450tk ✅ 1/1 516tk ✅ 1/1 735tk ❓ N/A ❓ N/A ❓ N/A
gemini/gemini-2.5-flash@xml ✅ 1/1 432tk ✅ 1/1 468tk ✅ 1/1 534tk ✅ 1/1 836tk ❓ N/A ❓ N/A ❓ N/A
groq/moonshotai/kimi-k2-instruct@markdown ✅ 1/1 407tk ✅ 1/1 497tk ✅ 1/1 575tk ✅ 1/1 1079tk ❓ N/A ❓ N/A ❓ N/A
groq/moonshotai/kimi-k2-instruct@xml ✅ 1/1 424tk ✅ 1/1 498tk ✅ 1/1 587tk ✅ 1/1 1150tk ❓ N/A ❓ N/A ❓ N/A
groq/qwen/qwen3-32b@markdown ✅ 1/1 709tk ✅ 1/1 711tk ❌ 0/1 0tk ❌ 0/1 0tk ✅ 1/1 2609tk ❓ N/A ❓ N/A
groq/qwen/qwen3-32b@tool ❌ 0/1 172tk ❌ 0/1 187tk ❌ 0/1 194tk ❌ 0/1 179tk ❌ 0/1 175tk ❓ N/A ❓ N/A
groq/qwen/qwen3-32b@xml ✅ 1/1 1918tk ✅ 1/1 725tk ✅ 1/1 1115tk ❌ 0/1 5171tk ❌ 0/1 4367tk ❓ N/A ❓ N/A
openai/gpt-4-turbo ✅ 3/3 255tk ✅ 3/3 312tk ✅ 3/3 376tk ✅ 3/3 527tk ✅ 4/4 590tk ✅ 4/4 784tk ✅ 6/7 819tk
openai/gpt-4o ✅ 90/91 325tk ✅ 90/91 313tk 🔶 65/91 337tk 🔶 35/91 441tk 🔶 67/91 626tk ✅ 5/5 663tk ✅ 5/5 1253tk
openai/gpt-4o-mini 🔶 60/92 269tk ✅ 91/92 321tk ✅ 91/92 377tk 🔶 62/92 510tk ✅ 80/92 759tk ✅ 6/6 813tk ✅ 6/6 951tk
openai/gpt-4o-mini@markdown ❌ 1/12 217tk ✅ 12/12 302tk ✅ 12/12 361tk 🔶 3/12 440tk ✅ 12/12 823tk ❓ N/A ❓ N/A
openai/gpt-4o-mini@tool ✅ 215/224 288tk ✅ 187/224 300tk ✅ 224/224 419tk 🔶 162/224 623tk ✅ 224/224 681tk ❓ N/A ❓ N/A
openai/gpt-4o@markdown 🔶 76/224 289tk 🔶 88/224 369tk ✅ 205/224 399tk ❌ 30/224 536tk 🔶 93/224 650tk ❓ N/A ❓ N/A
openai/gpt-4o@tool ❌ 43/224 291tk 🔶 147/224 448tk 🔶 132/224 458tk 🔶 77/224 704tk ✅ 220/224 833tk ❓ N/A ❓ N/A
openai/gpt-4o@xml ❌ 12/212 256tk 🔶 47/212 301tk ❌ 17/212 277tk ❌ 1/212 366tk ❌ 2/212 302tk ❓ N/A ❓ N/A
openai/gpt-5-mini@markdown ✅ 1/1 379tk ✅ 1/1 358tk ✅ 1/1 884tk ❌ 0/1 0tk ❌ 0/1 715tk ❓ N/A ❓ N/A
openai/gpt-5-mini@tool ✅ 1/1 286tk ✅ 1/1 294tk ✅ 1/1 1150tk ❌ 0/1 0tk ❌ 0/1 0tk ❓ N/A ❓ N/A
openai/gpt-5-mini@xml ❌ 0/1 331tk ✅ 1/1 386tk ✅ 1/1 922tk ✅ 1/1 1369tk ✅ 1/1 1614tk ❓ N/A ❓ N/A
openai/gpt-5@markdown ✅ 1/1 228tk ✅ 1/1 273tk ✅ 1/1 367tk ✅ 1/1 695tk ❌ 0/1 0tk ❓ N/A ❓ N/A
openai/gpt-5@tool ✅ 1/1 306tk ✅ 1/1 299tk ✅ 1/1 348tk ❌ 0/1 0tk ❌ 0/1 0tk ❓ N/A ❓ N/A
openai/gpt-5@xml ✅ 1/1 237tk ❌ 0/1 218tk ❌ 0/1 217tk ❌ 0/1 678tk ❌ 0/1 255tk ❓ N/A ❓ N/A
openai/o1-mini ✅ 3/3 354tk ✅ 3/3 431tk ✅ 3/3 460tk 🔶 2/3 567tk 🔶 3/5 2222tk 🔶 1/5 1412tk 🔶 2/3 813tk
openai/o1-preview ✅ 2/2 308tk ✅ 2/2 570tk 🔶 1/2 549tk ✅ 2/2 490tk ✅ 3/3 823tk ✅ 1/1 656tk ✅ 1/1 1998tk
google/gemini-flash-1.5 ✅ 2/2 225tk ✅ 2/2 401tk ✅ 2/2 430tk ❌ 0/2 296tk ✅ 1/1 686tk ❌ 0/1 661tk ✅ 1/1 1014tk
google/gemini-pro-1.5 ✅ 1/1 341tk ✅ 1/1 419tk ✅ 1/1 456tk ✅ 1/1 676tk 🔶 2/3 431tk 🔶 1/2 1016tk ✅ 2/2 1308tk
google/gemma-2-27b-it ✅ 1/1 288tk ✅ 1/1 384tk ✅ 1/1 446tk ✅ 1/1 714tk ✅ 1/1 570tk ❌ 0/1 535tk ❌ 0/1 235tk
google/gemma-2-9b-it ❌ 0/2 186tk ✅ 2/2 370tk ✅ 2/2 368tk ❌ 0/2 545tk ✅ 1/1 492tk ❌ 0/1 1730tk ❌ 0/1 352tk
meta-llama/llama-3.1-405b-instruct 🔶 69/91 244tk ✅ 74/91 417tk ✅ 73/91 364tk 🔶 71/91 440tk 🔶 60/91 488tk 🔶 2/5 255tk ❌ 0/5 85tk
meta-llama/llama-3.1-405b-instruct@markdown ✅ 11/12 318tk ✅ 10/12 269tk ✅ 10/12 366tk 🔶 8/12 516tk ✅ 11/12 600tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-405b-instruct@tool ❌ 0/12 184tk ❌ 0/12 198tk ❌ 0/12 214tk ❌ 0/12 190tk ❌ 0/12 189tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-70b-instruct ✅ 5/6 367tk ✅ 5/6 424tk ✅ 6/6 452tk 🔶 2/6 546tk ✅ 5/6 813tk 🔶 3/4 682tk 🔶 2/3 1461tk
meta-llama/llama-3.1-70b-instruct@xml 🔶 164/212 271tk ❌ 9/212 366tk 🔶 52/212 432tk ❌ 20/212 396tk 🔶 126/212 483tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-8b-instruct ✅ 1/1 277tk ✅ 1/1 441tk ❌ 0/1 400tk ❌ 0/1 5095tk ✅ 1/1 2266tk ❓ N/A ❓ N/A
meta-llama/llama-3.2-11b-vision-instruct ✅ 2/2 352tk ✅ 2/2 493tk ❌ 0/2 479tk ✅ 2/2 2643tk ❓ N/A ❓ N/A ❓ N/A
meta-llama/llama-3.2-90b-vision-instruct 🔶 2/4 237tk 🔶 2/4 288tk 🔶 3/4 336tk 🔶 1/4 233tk ❓ N/A ❓ N/A ❓ N/A
mistralai/magistral-medium-2506@markdown ✅ 1/1 531tk ✅ 1/1 569tk ✅ 1/1 666tk ✅ 1/1 1106tk ❓ N/A ❓ N/A ❓ N/A
mistralai/magistral-medium-2506@tool ❌ 0/1 465tk ❌ 0/1 604tk ❌ 0/1 0tk ❌ 0/1 0tk ❓ N/A ❓ N/A ❓ N/A
mistralai/magistral-medium-2506@xml ✅ 1/1 516tk ✅ 1/1 568tk ❌ 0/1 552tk ✅ 1/1 1075tk ❓ N/A ❓ N/A ❓ N/A
moonshotai/kimi-k2-0905@markdown ✅ 2/2 464tk ✅ 2/2 590tk ✅ 2/2 613tk 🔶 1/2 650tk ❓ N/A ❓ N/A ❓ N/A
moonshotai/kimi-k2-0905@tool ❌ 0/1 397tk ❌ 0/1 483tk ❌ 0/1 592tk ✅ 1/1 990tk ❓ N/A ❓ N/A ❓ N/A
moonshotai/kimi-k2-0905@xml ✅ 1/1 441tk ✅ 1/1 563tk ✅ 1/1 848tk ❌ 0/1 598tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-2-pro-llama-3-8b ✅ 1/1 341tk ❌ 0/1 4274tk ❌ 0/1 3760tk ❌ 0/1 659tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-3-llama-3.1-405b ✅ 2/2 317tk ✅ 2/2 420tk ✅ 2/2 325tk ✅ 2/2 410tk ✅ 1/1 821tk ✅ 1/1 758tk ✅ 1/1 1039tk
nousresearch/hermes-3-llama-3.1-70b ❌ 0/2 173tk ❌ 0/2 187tk ❌ 0/2 202tk ❌ 0/2 177tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-4-70b@markdown ❌ 0/1 439tk ✅ 1/1 612tk ✅ 1/1 1235tk ❌ 0/1 0tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-4-70b@tool ✅ 1/1 476tk ✅ 1/1 536tk ❌ 0/1 1310tk ❌ 0/1 0tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-4-70b@xml ✅ 1/1 466tk ✅ 1/1 516tk ✅ 1/1 631tk ❌ 0/1 1255tk ❓ N/A ❓ N/A ❓ N/A
qwen/qwen3-max@markdown ✅ 1/1 422tk ✅ 1/1 492tk 🔶 1/2 615tk ✅ 1/1 949tk ❓ N/A ❓ N/A ❓ N/A
qwen/qwen3-max@tool ✅ 1/1 436tk ✅ 1/1 519tk ✅ 1/1 546tk ✅ 1/1 663tk ❓ N/A ❓ N/A ❓ N/A
qwen/qwen3-max@xml ✅ 1/1 437tk ✅ 1/1 530tk ✅ 1/1 580tk ✅ 1/1 901tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-4-fast:free@markdown ✅ 1/1 561tk 🔶 1/2 760tk ✅ 1/1 1326tk ✅ 1/1 1016tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-code-fast-1@markdown ✅ 1/1 661tk ❌ 0/1 1385tk ✅ 1/1 829tk ✅ 1/1 955tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-code-fast-1@tool ✅ 1/1 663tk ❌ 0/1 2590tk ❌ 0/1 1415tk ✅ 1/1 1807tk ❓ N/A ❓ N/A ❓ N/A
x-ai/grok-code-fast-1@xml ❌ 0/1 485tk ❌ 0/1 1652tk ✅ 1/1 759tk ✅ 1/1 1112tk ❓ N/A ❓ N/A ❓ N/A
We are working on making the evals more robust, informative, and challenging.
Other evals#
We have considered running gptme on other evals such as SWE-Bench, but have not finished it (see PR #142).
If you are interested in running gptme on other evals, drop a comment in the issues!