Evals#

gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?

To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.

Note

The evaluation suite is still tiny and under development, but the eval harness is fully functional.

Usage#

You can run the simple hello eval like this:

gptme-eval hello --model anthropic/claude-sonnet-4-5

However, we recommend running it in Docker to improve isolation and reproducibility:

make build-docker
docker run \
    -e "ANTHROPIC_API_KEY=<your api key>" \
    -v $(pwd)/eval_results:/app/eval_results \
    gptme-eval hello --model anthropic/claude-sonnet-4-5

Available Evals#

The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction.

Results#

Here are the results of the evals we have run so far:

$ gptme-eval eval_results/*/eval_results.csv
Model                                          hello             hello-patch       hello-ask         prime100          init-git          init-rust      whois-superuserlabs-ceo
---------------------------------------------  ----------------  ----------------  ----------------  ----------------  ----------------  -------------  -------------------------
anthropic/claude-3-5-haiku-20241022            ✅ 55/57 473tk    ✅ 56/57 386tk    ✅ 56/57 468tk    ✅ 56/57 912tk    ✅ 54/56 922tk    ❓ N/A         ✅ 1/1 2619tk
anthropic/claude-3-5-haiku-20241022@markdown   ✅ 11/12 373tk    ✅ 12/12 332tk    ✅ 12/12 407tk    ✅ 11/12 758tk    ✅ 11/12 830tk    ❓ N/A         ❓ N/A
anthropic/claude-3-5-haiku-20241022@tool       🔶 67/126 287tk   🔶 68/126 330tk   🔶 68/126 433tk   ❌ 15/126 595tk   🔶 73/126 643tk   ❓ N/A         ❓ N/A
anthropic/claude-3-5-haiku-20241022@xml        🔶 67/114 298tk   🔶 68/114 329tk   🔶 46/114 435tk   ❌ 11/114 715tk   🔶 62/114 638tk   ❓ N/A         ❓ N/A
anthropic/claude-3-5-sonnet-20240620           ✅ 34/34 630tk    ✅ 34/34 542tk    ✅ 34/34 710tk    ✅ 34/34 924tk    ✅ 34/34 1098tk   ✅ 4/4 1504tk  🔶 3/4 1306tk
anthropic/claude-3-5-sonnet-20241022           ✅ 54/56 378tk    ✅ 54/56 371tk    ✅ 54/56 416tk    ✅ 55/56 869tk    ✅ 55/56 754tk    ❓ N/A         ❓ N/A
anthropic/claude-3-5-sonnet-20241022@markdown  🔶 92/208 275tk   🔶 132/208 298tk  🔶 136/208 347tk  🔶 136/208 538tk  🔶 136/208 527tk  ❓ N/A         ❓ N/A
anthropic/claude-3-5-sonnet-20241022@tool      🔶 80/208 237tk   🔶 59/208 301tk   🔶 80/208 341tk   🔶 77/208 453tk   🔶 70/208 483tk   ❓ N/A         ❓ N/A
anthropic/claude-3-5-sonnet-20241022@xml       ❌ 10/196 261tk   🔶 80/196 281tk   🔶 76/196 323tk   ❌ 12/196 485tk   🔶 41/196 509tk   ❓ N/A         ❓ N/A
anthropic/claude-3-haiku-20240307              ✅ 34/34 388tk    ✅ 34/34 375tk    ✅ 34/34 432tk    ❌ 6/34 781tk     🔶 24/34 903tk    🔶 3/4 670tk   ✅ 4/4 1535tk
anthropic/claude-haiku-4-5@tool                ✅ 82/82 363tk    ✅ 82/82 474tk    ✅ 82/82 553tk    🔶 30/82 1186tk   🔶 65/82 1063tk   ❓ N/A         ❓ N/A
anthropic/claude-haiku-4-5@xml                 ✅ 81/82 391tk    ✅ 81/82 479tk    ✅ 82/82 568tk    ✅ 82/82 851tk    🔶 61/82 1008tk   ❓ N/A         ❓ N/A
anthropic/claude-opus-4-1-20250805@markdown    🔶 1/2 393tk      ❌ 0/1 197tk      ❌ 0/1 204tk      ❌ 0/1 190tk      ❌ 0/1 186tk      ❓ N/A         ❓ N/A
anthropic/claude-opus-4-1-20250805@tool        ❌ 0/1 181tk      ❌ 0/1 196tk      ❌ 0/1 202tk      ❌ 0/1 188tk      ❌ 0/1 184tk      ❓ N/A         ❓ N/A
anthropic/claude-opus-4-1-20250805@xml         🔶 1/2 372tk      ❌ 0/1 197tk      ❌ 0/1 205tk      ❌ 0/1 191tk      ❌ 0/1 187tk      ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-20250514@markdown    ✅ 1/1 439tk      ✅ 1/1 480tk      ✅ 1/1 532tk      ✅ 1/1 1180tk     ✅ 1/1 2006tk     ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-20250514@tool        ✅ 1/1 283tk      ✅ 1/1 320tk      ✅ 1/1 471tk      ✅ 1/1 1133tk     ✅ 1/1 1070tk     ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-20250514@xml         ✅ 1/1 404tk      ✅ 1/1 495tk      ✅ 1/1 584tk      ✅ 1/1 1223tk     ❌ 0/1 1422tk     ❓ N/A         ❓ N/A
deepseek/deepseek-chat@markdown                ✅ 1/1 429tk      ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A         ❓ N/A
deepseek/deepseek-reasoner@markdown            ✅ 1/1 742tk      ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A         ❓ N/A
deepseek/deepseek-reasoner@xml                 ✅ 1/1 680tk      ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A         ❓ N/A
gemini/gemini-1.5-flash-latest                 ❌ 0/42 28tk      ❌ 0/42 28tk      ❌ 0/42 28tk      ❌ 0/42 28tk      ❌ 0/42 28tk      ❓ N/A         ❓ N/A
gemini/gemini-1.5-flash-latest@markdown        ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❓ N/A         ❓ N/A
gemini/gemini-1.5-flash-latest@tool            ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❓ N/A         ❓ N/A
gemini/gemini-2.5-flash@markdown               ✅ 1/1 419tk      ✅ 1/1 450tk      ✅ 1/1 516tk      ✅ 1/1 735tk      ❓ N/A            ❓ N/A         ❓ N/A
gemini/gemini-2.5-flash@xml                    ✅ 1/1 432tk      ✅ 1/1 468tk      ✅ 1/1 534tk      ✅ 1/1 836tk      ❓ N/A            ❓ N/A         ❓ N/A
groq/moonshotai/kimi-k2-instruct@markdown      ✅ 1/1 407tk      ✅ 1/1 497tk      ✅ 1/1 575tk      ✅ 1/1 1079tk     ❓ N/A            ❓ N/A         ❓ N/A
groq/moonshotai/kimi-k2-instruct@xml           ✅ 1/1 424tk      ✅ 1/1 498tk      ✅ 1/1 587tk      ✅ 1/1 1150tk     ❓ N/A            ❓ N/A         ❓ N/A
groq/qwen/qwen3-32b@markdown                   ✅ 1/1 709tk      ✅ 1/1 711tk      ❌ 0/1 0tk        ❌ 0/1 0tk        ✅ 1/1 2609tk     ❓ N/A         ❓ N/A
groq/qwen/qwen3-32b@tool                       ❌ 0/1 172tk      ❌ 0/1 187tk      ❌ 0/1 194tk      ❌ 0/1 179tk      ❌ 0/1 175tk      ❓ N/A         ❓ N/A
groq/qwen/qwen3-32b@xml                        ✅ 1/1 1918tk     ✅ 1/1 725tk      ✅ 1/1 1115tk     ❌ 0/1 5171tk     ❌ 0/1 4367tk     ❓ N/A         ❓ N/A
openai/gpt-4-turbo                             ✅ 3/3 255tk      ✅ 3/3 312tk      ✅ 3/3 376tk      ✅ 3/3 527tk      ✅ 4/4 590tk      ✅ 4/4 784tk   ✅ 6/7 819tk
openai/gpt-4o                                  ✅ 90/91 325tk    ✅ 90/91 313tk    🔶 65/91 337tk    🔶 35/91 441tk    🔶 67/91 626tk    ✅ 5/5 663tk   ✅ 5/5 1253tk
openai/gpt-4o-mini                             🔶 60/92 269tk    ✅ 91/92 321tk    ✅ 91/92 377tk    🔶 62/92 510tk    ✅ 80/92 759tk    ✅ 6/6 813tk   ✅ 6/6 951tk
openai/gpt-4o-mini@markdown                    ❌ 1/12 217tk     ✅ 12/12 302tk    ✅ 12/12 361tk    🔶 3/12 440tk     ✅ 12/12 823tk    ❓ N/A         ❓ N/A
openai/gpt-4o-mini@tool                        ✅ 199/208 286tk  ✅ 179/208 308tk  ✅ 208/208 416tk  🔶 155/208 619tk  ✅ 208/208 680tk  ❓ N/A         ❓ N/A
openai/gpt-4o@markdown                         🔶 76/208 288tk   🔶 86/208 367tk   ✅ 191/208 391tk  ❌ 27/208 534tk   🔶 79/208 633tk   ❓ N/A         ❓ N/A
openai/gpt-4o@tool                             🔶 43/208 289tk   🔶 141/208 446tk  🔶 129/208 452tk  🔶 76/208 707tk   ✅ 205/208 828tk  ❓ N/A         ❓ N/A
openai/gpt-4o@xml                              ❌ 9/196 251tk    🔶 46/196 289tk   ❌ 12/196 271tk   ❌ 1/196 371tk    ❌ 2/196 299tk    ❓ N/A         ❓ N/A
openai/gpt-5-mini@markdown                     ✅ 1/1 379tk      ✅ 1/1 358tk      ✅ 1/1 884tk      ❌ 0/1 0tk        ❌ 0/1 715tk      ❓ N/A         ❓ N/A
openai/gpt-5-mini@tool                         ✅ 1/1 286tk      ✅ 1/1 294tk      ✅ 1/1 1150tk     ❌ 0/1 0tk        ❌ 0/1 0tk        ❓ N/A         ❓ N/A
openai/gpt-5-mini@xml                          ❌ 0/1 331tk      ✅ 1/1 386tk      ✅ 1/1 922tk      ✅ 1/1 1369tk     ✅ 1/1 1614tk     ❓ N/A         ❓ N/A
openai/gpt-5@markdown                          ✅ 1/1 228tk      ✅ 1/1 273tk      ✅ 1/1 367tk      ✅ 1/1 695tk      ❌ 0/1 0tk        ❓ N/A         ❓ N/A
openai/gpt-5@tool                              ✅ 1/1 306tk      ✅ 1/1 299tk      ✅ 1/1 348tk      ❌ 0/1 0tk        ❌ 0/1 0tk        ❓ N/A         ❓ N/A
openai/gpt-5@xml                               ✅ 1/1 237tk      ❌ 0/1 218tk      ❌ 0/1 217tk      ❌ 0/1 678tk      ❌ 0/1 255tk      ❓ N/A         ❓ N/A
openai/o1-mini                                 ✅ 3/3 354tk      ✅ 3/3 431tk      ✅ 3/3 460tk      🔶 2/3 567tk      🔶 3/5 2222tk     🔶 1/5 1412tk  🔶 2/3 813tk
openai/o1-preview                              ✅ 2/2 308tk      ✅ 2/2 570tk      🔶 1/2 549tk      ✅ 2/2 490tk      ✅ 3/3 823tk      ✅ 1/1 656tk   ✅ 1/1 1998tk
google/gemini-flash-1.5                        ✅ 2/2 225tk      ✅ 2/2 401tk      ✅ 2/2 430tk      ❌ 0/2 296tk      ✅ 1/1 686tk      ❌ 0/1 661tk   ✅ 1/1 1014tk
google/gemini-pro-1.5                          ✅ 1/1 341tk      ✅ 1/1 419tk      ✅ 1/1 456tk      ✅ 1/1 676tk      🔶 2/3 431tk      🔶 1/2 1016tk  ✅ 2/2 1308tk
google/gemma-2-27b-it                          ✅ 1/1 288tk      ✅ 1/1 384tk      ✅ 1/1 446tk      ✅ 1/1 714tk      ✅ 1/1 570tk      ❌ 0/1 535tk   ❌ 0/1 235tk
google/gemma-2-9b-it                           ❌ 0/2 186tk      ✅ 2/2 370tk      ✅ 2/2 368tk      ❌ 0/2 545tk      ✅ 1/1 492tk      ❌ 0/1 1730tk  ❌ 0/1 352tk
meta-llama/llama-3.1-405b-instruct             🔶 69/91 244tk    ✅ 74/91 417tk    ✅ 73/91 364tk    🔶 71/91 440tk    🔶 60/91 488tk    🔶 2/5 255tk   ❌ 0/5 85tk
meta-llama/llama-3.1-405b-instruct@markdown    ✅ 11/12 318tk    ✅ 10/12 269tk    ✅ 10/12 366tk    🔶 8/12 516tk     ✅ 11/12 600tk    ❓ N/A         ❓ N/A
meta-llama/llama-3.1-405b-instruct@tool        ❌ 0/12 184tk     ❌ 0/12 198tk     ❌ 0/12 214tk     ❌ 0/12 190tk     ❌ 0/12 189tk     ❓ N/A         ❓ N/A
meta-llama/llama-3.1-70b-instruct              ✅ 5/6 367tk      ✅ 5/6 424tk      ✅ 6/6 452tk      🔶 2/6 546tk      ✅ 5/6 813tk      🔶 3/4 682tk   🔶 2/3 1461tk
meta-llama/llama-3.1-70b-instruct@xml          🔶 148/196 268tk  ❌ 8/196 365tk    🔶 46/196 443tk   ❌ 16/196 398tk   🔶 114/196 479tk  ❓ N/A         ❓ N/A
meta-llama/llama-3.1-8b-instruct               ✅ 1/1 277tk      ✅ 1/1 441tk      ❌ 0/1 400tk      ❌ 0/1 5095tk     ✅ 1/1 2266tk     ❓ N/A         ❓ N/A
meta-llama/llama-3.2-11b-vision-instruct       ✅ 2/2 352tk      ✅ 2/2 493tk      ❌ 0/2 479tk      ✅ 2/2 2643tk     ❓ N/A            ❓ N/A         ❓ N/A
meta-llama/llama-3.2-90b-vision-instruct       🔶 2/4 237tk      🔶 2/4 288tk      🔶 3/4 336tk      🔶 1/4 233tk      ❓ N/A            ❓ N/A         ❓ N/A
mistralai/magistral-medium-2506@markdown       ✅ 1/1 531tk      ✅ 1/1 569tk      ✅ 1/1 666tk      ✅ 1/1 1106tk     ❓ N/A            ❓ N/A         ❓ N/A
mistralai/magistral-medium-2506@tool           ❌ 0/1 465tk      ❌ 0/1 604tk      ❌ 0/1 0tk        ❌ 0/1 0tk        ❓ N/A            ❓ N/A         ❓ N/A
mistralai/magistral-medium-2506@xml            ✅ 1/1 516tk      ✅ 1/1 568tk      ❌ 0/1 552tk      ✅ 1/1 1075tk     ❓ N/A            ❓ N/A         ❓ N/A
moonshotai/kimi-k2-0905@markdown               ✅ 2/2 464tk      ✅ 2/2 590tk      ✅ 2/2 613tk      🔶 1/2 650tk      ❓ N/A            ❓ N/A         ❓ N/A
moonshotai/kimi-k2-0905@tool                   ❌ 0/1 397tk      ❌ 0/1 483tk      ❌ 0/1 592tk      ✅ 1/1 990tk      ❓ N/A            ❓ N/A         ❓ N/A
moonshotai/kimi-k2-0905@xml                    ✅ 1/1 441tk      ✅ 1/1 563tk      ✅ 1/1 848tk      ❌ 0/1 598tk      ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-2-pro-llama-3-8b           ✅ 1/1 341tk      ❌ 0/1 4274tk     ❌ 0/1 3760tk     ❌ 0/1 659tk      ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-3-llama-3.1-405b           ✅ 2/2 317tk      ✅ 2/2 420tk      ✅ 2/2 325tk      ✅ 2/2 410tk      ✅ 1/1 821tk      ✅ 1/1 758tk   ✅ 1/1 1039tk
nousresearch/hermes-3-llama-3.1-70b            ❌ 0/2 173tk      ❌ 0/2 187tk      ❌ 0/2 202tk      ❌ 0/2 177tk      ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-4-70b@markdown             ❌ 0/1 439tk      ✅ 1/1 612tk      ✅ 1/1 1235tk     ❌ 0/1 0tk        ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-4-70b@tool                 ✅ 1/1 476tk      ✅ 1/1 536tk      ❌ 0/1 1310tk     ❌ 0/1 0tk        ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-4-70b@xml                  ✅ 1/1 466tk      ✅ 1/1 516tk      ✅ 1/1 631tk      ❌ 0/1 1255tk     ❓ N/A            ❓ N/A         ❓ N/A
qwen/qwen3-max@markdown                        ✅ 1/1 422tk      ✅ 1/1 492tk      🔶 1/2 615tk      ✅ 1/1 949tk      ❓ N/A            ❓ N/A         ❓ N/A
qwen/qwen3-max@tool                            ✅ 1/1 436tk      ✅ 1/1 519tk      ✅ 1/1 546tk      ✅ 1/1 663tk      ❓ N/A            ❓ N/A         ❓ N/A
qwen/qwen3-max@xml                             ✅ 1/1 437tk      ✅ 1/1 530tk      ✅ 1/1 580tk      ✅ 1/1 901tk      ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-4-fast:free@markdown                 ✅ 1/1 561tk      🔶 1/2 760tk      ✅ 1/1 1326tk     ✅ 1/1 1016tk     ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-code-fast-1@markdown                 ✅ 1/1 661tk      ❌ 0/1 1385tk     ✅ 1/1 829tk      ✅ 1/1 955tk      ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-code-fast-1@tool                     ✅ 1/1 663tk      ❌ 0/1 2590tk     ❌ 0/1 1415tk     ✅ 1/1 1807tk     ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-code-fast-1@xml                      ❌ 0/1 485tk      ❌ 0/1 1652tk     ✅ 1/1 759tk      ✅ 1/1 1112tk     ❓ N/A            ❓ N/A         ❓ N/A

We are working on making the evals more robust, informative, and challenging.

Other evals#

We have considered running gptme on other evals such as SWE-Bench, but have not finished it (see PR #142).

If you are interested in running gptme on other evals, drop a comment in the issues!