Evals#

gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?

To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.

Note

The evaluation suite is still tiny and under development, but the eval harness is fully functional.

Usage#

You can run the simple hello eval like this:

gptme-eval hello --model anthropic/claude-sonnet-4-5

However, we recommend running it in Docker to improve isolation and reproducibility:

make build-docker
docker run \
    -e "ANTHROPIC_API_KEY=<your api key>" \
    -v $(pwd)/eval_results:/app/eval_results \
    gptme-eval hello --model anthropic/claude-sonnet-4-5

Available Evals#

The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction.

Results#

Here are the results of the evals we have run so far:

$ gptme-eval eval_results/*/eval_results.csv
2026-02-22 12:04:33,637 - INFO - gptme.config - Using user configuration from ~/.config/gptme/config.toml
Model                                     Format    hello             hello-patch       hello-ask         prime100          init-git          init-rust      whois-superuserlabs-ceo
----------------------------------------  --------  ----------------  ----------------  ----------------  ----------------  ----------------  -------------  -------------------------
anthropic/claude-3-5-haiku-20241022       markdown  ✅ 66/69 456tk    ✅ 68/69 377tk    ✅ 68/69 458tk    ✅ 67/69 885tk    ✅ 65/68 905tk    ❓ N/A         ✅ 1/1 2619tk
anthropic/claude-3-5-haiku-20241022       tool      🔶 67/126 287tk   🔶 68/126 330tk   🔶 68/126 433tk   ❌ 15/126 595tk   🔶 73/126 643tk   ❓ N/A         ❓ N/A
anthropic/claude-3-5-haiku-20241022       xml       🔶 67/114 298tk   🔶 68/114 329tk   🔶 46/114 435tk   ❌ 11/114 715tk   🔶 62/114 638tk   ❓ N/A         ❓ N/A
anthropic/claude-3-5-sonnet-20240620      markdown  ✅ 34/34 630tk    ✅ 34/34 542tk    ✅ 34/34 710tk    ✅ 34/34 924tk    ✅ 34/34 1098tk   ✅ 4/4 1504tk  🔶 3/4 1306tk
anthropic/claude-3-5-sonnet-20241022      markdown  🔶 146/294 286tk  🔶 186/294 302tk  🔶 190/294 346tk  🔶 191/294 567tk  🔶 191/294 536tk  ❓ N/A         ❓ N/A
anthropic/claude-3-5-sonnet-20241022      tool      🔶 80/238 230tk   🔶 59/238 288tk   🔶 80/238 324tk   🔶 77/238 420tk   🔶 70/238 446tk   ❓ N/A         ❓ N/A
anthropic/claude-3-5-sonnet-20241022      xml       ❌ 10/226 251tk   🔶 80/226 271tk   🔶 76/226 308tk   ❌ 12/226 447tk   ❌ 41/226 466tk   ❓ N/A         ❓ N/A
anthropic/claude-3-haiku-20240307         markdown  ✅ 34/34 388tk    ✅ 34/34 375tk    ✅ 34/34 432tk    ❌ 6/34 781tk     🔶 24/34 903tk    🔶 3/4 670tk   ✅ 4/4 1535tk
anthropic/claude-haiku-4-5                tool      🔶 91/116 350tk   🔶 91/116 456tk   🔶 91/116 524tk   🔶 39/116 1075tk  🔶 72/116 904tk   ❓ N/A         ❓ N/A
anthropic/claude-haiku-4-5                xml       ✅ 111/116 386tk  ✅ 110/116 470tk  ✅ 112/116 557tk  ✅ 112/116 803tk  🔶 73/116 944tk   ❓ N/A         ❓ N/A
anthropic/claude-opus-4-1-20250805        markdown  🔶 1/2 393tk      ❌ 0/1 197tk      ❌ 0/1 204tk      ❌ 0/1 190tk      ❌ 0/1 186tk      ❓ N/A         ❓ N/A
anthropic/claude-opus-4-1-20250805        tool      ❌ 0/1 181tk      ❌ 0/1 196tk      ❌ 0/1 202tk      ❌ 0/1 188tk      ❌ 0/1 184tk      ❓ N/A         ❓ N/A
anthropic/claude-opus-4-1-20250805        xml       🔶 1/2 372tk      ❌ 0/1 197tk      ❌ 0/1 205tk      ❌ 0/1 191tk      ❌ 0/1 187tk      ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-20250514        markdown  ✅ 1/1 439tk      ✅ 1/1 480tk      ✅ 1/1 532tk      ✅ 1/1 1180tk     ✅ 1/1 2006tk     ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-20250514        tool      ✅ 1/1 283tk      ✅ 1/1 320tk      ✅ 1/1 471tk      ✅ 1/1 1133tk     ✅ 1/1 1070tk     ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-20250514        xml       ✅ 1/1 404tk      ✅ 1/1 495tk      ✅ 1/1 584tk      ✅ 1/1 1223tk     ❌ 0/1 1422tk     ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-6               markdown  ✅ 4/4 294tk      ✅ 4/4 373tk      ✅ 4/4 465tk      ✅ 4/4 430tk      ✅ 4/4 650tk      ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-6               tool      ❌ 0/4 256tk      ❌ 0/4 365tk      ❌ 0/4 374tk      ❌ 0/4 338tk      ❌ 0/4 301tk      ❓ N/A         ❓ N/A
anthropic/claude-sonnet-4-6               xml       ✅ 4/4 306tk      ✅ 4/4 390tk      ✅ 4/4 417tk      ✅ 4/4 449tk      🔶 3/4 831tk      ❓ N/A         ❓ N/A
deepseek/deepseek-chat                    markdown  ✅ 1/1 429tk      ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A         ❓ N/A
deepseek/deepseek-reasoner                markdown  ✅ 1/1 742tk      ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A         ❓ N/A
deepseek/deepseek-reasoner                xml       ✅ 1/1 680tk      ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A            ❓ N/A         ❓ N/A
gemini/gemini-1.5-flash-latest            markdown  ❌ 0/54 28tk      ❌ 0/54 28tk      ❌ 0/54 28tk      ❌ 0/54 28tk      ❌ 0/54 28tk      ❓ N/A         ❓ N/A
gemini/gemini-1.5-flash-latest            tool      ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❌ 0/12 29tk      ❓ N/A         ❓ N/A
gemini/gemini-2.5-flash                   markdown  ✅ 1/1 419tk      ✅ 1/1 450tk      ✅ 1/1 516tk      ✅ 1/1 735tk      ❓ N/A            ❓ N/A         ❓ N/A
gemini/gemini-2.5-flash                   xml       ✅ 1/1 432tk      ✅ 1/1 468tk      ✅ 1/1 534tk      ✅ 1/1 836tk      ❓ N/A            ❓ N/A         ❓ N/A
groq/moonshotai/kimi-k2-instruct          markdown  ✅ 1/1 407tk      ✅ 1/1 497tk      ✅ 1/1 575tk      ✅ 1/1 1079tk     ❓ N/A            ❓ N/A         ❓ N/A
groq/moonshotai/kimi-k2-instruct          xml       ✅ 1/1 424tk      ✅ 1/1 498tk      ✅ 1/1 587tk      ✅ 1/1 1150tk     ❓ N/A            ❓ N/A         ❓ N/A
groq/qwen/qwen3-32b                       markdown  ✅ 1/1 709tk      ✅ 1/1 711tk      ❌ 0/1 0tk        ❌ 0/1 0tk        ✅ 1/1 2609tk     ❓ N/A         ❓ N/A
groq/qwen/qwen3-32b                       tool      ❌ 0/1 172tk      ❌ 0/1 187tk      ❌ 0/1 194tk      ❌ 0/1 179tk      ❌ 0/1 175tk      ❓ N/A         ❓ N/A
groq/qwen/qwen3-32b                       xml       ✅ 1/1 1918tk     ✅ 1/1 725tk      ✅ 1/1 1115tk     ❌ 0/1 5171tk     ❌ 0/1 4367tk     ❓ N/A         ❓ N/A
openai/gpt-4-turbo                        markdown  ✅ 3/3 255tk      ✅ 3/3 312tk      ✅ 3/3 376tk      ✅ 3/3 527tk      ✅ 4/4 590tk      ✅ 4/4 784tk   ✅ 6/7 819tk
openai/gpt-4o-mini                        markdown  🔶 61/104 263tk   ✅ 103/104 319tk  ✅ 103/104 375tk  🔶 65/104 502tk   ✅ 92/104 766tk   ✅ 6/6 813tk   ✅ 6/6 951tk
openai/gpt-4o-mini                        tool      ✅ 233/242 289tk  ✅ 196/242 292tk  ✅ 242/242 420tk  🔶 171/242 621tk  ✅ 242/242 680tk  ❓ N/A         ❓ N/A
openai/gpt-4o                             markdown  🔶 167/333 299tk  🔶 180/333 356tk  ✅ 287/333 382tk  🔶 69/333 512tk   🔶 177/333 654tk  ✅ 5/5 663tk   ✅ 5/5 1253tk
openai/gpt-4o                             tool      ❌ 43/242 292tk   🔶 158/242 454tk  🔶 138/242 461tk  🔶 78/242 691tk   ✅ 237/242 834tk  ❓ N/A         ❓ N/A
openai/gpt-4o                             xml       ❌ 12/230 260tk   🔶 52/230 307tk   ❌ 24/230 283tk   ❌ 1/230 360tk    ❌ 2/230 303tk    ❓ N/A         ❓ N/A
openai/gpt-5-mini                         markdown  ✅ 1/1 379tk      ✅ 1/1 358tk      ✅ 1/1 884tk      ❌ 0/1 0tk        ❌ 0/1 715tk      ❓ N/A         ❓ N/A
openai/gpt-5-mini                         tool      ✅ 1/1 286tk      ✅ 1/1 294tk      ✅ 1/1 1150tk     ❌ 0/1 0tk        ❌ 0/1 0tk        ❓ N/A         ❓ N/A
openai/gpt-5-mini                         xml       ❌ 0/1 331tk      ✅ 1/1 386tk      ✅ 1/1 922tk      ✅ 1/1 1369tk     ✅ 1/1 1614tk     ❓ N/A         ❓ N/A
openai/gpt-5                              markdown  ✅ 1/1 228tk      ✅ 1/1 273tk      ✅ 1/1 367tk      ✅ 1/1 695tk      ❌ 0/1 0tk        ❓ N/A         ❓ N/A
openai/gpt-5                              tool      ✅ 1/1 306tk      ✅ 1/1 299tk      ✅ 1/1 348tk      ❌ 0/1 0tk        ❌ 0/1 0tk        ❓ N/A         ❓ N/A
openai/gpt-5                              xml       ✅ 1/1 237tk      ❌ 0/1 218tk      ❌ 0/1 217tk      ❌ 0/1 678tk      ❌ 0/1 255tk      ❓ N/A         ❓ N/A
openai/o1-mini                            markdown  ✅ 3/3 354tk      ✅ 3/3 431tk      ✅ 3/3 460tk      🔶 2/3 567tk      🔶 3/5 2222tk     🔶 1/5 1412tk  🔶 2/3 813tk
openai/o1-preview                         markdown  ✅ 2/2 308tk      ✅ 2/2 570tk      🔶 1/2 549tk      ✅ 2/2 490tk      ✅ 3/3 823tk      ✅ 1/1 656tk   ✅ 1/1 1998tk
google/gemini-flash-1.5                   markdown  ✅ 2/2 225tk      ✅ 2/2 401tk      ✅ 2/2 430tk      ❌ 0/2 296tk      ✅ 1/1 686tk      ❌ 0/1 661tk   ✅ 1/1 1014tk
google/gemini-pro-1.5                     markdown  ✅ 1/1 341tk      ✅ 1/1 419tk      ✅ 1/1 456tk      ✅ 1/1 676tk      🔶 2/3 431tk      🔶 1/2 1016tk  ✅ 2/2 1308tk
google/gemma-2-27b-it                     markdown  ✅ 1/1 288tk      ✅ 1/1 384tk      ✅ 1/1 446tk      ✅ 1/1 714tk      ✅ 1/1 570tk      ❌ 0/1 535tk   ❌ 0/1 235tk
google/gemma-2-9b-it                      markdown  ❌ 0/2 186tk      ✅ 2/2 370tk      ✅ 2/2 368tk      ❌ 0/2 545tk      ✅ 1/1 492tk      ❌ 0/1 1730tk  ❌ 0/1 352tk
meta-llama/llama-3.1-405b-instruct        markdown  🔶 80/103 253tk   ✅ 84/103 400tk   ✅ 83/103 364tk   🔶 79/103 449tk   🔶 71/103 501tk   🔶 2/5 255tk   ❌ 0/5 85tk
meta-llama/llama-3.1-405b-instruct        tool      ❌ 0/12 184tk     ❌ 0/12 198tk     ❌ 0/12 214tk     ❌ 0/12 190tk     ❌ 0/12 189tk     ❓ N/A         ❓ N/A
meta-llama/llama-3.1-70b-instruct         markdown  ✅ 5/6 367tk      ✅ 5/6 424tk      ✅ 6/6 452tk      🔶 2/6 546tk      ✅ 5/6 813tk      🔶 3/4 682tk   🔶 2/3 1461tk
meta-llama/llama-3.1-70b-instruct         xml       🔶 182/230 273tk  ❌ 9/230 355tk    🔶 56/230 412tk   ❌ 21/230 373tk   🔶 136/230 474tk  ❓ N/A         ❓ N/A
meta-llama/llama-3.1-8b-instruct          markdown  ✅ 1/1 277tk      ✅ 1/1 441tk      ❌ 0/1 400tk      ❌ 0/1 5095tk     ✅ 1/1 2266tk     ❓ N/A         ❓ N/A
meta-llama/llama-3.2-11b-vision-instruct  markdown  ✅ 2/2 352tk      ✅ 2/2 493tk      ❌ 0/2 479tk      ✅ 2/2 2643tk     ❓ N/A            ❓ N/A         ❓ N/A
meta-llama/llama-3.2-90b-vision-instruct  markdown  🔶 2/4 237tk      🔶 2/4 288tk      🔶 3/4 336tk      🔶 1/4 233tk      ❓ N/A            ❓ N/A         ❓ N/A
mistralai/magistral-medium-2506           markdown  ✅ 1/1 531tk      ✅ 1/1 569tk      ✅ 1/1 666tk      ✅ 1/1 1106tk     ❓ N/A            ❓ N/A         ❓ N/A
mistralai/magistral-medium-2506           tool      ❌ 0/1 465tk      ❌ 0/1 604tk      ❌ 0/1 0tk        ❌ 0/1 0tk        ❓ N/A            ❓ N/A         ❓ N/A
mistralai/magistral-medium-2506           xml       ✅ 1/1 516tk      ✅ 1/1 568tk      ❌ 0/1 552tk      ✅ 1/1 1075tk     ❓ N/A            ❓ N/A         ❓ N/A
moonshotai/kimi-k2-0905                   markdown  ✅ 2/2 464tk      ✅ 2/2 590tk      ✅ 2/2 613tk      🔶 1/2 650tk      ❓ N/A            ❓ N/A         ❓ N/A
moonshotai/kimi-k2-0905                   tool      ❌ 0/1 397tk      ❌ 0/1 483tk      ❌ 0/1 592tk      ✅ 1/1 990tk      ❓ N/A            ❓ N/A         ❓ N/A
moonshotai/kimi-k2-0905                   xml       ✅ 1/1 441tk      ✅ 1/1 563tk      ✅ 1/1 848tk      ❌ 0/1 598tk      ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-2-pro-llama-3-8b      markdown  ✅ 1/1 341tk      ❌ 0/1 4274tk     ❌ 0/1 3760tk     ❌ 0/1 659tk      ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-3-llama-3.1-405b      markdown  ✅ 2/2 317tk      ✅ 2/2 420tk      ✅ 2/2 325tk      ✅ 2/2 410tk      ✅ 1/1 821tk      ✅ 1/1 758tk   ✅ 1/1 1039tk
nousresearch/hermes-3-llama-3.1-70b       markdown  ❌ 0/2 173tk      ❌ 0/2 187tk      ❌ 0/2 202tk      ❌ 0/2 177tk      ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-4-70b                 markdown  ❌ 0/1 439tk      ✅ 1/1 612tk      ✅ 1/1 1235tk     ❌ 0/1 0tk        ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-4-70b                 tool      ✅ 1/1 476tk      ✅ 1/1 536tk      ❌ 0/1 1310tk     ❌ 0/1 0tk        ❓ N/A            ❓ N/A         ❓ N/A
nousresearch/hermes-4-70b                 xml       ✅ 1/1 466tk      ✅ 1/1 516tk      ✅ 1/1 631tk      ❌ 0/1 1255tk     ❓ N/A            ❓ N/A         ❓ N/A
qwen/qwen3-max                            markdown  ✅ 1/1 422tk      ✅ 1/1 492tk      🔶 1/2 615tk      ✅ 1/1 949tk      ❓ N/A            ❓ N/A         ❓ N/A
qwen/qwen3-max                            tool      ✅ 1/1 436tk      ✅ 1/1 519tk      ✅ 1/1 546tk      ✅ 1/1 663tk      ❓ N/A            ❓ N/A         ❓ N/A
qwen/qwen3-max                            xml       ✅ 1/1 437tk      ✅ 1/1 530tk      ✅ 1/1 580tk      ✅ 1/1 901tk      ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-4-fast:free                     markdown  ✅ 1/1 561tk      🔶 1/2 760tk      ✅ 1/1 1326tk     ✅ 1/1 1016tk     ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-code-fast-1                     markdown  ✅ 1/1 661tk      ❌ 0/1 1385tk     ✅ 1/1 829tk      ✅ 1/1 955tk      ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-code-fast-1                     tool      ✅ 1/1 663tk      ❌ 0/1 2590tk     ❌ 0/1 1415tk     ✅ 1/1 1807tk     ❓ N/A            ❓ N/A         ❓ N/A
x-ai/grok-code-fast-1                     xml       ❌ 0/1 485tk      ❌ 0/1 1652tk     ✅ 1/1 759tk      ✅ 1/1 1112tk     ❓ N/A            ❓ N/A         ❓ N/A

We are working on making the evals more robust, informative, and challenging.

Other evals#

We have considered running gptme on other evals such as SWE-Bench, but have not finished it (see PR #142).

If you are interested in running gptme on other evals, drop a comment in the issues!