Evals#
gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?
To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
Note
The evaluation suite is still tiny and under development, but the eval harness is fully functional.
Recommended Model#
The recommended model is Claude 3.7 Sonnet (anthropic/claude-3-7-sonnet-20250219
) for its:
Strong coder capabilities
Strong performance across all tool types
Reasoning capabilities
Vision & computer use capabilities
Decent alternatives include:
GPT-4o (
openai/gpt-4o
)Llama 3.1 405B (
openrouter/meta-llama/llama-3.1-405b-instruct
)DeepSeek V3 (
deepseek/deepseek-chat
)DeepSeek R1 (
deepseek/deepseek-reasoner
)
Usage#
You can run the simple hello
eval with Claude 3.7 Sonnet like this:
gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219
However, we recommend running it in Docker to improve isolation and reproducibility:
make build-docker
docker run \
-e "ANTHROPIC_API_KEY=<your api key>" \
-v $(pwd)/eval_results:/app/eval_results \
gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219
Available Evals#
The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction.
Results#
Here are the results of the evals we have run so far:
$ gptme-eval eval_results/*/eval_results.csv
[11:03:28] Browser tool available (using playwright)
Model hello hello-patch hello-ask prime100 init-git init-rust whois-superuserlabs-ceo
--------------------------------------------- -------------- -------------- -------------- -------------- --------------- ------------- -------------------------
anthropic/claude-3-5-haiku-20241022 ✅ 55/57 473tk ✅ 56/57 386tk ✅ 56/57 468tk ✅ 56/57 912tk ✅ 54/56 922tk ❓ N/A ✅ 1/1 2619tk
anthropic/claude-3-5-haiku-20241022@markdown ✅ 11/12 373tk ✅ 12/12 332tk ✅ 12/12 407tk ✅ 11/12 758tk ✅ 11/12 830tk ❓ N/A ❓ N/A
anthropic/claude-3-5-haiku-20241022@tool ❌ 4/61 253tk ❌ 6/61 283tk ❌ 6/61 340tk ❌ 5/61 533tk ❌ 11/61 491tk ❓ N/A ❓ N/A
anthropic/claude-3-5-haiku-20241022@xml ❌ 4/49 264tk ❌ 5/49 277tk ❌ 5/49 329tk ❌ 4/49 505tk ❌ 4/49 431tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20240620 ✅ 34/34 630tk ✅ 34/34 542tk ✅ 34/34 710tk ✅ 34/34 924tk ✅ 34/34 1098tk ✅ 4/4 1504tk 🔶 3/4 1306tk
anthropic/claude-3-5-sonnet-20241022 ✅ 54/56 378tk ✅ 54/56 371tk ✅ 54/56 416tk ✅ 55/56 869tk ✅ 55/56 754tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022@markdown ✅ 61/61 370tk ✅ 61/61 316tk ✅ 61/61 402tk ✅ 61/61 767tk ✅ 61/61 713tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022@tool ❌ 5/61 225tk ❌ 5/61 251tk ❌ 5/61 312tk ❌ 5/61 423tk ❌ 5/61 373tk ❓ N/A ❓ N/A
anthropic/claude-3-5-sonnet-20241022@xml ❌ 5/49 248tk ❌ 5/49 251tk ❌ 5/49 307tk ❌ 5/49 463tk ❌ 5/49 383tk ❓ N/A ❓ N/A
anthropic/claude-3-haiku-20240307 ✅ 34/34 388tk ✅ 34/34 375tk ✅ 34/34 432tk ❌ 6/34 781tk 🔶 24/34 903tk 🔶 3/4 670tk ✅ 4/4 1535tk
gemini/gemini-1.5-flash-latest ❌ 0/42 28tk ❌ 0/42 28tk ❌ 0/42 28tk ❌ 0/42 28tk ❌ 0/42 28tk ❓ N/A ❓ N/A
gemini/gemini-1.5-flash-latest@markdown ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❓ N/A ❓ N/A
gemini/gemini-1.5-flash-latest@tool ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❌ 0/12 29tk ❓ N/A ❓ N/A
openai/gpt-4-turbo ✅ 3/3 255tk ✅ 3/3 312tk ✅ 3/3 376tk ✅ 3/3 527tk ✅ 4/4 590tk ✅ 4/4 784tk ✅ 6/7 819tk
openai/gpt-4o ✅ 90/91 325tk ✅ 90/91 313tk 🔶 65/91 337tk 🔶 35/91 441tk 🔶 67/91 626tk ✅ 5/5 663tk ✅ 5/5 1253tk
openai/gpt-4o-mini 🔶 60/92 269tk ✅ 91/92 321tk ✅ 91/92 377tk 🔶 62/92 510tk ✅ 80/92 759tk ✅ 6/6 813tk ✅ 6/6 951tk
openai/gpt-4o-mini@markdown ❌ 1/12 217tk ✅ 12/12 302tk ✅ 12/12 361tk 🔶 3/12 440tk ✅ 12/12 823tk ❓ N/A ❓ N/A
openai/gpt-4o-mini@tool ✅ 52/61 269tk ✅ 61/61 308tk ✅ 61/61 383tk ✅ 49/61 552tk ✅ 61/61 657tk ❓ N/A ❓ N/A
openai/gpt-4o@markdown ✅ 60/61 266tk ✅ 60/61 302tk ✅ 59/61 358tk ❌ 0/61 420tk 🔶 15/61 413tk ❓ N/A ❓ N/A
openai/gpt-4o@tool 🔶 37/61 264tk ✅ 61/61 303tk ✅ 52/61 363tk ✅ 59/61 653tk ✅ 61/61 650tk ❓ N/A ❓ N/A
openai/gpt-4o@xml ❌ 5/49 202tk ❌ 5/49 246tk ❌ 4/49 269tk ❌ 0/49 427tk ❌ 2/49 295tk ❓ N/A ❓ N/A
openai/o1-mini ✅ 3/3 354tk ✅ 3/3 431tk ✅ 3/3 460tk 🔶 2/3 567tk 🔶 3/5 2222tk 🔶 1/5 1412tk 🔶 2/3 813tk
openai/o1-preview ✅ 2/2 308tk ✅ 2/2 570tk 🔶 1/2 549tk ✅ 2/2 490tk ✅ 3/3 823tk ✅ 1/1 656tk ✅ 1/1 1998tk
google/gemini-flash-1.5 ✅ 2/2 225tk ✅ 2/2 401tk ✅ 2/2 430tk ❌ 0/2 296tk ✅ 1/1 686tk ❌ 0/1 661tk ✅ 1/1 1014tk
google/gemini-pro-1.5 ✅ 1/1 341tk ✅ 1/1 419tk ✅ 1/1 456tk ✅ 1/1 676tk 🔶 2/3 431tk 🔶 1/2 1016tk ✅ 2/2 1308tk
google/gemma-2-27b-it ✅ 1/1 288tk ✅ 1/1 384tk ✅ 1/1 446tk ✅ 1/1 714tk ✅ 1/1 570tk ❌ 0/1 535tk ❌ 0/1 235tk
google/gemma-2-9b-it ❌ 0/2 186tk ✅ 2/2 370tk ✅ 2/2 368tk ❌ 0/2 545tk ✅ 1/1 492tk ❌ 0/1 1730tk ❌ 0/1 352tk
meta-llama/llama-3.1-405b-instruct 🔶 69/91 244tk ✅ 74/91 417tk ✅ 73/91 364tk 🔶 71/91 440tk 🔶 60/91 488tk 🔶 2/5 255tk ❌ 0/5 85tk
meta-llama/llama-3.1-405b-instruct@markdown ✅ 11/12 318tk ✅ 10/12 269tk ✅ 10/12 366tk 🔶 8/12 516tk ✅ 11/12 600tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-405b-instruct@tool ❌ 0/12 184tk ❌ 0/12 198tk ❌ 0/12 214tk ❌ 0/12 190tk ❌ 0/12 189tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-70b-instruct ✅ 5/6 367tk ✅ 5/6 424tk ✅ 6/6 452tk 🔶 2/6 546tk ✅ 5/6 813tk 🔶 3/4 682tk 🔶 2/3 1461tk
meta-llama/llama-3.1-70b-instruct@xml ❌ 5/49 220tk ❌ 5/49 249tk ❌ 5/49 298tk ❌ 0/49 284tk ❌ 0/49 309tk ❓ N/A ❓ N/A
meta-llama/llama-3.1-8b-instruct ✅ 1/1 277tk ✅ 1/1 441tk ❌ 0/1 400tk ❌ 0/1 5095tk ✅ 1/1 2266tk ❓ N/A ❓ N/A
meta-llama/llama-3.2-11b-vision-instruct ✅ 2/2 352tk ✅ 2/2 493tk ❌ 0/2 479tk ✅ 2/2 2643tk ❓ N/A ❓ N/A ❓ N/A
meta-llama/llama-3.2-90b-vision-instruct 🔶 2/4 237tk 🔶 2/4 288tk 🔶 3/4 336tk 🔶 1/4 233tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-2-pro-llama-3-8b ✅ 1/1 341tk ❌ 0/1 4274tk ❌ 0/1 3760tk ❌ 0/1 659tk ❓ N/A ❓ N/A ❓ N/A
nousresearch/hermes-3-llama-3.1-405b ✅ 2/2 317tk ✅ 2/2 420tk ✅ 2/2 325tk ✅ 2/2 410tk ✅ 1/1 821tk ✅ 1/1 758tk ✅ 1/1 1039tk
nousresearch/hermes-3-llama-3.1-70b ❌ 0/2 173tk ❌ 0/2 187tk ❌ 0/2 202tk ❌ 0/2 177tk ❓ N/A ❓ N/A ❓ N/A
We are working on making the evals more robust, informative, and challenging.
Other evals#
We have considered running gptme on other evals such as SWE-Bench, but have not finished it (see PR #142).
If you are interested in running gptme on other evals, drop a comment in the issues!