Choosing the best GPT model: benchmark on your own data
This guide will walk you through how to compare OpenAI's GPT-4o and GPT-4o-mini, top contenders for the most powerful and effective GPT models. This testing framework will give you the chance to test the models' reasoning capabilities, cost, and latency.
New model releases often score well on benchmarks. But generic benchmarks are for generic use cases. If you're building an LLM app, you should evaluate these models on your own data and make an informed decision based on your specific needs.
The end result will be a side-by-side comparison that looks like this:
Prerequisites
To start, make sure you have:
- promptfoo CLI installed. If not, refer to the installation guide.
- An active OpenAI API key set as the
OPENAI_API_KEY
environment variable. See OpenAI configuration for details.
Step 1: Setup
Create a dedicated directory for your comparison project:
npx promptfoo@latest init gpt-comparison
Edit promptfooconfig.yaml
to include both models:
providers:
- openai:gpt-4o-mini
- openai:gpt-4
Step 2: Crafting the prompts
For our comparison, we'll use a simple prompt:
prompts:
- 'Solve this riddle: {{riddle}}'
Feel free to add multiple prompts and tailor to your use case.
Step 3: Create test cases
Above, we have a {{riddle}}
placeholder variable. Each test case runs the prompts with a different riddle:
tests:
- vars:
riddle: 'I speak without a mouth and hear without ears. I have no body, but I come alive with wind. What am I?'
- vars:
riddle: 'You see a boat filled with people. It has not sunk, but when you look again you don’t see a single person on the boat. Why?'
- vars:
riddle: 'The more of this there is, the less you see. What is it?'
Step 4: Run the comparison
Execute the comparison with the following command:
npx promptfoo@latest eval
This will process the riddles against both GPT-3.5 and GPT-4, providing you with side-by-side results in your command line interface:
npx promptfoo@latest view