How to evaluate OpenAI Assistants
OpenAI recently released an Assistants API that offers simplified handling for message state and tool usage. It also enables code interpreter and knowledge retrieval features, abstracting away some of the dirty work for implementing RAG architecture.
Test-driven development allows you to compare prompts, models, and tools while measuring improvement and avoiding unexplained regressions. It's an example of systematic iteration vs. trial and error.
This guide walks you through using promptfoo to select the best prompt, model, and tools using OpenAI's Assistants API. It assumes that you've already set up promptfoo.
Step 1: Create an assistant
Use the OpenAI playground to an assistant. The eval will use this assistant with different instructions and models.
Add your desired functions and enable the code interpreter and retrieval as desired.
After you create an assistant, record its ID. It will look similar to asst_fEhNN3MClMamLfKLkIaoIpgB
.
Step 2: Set up the eval
An eval config has a few key components:
prompts
: The user chat messages you want to testproviders
: The assistant(s) and/or LLM APIs you want to testtests
: Individual test cases to try
Let's set up a basic promptfooconfig.yaml
:
prompts:
- 'Help me out with this: {{message}}'
providers:
- openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
tests:
- vars:
message: write a tweet about bananas
- vars:
message: what is the sum of 38.293 and the square root of 30300300
- vars:
message: reverse the string "all dogs go to heaven"
Step 3: Run the eval
Now that we've set up the config, run the eval on your command line:
npx promptfoo@latest eval
This will produce a simple view of assistant outputs. Note that it records the conversation, as well as code interpreter, function, and retrieval inputs and outputs:
This is a basic view, but now we're ready to actually get serious with our eval. In the next sections, we'll learn how to compare different assistants or different versions of the same assistant.
Comparing multiple assistants
To compare different assistants, reference them in the providers
section of your promptfooconfig.yaml
. For example:
providers:
- openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
- openai:assistant:asst_another_assistant_id_123
This will run the same tests on both assistants and allow you to compare their performance.
Comparing different versions of the same assistant
If you want to override the configuration of an assistant for a specific test, you can do so in the options
section of a test. For example:
providers:
- id: openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
config:
model: gpt-4o
instructions: 'Enter a replacement for system-level instructions here'
tools:
- type: code_interpreter
- type: retrieval
thread:
messages:
- role: user
content: 'These messages are included in every test case before the prompt.'
- role: assistant
content: 'Okay'
In this example, the Assistant API is called with the above parameters.
Here's an example that compares the saved Assistant settings against new potential settings:
providers:
# Original
- id: openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
# Modified
- id: openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
config:
model: gpt-4o
instructions: 'Always talk like a pirate'
This eval will test both versions of the Assistant and display the results side-by-side.
Adding metrics and assertions
Metrics and assertions allow you to automatically evaluate the performance of your assistants. You can add them in the assert
section of a test. For example:
tests:
- vars:
message: write a tweet about bananas
assert:
- type: contains
value: 'banana'
- type: similar
value: 'I love bananas!'
threshold: 0.6
In this example, the contains
assertion checks if the assistant's response contains the word 'banana'. The similar
assertion checks if the assistant's response is semantically similar to 'I love bananas!' with a cosine similarity threshold of 0.6.
There are many different assertions to consider, ranging from simple metrics (such as string matching) to complex metrics (such as model-graded evaluations). I strongly encourage you to set up assertions that are tailored to your use case.
Based on these assertions, promptfoo will automatically score the different versions of your assistants, so that you can pick the top performing one.
Next steps
Now that you've got a basic eval set up, you may also be interested in specific techniques for evaluating retrieval agents.