How to evaluate OpenAI Assistants

OpenAI recently released an Assistants API that offers simplified handling for message state and tool usage. It also enables code interpreter and knowledge retrieval features, abstracting away some of the dirty work for implementing RAG architecture.

Test-driven development allows you to compare prompts, models, and tools while measuring improvement and avoiding unexplained regressions. It's an example of systematic iteration vs. trial and error.

This guide walks you through using promptfoo to select the best prompt, model, and tools using OpenAI's Assistants API. It assumes that you've already set up promptfoo.

Step 1: Create an assistant

Use the OpenAI playground to an assistant. The eval will use this assistant with different instructions and models.

Add your desired functions and enable the code interpreter and retrieval as desired.

After you create an assistant, record its ID. It will look similar to asst_fEhNN3MClMamLfKLkIaoIpgB.

Step 2: Set up the eval

An eval config has a few key components:

prompts: The user chat messages you want to test
providers: The assistant(s) and/or LLM APIs you want to test
tests: Individual test cases to try

Let's set up a basic promptfooconfig.yaml:

prompts:
  - 'Help me out with this: {{message}}'
providers:
  - openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
tests:
  - vars:
      message: write a tweet about bananas
  - vars:
      message: what is the sum of 38.293 and the square root of 30300300
  - vars:
      message: reverse the string "all dogs go to heaven"

Step 3: Run the eval

Now that we've set up the config, run the eval on your command line:

npx promptfoo@latest eval

This will produce a simple view of assistant outputs. Note that it records the conversation, as well as code interpreter, function, and retrieval inputs and outputs:

assistant eval

This is a basic view, but now we're ready to actually get serious with our eval. In the next sections, we'll learn how to compare different assistants or different versions of the same assistant.

Comparing multiple assistants

To compare different assistants, reference them in the providers section of your promptfooconfig.yaml. For example:

providers:
  - openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
  - openai:assistant:asst_another_assistant_id_123

This will run the same tests on both assistants and allow you to compare their performance.

Comparing different versions of the same assistant

If you want to override the configuration of an assistant for a specific test, you can do so in the options section of a test. For example:

providers:
  - id: openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
    config:
      model: gpt-4.1
      instructions: 'Enter a replacement for system-level instructions here'
      tools:
        - type: code_interpreter
        - type: retrieval
      thread:
        messages:
          - role: user
            content: 'These messages are included in every test case before the prompt.'
          - role: assistant
            content: 'Okay'

In this example, the Assistant API is called with the above parameters.

Here's an example that compares the saved Assistant settings against new potential settings:

providers:
  # Original
  - id: openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB

  # Modified
  - id: openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgB
    config:
      model: gpt-4.1
      instructions: 'Always talk like a pirate'

This eval will test both versions of the Assistant and display the results side-by-side.

Adding metrics and assertions

Metrics and assertions allow you to automatically evaluate the performance of your assistants. You can add them in the assert section of a test. For example:

tests:
  - vars:
      message: write a tweet about bananas
    assert:
      - type: contains
        value: 'banana'
      - type: similar
        value: 'I love bananas!'
        threshold: 0.6

In this example, the contains assertion checks if the assistant's response contains the word 'banana'. The similar assertion checks if the assistant's response is semantically similar to 'I love bananas!' with a cosine similarity threshold of 0.6.

There are many different assertions to consider, ranging from simple metrics (such as string matching) to complex metrics (such as model-graded evaluations). I strongly encourage you to set up assertions that are tailored to your use case.

Based on these assertions, promptfoo will automatically score the different versions of your assistants, so that you can pick the top performing one.

Next steps

Now that you've got a basic eval set up, you may also be interested in specific techniques for evaluating retrieval agents.

Step 1: Create an assistant​

Step 2: Set up the eval​

Step 3: Run the eval​

Comparing multiple assistants​

Comparing different versions of the same assistant​

Adding metrics and assertions​

Next steps​