Chat conversations / threads
The prompt file supports a message in OpenAI's JSON prompt format. This allows you to set multiple messages including the system prompt. For example:
[
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Who won the world series in {{ year }}?" }
]
Equivalent yaml is also supported:
- role: system
content: You are a helpful assistant.
- role: user
content: Who won the world series in {{ year }}?
Multishot conversations
Most providers support full "multishot" chat conversations, including multiple assistant, user, and system prompts.
One way to do this, if you are using the OpenAI format, is by creating a list of {role, content}
objects. Here's an example:
prompts:
- file://prompt.json
providers:
- openai:gpt-4o-mini
tests:
- vars:
messages:
- role: system
content: Respond as a pirate
- role: user
content: Who founded Facebook?
- role: assistant
content: Mark Zuckerberg
- role: user
content: Did he found any other companies?
Then the prompt itself is just a JSON dump of messages
:
{{ messages | dump }}
Simplified chat markup
Alternatively, you may prefer to specify a list of role: message
, like this:
tests:
- vars:
messages:
- user: Who founded Facebook?
- assistant: Mark Zuckerberg
- user: Did he found any other companies?
This simplifies the config, but we need to work some magic in the prompt template:
[
{% for message in messages %}
{% set outer_loop = loop %}
{% for role, content in message %}
{
"role": "{{ role }}",
"content": "{{ content }}"
}{% if not (loop.last and outer_loop.last) %},{% endif %}
{% endfor %}
{% endfor %}
]
Creating a conversation history fixture
Using nunjucks templates, we can combine multiple chat messages. Here's an example in which the previous conversation is a fixture for all tests. Each case tests a different follow-up message:
# Set up the conversation history
defaultTest:
vars:
system_message: Answer concisely
messages:
- user: Who founded Facebook?
- assistant: Mark Zuckerberg
- user: What's his favorite food?
- assistant: Pizza
# Test multiple follow-ups
tests:
- vars:
question: Did he create any other companies?
- vars:
question: What is his role at Internet.org?
- vars:
question: Will he let me borrow $5?
In the prompt template, we construct the conversation history followed by a user message containing the question
:
[
{
"role": "system",
"content": {{ system_message | dump }}
},
{% for message in messages %}
{% for role, content in message %}
{
"role": "{{ role }}",
"content": {{ content | dump }}
},
{% endfor %}
{% endfor %}
{
"role": "user",
"content": {{ question | dump }}
}
]
Variables containing multiple lines and quotes are automatically escaped in JSON prompt files.
If the file is not valid JSON (such as in the case above, due to the nunjucks {% for %}
loops), use the built-in nunjucks filter dump
to stringify the object as JSON.
Using the _conversation
variable
A built-in _conversation
variable contains the full prompt and previous turns of a conversation. Use it to reference previous outputs and test an ongoing chat conversation.
The _conversation
variable has the following type signature:
type Completion = {
prompt: string | object;
input: string;
output: string;
};
type Conversation = Completion[];
In most cases, you'll loop through the _conversation
variable and use each Completion
object.
Use completion.prompt
to reference the previous conversation. For example, to get the number of messages in a chat-formatted prompt:
{{ completion.prompt.length }}
Or to get the first message in the conversation:
{{ completion.prompt[0] }}
Use completion.input
as a shortcut to get the last user message. In a chat-formatted prompt, input
is set to the last user message, equivalent to completion.prompt[completion.prompt.length - 1].content
.
Here's an example test config. Note how each question assumes context from the previous output:
tests:
- vars:
question: Who founded Facebook?
- vars:
question: Where does he live?
- vars:
question: Which state is that in?
Here is the corresponding prompt:
[
{% for completion in _conversation %}
{
"role": "user",
"content": "{{ completion.input }}"
},
{
"role": "assistant",
"content": "{{ completion.output }}"
},
{% endfor %}
{
"role": "user",
"content": "{{ question }}"
}
]
The prompt inserts the previous conversation into the test case, creating a full turn-by-turn conversation:
Try it yourself by using the full example config.
When the _conversation
variable is present, the eval will run single-threaded (concurrency of 1).
Separating Chat Conversations
When running multiple test files or test sequences, you may want to maintain separate conversation histories in the same eval run. This can be achieved by adding a conversationId
to the test metadata:
# test1.yaml
- vars:
question: 'Who founded Facebook?'
metadata:
conversationId: 'conversation1'
- vars:
question: 'Where does he live?'
metadata:
conversationId: 'conversation1'
# test2.yaml
- vars:
question: 'Where is Yosemite National Park?'
metadata:
conversationId: 'conversation2'
- vars:
question: 'What are good hikes there?'
metadata:
conversationId: 'conversation2'
Each unique conversationId
maintains its own separate conversation history. If no conversationId
is specified, all tests using the same provider and prompt will share a conversation history.
Including JSON in prompt content
In some cases, you may want to send JSON within the OpenAI content
field. In order to do this, you must ensure that the JSON is properly escaped.
Here's an example that prompts OpenAI with a JSON object of the structure {query: string, history: {reply: string}[]}
. It first constructs this JSON object as the input
variable. Then, it includes input
in the prompt with proper JSON escaping:
{% set input %}
{
"query": "{{ query }}",
"history": [
{% for completion in _conversation %}
{"reply": "{{ completion.output }}"} {% if not loop.last %},{% endif %}
{% endfor %}
]
}
{% endset %}
[{
"role": "user",
"content": {{ input | trim | dump }}
}]
Here's the associated config:
prompts:
- file://prompt.json
providers:
- openai:gpt-4o-mini
tests:
- vars:
query: how you doing
- vars:
query: need help with my passport
This has the effect of including the conversation history within the prompt content. Here's what's sent to OpenAI for the second test case:
[
{
"role": "user",
"content": "{\n \"query\": \"how you doing\",\n \"history\": [\n \n ]\n}"
}
]
Using storeOutputAs
The storeOutputAs
option makes it possible to reference previous outputs in multi-turn conversations. When set, it records the LLM output as a variable that can be used in subsequent chats.
Here's an example:
prompts:
- 'Respond to the user: {{message}}'
providers:
- openai:gpt-4o
tests:
- vars:
message: "What's your favorite fruit? You must pick one. Output the name of a fruit only"
options:
storeOutputAs: favoriteFruit
- vars:
message: 'Why do you like {{favoriteFruit}} so much?'
options:
storeOutputAs: reason
- vars:
message: 'Write a snarky 2 sentence rebuttal to this argument for loving {{favoriteFruit}}: \"{{reason}}\"'
This creates favoriteFruit
and reason
vars on-the-go, as the chatbot answers questions.
Manipulating outputs with transform
Outputs can be modified before storage using the transform
property:
tests:
- vars:
message: "What's your favorite fruit? You must pick one. Output the name of a fruit only"
options:
storeOutputAs: favoriteFruit
transform: output.split(' ')[0]
- vars:
message: "Why do you like {{favoriteFruit}} so much?"
options:
storeOutputAs: reason
- vars:
message: 'Write a snarky 2 sentence rebuttal to this argument for loving {{favoriteFruit}}: \"{{reason}}\"'
Transforms can be Javascript snippets or they can be entire separate Python or Javascript files. See docs on transform.