Model-graded metrics
promptfoo supports several types of model-graded assertions:
Output-based:
llm-rubric
- checks if the LLM output matches given requirements, using a language model to grade the output based on the rubric.model-graded-closedqa
- similar to the above, a "criteria-checking" eval that ensures the answer meets a specific requirement. Uses an OpenAI-authored prompt from their public evals.factuality
- a factual consistency eval which, given a completionA
and reference answerB
evaluates whether A is a subset of B, A is a superset of B, A and B are equivalent, A and B disagree, or A and B differ, but that the difference doesn't matter from the perspective of factuality. It uses the prompt from OpenAI's public evals.g-eval
- evaluates outputs using chain-of-thought prompting based on custom criteria, following the G-Eval framework.answer-relevance
- ensure that LLM output is related to original querysimilar
- checks that the output is semantically similar to the expected value (uses embedding model)pi
- an alternative scoring approach that uses a dedicated model for evaluating inputs/outputs against criteria.classifier
- see classifier grading docs.moderation
- see moderation grading docs.select-best
- compare outputs from multiple test cases and choose a winner
Context-based:
context-recall
- ensure that ground truth appears in contextcontext-relevance
- ensure that context is relevant to original querycontext-faithfulness
- ensure that LLM output is supported by context
Context-based assertions are particularly useful for evaluating RAG systems. For complete RAG evaluation examples, see the RAG Evaluation Guide.
Examples (output-based)​
Example of llm-rubric
and/or model-graded-closedqa
:
assert:
- type: model-graded-closedqa # or llm-rubric
# Make sure the LLM output adheres to this criteria:
value: Is not apologetic
Example of factuality check:
assert:
- type: factuality
# Make sure the LLM output is consistent with this statement:
value: Sacramento is the capital of California
Example of pi scorer:
assert:
- type: pi
# Evaluate output based on this criteria:
value: Is not apologetic and provides a clear, concise answer
threshold: 0.8 # Requires a score of 0.8 or higher to pass
For more information on factuality, see the guide on LLM factuality.
Here's an example output that indicates PASS/FAIL based on LLM assessment (see example setup and outputs):
Using variables in the rubric​
You can use test vars
in the LLM rubric. This example uses the question
variable to help detect hallucinations:
providers:
- openai:gpt-4.1-mini
prompts:
- file://prompt1.txt
- file://prompt2.txt
defaultTest:
assert:
- type: llm-rubric
value: 'Says that it is uncertain or unable to answer the question: "{{question}}"'
tests:
- vars:
question: What's the weather in New York?
- vars:
question: Who won the latest football match between the Giants and 49ers?
Examples (comparison)​
The select-best
assertion type is used to compare multiple outputs in the same TestCase row and select the one that best meets a specified criterion.
Here's an example of how to use select-best
in a configuration file:
prompts:
- 'Write a tweet about {{topic}}'
- 'Write a very concise, funny tweet about {{topic}}'
providers:
- openai:gpt-4
tests:
- vars:
topic: bananas
assert:
- type: select-best
value: choose the funniest tweet
- vars:
topic: nyc
assert:
- type: select-best
value: choose the tweet that contains the most facts
Overriding the LLM grader​
By default, model-graded asserts use gpt-4.1-2025-04-14
for grading. If you do not have access to gpt-4.1-2025-04-14
or prefer not to use it, you can override the rubric grader. There are several ways to do this, depending on your preferred workflow:
-
Using the
--grader
CLI option:promptfoo eval --grader openai:gpt-4.1-mini
-
Using
test.options
ordefaultTest.options
on a per-test or testsuite basis:defaultTest:
options:
provider: openai:gpt-4.1-mini
tests:
- description: Use LLM to evaluate output
assert:
- type: llm-rubric
value: Is spoken like a pirate -
Using
assertion.provider
on a per-assertion basis:tests:
- description: Use LLM to evaluate output
assert:
- type: llm-rubric
value: Is spoken like a pirate
provider: openai:gpt-4.1-mini
Use the provider.config
field to set custom parameters:
provider:
- id: openai:gpt-4.1-mini
config:
temperature: 0
Also note that custom providers are supported as well.
Multiple graders​
Some assertions (such as answer-relevance
) use multiple types of providers. To override both the embedding and text providers separately, you can do something like this:
defaultTest:
options:
provider:
text:
id: azureopenai:chat:gpt-4-deployment
config:
apiHost: xxx.openai.azure.com
embedding:
id: azureopenai:embeddings:text-embedding-ada-002-deployment
config:
apiHost: xxx.openai.azure.com
If you are implementing a custom provider, text
providers require a callApi
function that returns a ProviderResponse
, whereas embedding providers require a callEmbeddingApi
function that returns a ProviderEmbeddingResponse
.
Overriding the rubric prompt​
For the greatest control over the output of llm-rubric
, you may set a custom prompt using the rubricPrompt
property of TestCase
or Assertion
.
The rubric prompt has two built-in variables that you may use:
{{output}}
- The output of the LLM (you probably want to use this){{rubric}}
- Thevalue
of the llm-rubricassert
object
When {{output}}
or {{rubric}}
contain objects, they are automatically converted to JSON strings by default to prevent display issues. To access object properties directly (e.g., {{output.text}}
), enable object property access:
export PROMPTFOO_DISABLE_OBJECT_STRINGIFY=true
promptfoo eval
For details, see the object template handling guide.
In this example, we set rubricPrompt
under defaultTest
, which applies it to every test in this test suite:
defaultTest:
options:
rubricPrompt: >
[
{
"role": "system",
"content": "Grade the output by the following specifications, keeping track of the points scored:\n\nDid the output mention {{x}}? +1 point\nDid the output describe {{y}}? +1 point\nDid the output ask to clarify {{z}}? +1 point\n\nCalculate the score but always pass the test. Output your response in the following JSON format:\n{pass: true, score: number, reason: string}"
},
{
"role": "user",
"content": "Output: {{ output }}"
}
]
See the full example.
Image-based rubric prompts​
llm-rubric
can also grade responses that reference images. Provide a rubricPrompt
in OpenAI chat format that includes an image and use a vision-capable provider such as openai:gpt-4.1
.
defaultTest:
options:
provider: openai:gpt-4.1
rubricPrompt: |
[
{ "role": "system", "content": "Evaluate if the answer matches the image. Respond with JSON {reason:string, pass:boolean, score:number}" },
{
"role": "user",
"content": [
{ "type": "image_url", "image_url": { "url": "{{image_url}}" } },
{ "type": "text", "text": "Output: {{ output }}\nRubric: {{ rubric }}" }
]
}
]
select-best rubric prompt​
For control over the select-best
rubric prompt, you may use the variables {{outputs}}
(list of strings) and {{criteria}}
(string). It expects the LLM output to contain the index of the winning output.
Classifiers​
Classifiers can be used to detect tone, bias, toxicity, helpfulness, and much more. See classifier documentation.
Context-based​
Context-based assertions are a special class of model-graded assertions that evaluate whether the LLM's output is supported by context provided at inference time. They are particularly useful for evaluating RAG systems.
context-recall
- ensure that ground truth appears in contextcontext-relevance
- ensure that context is relevant to original querycontext-faithfulness
- ensure that LLM output is supported by context
Defining context​
Context can be defined in one of two ways: statically using test case variables or dynamically from the provider's response.
Statically via test variables​
Set context
as a variable in your test case:
tests:
- vars:
context: 'Paris is the capital of France. It has a population of over 2 million people.'
assert:
- type: context-recall
value: 'Paris is the capital of France'
threshold: 0.8
Dynamically via Context Transform​
Defining contextTransform
allows you to construct context from provider responses. This is particularly useful for RAG systems.
assert:
- type: context-faithfulness
contextTransform: 'output.citations.join("\n")'
threshold: 0.8
The contextTransform
property accepts a stringified Javascript expression which itself accepts two arguments: output
and context
, and must return a non-empty string.
/**
* The context transform function signature.
*/
type ContextTransform = (output: Output, context: Context) => string;
/**
* The provider's response output.
*/
type Output = string | object;
/**
* Metadata about the test case, prompt, and provider response.
*/
type Context = {
// Test case variables
vars: Record<string, string | object>;
// Raw prompt sent to LLM
prompt: {
label: string;
};
// Provider-specific metadata.
// The documentation for each provider will describe any available metadata.
metadata?: object;
};
For example, given the following provider response:
/**
* A response from a fictional Research Knowledge Base.
*/
type ProviderResponse = {
output: {
content: string;
};
metadata: {
retrieved_docs: {
content: string;
}[];
};
};
assert:
- type: context-faithfulness
contextTransform: 'output.content'
threshold: 0.8
- type: context-relevance
# Note: `ProviderResponse['metadata']` is accessible as `context.metadata`
contextTransform: 'context.metadata.retrieved_docs.map(d => d.content).join("\n")'
threshold: 0.7
If your expression should return undefined
or null
, for example because no context is available, add a fallback:
contextTransform: 'output.context ?? "No context found"'
If you expected your context to be non-empty, but it's empty, you can debug your provider response by returning a stringified version of the response:
contextTransform: 'JSON.stringify(output, null, 2)'
Examples​
Context-based metrics require a query
and context. You must also set the threshold
property on your test (all scores are normalized between 0 and 1).
Here's an example config using statically-defined (test.vars.context
) context:
prompts:
- |
You are an internal corporate chatbot.
Respond to this query: {{query}}
Here is some context that you can use to write your response: {{context}}
providers:
- openai:gpt-4
tests:
- vars:
query: What is the max purchase that doesn't require approval?
context: file://docs/reimbursement.md
assert:
- type: contains
value: '$500'
- type: factuality
value: the employee's manager is responsible for approvals
- type: answer-relevance
threshold: 0.9
- type: context-recall
threshold: 0.9
value: max purchase price without approval is $500. Talk to Fred before submitting anything.
- type: context-relevance
threshold: 0.9
- type: context-faithfulness
threshold: 0.9
- vars:
query: How many weeks is maternity leave?
context: file://docs/maternity.md
assert:
- type: factuality
value: maternity leave is 4 months
- type: answer-relevance
threshold: 0.9
- type: context-recall
threshold: 0.9
value: The company offers 4 months of maternity leave, unless you are an elephant, in which case you get 22 months of maternity leave.
- type: context-relevance
threshold: 0.9
- type: context-faithfulness
threshold: 0.9
Alternatively, if your system returns context in the response, like in a RAG system, you can use contextTransform
:
prompts:
- |
You are an internal corporate chatbot.
Respond to this query: {{query}}
providers:
- openai:gpt-4
tests:
- vars:
query: What is the max purchase that doesn't require approval?
assert:
- type: context-recall
contextTransform: 'output.context'
threshold: 0.9
value: max purchase price without approval is $500
- type: context-relevance
contextTransform: 'output.context'
threshold: 0.9
- type: context-faithfulness
contextTransform: 'output.context'
threshold: 0.9
Other assertion types​
For more info on assertions, see Test assertions.