Deterministic metrics

These metrics are created by logical tests that are run on LLM output.

Assertion Type	Returns true if...
contains	output contains substring
contains-all	output contains all list of substrings
contains-any	output contains any of the listed substrings
contains-json	output contains valid json (optional json schema validation)
contains-sql	output contains valid sql
contains-xml	output contains valid xml
cost	Inference cost is below a threshold
equals	output matches exactly
icontains	output contains substring, case insensitive
icontains-all	output contains all list of substrings, case insensitive
icontains-any	output contains any of the listed substrings, case insensitive
is-json	output is valid json (optional json schema validation)
is-sql	output is valid SQL statement (optional authority list validation)
is-valid-openai-function-call	Ensure that the function call matches the function's JSON schema
is-valid-openai-tools-call	Ensure all tool calls match the tools JSON schema
is-xml	output is valid xml
javascript	provided Javascript function validates the output
latency	Latency is below a threshold (milliseconds)
levenshtein	Levenshtein distance is below a threshold
perplexity-score	Normalized perplexity
perplexity	Perplexity is below a threshold
python	provided Python function validates the output
regex	output matches regex
rouge-n	Rouge-N score is above a given threshold
starts-with	output starts with string
webhook	provided webhook returns {pass: true}

tip

Every test type can be negated by prepending not-. For example, not-equals or not-regex.

Assertion types

Contains

The contains assertion checks if the LLM output contains the expected value.

Example:

assert:
  - type: contains
    value: 'The expected substring'

The icontains is the same, except it ignores case:

assert:
  - type: icontains
    value: 'The expected substring'

Contains-All

The contains-all assertion checks if the LLM output contains all of the specified values.

Example:

assert:
  - type: contains-all
    value:
      - 'Value 1'
      - 'Value 2'
      - 'Value 3'

Contains-Any

The contains-any assertion checks if the LLM output contains at least one of the specified values.

Example:

assert:
  - type: contains-any
    value:
      - 'Value 1'
      - 'Value 2'
      - 'Value 3'

For case insensitive matching, use icontains-any.

For case insensitive matching, use icontains-all.

Regex

The regex assertion checks if the LLM output matches the provided regular expression.

Example:

assert:
  - type: regex
    value: "\\d{4}" # Matches a 4-digit number

Contains-JSON

The contains-json assertion checks if the LLM output contains a valid JSON structure.

Example:

assert:
  - type: contains-json

You may optionally set a value as a JSON schema in order to validate the JSON contents:

assert:
  - type: contains-json
    value:
      required: [latitude, longitude]
      type: object
      properties:
        latitude:
          minimum: -90
          type: number
          maximum: 90
        longitude:
          minimum: -180
          type: number
          maximum: 180

JSON is valid YAML, so you can also just copy in any JSON schema directly:

assert:
  - type: contains-json
    value:
      {
        'required': ['latitude', 'longitude'],
        'type': 'object',
        'properties':
          {
            'latitude': { 'type': 'number', 'minimum': -90, 'maximum': 90 },
            'longitude': { 'type': 'number', 'minimum': -180, 'maximum': 180 },
          },
      }

If your JSON schema is large, import it from a file:

assert:
  - type: contains-json
    value: file://./path/to/schema.json

Contains-Sql

This assertion ensure that the output is either valid SQL, or contains a code block with valid SQL.

assert:
  - type: contains-sql

See is-sql for advanced usage, including specific database types and allowlists for tables and columns.

Cost

The cost assertion checks if the cost of the LLM call is below a specified threshold.

This requires LLM providers to return cost information. Currently this is only supported by OpenAI GPT models and custom providers.

Example:

providers:
  - openai:gpt-4o-mini
  - openai:gpt-4
assert:
  # Pass if the LLM call costs less than $0.001
  - type: cost
    threshold: 0.001

Equality

The equals assertion checks if the LLM output is equal to the expected value.

Example:

assert:
  - type: equals
    value: 'The expected output'

You can also check whether it matches the expected JSON format.

assert:
  - type: equals
    value: { 'key': 'value' }

If your expected JSON is large, import it from a file:

assert:
  - type: equals
    value: 'file://path/to/expected.json'

Is-JSON

The is-json assertion checks if the LLM output is a valid JSON string.

Example:

assert:
  - type: is-json

You may optionally set a value as a JSON schema. If set, the output will be validated against this schema:

assert:
  - type: is-json
    value:
      required: [latitude, longitude]
      type: object
      properties:
        latitude:
          minimum: -90
          type: number
          maximum: 90
        longitude:
          minimum: -180
          type: number
          maximum: 180

JSON is valid YAML, so you can also just copy in any JSON schema directly:

assert:
  - type: is-json
    value:
      {
        'required': ['latitude', 'longitude'],
        'type': 'object',
        'properties':
          {
            'latitude': { 'type': 'number', 'minimum': -90, 'maximum': 90 },
            'longitude': { 'type': 'number', 'minimum': -180, 'maximum': 180 },
          },
      }

If your JSON schema is large, import it from a file:

assert:
  - type: is-json
    value: file://./path/to/schema.json

Is-XML

The is-xml assertion checks if the entire LLM output is a valid XML string. It can also verify the presence of specific elements within the XML structure.

Example:

assert:
  - type: is-xml

This basic usage checks if the output is valid XML.

You can also specify required elements:

assert:
  - type: is-xml
    value:
      requiredElements:
        - root.child
        - root.sibling

This checks if the XML is valid and contains the specified elements. The elements are specified as dot-separated paths, allowing for nested element checking.

How it works

The assertion first attempts to parse the entire output as XML using a parser (fast-xml-parser).
If parsing succeeds, it's considered valid XML.
If value is specified:
- It checks for a requiredElements key with an array of required elements.
- Each element path (e.g., "root.child") is split by dots.
- It traverses the parsed XML object following these paths.
- If any required element is not found, the assertion fails.

Examples

Basic XML validation:

assert:
  - type: is-xml

Passes for: <root><child>Content</child></root> Fails for: <root><child>Content</child></root (missing closing tag)

Checking for specific elements:

assert:
  - type: is-xml
    value:
      requiredElements:
        - analysis.classification
        - analysis.color

Passes for: <analysis><classification>T-shirt</classification><color>Red</color></analysis> Fails for: <analysis><classification>T-shirt</classification></analysis> (missing color element)

Checking nested elements:

assert:
  - type: is-xml
    value:
      requiredElements:
        - root.parent.child.grandchild

Passes for: <root><parent><child><grandchild>Content</grandchild></child></parent></root> Fails for: <root><parent><child></child></parent></root> (missing grandchild element)

Inverse assertion

You can use the not-is-xml assertion to check if the output is not valid XML:

assert:
  - type: not-is-xml

This will pass for non-XML content and fail for valid XML content.

Note: The is-xml assertion requires the entire output to be valid XML. For checking XML content within a larger text, use the contains-xml assertion.

Contains-XML

The contains-xml is identical to is-xml, except it checks if the LLM output contains valid XML content, even if it's not the entire output. For example, the following is valid.

Sure, here is your xml:
<root><child>Content</child></root>
let me know if you have any other questions!

Is-SQL

The is-sql assertion checks if the LLM output is a valid SQL statement.

Example:

assert:
  - type: is-sql

To use this assertion, you need to install the node-sql-parser package. You can install it using npm:

npm install node-sql-parser

You can optionally set a databaseType in the value to determine the specific database syntax that your LLM output will be validated against. The default database syntax is MySQL. For a complete and up-to-date list of supported database syntaxes, please refer to the node-sql-parser documentation.
The supported database syntax list:

Athena
BigQuery
DB2
FlinkSQL
Hive
MariaDB
MySQL
Noql
PostgresQL
Redshift
Snowflake(alpha)
Sqlite
TransactSQL

Example:

assert:
  - type: is-sql
    value:
      databaseType: 'MySQL'

You can also optionally set a allowedTables/allowedColumns in the value to determine the SQL authority list that your LLM output will be validated against.
The format of allowedTables:

{type}::{dbName}::{tableName} // type could be select, update, delete or insert

The format of allowedColumns:

{type}::{tableName}::{columnName} // type could be select, update, delete or insert

For SELECT *, DELETE, and INSERT INTO tableName VALUES() without specified columns, the .* column authority regex is required.

Example:

assert:
  - type: is-sql
    value:
      databaseType: 'MySQL'
      allowedTables:
        - '(select|update|insert|delete)::null::departments'
      allowedColumns:
        - 'select::null::name'
        - 'update::null::id'

is-valid-openai-function-call

This ensures that any JSON LLM output adheres to the schema specified in the functions configuration of the provider. Learn more about the OpenAI provider.

is-valid-openai-tools-call

This ensures that any JSON LLM output adheres to the schema specified in the tools configuration of the provider. Learn more about the OpenAI provider.

Javascript

See Javascript assertions.

Latency

The latency assertion passes if the LLM call takes longer than the specified threshold. Duration is specified in milliseconds.

Example:

assert:
  # Fail if the LLM call takes longer than 5 seconds
  - type: latency
    threshold: 5000

Note that latency requires that the cache is disabled with promptfoo eval --no-cache or an equivalent option.

Levenshtein distance

The levenshtein assertion checks if the LLM output is within a given edit distance from an expected value.

Example:

assert:
  # Ensure Levenshtein distance from "hello world" is <= 5
  - type: levenshtein
    threshold: 5
    value: hello world

value can reference other variables using template syntax. For example:

tests:
  - vars:
      expected: foobar
    assert:
      - type: levenshtein
        threshold: 2
        value: '{{expected}}'

Perplexity

Perplexity is a measurement used in natural language processing to quantify how well a language model predicts a sample of text. It's essentially a measure of the model's uncertainty.

High perplexity suggests it is less certain about its predictions, often because the text is very diverse or the model is not well-tuned to the task at hand.

Low perplexity means the model predicts the text with greater confidence, implying it's better at understanding and generating text similar to its training data.

To specify a perplexity threshold, use the perplexity assertion type:

assert:
  # Fail if the LLM is below perplexity threshold
  - type: perplexity
    threshold: 1.5

warning

Perplexity requires the LLM API to output logprobs. Currently only more recent versions of OpenAI GPT and Azure OpenAI GPT APIs support this.

Comparing different outputs from the same LLM

You can compare perplexity scores across different outputs from the same model to get a sense of which output the model finds more likely (or less surprising). This is a good way to tune your prompts and hyperparameters (like temperature) to be more accurate.

Comparing outputs from different LLMs

Comparing scores across models may not be meaningful, unless the models have been trained on similar datasets, the tokenization process is consistent between models, and the vocabulary of the models is roughly the same.

perplexity-score

perplexity-score is a supported metric similar to perplexity, except it is normalized between 0 and 1 and inverted, meaning larger numbers are better.

This makes it easier to include in an aggregate promptfoo score, as higher scores are usually better. In this example, we compare perplexity across multiple GPTs:

providers:
  - openai:gpt-4o-mini
  - openai:gpt-4o
tests:
  - assert:
      - type: perplexity-score
        threshold: 0.5 # optional
  # ...

Python

See Python assertions.

Starts-With

The starts-with assertion checks if the LLM output begins with the specified string.

This example checks if the output starts with "Yes":

assert:
  - type: starts-with
    value: 'Yes'

Webhook

The webhook assertion sends the LLM output to a specified webhook URL for custom validation. The webhook should return a JSON object with a pass property set to true or false.

Example:

assert:
  - type: webhook
    value: 'https://example.com/webhook'

The webhook will receive a POST request with a JSON payload containing the LLM output and the context (test case variables). For example, if the LLM output is "Hello, World!" and the test case has a variable example set to "Example text", the payload will look like:

{
  "output": "Hello, World!",
  "context": {
    "prompt": "Greet the user",
    "vars": {
      "example": "Example text"
    }
  }
}

The webhook should process the request and return a JSON response with a pass property set to true or false, indicating whether the LLM output meets the custom validation criteria. Optionally, the webhook can also provide a reason property to describe why the output passed or failed the assertion.

Example response:

{
  "pass": true,
  "reason": "The output meets the custom validation criteria"
}

If the webhook returns a pass value of true, the assertion will be considered successful. If it returns false, the assertion will fail, and the provided reason will be used to describe the failure.

You may also return a score:

{
  "pass": true,
  "score": 0.5,
  "reason": "The output meets the custom validation criteria"
}

Deterministic metrics

Assertion types​

Contains​

Contains-All​

Contains-Any​

Regex​

Contains-JSON​

Contains-Sql​

Cost​

Equality​

Is-JSON​

Is-XML​

How it works​

Examples​

Inverse assertion​

Contains-XML​

Is-SQL​

is-valid-openai-function-call​

is-valid-openai-tools-call​

Javascript​

Latency​

Levenshtein distance​

Perplexity​

Comparing different outputs from the same LLM​

Comparing outputs from different LLMs​

perplexity-score​

Python​

Starts-With​

Webhook​

Assertion types

Contains

Contains-All

Contains-Any

Regex

Contains-JSON

Contains-Sql

Cost

Equality

Is-JSON

Is-XML

How it works

Examples

Inverse assertion

Contains-XML

Is-SQL

is-valid-openai-function-call

is-valid-openai-tools-call

Javascript

Latency

Levenshtein distance

Perplexity

Comparing different outputs from the same LLM

Comparing outputs from different LLMs

perplexity-score

Python

Starts-With

Webhook