Skip to main content

Iterate on LLMs faster

Measure LLM quality and catch regressions

Used by 20,000+ developers at companies like

ShopifyDiscordGoogleMicrosoftSalesforceCarvana

... to rapidly improve prompts and evaluate models

Simple, declarative config


# Compare prompts...
prompts:
- "Summarize this in {{language}}: {{document}}"
- "Summarize this in {{language}}, concisely and professionally: {{document}}"

# And models...
providers:
- openai:gpt-4-0125-preview
- anthropic:claude-3-opus
- mistral:mistral-large-latest

# ... using these tests
tests:
- vars:
language: French
document: "To be or not to be, that is the question..."
assert:
- type: contains
value: "Être ou ne pas être"
- type: cost
threshold: 0.01
- type: latency
threshold: 1000
- type: llm-rubric
value: does not apologize
- # ...

Detailed, actionable results

How it works

Create a test dataset

Use a representative sample of user inputs to reduce subjectivity when tuning prompts.

Set up evaluation metrics

Use built-in metrics, LLM-graded evals, or define your own custom metrics.

Select the best prompt & model

Compare prompts and model outputs side-by-side, or integrate the library into your existing test/CI workflow.

Web Viewer

Command line

promptfoo is used by LLM apps serving over 10 million users