Evaluating factuality
What is factuality and why is it important?
Factuality is the measure of how accurately an LLM's response aligns with established facts or reference information. Simply put, it answers the question: "Is what the AI saying actually true?"
A concrete example:
Question: "What is the capital of France?"
AI response: "The capital of France is Paris, which has been the country's capital since 987 CE."
Reference fact: "Paris is the capital of France."In this case, the AI response is factually accurate (it includes the correct capital) but adds additional information about when Paris became the capital.
As LLMs become increasingly integrated into critical applications, ensuring they provide factually accurate information is essential for:
-
Building trust: Users need confidence that AI responses are reliable and truthful. For example, a financial advisor chatbot that gives incorrect information about tax laws could cause users to make costly mistakes and lose trust in your service.
-
Reducing misinformation: Factually incorrect AI outputs can spread misinformation at scale. For instance, a healthcare bot incorrectly stating that a common vaccine is dangerous could influence thousands of patients to avoid important preventative care.
-
Supporting critical use cases: Applications in healthcare, finance, education, and legal domains require high factual accuracy. A legal assistant that misrepresents case law precedents could lead to flawed legal strategies with serious consequences.
-
Improving model selection: Comparing factuality across models helps choose the right model for your application. A company might discover that while one model is more creative, another has 30% better factual accuracy for technical documentation.
-
Identifying hallucinations: Factuality evaluation helps detect when models "make up" information. For example, discovering that your product support chatbot fabricates non-existent troubleshooting steps 15% of the time would be a critical finding.
promptfoo's factuality evaluation enables you to systematically measure how well your model outputs align with reference facts, helping you identify and address issues before they reach users.
Quick Start: Try it today
The fastest way to get started with factuality evaluation is to use our pre-built TruthfulQA example:
# Initialize the example - this command creates a new directory with all necessary files
npx promptfoo@latest init --example huggingface-dataset-factuality
# Change into the newly created directory
cd huggingface-dataset-factuality
# Run the evaluation - this executes the factuality tests using the models specified in the config
npx promptfoo eval
# View the results in an interactive web interface
npx promptfoo view
What these commands do:
- The first command initializes a new project using our huggingface-dataset-factuality example template
- The second command navigates into the project directory
- The third command runs the factuality evaluation against the TruthfulQA dataset
- The final command opens the results in your browser for analysis
This example:
- Fetches the TruthfulQA dataset (designed to test model truthfulness)
- Creates test cases with built-in factuality assertions
- Compares model outputs against reference answers
- Provides detailed factuality scores and analysis
You can easily customize it by:
- Uncommenting additional providers in
promptfooconfig.yaml
to test more models - Adjusting the prompt template to change how questions are asked
- Modifying the factuality scoring weights to match your requirements
How factuality evaluation works
promptfoo implements a structured factuality evaluation methodology based on OpenAI's evals, using the factuality
assertion type.
The model-graded factuality check takes the following three inputs:
- Prompt: prompt sent to the LLM
- Output: text produced by the LLM
- Reference: the ideal LLM output, provided by the author of the eval
Key terminology explained
The evaluation classifies the relationship between the LLM output and the reference into one of five categories:
-
A: Output is a subset of the reference and is fully consistent with it
- Example: If the reference is "Paris is the capital of France and has a population of 2.1 million," a subset would be "Paris is the capital of France" — it contains less information but is fully consistent
-
B: Output is a superset of the reference and is fully consistent with it
- Example: If the reference is "Paris is the capital of France," a superset would be "Paris is the capital of France and home to the Eiffel Tower" — it adds accurate information while maintaining consistency
-
C: Output contains all the same details as the reference
- Example: If the reference is "The Earth orbits the Sun," and the output is "The Sun is orbited by the Earth" — same information, different wording
-
D: Output and reference disagree
- Example: If the reference is "Paris is the capital of France," but the output claims "Lyon is the capital of France" — this is a factual disagreement
-
E: Output and reference differ, but differences don't affect factuality
- Example: If the reference is "The distance from Earth to the Moon is 384,400 km," and the output says "The Moon is about 384,000 km from Earth" — the small difference doesn't materially affect factuality
By default, categories A, B, C, and E are considered passing (with customizable scores), while category D (disagreement) is considered failing.
Creating a basic factuality evaluation
To set up a simple factuality evaluation for your LLM outputs:
- Create a configuration file with a factuality assertion:
providers:
- openai:gpt-4.1-mini
prompts:
- |
Please answer the following question accurately:
Question: What is the capital of {{location}}?
tests:
- vars:
location: California
assert:
- type: factuality
value: The capital of California is Sacramento
- Run your evaluation:
npx promptfoo eval
npx promptfoo view
This will produce a report showing how factually accurate your model's responses are compared to the reference answers.
Comparing Multiple Models
Factuality evaluation is especially useful for comparing how different models perform on the same facts:
providers:
- openai:gpt-4.1-mini
- openai:gpt-4.1
- anthropic:claude-3-7-sonnet-20250219
- google:gemini-2.0-flash
prompts:
- |
Question: What is the capital of {{location}}?
Please answer accurately.
tests:
- vars:
location: California
assert:
- type: factuality
value: The capital of California is Sacramento
- vars:
location: New York
assert:
- type: factuality
value: Albany is the capital of New York