Intro
promptfoo
is a CLI and library for evaluating LLM output quality.
With promptfoo, you can:
- Systematically test prompts, models, and RAGs with predefined test cases
- Evaluate quality and catch regressions by comparing LLM outputs side-by-side
- Speed up evaluations with caching and concurrent tests
- Score outputs automatically by defining expectations
- Use as a CLI, or integrate into your workflow as a library
- Use OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, or integrate custom API providers for any LLM API
The goal: test-driven LLM development, not trial-and-error.
promptfoo produces matrix views that let you quickly evaluate outputs across many prompts.
Here's an example of a side-by-side comparison of multiple prompts and inputs:
It works on the command line too.
Why choose promptfoo?
There are many different ways to evaluate prompts. Here are some reasons to consider promptfoo:
- Battle-tested: promptfoo was built to eval & improve LLM apps serving over 10 million users in production. Our tooling is flexible and can be adapted to many setups.
- Simple, declarative test cases: Define your evals without writing code or working with heavy notebooks.
- Language agnostic: Use Python, Javascript, or whatever else you're working in.
- Share & collaborate: Built-in share functionality & web viewer for working with teammates.
- Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
- Privacy: This software runs completely locally. Your evals run on your machine and talk directly with the LLM.
Workflow and philosophy
Test-driven prompt engineering is much more effective than trial-and-error.
Serious LLM development requires a systematic approach to prompt engineering. Promptfoo streamlines the process of evaluating and improving language model performance.
- Define test cases: Identify core use cases and failure modes. Prepare a set of prompts and test cases that represent these scenarios.
- Configure evaluation: Set up your evaluation by specifying prompts, test cases, and API providers.
- Run evaluation: Use the command-line tool or library to execute the evaluation and record model outputs for each prompt.
- Analyze results: Set up automatic requirements, or review results in a structured format/web UI. Use these results to select the best model and prompt for your use case.
As you gather more examples and user feedback, continue to expand your test cases.