Skip to main content

Intro

promptfoo is a CLI and library for evaluating LLM output quality.

With promptfoo, you can:

  • Systematically test prompts, models, and RAGs with predefined test cases
  • Evaluate quality and catch regressions by comparing LLM outputs side-by-side
  • Speed up evaluations with caching and concurrent tests
  • Score outputs automatically by defining expectations
  • Use as a CLI, or integrate into your workflow as a library
  • Use OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, or integrate custom API providers for any LLM API

The goal: test-driven LLM development, not trial-and-error.

promptfoo produces matrix views that let you quickly evaluate outputs across many prompts.

Here's an example of a side-by-side comparison of multiple prompts and inputs:

Side-by-side evaluation of LLM prompt quality

It works on the command line too.

LLM prompt quality evaluation with PASS/FAIL expectations

Why choose promptfoo?

There are many different ways to evaluate prompts. Here are some reasons to consider promptfoo:

  • Battle-tested: promptfoo was built to eval & improve LLM apps serving over 10 million users in production. Our tooling is flexible and can be adapted to many setups.
  • Simple, declarative test cases: Define your evals without writing code or working with heavy notebooks.
  • Language agnostic: Use Python, Javascript, or whatever else you're working in.
  • Share & collaborate: Built-in share functionality & web viewer for working with teammates.
  • Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
  • Privacy: This software runs completely locally. Your evals run on your machine and talk directly with the LLM.

Workflow and philosophy

Test-driven prompt engineering is much more effective than trial-and-error.

Serious LLM development requires a systematic approach to prompt engineering. Promptfoo streamlines the process of evaluating and improving language model performance.

  1. Define test cases: Identify core use cases and failure modes. Prepare a set of prompts and test cases that represent these scenarios.
  2. Configure evaluation: Set up your evaluation by specifying prompts, test cases, and API providers.
  3. Run evaluation: Use the command-line tool or library to execute the evaluation and record model outputs for each prompt.
  4. Analyze results: Set up automatic requirements, or review results in a structured format/web UI. Use these results to select the best model and prompt for your use case.

test-driven llm ops

As you gather more examples and user feedback, continue to expand your test cases.