Iterate on LLMs faster
Measure LLM quality improvements and catch regressions
Used by developers at
How it works
Create a test dataset
Use a representative sample of user inputs to reduce subjectivity when tuning prompts.
Set up evaluation metrics
Use built-in metrics, LLM-graded evals, or define your own custom metrics.
Select the best prompt & model
Compare prompts and model outputs side-by-side, or integrate the library into your existing test/CI workflow.