G-Eval
G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. It's based on the paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment".
How to use it
To use G-Eval in your test configuration:
assert:
- type: g-eval
value: 'Ensure the response is factually accurate and well-structured'
threshold: 0.7 # Optional, defaults to 0.7
You can also provide multiple evaluation criteria as an array:
assert:
- type: g-eval
value:
- 'Check if the response maintains a professional tone'
- 'Verify that all technical terms are used correctly'
- 'Ensure no confidential information is revealed'
How it works
G-Eval uses GPT-4o (by default) to evaluate outputs based on your specified criteria. The evaluation process:
- Takes your evaluation criteria
- Uses chain-of-thought prompting to analyze the output
- Returns a normalized score between 0 and 1
The assertion passes if the score meets or exceeds the threshold (default 0.7).
Customizing the evaluator
Like other model-graded assertions, you can override the default GPT-4o evaluator:
assert:
- type: g-eval
value: 'Ensure response is factually accurate'
provider: openai:gpt-4o-mini
Or globally via test options:
defaultTest:
options:
provider: openai:gpt-4o-mini
Example
Here's a complete example showing how to use G-Eval to assess multiple aspects of an LLM response:
prompts:
- |
Write a technical explanation of {{topic}}
suitable for a beginner audience.
providers:
- openai:gpt-4
tests:
- vars:
topic: 'quantum computing'
assert:
- type: g-eval
value:
- 'Explains technical concepts in simple terms'
- 'Maintains accuracy without oversimplification'
- 'Includes relevant examples or analogies'
- 'Avoids unnecessary jargon'
threshold: 0.8