Similarity (embeddings)
The similar assertion checks if an embedding of the LLM's output
is semantically similar to the expected value,
using a configurable similarity or distance metric with a threshold.
By default, embeddings are computed via OpenAI's text-embedding-3-large model.
Example:
assert:
- type: similar
value: 'The expected output'
threshold: 0.8
If you provide an array of values, the test will pass if it is similar to at least one of them:
assert:
- type: similar
value:
- The expected output
- Expected output
- file://my_expected_output.txt
threshold: 0.8
Similarity Metrics
You can specify which metric to use by including it in the assertion type. The default is similar (cosine similarity).
Cosine Similarity (default)
Measures the cosine of the angle between two vectors. Range: -1 to 1 (higher is more similar), though text embeddings typically produce values between 0 and 1.
assert:
# Default - uses cosine similarity
- type: similar
value: 'The expected output'
threshold: 0.8
# Explicit cosine
- type: similar:cosine
value: 'The expected output'
threshold: 0.8
When to use: Best for semantic similarity where you care about the direction of the embedding vector, not its magnitude. This is the industry standard for embeddings.
Dot Product
Computes the dot product of two vectors. Range: unbounded, but typically 0 to 1 for normalized embeddings (higher is more similar).
assert:
- type: similar:dot
value: 'The expected output'
threshold: 0.8
When to use: Useful when you want to match the metric used in your production vector database (many use dot product for performance). For normalized embeddings, dot product is nearly equivalent to cosine similarity.
Euclidean Distance
Computes the straight-line distance between two vectors. Range: 0 to ∞ (lower is more similar).
assert:
- type: similar:euclidean
value: 'The expected output'
threshold: 0.5 # Maximum distance threshold
When to use: When you care about both the angle and magnitude differences between vectors. Note that the threshold represents the maximum distance (not minimum similarity), so lower values are stricter.
Important: For euclidean distance, the threshold semantics are inverted - it represents the maximum acceptable distance rather than minimum similarity.
Overriding the provider
By default similar will use OpenAI. To specify the model that creates the embeddings, do one of the following:
-
Use
test.optionsordefaultTest.optionsto override the provider across the entire test suite. For example:defaultTest:
options:
provider:
embedding:
id: azureopenai:embedding:text-embedding-ada-002
config:
apiHost: xxx.openai.azure.com
tests:
assert:
- type: similar
value: Hello world -
Set
assertion.provideron a per-assertion basis. For example:tests:
assert:
- type: similar
value: Hello world
provider: huggingface:sentence-similarity:sentence-transformers/all-MiniLM-L6-v2