Skip to main content

Moderation

Use the moderation assert type to ensure that LLM outputs are safe.

Currently, this supports OpenAI's moderation model and Meta's LlamaGuard 2 model via Replicate.

In general, we encourage the use of Meta's LlamaGuard as it substantially outperforms OpenAI's moderation API as well as GPT-4. See benchmarks.

OpenAI moderation

By default, the moderation assertion uses OpenAI. Just make sure that the OPENAI_API_KEY environment variable is set:

tests:
- vars:
foo: bar
assert:
# Ensure that it passes OpenAI's moderation filters
- type: moderation

OpenAI monitors the following categories:

CategoryDescription
hateContent that promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability, or caste. Hateful content aimed at non-protected groups is harassment.
hate/threateningHateful content that includes violence or serious harm towards the targeted group.
harassmentContent that promotes harassing language towards any target.
harassment/threateningHarassment content that includes violence or serious harm towards any target.
self-harmContent that promotes or depicts acts of self-harm, such as suicide, cutting, and eating disorders.
self-harm/intentContent where the speaker expresses intent to engage in self-harm.
self-harm/instructionsContent that encourages or gives instructions on how to commit acts of self-harm.
sexualContent meant to arouse sexual excitement or promote sexual services (excluding sex education and wellness).
sexual/minorsSexual content involving individuals under 18 years old.
violenceContent that depicts death, violence, or physical injury.
violence/graphicContent that depicts death, violence, or physical injury in graphic detail.

Check specific categories

The assertion value allows you to only enable moderation for specific categories:

tests:
- vars:
foo: bar
assert:
- type: moderation
value:
- harassment
- harassment/threatening
- sexual
- sexual/minors

Meta LlamaGuard moderation

This example uses the LlamaGuard model hosted on Replicate. Be sure to set the REPLICATE_API_KEY environment variable:

tests:
- vars:
foo: bar
assert:
- type: moderation
# Use the latest Llama Guard on replicate
provider: 'replicate:moderation:meta/meta-llama-guard-2-8b:b063023ee937f28e922982abdbf97b041ffe34ad3b35a53d33e1d74bb19b36c4'

LlamaGuard monitors the following categories:

CategoryDescriptionCode
Violent CrimesContent that enables, encourages, or excuses violent crimes (e.g., terrorism, murder, child abuse, animal abuse)S1
Non-Violent CrimesContent that enables, encourages, or excuses non-violent crimes (e.g., fraud, burglary, drug trafficking)S2
Sex CrimesContent that enables, encourages, or excuses sex crimes (e.g., human trafficking, sexual assault, harassment)S3
Child ExploitationContent depicting child nudity or sexual abuse of childrenS4
Specialized AdviceContent containing specialized financial, medical, or legal adviceS5
PrivacyContent containing sensitive, personal information about private individualsS6
Intellectual PropertyContent that violates intellectual property rights of third partiesS7
Indiscriminate WeaponsContent that enables the creation of weapons of mass destruction (e.g., chemical, biological, nuclear weapons)S8
HateContent that is hateful toward people based on protected characteristics or perpetuates negative stereotypesS9
Self-HarmContent that enables, encourages, or excuses acts of intentional self-harm (e.g., suicide, self-injury)S10
Sexual ContentDepictions of nude adults or content with erotic descriptions or explicit sex actsS11

Check specific categories

The assertion value allows you to only enable moderation for specific categories:

tests:
- vars:
foo: bar
assert:
- type: moderation
provider: 'replicate:moderation:meta/meta-llama-guard-2-8b:b063023ee937f28e922982abdbf97b041ffe34ad3b35a53d33e1d74bb19b36c4'
value:
- S1
- S3
- S4