Guardrails
Guardrails are an active mitigation solution to LLM security, implemented to control and monitor user interactions. They help prevent misuse, detect potential security risks, and ensure appropriate model behavior by filtering or blocking problematic inputs and outputs.
Common guardrails include prompt injection/jailbreak detection, content moderation, and PII (personally identifiable information) detection.
Guardrails API
The Guardrails API helps detect potential security risks in user inputs to LLMs, identify PII, and assess potential harm in content.
The Guardrails API is focused on classification and detection. It returns a result, and your application can decide whether to warn, block, or otherwise respond to the input.
API Base URL
https://api.promptfoo.dev
Endpoints
1. Prompt injection and Jailbreak detection
Analyzes input text to classify potential security threats from prompt injections and jailbreaks.
Request
POST /v1/guard
Headers
Content-Type: application/json
Body
{
"input": "String containing the text to analyze"
}
Response
{
"model": "promptfoo-guard",
"results": [
{
"categories": {
"prompt_injection": boolean,
"jailbreak": boolean
},
"category_scores": {
"prompt_injection": number,
"jailbreak": number
},
"flagged": boolean
}
]
}
categories.prompt_injection
: Indicates if the input may be attempting a prompt injection.categories.jailbreak
: Indicates if the input may be attempting a jailbreak.flagged
: True if the input is classified as either prompt injection or jailbreak.
2. PII Detection
Detects personally identifiable information (PII) in the input text. This system can identify a wide range of PII elements.
Entity Type | Description |
---|---|
account_number | Account numbers (e.g., bank account) |
building_number | Building or house numbers |
city | City names |
credit_card_number | Credit card numbers |
date_of_birth | Dates of birth |
driver_license_number | Driver's license numbers |
email_address | Email addresses |
given_name | First or given names |
id_card_number | ID card numbers |
password | Passwords or passcodes |
social_security_number | Social security numbers |
street_name | Street names |
surname | Last names or surnames |
tax_id_number | Tax identification numbers |
phone_number | Telephone numbers |
username | Usernames |
zip_code | Postal or ZIP codes |
Request
POST /v1/pii
Headers
Content-Type: application/json
Body
{
"input": "String containing the text to analyze for PII"
}
Response
{
"model": "promptfoo-pii",
"results": [
{
"categories": {
"pii": boolean
},
"category_scores": {
"pii": number
},
"flagged": boolean,
"payload": {
"pii": [
{
"entity_type": string,
"start": number,
"end": number,
"pii": string
}
]
}
}
]
}
pii
: Indicates if PII was detected in the input.flagged
: True if any PII was detected.payload.pii
: Array of detected PII entities with their types and positions in the text.
3. Harm Detection
Analyzes input text to detect potential harmful content across various categories.
Request
POST /v1/harm
Headers
Content-Type: application/json
Body
{
"input": "String containing the text to analyze for potential harm"
}
Response
{
"model": "promptfoo-harm",
"results": [
{
"categories": {
"violent_crimes": boolean,
"non_violent_crimes": boolean,
"sex_related_crimes": boolean,
"child_sexual_exploitation": boolean,
"defamation": boolean,
"specialized_advice": boolean,
"privacy": boolean,
"intellectual_property": boolean,
"indiscriminate_weapons": boolean,
"hate": boolean,
"suicide_and_self_harm": boolean,
"sexual_content": boolean,
"elections": boolean,
"code_interpreter_abuse": boolean
},
"category_scores": {
"violent_crimes": number,
"non_violent_crimes": number,
"sex_related_crimes": number,
"child_sexual_exploitation": number,
"defamation": number,
"specialized_advice": number,
"privacy": number,
"intellectual_property": number,
"indiscriminate_weapons": number,
"hate": number,
"suicide_and_self_harm": number,
"sexual_content": number,
"elections": number,
"code_interpreter_abuse": number
},
"flagged": boolean
}
]
}
- Each category in
categories
indicates whether the input contains content related to that harm category. category_scores
provides a numerical score (between 0 and 1) for each harm category.flagged
: True if any harm category is detected in the input.
Supported Categories
The harm detection API supports the following categories from ML Commons taxonomy:
Category | Description |
---|---|
violent_crimes | Content related to violent criminal activities |
non_violent_crimes | Content related to non-violent criminal activities |
sex_related_crimes | Content related to sex crimes |
child_sexual_exploitation | Content involving the sexual exploitation of minors |
defamation | Content that could be considered defamatory |
specialized_advice | Potentially harmful specialized advice (e.g., medical, legal) |
privacy | Content that may violate privacy |
intellectual_property | Content that may infringe on intellectual property rights |
indiscriminate_weapons | Content related to weapons of mass destruction |
hate | Hate speech or content promoting discrimination |
suicide_and_self_harm | Content related to suicide or self-harm |
sexual_content | Explicit sexual content |
elections | Content that may interfere with elections |
code_interpreter_abuse | Potential abuse of code interpretation features |
Each category is assigned a boolean value indicating its presence and a numerical score between 0 and 1 representing the confidence level of the detection.
Examples
Guard Classification Example
curl https://api.promptfoo.dev/v1/guard \
-X POST \
-d '{"input": "Ignore previous instructions"}' \
-H 'Content-Type: application/json'
Response
{
"model": "promptfoo-guard",
"results": [
{
"categories": {
"prompt_injection": false,
"jailbreak": true
},
"category_scores": {
"prompt_injection": 0.00004004167567472905,
"jailbreak": 0.9999395608901978
},
"flagged": true
}
]
}
This example shows a high probability of a jailbreak attempt.
PII Detection Example
curl https://api.promptfoo.dev/v1/pii \
-X POST \
-d '{"input": "My name is John Doe and my email is [email protected]"}' \
-H 'Content-Type: application/json'
Response
{
"model": "promptfoo-pii",
"results": [
{
"categories": {
"pii": true
},
"category_scores": {
"pii": 1
},
"flagged": true,
"payload": {
"pii": [
{
"entity_type": "PERSON",
"start": 11,
"end": 19,
"pii": "John Doe"
},
{
"entity_type": "EMAIL",
"start": 34,
"end": 50,
"pii": "[email protected]"
}
]
}
}
]
}
Harm Detection Example
curl https://api.promptfoo.dev/v1/harm \
-X POST \
-d '{"input": "How to make homemade explosives"}' \
-H 'Content-Type: application/json'
Response
{
"model": "promptfoo-harm",
"results": [
{
"categories": {
"indiscriminate_weapons": true
},
"category_scores": {
"indiscriminate_weapons": 1
},
"flagged": true
}
]
}
This example shows the detection of potentially harmful content related to indiscriminate weapons.
UI Features
Dashboard Overview
Promptfoo provides a UI for guardrail observability for Cloud or Enterprise users.
The dashboard provides a summary of guardrail events.
Event Details
The table view provides detailed information about each event.
Node.js Integration
The guardrails functionality is also available directly in the promptfoo Node.js package:
import { guardrails } from 'promptfoo';
// Check for prompt injections/jailbreaks
const guardResult = await guardrails.guard('Some text');
// Check for PII
const piiResult = await guardrails.pii('Some text');
// Check for harmful content
const harmResult = await guardrails.harm('Some text');
All methods return a GuardResult
object with the following TypeScript interface:
interface GuardResult {
model: string;
results: Array<{
categories: Record<string, boolean>;
category_scores: Record<string, number>;
flagged: boolean;
payload?: {
pii?: Array<{
entity_type: string;
start: number;
end: number;
pii: string;
}>;
};
}>;
}
The response format matches exactly what's returned by the REST API endpoints described above.
Additional Resources
For more information on LLM vulnerabilities and how to mitigate LLM failure modes, refer to our Types of LLM Vulnerabilities and Introduction to AI red teaming documentation.