CI/CD Integration for LLM Evaluation and Security
Integrate promptfoo into your CI/CD pipelines to automatically evaluate prompts, test for security vulnerabilities, and ensure quality before deployment. This guide covers modern CI/CD workflows for both quality testing and security scanning.
Why CI/CD for LLM Apps?
- Catch regressions early - Test prompt changes before they reach production
- Security scanning - Automated red teaming and vulnerability detection
- Quality gates - Enforce minimum performance thresholds
- Compliance - Generate reports for OWASP, NIST, and other frameworks
- Cost control - Track token usage and API costs over time
Quick Start
If you're using GitHub Actions, check out our dedicated GitHub Actions guide or the GitHub Marketplace action.
For other platforms, here's a basic example:
# Run eval (no global install required)
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
# Run security scan (red teaming)
npx promptfoo@latest redteam run
Prerequisites
- Node.js 18+ installed in your CI environment
- LLM provider API keys (stored as secure environment variables)
- A promptfoo configuration file (
promptfooconfig.yaml
) - (Optional) Docker for containerized environments
Core Concepts
1. Eval vs Red Teaming
Promptfoo supports two main CI/CD workflows:
Eval - Test prompt quality and performance:
npx promptfoo@latest eval -c promptfooconfig.yaml
Red Teaming - Security vulnerability scanning:
npx promptfoo@latest redteam run
See our red team quickstart for security testing details.
2. Output Formats
Promptfoo supports multiple output formats for different CI/CD needs:
# JSON for programmatic processing
npx promptfoo@latest eval -o results.json
# HTML for human-readable reports
npx promptfoo@latest eval -o report.html
# XML for enterprise tools
npx promptfoo@latest eval -o results.xml
# Multiple formats
npx promptfoo@latest eval -o results.json -o report.html
Learn more about output formats and processing.
SonarQube integration is available in Promptfoo Enterprise. Use the standard JSON output format and process it for SonarQube import.
3. Quality Gates
Fail the build when quality thresholds aren't met:
# Fail on any test failures
npx promptfoo@latest eval --fail-on-error
# Custom threshold checking
npx promptfoo@latest eval -o results.json
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' results.json)
if (( $(echo "$PASS_RATE < 95" | bc -l) )); then
echo "Quality gate failed: Pass rate ${PASS_RATE}% < 95%"
exit 1
fi
See assertions and metrics for comprehensive validation options.
Platform-Specific Guides
GitHub Actions
name: LLM Eval
on:
pull_request:
paths:
- 'prompts/**'
- 'promptfooconfig.yaml'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'npm'
- name: Cache promptfoo
uses: actions/cache@v4
with:
path: ~/.cache/promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('prompts/**') }}
restore-keys: |
${{ runner.os }}-promptfoo-
- name: Run eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
run: |
npx promptfoo@latest eval \
-c promptfooconfig.yaml \
--share \
-o results.json \
-o report.html
- name: Check quality gate
run: |
FAILURES=$(jq '.results.stats.failures' results.json)
if [ "$FAILURES" -gt 0 ]; then
echo "❌ Eval failed with $FAILURES failures"
exit 1
fi
echo "✅ All tests passed!"
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: |
results.json
report.html
For red teaming in CI/CD:
name: Security Scan
on:
schedule:
- cron: '0 0 * * *' # Daily
workflow_dispatch:
jobs:
red-team:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run red team scan
uses: promptfoo/promptfoo-action@v1
with:
type: 'redteam'
config: 'promptfooconfig.yaml'
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
github-token: ${{ secrets.GITHUB_TOKEN }}
See also: Standalone GitHub Action example.
GitLab CI
See our detailed GitLab CI guide.
image: node:20
evaluate:
script:
- |
npx promptfoo@latest eval \
-c promptfooconfig.yaml \
--share \
-o output.json
variables:
OPENAI_API_KEY: ${OPENAI_API_KEY}
PROMPTFOO_CACHE_PATH: .cache/promptfoo
cache:
key: ${CI_COMMIT_REF_SLUG}-promptfoo
paths:
- .cache/promptfoo
artifacts:
reports:
junit: output.xml
paths:
- output.json
- report.html
Jenkins
See our detailed Jenkins guide.
pipeline {
agent any
environment {
OPENAI_API_KEY = credentials('openai-api-key')
PROMPTFOO_CACHE_PATH = "${WORKSPACE}/.cache/promptfoo"
}
stages {
stage('Evaluate') {
steps {
sh '''
npx promptfoo@latest eval \
-c promptfooconfig.yaml \
--share \
-o results.json
'''
}
}
stage('Quality Gate') {
steps {
script {
def results = readJSON file: 'results.json'
def failures = results.results.stats.failures
if (failures > 0) {
error("Eval failed with ${failures} failures")
}
}
}
}
}
}
Other Platforms
Advanced Patterns
1. Docker-based CI/CD
Create a custom Docker image with promptfoo pre-installed:
FROM node:20-slim
WORKDIR /app
COPY . .
CMD ["npx", "promptfoo@latest", "eval"]
2. Parallel Testing
Test multiple models or configurations in parallel:
# GitHub Actions example
strategy:
matrix:
model: [gpt-4, gpt-3.5-turbo, claude-3-opus]
steps:
- name: Test ${{ matrix.model }}
run: |
npx promptfoo@latest eval \
--providers.0.config.model=${{ matrix.model }} \
-o results-${{ matrix.model }}.json
3. Scheduled Security Scans
Run comprehensive security scans on a schedule:
# GitHub Actions
on:
schedule:
- cron: '0 2 * * *' # 2 AM daily
jobs:
security-scan:
runs-on: ubuntu-latest
steps:
- name: Full red team scan
run: |
npx promptfoo@latest redteam generate \
--plugins harmful,pii,contracts \
--strategies jailbreak,prompt-injection
npx promptfoo@latest redteam run
4. SonarQube Integration
Direct SonarQube output format is available in Promptfoo Enterprise. For open-source users, export to JSON and transform the results.
For enterprise environments, integrate with SonarQube:
# Export results for SonarQube processing
- name: Run promptfoo security scan
run: |
npx promptfoo@latest eval \
--config promptfooconfig.yaml \
-o results.json
# Transform results for SonarQube (custom script required)
- name: Transform for SonarQube
run: |
node transform-to-sonarqube.js results.json > sonar-report.json
- name: SonarQube scan
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
run: |
sonar-scanner \
-Dsonar.externalIssuesReportPaths=sonar-report.json
See our SonarQube integration guide for detailed setup.
Processing Results
Parsing JSON Output
The output JSON follows this schema:
interface OutputFile {
evalId?: string;
results: {
stats: {
successes: number;
failures: number;
errors: number;
};
outputs: Array<{
pass: boolean;
score: number;
error?: string;
// ... other fields
}>;
};
config: UnifiedConfig;
shareableUrl: string | null;
}
Example processing script:
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));
// Calculate metrics
const passRate =
(results.results.stats.successes /
(results.results.stats.successes + results.results.stats.failures)) *
100;
console.log(`Pass rate: ${passRate.toFixed(2)}%`);
console.log(`Shareable URL: ${results.shareableUrl}`);
// Check for specific failures
const criticalFailures = results.results.outputs.filter(
(o) => o.error?.includes('security') || o.error?.includes('injection'),
);
if (criticalFailures.length > 0) {
console.error('Critical security failures detected!');
process.exit(1);
}
Posting Results
Post eval results to PR comments, Slack, or other channels:
# Extract and post results
SHARE_URL=$(jq -r '.shareableUrl' results.json)
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' results.json)
# Post to GitHub PR
gh pr comment --body "
## Promptfoo Eval Results
- Pass rate: ${PASS_RATE}%
- [View detailed results](${SHARE_URL})
"
Caching Strategies
Optimize CI/CD performance with proper caching [[memory:3455374]]:
# Set cache location
env:
PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
PROMPTFOO_CACHE_TTL: 86400 # 24 hours
# Cache configuration
cache:
key: promptfoo-${{ hashFiles('prompts/**', 'promptfooconfig.yaml') }}
paths:
- ~/.cache/promptfoo
Security Best Practices
-
API Key Management
- Store API keys as encrypted secrets
- Use least-privilege access controls
- Rotate keys regularly
-
Network Security
- Use private runners for sensitive data
- Restrict outbound network access
- Consider on-premise deployments for enterprise
-
Data Privacy
- Enable output stripping for sensitive data:
export PROMPTFOO_STRIP_RESPONSE_OUTPUT=true
export PROMPTFOO_STRIP_TEST_VARS=true -
Audit Logging
- Keep eval history
- Track who triggered security scans
- Monitor for anomalous patterns
Troubleshooting
Common Issues
Issue | Solution |
---|---|
Rate limits | Enable caching, reduce concurrency with -j 1 |
Timeouts | Increase timeout values, use --max-concurrency |
Memory issues | Use streaming mode, process results in batches |
Cache misses | Check cache key includes all relevant files |
Debug Mode
Enable detailed logging:
LOG_LEVEL=debug npx promptfoo@latest eval -c config.yaml
Real-World Examples
Automated Testing Examples
- Self-grading example - Automated LLM evaluation
- Custom grading prompts - Complex evaluation logic
- Store and reuse outputs - Multi-step testing
Security Examples
- Red team starter - Basic security testing
- RAG poisoning tests - Document poisoning detection
- DoNotAnswer dataset - Harmful content detection
Integration Examples
- GitHub Action standalone - Custom GitHub workflows
- JSON output processing - Result parsing patterns
- CSV test data - Bulk test management
Related Documentation
Configuration & Testing
- Configuration Guide - Complete setup instructions
- Test Cases - Writing effective tests
- Assertions & Metrics - Validation strategies
- Python Assertions - Custom Python validators
- JavaScript Assertions - Custom JS validators
Security & Red Teaming
- Red Team Architecture - Security testing framework
- OWASP Top 10 for LLMs - Security compliance
- RAG Security Testing - Testing retrieval systems
- MCP Security Testing - Model Context Protocol security
Enterprise & Scaling
- Enterprise Features - Team collaboration and compliance
- Red Teams in Enterprise - Organization-wide security
- Service Accounts - Automated access