CI/CD Integration for LLM Evaluation and Security

Integrate promptfoo into your CI/CD pipelines to automatically evaluate prompts, test for security vulnerabilities, and ensure quality before deployment. This guide covers modern CI/CD workflows for both quality testing and security scanning.

Why CI/CD for LLM Apps?

Catch regressions early - Test prompt changes before they reach production
Security scanning - Automated red teaming and vulnerability detection
Quality gates - Enforce minimum performance thresholds
Compliance - Generate reports for OWASP, NIST, and other frameworks
Cost control - Track token usage and API costs over time

Quick Start

If you're using GitHub Actions, check out our dedicated GitHub Actions guide or the GitHub Marketplace action.

For other platforms, here's a basic example:

# Run eval (no global install required)
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

# Run security scan (red teaming)
npx promptfoo@latest redteam run

Prerequisites

Node.js 20+ installed in your CI environment
LLM provider API keys (stored as secure environment variables)
A promptfoo configuration file (promptfooconfig.yaml)
(Optional) Docker for containerized environments

Core Concepts

1. Eval vs Red Teaming

Promptfoo supports two main CI/CD workflows:

Eval - Test prompt quality and performance:

npx promptfoo@latest eval -c promptfooconfig.yaml

Red Teaming - Security vulnerability scanning:

npx promptfoo@latest redteam run

See our red team quickstart for security testing details.

2. Output Formats

Promptfoo supports multiple output formats for different CI/CD needs:

# JSON for programmatic processing
npx promptfoo@latest eval -o results.json

# HTML for human-readable reports
npx promptfoo@latest eval -o report.html

# XML for enterprise tools
npx promptfoo@latest eval -o results.xml

# Multiple formats
npx promptfoo@latest eval -o results.json -o report.html

Learn more about output formats and processing.

Enterprise Feature

SonarQube integration is available in Promptfoo Enterprise. Use the standard JSON output format and process it for SonarQube import.

3. Quality Gates

Fail the build when quality thresholds aren't met:

# Fail on any test failures
npx promptfoo@latest eval --fail-on-error

# Custom threshold checking
npx promptfoo@latest eval -o results.json
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' results.json)
if (( $(echo "$PASS_RATE < 95" | bc -l) )); then
  echo "Quality gate failed: Pass rate ${PASS_RATE}% < 95%"
  exit 1
fi

See assertions and metrics for comprehensive validation options.

Platform-Specific Guides

GitHub Actions

.github/workflows/eval.yml
name: LLM Eval
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'

      - name: Cache promptfoo
        uses: actions/cache@v4
        with:
          path: ~/.cache/promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('prompts/**') }}
          restore-keys: |
            ${{ runner.os }}-promptfoo-

      - name: Run eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
        run: |
          npx promptfoo@latest eval \
            -c promptfooconfig.yaml \
            --share \
            -o results.json \
            -o report.html

      - name: Check quality gate
        run: |
          FAILURES=$(jq '.results.stats.failures' results.json)
          if [ "$FAILURES" -gt 0 ]; then
            echo "❌ Eval failed with $FAILURES failures"
            exit 1
          fi
          echo "✅ All tests passed!"

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: |
            results.json
            report.html

For red teaming in CI/CD:

.github/workflows/redteam.yml
name: Security Scan
on:
  schedule:
    - cron: '0 0 * * *' # Daily
  workflow_dispatch:

jobs:
  red-team:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run red team scan
        uses: promptfoo/promptfoo-action@v1
        with:
          type: 'redteam'
          config: 'promptfooconfig.yaml'
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          github-token: ${{ secrets.GITHUB_TOKEN }}

GitLab CI

See our detailed GitLab CI guide.

.gitlab-ci.yml
image: node:20

evaluate:
  script:
    - |
      npx promptfoo@latest eval \
        -c promptfooconfig.yaml \
        --share \
        -o output.json
  variables:
    OPENAI_API_KEY: ${OPENAI_API_KEY}
    PROMPTFOO_CACHE_PATH: .cache/promptfoo
  cache:
    key: ${CI_COMMIT_REF_SLUG}-promptfoo
    paths:
      - .cache/promptfoo
  artifacts:
    reports:
      junit: output.xml
    paths:
      - output.json
      - report.html

Jenkins

See our detailed Jenkins guide.

Jenkinsfile
pipeline {
    agent any

    environment {
        OPENAI_API_KEY = credentials('openai-api-key')
        PROMPTFOO_CACHE_PATH = "${WORKSPACE}/.cache/promptfoo"
    }

    stages {
        stage('Evaluate') {
            steps {
                sh '''
                    npx promptfoo@latest eval \
                        -c promptfooconfig.yaml \
                        --share \
                        -o results.json
                '''
            }
        }

        stage('Quality Gate') {
            steps {
                script {
                    def results = readJSON file: 'results.json'
                    def failures = results.results.stats.failures
                    if (failures > 0) {
                        error("Eval failed with ${failures} failures")
                    }
                }
            }
        }
    }
}

Other Platforms

Advanced Patterns

1. Docker-based CI/CD

Create a custom Docker image with promptfoo pre-installed:

Dockerfile
FROM node:20-slim
WORKDIR /app
COPY . .
CMD ["npx", "promptfoo@latest", "eval"]

2. Parallel Testing

Test multiple models or configurations in parallel:

# GitHub Actions example
strategy:
  matrix:
    model: [gpt-4, gpt-3.5-turbo, claude-3-opus]
steps:
  - name: Test ${{ matrix.model }}
    run: |
      npx promptfoo@latest eval \
        --providers.0.config.model=${{ matrix.model }} \
        -o results-${{ matrix.model }}.json

3. Scheduled Security Scans

Run comprehensive security scans on a schedule:

# GitHub Actions
on:
  schedule:
    - cron: '0 2 * * *' # 2 AM daily

jobs:
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Full red team scan
        run: |
          npx promptfoo@latest redteam generate \
            --plugins harmful,pii,contracts \
            --strategies jailbreak,prompt-injection
          npx promptfoo@latest redteam run

4. SonarQube Integration

Enterprise Feature

Direct SonarQube output format is available in Promptfoo Enterprise. For open-source users, export to JSON and transform the results.

For enterprise environments, integrate with SonarQube:

# Export results for SonarQube processing
- name: Run promptfoo security scan
  run: |
    npx promptfoo@latest eval \
      --config promptfooconfig.yaml \
      -o results.json

# Transform results for SonarQube (custom script required)
- name: Transform for SonarQube
  run: |
    node transform-to-sonarqube.js results.json > sonar-report.json

- name: SonarQube scan
  env:
    SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
  run: |
    sonar-scanner \
      -Dsonar.externalIssuesReportPaths=sonar-report.json

See our SonarQube integration guide for detailed setup.

Processing Results

Parsing JSON Output

The output JSON follows this schema:

interface OutputFile {
  evalId?: string;
  results: {
    stats: {
      successes: number;
      failures: number;
      errors: number;
    };
    outputs: Array<{
      pass: boolean;
      score: number;
      error?: string;
      // ... other fields
    }>;
  };
  config: UnifiedConfig;
  shareableUrl: string | null;
}

Example processing script:

process-results.js
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));

// Calculate metrics
const passRate =
  (results.results.stats.successes /
    (results.results.stats.successes + results.results.stats.failures)) *
  100;

console.log(`Pass rate: ${passRate.toFixed(2)}%`);
console.log(`Shareable URL: ${results.shareableUrl}`);

// Check for specific failures
const criticalFailures = results.results.outputs.filter(
  (o) => o.error?.includes('security') || o.error?.includes('injection'),
);

if (criticalFailures.length > 0) {
  console.error('Critical security failures detected!');
  process.exit(1);
}

Posting Results

Post eval results to PR comments, Slack, or other channels:

# Extract and post results
SHARE_URL=$(jq -r '.shareableUrl' results.json)
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' results.json)

# Post to GitHub PR
gh pr comment --body "
## Promptfoo Eval Results
- Pass rate: ${PASS_RATE}%
- [View detailed results](${SHARE_URL})
"

Caching Strategies

Optimize CI/CD performance with proper caching [[memory:3455374]]:

# Set cache location
env:
  PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
  PROMPTFOO_CACHE_TTL: 86400 # 24 hours

# Cache configuration
cache:
  key: promptfoo-${{ hashFiles('prompts/**', 'promptfooconfig.yaml') }}
  paths:
    - ~/.cache/promptfoo

Security Best Practices

API Key Management
- Store API keys as encrypted secrets
- Use least-privilege access controls
- Rotate keys regularly
Network Security
- Use private runners for sensitive data
- Restrict outbound network access
- Consider on-premise deployments for enterprise

Data Privacy

Enable output stripping for sensitive data:

export PROMPTFOO_STRIP_RESPONSE_OUTPUT=true
export PROMPTFOO_STRIP_TEST_VARS=true

Audit Logging
- Keep eval history
- Track who triggered security scans
- Monitor for anomalous patterns

Troubleshooting

Common Issues

Issue	Solution
Rate limits	Enable caching, reduce concurrency with `-j 1`
Timeouts	Increase timeout values, use `--max-concurrency`
Memory issues	Use streaming mode, process results in batches
Cache misses	Check cache key includes all relevant files

Debug Mode

Enable detailed logging:

LOG_LEVEL=debug npx promptfoo@latest eval -c config.yaml

Real-World Examples

Automated Testing Examples

Self-grading example - Automated LLM evaluation
Custom grading prompts - Complex evaluation logic
Store and reuse outputs - Multi-step testing

Security Examples

Red team starter - Basic security testing
RAG poisoning tests - Document poisoning detection
DoNotAnswer dataset - Harmful content detection

Integration Examples

GitHub Action standalone - Custom GitHub workflows
JSON output processing - Result parsing patterns
CSV test data - Bulk test management

Configuration & Testing

Configuration Guide - Complete setup instructions
Test Cases - Writing effective tests
Assertions & Metrics - Validation strategies
Python Assertions - Custom Python validators
JavaScript Assertions - Custom JS validators

Security & Red Teaming

Red Team Architecture - Security testing framework
OWASP Top 10 for LLMs - Security compliance
RAG Security Testing - Testing retrieval systems
MCP Security Testing - Model Context Protocol security

Enterprise & Scaling

Enterprise Features - Team collaboration and compliance
Red Teams in Enterprise - Organization-wide security
Service Accounts - Automated access

Why CI/CD for LLM Apps?​

Quick Start​

Prerequisites​

Core Concepts​

1. Eval vs Red Teaming​

2. Output Formats​

3. Quality Gates​

Platform-Specific Guides​

GitHub Actions​

GitLab CI​

Jenkins​

Other Platforms​

Advanced Patterns​

1. Docker-based CI/CD​

2. Parallel Testing​

3. Scheduled Security Scans​

4. SonarQube Integration​

Processing Results​

Parsing JSON Output​

Posting Results​

Caching Strategies​

Security Best Practices​

Troubleshooting​

Common Issues​

Debug Mode​

Real-World Examples​

Automated Testing Examples​

Security Examples​

Integration Examples​

Related Documentation​

Configuration & Testing​

Security & Red Teaming​

Enterprise & Scaling​

See Also​