Skip to main content

GOAT Technique for Jailbreaking LLMs

The GOAT (Generative Offensive Agent Tester) strategy is an advanced automated red teaming technique that uses an "attacker" LLM to dynamically generate multi-turn conversations aimed at bypassing a target model's safety measures.

It was introduced by Meta researchers in 2024 and achieves high success rates against modern LLMs by simulating how real users interact with AI systems, with an Attack Success Rate (ASR@10) of 97% against Llama 3.1 and 88% against GPT-4-Turbo on the JailbreakBench dataset.

Use it like so in your promptfooconfig.yaml:

strategies:
- name: goat
options:
maxTurns: 5 # Maximum conversation turns (default)
warning

This is a remote-only strategy and requires an connection to promptfoo's free grading API. Local grading is not supported. Your target model must be able to handle the OpenAI message format.

How It Works

GOAT LLM attack

GOAT uses an attacker LLM that engages in multi-turn conversations with a target model.

The attacker pursues multiple adversarial techniques: output manipulation, safe response distractors, and fictional scenarios. Unlike simpler approaches, GOAT adapts its strategy based on the target model's responses, similar to how human red teamers operate.

Each conversation turn follows a structured three-step reasoning process:

  1. Observation: Analyzes the target model's previous response and identifies triggered safety mechanisms
  2. Strategic Planning: Reflects on conversation progress and develops the next approach
  3. Attack Generation: Selects and combines appropriate techniques to generate the next prompt

This process is looped until we either achieve the goal or reach a maximum number of turns.

GOAT's effectiveness stems from its ability to simulate realistic user behavior while maintaining technical sophistication.

Instead of relying on brute-force approaches or static prompts, its dynamic conversation and reasoning make it particularly effective at identifying vulnerabilities in LLM applications.

For a comprehensive overview of LLM vulnerabilities and red teaming strategies, visit our Types of LLM Vulnerabilities page.