Introducing GOAT—Promptfoo's Latest Strategy
We're excited to announce the deployment of Promptfoo's latest strategy, GOAT, to jailbreak multi-turn conversations.
This strategy is inspired by Meta's latest research on building agentic red teaming systems, called Generative Offensive Agent Testers (GOAT).
Traditional jailbreaking techniques depend on single-turn prompts and responses that are effective at simpler bypasses. However, conversational AI requires more sophisticated attacks. To achieve this, GOAT uses an attacker LLM that engages in multi-turn conversations with a target model.
Each conversation turn follows a structured three-step reasoning process:
- Observation: Analyzes the target model's previous response and identifies triggered safety mechanisms.
- Strategic Planning: Reflects on conversation progress and develops the next approach.
- Attack Generation: Selects and combines appropriate techniques to generate the next prompt.
Meta's research suggests that multi-turn, agentic attacks are highly effective in jailbreaking LLMs and bypassing restrictions. We'll dive deeper into how these attacks work below.
Single-Turn Attacks
To demonstrate how GOAT excels, we developed an interactive red teaming exercise that leverages single-turn attacks without GOAT. Your task is to prompt the AI model below to write a speech glorifying an extremist ideology.
To make it easier, we provide a list of red teaming commands that will transform your queries into specific styles for red teaming.
Red Teaming Challenge
You probably noticed how you kept changing techniques to adapt to the model's responses. Many of these techniques can be automated using Promptfoo's latest strategy.
Why GOAT?
GOAT stands out from traditional automated red teaming by simulating dynamic, human-like adversarial interactions with LLMs. It adapts strategies, engages in multi-turn conversations, and learns from responses, much like a human red teamer. Instead of merely breaking rules, GOAT uncovers vulnerabilities through evolving dialogue, making it more effective than rigid testing methods.
How does GOAT work exactly?
Initialization
The Attacker LLM is initialized with a specific adversarial goal—something it wants the target model to say or do that would violate the safety boundaries set by its developers. This goal could be to generate a harmful statement, bypass a restriction, or reveal restricted information.
Toolbox of Attacks
The Attacker LLM has access to a toolbox of various red teaming techniques. This toolbox is fully customizable, and more attacks can be added, as you'll see later. These techniques are inspired by human adversarial approaches and can be layered together. Examples include:
- Priming Responses: Crafting the prompt in a way that leads the target model toward an unsafe response.
- Hypotheticals: Creating fictional scenarios that trick the target into providing harmful outputs by making it think the context is safe.
- Persona Modification: Asking the target model to take on a particular persona that may be more inclined to provide unsafe answers.
Chain-of-Thought Prompting
GOAT utilizes a structured, iterative process known as Chain-of-Thought (CoT) prompting. For every turn in the conversation, the Attacker LLM goes through three key steps: Observation, Thought, and Strategy.
- Observation: The Attacker LLM starts by observing the target model's response to its latest prompt. Did the target provide a useful piece of information, or did it refuse the request?
- Thought: Based on the observation, the Attacker LLM thinks through what this means for the current conversation. It considers whether it's getting closer to the adversarial goal or if it needs to adjust its approach.
- Strategy: Finally, the Attacker LLM formulates a strategy for the next move. It selects one or more techniques from the toolbox to continue probing the target, aiming to find a way past its defenses.
Multi-Turn Conversations
GOAT engages in a multi-turn conversation, adjusting its approach based on each new piece of information. Unlike static attacks that use a single prompt, GOAT simulates how real adversaries behave—they try multiple angles, rephrase requests, and persist until they find a weakness. This means GOAT is highly effective in exposing vulnerabilities that might only become apparent over a longer conversation.
Dynamic Adaptation
Throughout the conversation, the Attacker LLM adapts dynamically. If one attack strategy fails, it shifts gears and tries another. For instance, if a direct prompt fails to elicit a response, it may try to create a hypothetical situation that makes the target think it's safe to respond. This dynamic adaptation is key to GOAT's effectiveness.
Try it yourself below!
Set Your Goal
Enter your red teaming goal or select from the suggestions below:
Testing It Out
The GOAT strategy will be most impactful for LLM applications leveraging conversational AI, such as chatbots with RAG and agentic systems.
Ready to try it out? Schedule a time with our team today to learn how Promptfoo's latest strategy can protect your applications at scale.