GCG Suffix Data Exfiltration
Research Paper
WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes
View PaperDescription: A Cross-Prompt Injection Attack (XPIA) can be amplified by appending a Greedy Coordinate Gradient (GCG) suffix to the malicious injection. This increases the likelihood that a Large Language Model (LLM) will execute the injected instruction, even in the presence of a user's primary instruction, leading to data exfiltration. The success rate of the attack depends on the LLM's complexity; medium-complexity models show increased vulnerability.
Examples: See the white paper for experimental setup and results. The attack involves crafting a malicious injection targeting a specific function (e.g., a network request) and embedding it within third-party data presented to the LLM alongside a user prompt. Appending a GCG suffix to the injection significantly increases the probability of the LLM executing the injection. Specific examples from the dataset are not publicly available.
Impact: Successful exploitation leads to data exfiltration from the user's context, potentially exposing sensitive information such as credentials or Personal Identifiable Information (PII). The financial impact of a successful attack can be significant.
Affected Systems: LLMs vulnerable to XPIA and susceptible to manipulation by GCG suffixes. Specifically, the paper tested Phi-3-mini, GPT-3.5, and GPT-4, showing varying degrees of vulnerability. Other LLMs with similar architecture or training may also be affected.
Mitigation Steps:
- Improved Prompt Filtering: Implement more robust prompt filtering techniques to detect malicious injections, particularly those incorporating GCG suffixes.
- Model Complexity: Consider using more complex LLMs, as they exhibit higher resistance to this attack vector.
- GCG Suffix Detection: Develop methods to specifically identify and neutralize GCG suffixes.
- Function Call Sanitization: Sanitize and validate function calls generated by the LLM before execution, to prevent the misuse of tools that provide external access (e.g., network requests).
- Defense Variation: Implement different defense strategies for LLMs of varying complexities, recognizing that the effectiveness of certain defenses (such as prompt filtering) may vary drastically based on model complexity.
© 2025 Promptfoo. All rights reserved.