GPT-4v System Prompt Leakage
Research Paper
Jailbreaking gpt-4v via self-adversarial attacks with system prompts
View PaperDescription: A system prompt leakage vulnerability in GPT-4V allows extraction of internal system prompts through carefully crafted, incomplete conversations combined with image input. Extracted prompts can be used as highly effective jailbreak prompts, bypassing safety restrictions and leading to undesirable outputs, including revealing personally identifiable information from images.
Examples: See the paper for details on the "meta-theft prompt" used to extract the system prompt and subsequent jailbreak prompts derived from it. The paper includes specific examples of both (Section 3.1 and 3.3).
Impact: Successful exploitation allows attackers to bypass safety mechanisms in GPT-4V, enabling the model to generate unsafe or harmful outputs, including the identification of individuals in images against the model's intended safety constraints. This compromises user privacy and the integrity of the system.
Affected Systems: GPT-4V (and potentially other models using similar system prompt mechanisms).
Mitigation Steps:
- Implement robust prompt validation and filtering mechanisms to prevent the injection of adversarial prompts designed to extract system prompts.
- Regularly audit and update system prompts to minimize vulnerabilities.
- Employ techniques to minimize the information disclosed in system prompts, ensuring they are functionally necessary without revealing sensitive details.
- Explore methods to detect and prevent attempted extractions of system prompts.
- Develop mechanisms for detecting and mitigating the effects of jailbreak attacks, such as prompt analysis and output filtering.
© 2025 Promptfoo. All rights reserved.