LMVD-ID: 22e2ff7b
Published November 1, 2024

Zeroth-Order MLLM Jailbreak

Affected Models:minigpt-4, llava1.5, inf-mllm1, gpt-4o, llama 2

Research Paper

Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models

View Paper

Description: A vulnerability in multi-modal large language models (MLLMs) allows attackers to bypass safety mechanisms and elicit harmful responses using a memory-efficient zeroth-order optimization technique. The attack, termed Zer0-Jack, leverages simultaneous perturbation stochastic approximation (SPSA) with patch coordinate descent to generate malicious image inputs, even without access to the model's internal parameters (black-box setting).

Examples: See the supplementary materials of the research paper, including the Github repository [link to be inserted]. Specific examples demonstrating successful jailbreaks of MiniGPT-4, LLaVA1.5, INF-MLLM1, and GPT-4o are provided.

Impact: Successful exploitation allows attackers to circumvent safety protocols and obtain harmful outputs from MLLMs, including instructions for illegal activities, personal information disclosure, and generation of biased or toxic content. The memory efficiency of the attack makes it feasible to target even large language models.

Affected Systems: Multi-modal Large Language Models (MLLMs), including but not limited to, MiniGPT-4, LLaVA1.5, INF-MLLM1, and GPT-4o. Potentially affects any MLLM that accepts image inputs and reveals sufficient information through its API to allow for zeroth-order gradient estimation.

Mitigation Steps:

  • Input Sanitization: Implement robust input sanitization and validation mechanisms to detect and filter malicious images. This could include analyzing image content for patterns associated with jailbreaking attacks.
  • API Restrictions: Limit or modify the API's response to prevent the leakage of information useful for zeroth-order gradient estimation. Restrict the types of responses returned and consider reducing the specificity of probability scores provided.
  • Output Filtering: Employ more sophisticated output filtering and moderation techniques to identify and suppress harmful responses, even for those that would otherwise appear natural or plausible. This should incorporate techniques that are robust to paraphrasing and variations in wording.
  • Defense Models: Utilize secondary models to judge the safety of generated outputs before they are delivered to the user. This requires the development of robust safety classifiers that are specific to the type of harmful content being generated.
  • Regular Security Audits: Conduct regular security audits and penetration testing to identify and address potential vulnerabilities in MLLMs and their APIs. These audits should specifically include testing for novel adversarial attack techniques.

© 2025 Promptfoo. All rights reserved.