End-of-Sentence MLP Jailbreak
Research Paper
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
View PaperDescription: Instruction-tuned Large Language Models (LLMs) are vulnerable to jailbreaks via the manipulation of Multi-Layer Perceptron (MLP) neuron weights in end-of-sentence inferences. By selectively re-weighting these neuron activations, an attacker can bypass the model's safety mechanisms and elicit harmful responses. The vulnerability is independent of the specific prompt and generalizable across various models, impacting both prompt-specific and prompt-general attacks.
Examples: See figures 1, 2, and 8 in the provided paper. The figures demonstrate how modifying MLP weights during end-of-sentence processing can cause the model to generate harmful responses despite being designed with built-in safety measures. The specific weight modifications are shown as heatmaps indicating varying modification scales for different layers and inferences.
Impact: Successful exploitation allows attackers to bypass safety restrictions implemented in instruction-tuned LLMs, enabling the generation of harmful content such as instructions for illegal activities. This can lead to the dissemination of misinformation, malicious code, or instructions for harmful actions. A prompt-general attack can circumvent safety protocols for any prompt without needing further modifications.
Affected Systems: The vulnerability affects various open-source instruction-tuned LLMs, including those based on the LLaMA architecture and those with sizes ranging from 2B to 72B parameters. Specifically mentioned are LLaMA-3 8B-Instruct and models from the Gemma family. Other instruction-tuned models may be vulnerable.
Mitigation Steps:
- Improved Safety Mechanisms: Develop more robust safety mechanisms that are less susceptible to manipulation of individual MLP weights. This could involve using more distributed safety checks throughout the model's architecture.
- Regularization and Robustness Training: Incorporate regularization techniques during training that penalize extreme weight adjustments in the end-of-sentence MLP layers to enhance their robustness to adversarial manipulation.
- Layer-wise Attention Mechanisms: Investigate alternative architectures that use different, more resilient mechanisms for evaluating input safety instead of relying heavily on the end-of-sentence MLP layer for this specific task.
- Input Sanitization and Filtering: Implement comprehensive input sanitization and filtering techniques to prevent the injection of adversarial modifications. Although this alone cannot address the vulnerability completely, it adds a layer of defense.
Note: The specific methods for mitigation remain a subject of ongoing research and development, as the paper highlights the need for better understanding of the LLMs' internal mechanisms.
© 2025 Promptfoo. All rights reserved.