Targeted Model Editing Jailbreak

Description: A white-box attack, Targeted Model Editing (TME), allows bypassing safety filters in large language models (LLMs) by minimally altering internal model structures, specifically the MLP layers, without modifying inputs. The attack identifies and removes safety-critical transformations (SCTs) in model matrices, enabling the LLM to respond to malicious queries with harmful outputs.

Examples: See the paper's repository https://sites.google.com/view/d-llm for code and dataset. The paper includes examples of malicious prompts and the resulting harmful responses from the attacked LLMs.

Impact: Successful exploitation allows attackers to circumvent LLM safety mechanisms and elicit harmful or undesired responses to unmodified malicious prompts. This compromises the intended safety and reliability of the LLM, potentially leading to the generation of illegal, unethical, or harmful content. The attack is stealthy as it doesn't require modifications to the input prompts.

Affected Systems: Open-source LLMs using a decoder-only architecture, including but not limited to Llama-2-7b-chat, Llama-3-8b-Instruct, gemma-2-9b-it, and Mistral-7b-Instruct. Potentially affects any LLM vulnerable to manipulation of internal MLP layer matrices.

Mitigation Steps:

Investigate and implement model architectures more resistant to model editing attacks, such as Mixture of Experts (MoE) models.
Develop more robust safety mechanisms that are less susceptible to removal or alteration through targeted model modifications.
Regularly audit LLMs for signs of unauthorized modifications or tampering.
Strengthen access control to prevent unauthorized access to model parameters.

Targeted Model Editing Jailbreak

Research Paper