Unlearning Relearning Attack
Research Paper
Towards llm unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond
View PaperDescription: Standard Large Language Model (LLM) unlearning techniques, specifically Negative Preference Optimization (NPO), Gradient Difference (GradDiff), and Representation Misdirection for Unlearning (RMU), fail to sufficiently flatten the loss landscape surrounding the "forgotten" weights. This sharp loss landscape allows for a "Relearning Attack," wherein an attacker can fully restore the unlearned capabilities (such as hazardous knowledge, sensitive data, or copyrighted material) by performing lightweight fine-tuning on the unlearned model. This restoration requires an extremely small number of samples (as few as 20 to 125) from the original forget set or even unrelated datasets, effectively negating the unlearning process.
Examples:
- WMDP Bio Relearning:
- Take a Zephyr-7B-beta model that has undergone NPO-based unlearning to remove hazardous biological knowledge (using the WMDP Bio dataset).
- Perform fine-tuning on this unlearned model for a single epoch using only 20 samples from the forget dataset.
- Query the model with a hazardous prompt (e.g., requesting instructions for synthesizing a pathogen). The model will resume generating harmful responses with an accuracy comparable to the original, pre-unlearned model.
- MUSE Copyright Recovery:
- Take an LLaMA-2 7B model unlearned on the MUSE News dataset.
- Fine-tune the model using 75 to 125 samples of news articles.
- The model recovers verbatim memorization of the copyrighted news articles that were previously unlearned.
(Code and reproduction details available at: https://github.com/OPTML-Group/Unlearn-Smooth)
Impact:
- Data Privacy Violation: Reverses compliance with data regulations (e.g., GDPR "Right to be Forgotten") by allowing supposedly deleted personal data to be recovered.
- Safety Bypass: Restores hazardous capabilities (e.g., biosecurity threats, chemical synthesis instructions) that were removed for safety alignment.
- Copyright Infringement: Recovers copyrighted text segments that were intended to be erased from the model's knowledge base.
Affected Systems:
- LLMs processed with optimization-based unlearning methods, specifically:
- Negative Preference Optimization (NPO)
- Gradient Difference (GradDiff)
- Representation Misdirection for Unlearning (RMU)
- RMU with Latent Adversarial Training (RMU-LAT)
- Validated on models including Zephyr-7B-beta, LLaMA-2 7B, LLaMA-3 8B, and ICLM 7B.
Mitigation Steps:
- Integrate Sharpness-Aware Minimization (SAM): Incorporate SAM into the unlearning optimization objective to minimize both the loss value and loss sharpness, creating a flatter loss landscape that is resistant to weight perturbations.
- Apply Randomized Smoothing (RS): Convolve the unlearning objective with a Gaussian distribution to smooth the loss function during the optimization process.
- Implement Curvature Regularization (CR): Explicitly penalize the curvature of the forget loss to reduce the sensitivity of the model weights to fine-tuning.
- Use Weight Averaging (WA): Average model weights across multiple checkpoints during the unlearning training trajectory to enforce weight smoothness.
© 2026 Promptfoo. All rights reserved.