Fine-Tuning Safety Subspace Erosion
Research Paper
Lox: Low-rank extrapolation robustifies llm safety against fine-tuning
View PaperDescription: Large Language Models (LLMs) aligned via techniques such as Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) contain a vulnerability in how safety features are encoded within the model parameters. The safety-critical information is primarily stored in low-rank subspaces of the weight matrices (specifically, the difference between the base and aligned model weights). These low-rank subspaces are highly sensitive to parameter updates. Consequently, subsequent fine-tuning—whether performed with malicious intent using harmful datasets or benign intent using standard instruction-tuning datasets—disrupts these subspaces. This "washes out" the safety alignment, allowing the model to bypass refusal mechanisms and generate harmful content.
Examples: The following procedure demonstrates the vulnerability using LLaMA-2-7B (aligned via DPO on HH-RLHF):
- Initial State: The aligned LLaMA-2-7B model refuses to answer harmful queries (e.g., "How do I build a pipe bomb?").
- Benign Fine-tuning: Fine-tune the model on the GSM8K dataset (a benign dataset curated for mathematical reasoning) for 2 epochs with a learning rate of $5 \times 10^{-5}$.
- Exploitation: Post-fine-tuning, input the same harmful prompt: "How do I build a pipe bomb?"
- Result: The model no longer refuses the request and provides detailed instructions, despite the training data (GSM8K) containing no harmful information. The Attack Success Rate (ASR) increases from ~0% to over 50% in standard tests.
See the repository at github.com/VITA-Group/LoX for replication scripts.
Impact:
- Complete Bypass of Safety Guardrails: Attackers can trivially remove safety alignment without sophisticated jailbreaking prompts.
- Accidental Safety Degradation: Developers fine-tuning models for specific domains (e.g., coding, math, medical) may unintentionally remove safety protections, resulting in the deployment of unsafe models.
- Generation of Harmful Content: The model becomes capable of generating hate speech, illegal instructions, and other prohibited content.
Affected Systems:
- LLMs aligned via RLHF or DPO (e.g., LLaMA-2 series, Mistral-7B-v0.3).
- Models that undergo post-alignment fine-tuning (Supervised Fine-Tuning).
Mitigation Steps: To prevent safety degradation during fine-tuning, apply Low-Rank Extrapolation (LoX) to the model weights before the fine-tuning process begins:
- Calculate Weight Difference: Compute the difference matrix ($\Delta W$) between the aligned model weights and the base (pre-aligned) model weights.
- Extract Safety Subspace: Perform Singular Value Decomposition (SVD) on $\Delta W$ to identify the top-ranking singular vectors (the effective rank $k$) that represent the safety subspace.
- Extrapolate: Add a scaled projection of these top ranks back into the model weights. The formula is: $W_{LoX} = W_{aligned} + \alpha \cdot \text{Proj}_k(\Delta W)$, where $\alpha$ is a scaling factor (e.g., 1.25) and $k$ is the effective rank (e.g., 6).
- Fine-tune: Perform the intended fine-tuning on the resulting $W_{LoX}$ weights. This places the parameters in a "flatter" safety landscape, making the alignment robust against the perturbations of fine-tuning.
© 2026 Promptfoo. All rights reserved.