CoT Jailbreak Mitigation Failure

Description: Chain-of-thought (CoT) reasoning, while intended to improve safety, can paradoxically increase the harmfulness of successful jailbreak attacks by enabling the generation of highly detailed and actionable instructions. Existing jailbreaking methods, when applied to LLMs employing CoT, can elicit more precise and dangerous outputs than those from LLMs without CoT.

Examples: See the paper "Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?" for specific examples of the FICDETAIL jailbreaking attack against various LLMs with and without CoT. The paper includes examples showing significantly more detailed and actionable harmful outputs from LLMs using CoT.

Impact: Successful jailbreak attacks against LLMs employing CoT reasoning can result in the generation of highly detailed, step-by-step instructions for malicious activities, increasing the likelihood of successful execution of harmful actions. The increased detail significantly amplifies the potential impact compared to attacks against LLMs lacking CoT.

Affected Systems: Large language models (LLMs) that utilize chain-of-thought (CoT) reasoning for improved response generation, including but not limited to, those listed in the referenced paper.

Mitigation Steps:

Further research into methods for mitigating the enhanced detail generation within CoT reasoning is needed.
Develop and implement robust safety mechanisms that account for the increased detail provided by CoT, potentially focusing on evaluating not only the risk of a harmful response, but also its detail and potential for execution.
Rigorous testing and evaluation of LLMs using CoT, including adversarial testing against various jailbreaking techniques, to identify and address vulnerabilities.

CoT Jailbreak Mitigation Failure

Research Paper