Stealthy LLM Cold Attack

Description: The COLD-Attack framework allows for the generation of stealthy and controllable adversarial prompts that can bypass safety mechanisms in various Large Language Models (LLMs). The attack leverages an energy-based constrained decoding method to generate fluent and contextually coherent prompts designed to elicit harmful or unintended responses from the targeted LLM, even under constraints like specific sentiment or phrasing. This allows attacks to evade detection mechanisms solely relying on prompt fluency or keyword filtering.

Examples: See https://github.com/Yu-Fangxu/COLD-Attack for the code and examples provided by the research paper. Specific examples included in the paper demonstrate attacks with continuation constraints, paraphrasing constraints, and position constraints, showcasing the framework's versatility in evading various defense mechanisms.

Impact: Successful exploitation of this vulnerability can lead to LLMs generating harmful, biased, or otherwise undesired outputs, including:

Circumvention of safety filters and content moderation: The stealthy nature of the generated prompts allows bypass of security measures based on prompt analysis.
Elicitation of malicious behaviors: Adversarial prompts can trick LLMs into performing actions outside their intended design, such as generating harmful content or revealing sensitive information.
Misinformation generation and spread: The ability to control the content and style of generated text enables creation of convincing yet deceptive information.

Affected Systems: Various LLMs, including (but not limited to) Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4. The vulnerability is demonstrably transferable across different model architectures and sizes.

Mitigation Steps:

Robust prompt filtering: Implement more sophisticated filtering mechanisms that go beyond simple keyword matching or perplexity checks, incorporating semantic analysis and potentially adversarial training techniques.
Adversarial training: Train LLMs on adversarial examples generated by techniques like COLD-Attack to improve their robustness against these types of attacks.
Output filtering & monitoring: Implement additional measures to filter and monitor LLM outputs for potentially harmful content, even if the input prompt appears innocuous.
Regular updates and patching: Stay up-to-date with security advisories and patches released by LLM providers to address newly discovered vulnerabilities.
Diverse safety mechanism integration: Implement multiple layers of security, combining input filtering, output analysis and runtime monitoring.

Stealthy LLM Cold Attack

Research Paper