Scientific Language Jailbreak

Description: Large Language Models (LLMs) are vulnerable to malicious prompts disguised as summaries of scientific papers, even when those papers are fabricated by the attacker. This allows attackers to manipulate LLMs into generating responses exhibiting significantly increased stereotypical bias and toxicity. The vulnerability is exacerbated by multi-turn interactions, where bias scores tend to increase with each subsequent response. The inclusion of author names and publication venues in the fabricated summaries enhances the effectiveness of the attack.

Examples: See the paper "LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language". The paper includes specific examples demonstrating the attack against various LLMs (GPT-4o, GPT-4o-mini, GPT-4, Llama3-405BInstruct, Llama3-70B-Instruct, Cohere, Gemini) using both real and fabricated scientific papers as input. The examples show how neutral prompts are used as a springboard to elicit biased and toxic outputs.

Impact: The vulnerability allows attackers to manipulate LLMs into generating harmful content, including but not limited to:

Reinforcement of harmful societal stereotypes.
Generation of toxic and offensive language.
Dissemination of misinformation disguised as scientific fact.

The impact is increased in multi-turn scenarios, potentially leading to the propagation of increasingly harmful content. The vulnerability poses a potential risk to the trustworthiness of LLM-powered applications and systems.

Affected Systems: Various LLMs, including but not limited to: GPT-4o, GPT-4o-mini, GPT-4, Llama3-405BInstruct, Llama3-70B-Instruct, Cohere, and Gemini. The vulnerability is likely present in other LLMs as well.

Mitigation Steps:

Develop improved prompt filtering mechanisms to detect and block malicious prompts disguised as scientific statements.
Enhance LLM training data to reduce biases and improve resistance to manipulation.
Implement multi-turn dialogue safety measures to monitor and limit the escalation of bias in conversational interactions.
Develop robust methods for distinguishing between genuine scientific evidence and fabricated or misinterpreted data within prompts.

Scientific Language Jailbreak

Research Paper