Automated LLM Jailbreak Framework

Description: The MASTER KEY framework exploits timing-based characteristics of Large Language Model (LLM) chatbot responses to infer internal defense mechanisms and automatically generate jailbreak prompts. This allows bypassing safety restrictions and eliciting responses violating usage policies, including generation of illegal, harmful, privacy-violating, and adult content. The framework utilizes a three-step process: reverse-engineering defenses via time-based analysis, creating proof-of-concept jailbreak prompts, and fine-tuning an LLM to automatically generate effective prompts.

Examples: See the research paper for specific examples of time-based analysis and generated jailbreak prompts. Due to ethical considerations and responsible disclosure, the full dataset of jailbreak prompts is not publicly available.

Impact: Successful exploitation allows malicious actors to bypass LLM chatbot safety restrictions and obtain responses containing sensitive information, generate harmful content (e.g., instructions for creating malware), violate user privacy, and produce illegal or adult material. This undermines the security and intended functionality of the chatbots.

Affected Systems: OpenAI ChatGPT (GPT-3.5 and GPT-4), Google Bard, Microsoft Bing Chat, and Baidu Ernie. Potentially other LLMs employing similar defense mechanisms.

Mitigation Steps:

Strengthen ethical and policy-based alignment of LLMs through methods like supervised training.
Refine and rigorously test content moderation systems, including incorporating input sanitization.
Integrate contextual analysis to counter encoding strategies used to bypass keyword-based defenses.
Implement automated stress testing to identify and address vulnerabilities comprehensively.

Automated LLM Jailbreak Framework

Research Paper