Hackers are constantly seeking new methods to bypass the ethical and safety measures built into AI systems, allowing them to exploit these systems for various malicious purposes. This includes creating harmful content, spreading false information, and engaging in illegal activities by taking advantage of security flaws. Recently, Microsoft researchers discovered a new technique called the Skeleton Key, which can circumvent responsible AI guardrails in several generative AI models.
The Skeleton Key jailbreak involves a direct prompt injection attack, potentially defeating all safety precautions embedded in the AI models’ design. This method allows the AI to break policies, develop biases, or execute any malicious instructions. To combat this, Microsoft has shared their findings with other AI vendors and deployed Prompt Shields to detect and prevent such attacks within Azure AI-managed models. They have also updated their LLM technology to eliminate this vulnerability across their AI offerings, including Copilot assistants.
The multi-step approach used in the Skeleton Key jailbreak enables the evasion of AI model guardrails, allowing the model to be fully exploited despite its ethical limitations. This attack requires legitimate access to the AI model and can result in harmful content being produced or normal decision-making rules being overridden. Microsoft emphasizes the need for AI developers to consider such threats in their security models and suggests AI red teaming with software like PyRIT to enhance security.
Microsoft’s tests between April and May 2024 showed that base and hosted models from companies like Meta, Google, OpenAI, Mistral, Anthropic, and Cohere were all affected by this technique. The only exception was GPT-4, which showed resistance until the attack was formulated in system messages. These findings highlight the necessity of distinguishing between security systems and user inputs to mitigate vulnerabilities effectively. Microsoft has recommended several mitigations, including input filtering, system message adjustments, output filtering, and abuse monitoring to safeguard AI systems against such attacks.
Reference: