HiddenLayer exposes vulnerabilities in Google’s Gemini large language model (LLM), revealing potential risks of system prompt leaks, harmful content generation, and indirect injection attacks. These vulnerabilities affect both consumers using Gemini Advanced with Google Workspace and companies utilizing the LLM API. The findings highlight the critical need for robust security measures and continuous testing to defend against adversarial behaviors and ensure the safety of users interacting with language models.
One vulnerability involves bypassing security guardrails to leak system prompts, enabling the model to generate more useful responses based on conversation-wide instructions. Another class of vulnerabilities relates to “crafty jailbreaking” techniques, allowing the LLM to generate misinformation and output potentially illegal and dangerous information. Additionally, weaknesses in the system prompt mechanism could lead to information leakage by passing repeated uncommon tokens as input, posing further security concerns.
HiddenLayer’s discoveries shed light on the broader challenges faced by language models in mitigating adversarial attacks and ensuring user safety. These vulnerabilities underscore the importance of proactive security measures and ongoing vigilance in the development and deployment of AI technologies. Google acknowledges the findings and emphasizes its commitment to running red-teaming exercises and implementing safeguards to prevent harmful responses, while also enforcing restrictions on election-based queries to mitigate potential risks.