Researchers recently explored the capabilities of GPT-4, a large language model (LLM), in the cybersecurity domain, particularly focusing on its ability to exploit one-day vulnerabilities. Their findings revealed that GPT-4 could successfully exploit 87% of the vulnerabilities from a benchmark consisting of 15 real-world vulnerabilities. These vulnerabilities, sourced from the CVE database and academic papers, included issues within websites, container management software, and Python packages. The effectiveness of GPT-4 was highlighted by its ability to understand and manipulate complex multi-step vulnerabilities, setting it apart from other LLMs and open-source vulnerability scanners, which showed a 0% success rate in similar tests.
The study demonstrated that GPT-4’s success is heavily dependent on having access to detailed vulnerability descriptions from the CVE database. When provided with CVE descriptions, GPT-4’s success rate soared to 87%, but it dropped dramatically to 7% without them. This indicates that while GPT-4 is highly effective at exploiting known vulnerabilities, its ability to identify and exploit new vulnerabilities without prior detailed descriptions is significantly limited. This finding underscores the model’s current utility as a tool for understanding and testing known security vulnerabilities rather than discovering new ones.
The research also delved into the technical details of how GPT-4 achieves its high success rate. The model was given access to the ReAct agent framework and other tools, allowing it to execute its capabilities over just 91 lines of code. The researchers’ setup demonstrates the potential of using LLMs like GPT-4 for automated security testing, particularly in simulating attacks to identify potential breaches and improve defenses against complex cyber threats.
Overall, this study contributes to the understanding of how LLMs can be applied in cybersecurity, highlighting both their strengths and limitations. The ability of GPT-4 to handle complex, real-world cybersecurity tasks suggests a promising direction for further research and development in the field. However, the reliance on detailed prior knowledge to achieve high levels of success also calls for improvements in the model’s ability to tackle previously unknown threats, which remains a crucial challenge for future advancements in LLM applications in cybersecurity.