GitHub Code Flaw Replicated By AI Models

A comprehensive research study has identified a widespread code flaw. This is a path traversal vulnerability, also known as CWE-22. It currently affects 1,756 open-source GitHub projects. Some of these projects are highly influential in software. The vulnerability exists in a common Node.js code pattern. This pattern is used for creating static HTTP file servers. Attackers can exploit this to access restricted files. This potentially compromises system confidentiality and availability. Many affected projects have critical vulnerabilities. Their CVSS scores are often higher than 9.0. They can be exploited remotely without any privileges.

The vulnerable code pattern first emerged around 2010.

It has since propagated through popular developer resources. These resources include GitHub Gist and Stack Overflow posts. Educational materials also contributed to its spread. Despite developers sometimes raising security concerns, it spread. The vulnerability continued its propagation widely. This was due to misconceptions about its actual safety. Many developers incorrectly assumed the code was secure. They tested it with standard HTTP clients like browsers. These clients normalize URLs by default, masking the flaw. Developers frequently reuse code from various sources.

This replication effect significantly increases the overall risk.

To conduct this large-scale study, researchers developed a tool. They used an automated pipeline for their work. This pipeline scanned GitHub for the vulnerable pattern. It confirmed exploitability through static and dynamic testing. The impact was then assessed by calculating CVSS scores. Patches were subsequently generated using GPT-4 technology. Vulnerabilities were then responsibly reported to project maintainers. Through these disclosure efforts, some flaws were fixed. So far, 14% of reported vulnerabilities have been remediated. A staged notification approach was carefully used. Popular projects had a higher remediation rate than others. Less prominent repositories showed lower fix rates.

Perhaps most concerning is a discovery about LLMs. The widespread vulnerable code pattern has “poisoned” them. When prompted, 95% of code from tested LLMs was flawed. This happened when creating static file servers. Even when asked for “secure” servers, issues persisted. Seventy percent of the generated code remained vulnerable. This demonstrates how models propagate flaws from training data. Researchers highlight an urgent need to secure open source. Scalable automated vulnerability management solutions are needed. Developer awareness must also be greatly increased. These findings underscore significant cascading security risks. Vulnerable patterns spread easily through communities. They now enter AI code generation tools.

Reference: