Introduction

With the surge of Large Language Models (LLMs) nowadays, there is a rising trend among developers to use Large Language Models to assist their daily code writing. Famous products include GitHub Copilot or simply ChatGPT. However, just like those codes written by human developers, code generated by LLMs can sometimes have a lot of security risks as well. For example, there is a popular one called “Out-of-bounds Write”, which may allow attackers to write malicious information into the computer memory for some potential criminal activities.

In this post, we introduce some of the latest efforts in assessing the security of the code generated by Large Language Models so that one can have some basic ideas about how unreliable such models can be in general cases or when given some specific tasks.

Systematic experiment through CWEs

Common Weakness Enumeration (CWE) is a system operated by MITRE that categorises security vulnerabilities, including over 400 categories regarding software security risks. For example, the security risk “Out-of-bounds Write” mentioned earlier is assigned to “CWE-787”. By using CWE as the metric for evaluating the security risk of the generated code, one can easily analyse a group of security vulnerabilities in a systematic way.

2023 CWE Top 25 Most Dangerous Software Weakness Link: https://cwe.mitre.org/top25/archive/2023/2023_top25_list.html

In a recent research, to gain insight into the question “Are GitHub Copilot’s suggestions commonly insecure”, GitHub Copilot’s generated code was evaluated regarding a subset of MITRE’s CWEs. To be more specific, Copilot’s behaviour is studied along three dimensions, including diversity of weakness, diversity of prompt and diversity of domain. GitHub Copilot is a code completion tool developed by GitHub and OpenAI that assists users of Visual Studio Code, Visual Studio, Neovim, and JetBrains integrated development environments (IDEs) by autocompleting code.

For the diversity of weakness, three different scenarios for each applicable “top 25” CWEs were constructed as small, incomplete programs waiting to be filled out by Copilot. As code evaluation, CodeQL software scanning was used along with manual inspection to assess whether the suggestions returned were vulnerable to that CWE.

Next, for the diversity of prompt, the effect that different prompts (aka the input we give to the model) have on how likely Copilot is to return suggestions that are vulnerable to a selected CWE (e.g. SQL injection) are assessed.

Eventually, for the diversity of domain, the security of the code generated by Copilot is also assessed by focusing on the Copilot’s behaviour when it is tasked with a new domain added to CWEs, such as code written in another programming language.

In this case, Copilot’s response to these scenarios is mixed from a security standpoint, where around 40% of the generation results are found to be vulnerable.

User Studies

Apart from general systematic testing, user studies were also actively conducted to assess the security of the generated code. Recently, a security-driven user study to assess code written by student programmers when assisted by LLMs was conducted.

Students for this study were recruited through social media and split randomly into ‘control’ (no LLM access) and ‘assisted’ (with LLM access) groups. They were prompted to complete a shopping list program implementation in C. C was chosen here since a majority of security risks are memory-based issues in low level languages such as C/C++).

An LLM called code-cushman-001 was used in this study to help the students in the assisted group due to its fast speed to operate and its similar response time to GitHub Copilot. After generating the function, compilation would be checked and another completion would be requested if the result fails. To analyse the results, standard statistical hypothesis tests were used to provide evidence-based conclusions. Hypothesis Testing is a type of statistical analysis in which assumptions about a population parameter are put to the test. It is used to estimate the relationship between 2 statistical variables.

By examining the completed code for functionality and security, using manual and automated methods regarding a list of CWEs, the study found that the security impacts are minimal in this setting of using LLMs for code generation assistance. Existing findings on the productivity benefits of AI assistance were confirmed. They also found that the AI-assisted group produced security-critical bugs at a rate no greater than 10% higher than the control group (non-assisted). When investigating the origin of bugs within the assisted users, 63% of the bugs originated in code written by humans and 36% of the bugs were present in the taken suggestions.

Conclusion

The use of Large Language Models has witnessed significant advancement in multiple directions, especially when they are used in assisting code implementation. However, code generated by Large Language Models can have some potential security risks. In this post, we discussed the latest effort in assessing the security of the code generated by large language models, ranging from systematic testing to user studies. Through continuous investigation and refinement, we believe that the use of Large Language Models can open up exciting opportunities for us in code generation.

References

Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; and Karri, R., 2021. Asleep at the keyboard? assessing the security of github copilot’s code contributions. https://arxiv.org/abs/2108.09293
Sandoval, G.; Pearce, H.; Nys, T.; Karri, R.; Garg, S.; and Dolan-Gavitt, B., 2022. Lost at c: A user study on the security implications of large language model code assistants. https://arxiv.org/abs/2208.09727

Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.

Learn more on Medium