Today, GitHub updated with an experimental new feature. After using machine learning, the new version of the CodeQL code scanning service can help developers find more security vulnerabilities. Currently developing tests on JavaScript and TypeScript repositories, various language support will be gradually added in the future.
During testing, CodeQL has discovered over 20,000 security issues from 12,000 repositories, including remote code execution (RCE), SQL injection, and cross-site scripting (XSS) vulnerabilities.
How to use
- GitHub’s CodeQL code scans are free for public repositories.
- A new JavaScript/TypeScript analysis tool is now available to all users of the security-extended and security-and-quality analysis suites.
- If you are already using these suites, the analysis will be done automatically using new machine learning techniques.
- If you haven’t used it before, you can enable CodeQL by following the steps below.
- Under your repository home page, click Security.
- On the right side of Code scanning alerts, click Set up code scanning. If this item is missing, GitHub Advanced Security needs to be enabled by the repository administrator.
Under “Get started with code scanning”, click Set up this workflow in CodeQL Analysis.
5. Use the Start commit drop-down menu, enter the file name and submit.
6. Choose to commit directly to the default branch, or create a new branch and initiate a pull request.
8. Click Submit New File.
After the code scan analysis is successful, the user will see a security alert message in the “Security” tab .
Why using ML produces better results
To detect vulnerabilities in the repository, the CodeQL engine first builds a database, encodes a special relational representation of the code, and then executes a series of CodeQL queries on the database.
But with the rapid development of the open source ecosystem, the long tail effect is becoming more and more obvious.
Security experts continue to extend and improve these queries, modeling other common libraries and known patterns. However, manual modeling is time-consuming, and there will always be less common libraries and proprietary code that cannot be modeled manually.
This is where machine learning comes in handy. Given a large number of training code snippets, each query is labeled as a positive or negative example, features are extracted for each snippet, and a deep learning model is trained to classify new examples.
Instead of treating each code snippet simply as a string of words or characters and directly applying standard NLP techniques to classify those strings, GitHub leverages CodeQL to access a wealth of information about the underlying source code to generate a set of code for each code snippet. Rich features, and then label and sub-label them like NLP.
This generates a vocabulary from the training data and feeds the list of indices into a deep learning classifier, which outputs the probability that the current sample is each vulnerability.
Although ML-based vulnerability scanning is currently only available for JavaScript/TypeScript, GitHub promises to support more languages in the future, and CodeQL already supports many popular languages including Python, Go, C/C++.
Finally, GitHub also highlights that while the new tool can find more vulnerabilities, it also has the potential to increase the false positive rate (~80% recall and ~60% precision). This feature will improve over time in the future.
Reference link:
[1] https://github.blog/2022-02-17-code-scanning-finds-vulnerabilities-using-machine-learning/
[2] https://github.blog/2022-02-17-leveraging-machine-learning-find-security-vulnerabilities/
[3] https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/setting-up-code-scanning-for-a -repository