To date the most important security vulnerabilities have been found via laborius code auditing. Also, this is the only way vulnerabilities can be found and fixed during development. However, as software production rates increase, so does the need for a reliable, automated method for checking or classifiying this code in order to prioritize and organize human efforts in manual checks. We’re living in an age where machine learning is playing well in several other technological fields, how about applying it to our bug-finding appetite?
In this and upcoming articles, we are interested in the use of machine learning (ML) techniques to find security vulnerabilities in source code. It is important to specify this since, as we will see, there are many other related, but different, approaches such as:
Automatically fixing vulnerabilities
Vulnerability detection (VD) in binary code
ML-aided dynamic testing
Other automated techniques that don’t involve ML
The idea of using ML techniques for VD is not new. There are papers on the matter as old as 2001. Here we’ll try to describe in simple terms:
what has been done in this area,
what the current state-of-the-art is and
try to ellucidate new research paths.
We will be following and building on top of two previous state-of-the-art papers
We feel the grouping by semantic features extracted from code approach makes more sense, as do Ghaffarian and Shahriari (2017), These are further subdivided into:
Vulnerable code pattern recognition. Usually based on labeled data (samples of faulty and safe code) determine patterns that explain that, and
Anomaly detection. This means, based upon a large code base, extract models of what "normal code" should look like and determining pieces of code that do not fit in with this model.
Anomaly detection approaches
Most of the papers in this category are not security-focused, but their ideas can be used for VD. Also most of these works revolve around extracting features such as:
proper API usage patterns, v.g. the pair malloc and free,
missing checks, like ensuring a number is non-zero before dividing by it,
lack of input validation, leading to injections, buffer overflows, …
lack of access controls, which may lead to confidential information being leaked, altered or denied access to.
The system Chucky by Yamaguchi et al. (2013) is the one that interests us the most since it is more compatible with our interests, i.e., lightening the burden of manual code auditors; also, they achieve both the aforementioned objectives: detecting missing checks through security logic (v.g. access control) and secure API usage (v.g. checking buffer size). It uses the bag-of-words model to represent the code and the k-nearest-neighbors technique to analyze it. 'Chucky' discovered 12 new vulnerabilites in high-profile projects such as Pidgin and LibTIFF. See our article on Chucky for details.
A year later, Yamaguchi et al. (2014) reuse this idea of exploiting graph representations of code in order to find vulnerable code patterns. This time they propose automating the design of effective traversals which might lead to vulnerability detection using the unsupervised clustering approach. This resulted in the tool 'Joern', which was able to find 5 zero-day vulnerabilities in products like Pidgin.
Most of the papers in this category are not security focused. All of them use frequent itemset mining, only with different features to mine and different targets to extract. We summarize them here for the sake of completeness:
Table 1. Other anomaly-seeking approaches
|Livshits and Zimmermann (2005)||Commit logs||App-specific patterns|
|Li and Zhou (2005)||Source code||Implicit coding rules|
|Wasylowski et al. (2007)||Function call sequences||Object usage models|
|Acharya et al. (2007)||API usage traces||API usage orderings|
|Chang et al. (2008)||Neglected conditions||Implicit conditional rules|
|Thummalapenta et al (2009)||Programming rules||Alternative patterns|
|Gruska et al (2010)||Function calls||Cross-project anomalies|
In general terms, anomaly detection approaches have the following limitations:
they only apply to mature software, where we assume wrong API usage are rare occurrences,
that particular usage must be relatively infrequent in the codebase to be identified as an anomaly (otherwise the rule becomes the norm),
they generally cannot identify the type of the vulnerability, or even 'if' the anomaly is a security vulnerability, only that it is a deviant element, and
false-positive rates are generally high.
Pattern recognition approaches
The aim is to take a large dataset of vulnerability samples and extract vulnerable code patterns using (usually supervised) machine learning algorithms. The key is the technique used for extracting features, which range from convential parsers, data-flow and control-flow analysis, and even directly text mining the source code. Most of these papers use classification algorithms.
Once more Yamaguchi et al. (2011, 2012) take the lead, mimicking the mental process behind the daily grind of the code auditor: searching for similar instances of recently discovered vulnerabilities. They sensibly call this 'vulnerability extrapolation'. The gist: parse, embed into vector space via a bag-of-words-like method, perform semantic analysis to obtain particular matrices, and then compare to known-vulnerable code using standard distance functions.
Other approaches in this category are Scandariato et al. (2014) and Pang et al. (2015), who attempted to use techniques such as n-gram analysis using bag-of-words, but with limited results, probably due to shallow information and simple methods.
The binary analysis tool VDiscover doesn’t exactly fit our definition, but deserves mentioning. They identify each trace of a call to the standard C library as a text document and process them as n-grams and encode them with word2vec. They have tested several ML techniques such as logistic regression, MLP and random forests.
In the last few months, some in-scope papers have appeared. Li et al. propose two systems: VulDeePecker (2018a) and SySeVR (2018b), which claim to extract both syntactic and semantic information from the code, thus also considering both data and control flow. They report good results with low false positives and 15 zero-day vulnerabilities in high-profile open source libraries. See our article on these systems.
Lin et al. (2017) propose a different variant which simplifies the feature extraction, going back to just AST with no semantic information, using deep learning in the form of bidirectional long short-term memory (BLSTM) networks, plus a completely new element: unlike the vast majority of previous works, which work in the within-project domain, POSTER involves software metrics (see below) in order to compare to other projects.
However interesting these approaches seem, they are not without limitations:
Most of these models aren’t able to identify the type of the vulnerability. They only recognize patterns of vulnerable code. This also means that most do not pinpoint the exact locations of the potential flaws.
Any work in machine learning for VD should take into account several aspects of the code for richer descriptions, such as syntax, semantics and the flow of data and control.
The quality of the results is believed to be mostly due to the features that are extracted and fed to the learning algorithms. Ghaffarian calls this 'feature engineering'. Features extracted from graph representations, according to them, have not been fully exploited.
Unsupervised machine learning algorithms, especially deep learning, are underused, although this has started to change in recent years.
Software metrics such as:
size (logical lines of code),
code churn and
have been proposed as 'predictors' for the presence of vulnerabilities in software projects. These studies use mostly manual procedures based on publicly available vulnerability sources such as NVD. According to  and Walden et al. (2014), predicting the existence of vulnerabilities based on software engineering metrics could be thought of as a case of "confusing symptoms and causes":
Hence, most papers reviewed in this category present high false positive rates and hardly one of them has explored automated techniques.
That was the panorama of machine learning in software vulnerability research as of late 2018. Some limitations that are common:
The problem of finding vulnerabilities is 'undecidable' in view of Rice’s theorem, i.e., a universal algorithm for finding vulnerabilities cannot exist, since a program cannot identify semantic properties of another program in the general case.
Coarse granularity and lack of explanations.
A higher degree of automation is desirable, not in order to replace, but to guide, manual code auditing. Purely automated approaches are, in view of Rice’s theorem, imposible or misguided.
Thus our good old pentest is not dead. Even at the level of cutting-edge research, automated vulnerability discovery, and especially confirmation and exploitation, are tasks for human experts.
T. Abraham and O. de Vel (2017). 'A Review of Machine Learning in Software Vulnerability Research'. DST-Group-GD-0979. Australian department of defence.
S. Ghaffarian and H. Shahriari (2017). Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. 'ACM Computing Surveys (CSUR)' 50 (4)
Recommended blog posts
You might be interested in the following related posts.
Tips for choosing a vulnerability management solution
Definition, implementation, importance and alternatives
Keep tabs on this proposal from the Biden-Harris Admin
Vulnerability scanning and pentesting for a safer web
Definitions, classifications and pros and cons
Is your security testing covering the right risks?
How this process works and what benefits come with it
Get an overview of vulnerability assessment