Machine-learning to hackMachine learning for vulnerability discovery
To date the most important security vulnerabilities have been found via laborius code auditing. Also, this is the only way vulnerabilities can be found and fixed during development. However, as software production rates increase, so does the need for a reliable, automated method for checking or classifiying this code in order to prioritize and organize human efforts in manual checks. We’re living in an age where machine learning is playing well in several other technological fields, how about applying it to our bug-finding appetite?
In this and upcoming articles,
we are interested in
the use of machine learning (
to find security vulnerabilities in source code.
It is important to specify this since,
as we will see,
there are many other related, but different, approaches
Automatically fixing vulnerabilities
Vulnerability detection (
VD) in binary code
ML-aided dynamic testing
Other automated techniques that don’t involve
The idea of using
ML techniques for
is not new.
There are papers on the matter as old as 2001.
Here we’ll try to describe in simple terms:
what has been done in this area,
what the current state-of-the-art is and
try to ellucidate new research paths.
We will be following and building on top of two previous state-of-the-art papers
We feel the grouping by semantic features extracted from code approach makes more sense, as do Ghaffarian and Shahriari (2017), These are further subdivided into:
Vulnerable code pattern recognition. Usually based on labeled data (samples of faulty and safe code) determine patterns that explain that, and
Anomaly detection. This means, based upon a large code base, extract models of what "normal code" should look like and determining pieces of code that do not fit in with this model.
Anomaly detection approaches
Most of the papers in this category are not security-focused,
but their ideas can be used for
Also most of these works revolve around
extracting features such as:
APIusage patterns, v.g. the pair
missing checks, like ensuring a number is non-zero before dividing by it,
lack of input validation, leading to injections, buffer overflows, …
lack of access controls, which may lead to confidential information being leaked, altered or denied access to.
The system Chucky by
Yamaguchi et al. (2013)
is the one that interests us the most
since it is more compatible with our interests, i.e.,
lightening the burden of manual code auditors;
also, they achieve both the aforementioned objectives:
detecting missing checks through security logic (v.g. access control)
API usage (v.g. checking buffer size).
It uses the
model to represent the code and the
technique to analyze it.
'Chucky' discovered 12 new vulnerabilites in
high-profile projects such as
Pidgin and LibTIFF.
See our article on Chucky for details.
A year later, Yamaguchi et al. (2014) reuse this idea of exploiting graph representations of code in order to find vulnerable code patterns. This time they propose automating the design of effective traversals which might lead to vulnerability detection using the unsupervised clustering approach. This resulted in the tool 'Joern', which was able to find 5 zero-day vulnerabilities in products like Pidgin.
Most of the papers in this category are not security focused. All of them use frequent itemset mining, only with different features to mine and different targets to extract. We summarize them here for the sake of completeness:
Implicit coding rules
Function call sequences
Object usage models
Implicit conditional rules
In general terms, anomaly detection approaches have the following limitations:
they only apply to mature software, where we assume wrong
APIusage are rare occurrences,
that particular usage must be relatively infrequent in the codebase to be identified as an anomaly (otherwise the rule becomes the norm),
they generally cannot identify the type of the vulnerability, or even 'if' the anomaly is a security vulnerability, only that it is a deviant element, and
false-positive rates are generally high.
Pattern recognition approaches
The aim is to take a large dataset of vulnerability samples and extract vulnerable code patterns using (usually supervised) machine learning algorithms. The key is the technique used for extracting features, which range from convential parsers, data-flow and control-flow analysis, and even directly text mining the source code. Most of these papers use classification algorithms.
Once more Yamaguchi et al. (2011, 2012) take the lead, mimicking the mental process behind the daily grind of the code auditor: searching for similar instances of recently discovered vulnerabilities. They sensibly call this 'vulnerability extrapolation'. The gist: parse, embed into vector space via a bag-of-words-like method, perform semantic analysis to obtain particular matrices, and then compare to known-vulnerable code using standard distance functions.
Other approaches in this category are Scandariato et al. (2014) and Pang et al. (2015), who attempted to use techniques such as n-gram analysis using bag-of-words, but with limited results, probably due to shallow information and simple methods.
The binary analysis tool
VDiscover doesn’t exactly fit our definition,
but deserves mentioning.
They identify each trace of a call to the standard
as a text document and process them
and encode them with
They have tested several
ML techniques such as
and random forests.
In the last few months, some in-scope papers have appeared. Li et al. propose two systems: VulDeePecker (2018a) and SySeVR (2018b), which claim to extract both syntactic and semantic information from the code, thus also considering both data and control flow. They report good results with low false positives and 15 zero-day vulnerabilities in high-profile open source libraries. See our article on these systems.
Lin et al. (2017)
propose a different variant
which simplifies the feature extraction,
going back to just
AST with no semantic information,
in the form of
bidirectional long short-term memory (BLSTM) networks,
plus a completely new element:
unlike the vast majority of previous works,
which work in the within-project domain,
POSTER involves software metrics (see below)
in order to compare to other projects.
However interesting these approaches seem, they are not without limitations:
Most of these models aren’t able to identify the type of the vulnerability. They only recognize patterns of vulnerable code. This also means that most do not pinpoint the exact locations of the potential flaws.
Any work in machine learning for
VDshould take into account several aspects of the code for richer descriptions, such as syntax, semantics and the flow of data and control.
The quality of the results is believed to be mostly due to the features that are extracted and fed to the learning algorithms. Ghaffarian calls this 'feature engineering'. Features extracted from graph representations, according to them, have not been fully exploited.
Unsupervised machine learning algorithms, especially deep learning, are underused, although this has started to change in recent years.
Software metrics such as:
have been proposed as 'predictors' for the presence of vulnerabilities in software projects. These studies use mostly manual procedures based on publicly available vulnerability sources such as NVD. According to  and Walden et al. (2014), predicting the existence of vulnerabilities based on software engineering metrics could be thought of as a case of "confusing symptoms and causes":
Hence, most papers reviewed in this category present high false positive rates and hardly one of them has explored automated techniques.
That was the panorama of machine learning in software vulnerability research as of late 2018. Some limitations that are common:
The problem of finding vulnerabilities is 'undecidable' in view of Rice’s theorem, i.e., a universal algorithm for finding vulnerabilities cannot exist, since a program cannot identify semantic properties of another program in the general case.
Coarse granularity and lack of explanations.
A higher degree of automation is desirable, not in order to replace, but to guide, manual code auditing. Purely automated approaches are, in view of Rice’s theorem, imposible or misguided.
Thus our good old pentest is not dead. Even at the level of cutting-edge research, automated vulnerability discovery, and especially confirmation and exploitation, are tasks for human experts.
T. Abraham and O. de Vel (2017). 'A Review of Machine Learning in Software Vulnerability Research'. DST-Group-GD-0979. Australian department of defence.
S. Ghaffarian and H. Shahriari (2017). Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. 'ACM Computing Surveys (CSUR)' 50 (4)