Young hacker smiling

We hack your software

zero false positives

Expert intelligence + effective automation

Can machines learn to hack?

Machine-learning to hack

Machine learning for vulnerability discovery
A bird's eye view of machine learning techniques applied to vulnerability discovery in source code, reviewing papers from 2011 to 2018. Approaches are broadly grouped as anomaly detection, meta-code analysis and code pattern recognition, which will be the most interesting for our purposes.

To date the most important security vulnerabilities have been found via laborius code auditing. Also, this is the only way vulnerabilities can be found and fixed during development. However, as software production rates increase, so does the need for a reliable, automated method for checking or classifiying this code in order to prioritize and organize human efforts in manual checks. We’re living in an age where machine learning is playing well in several other technological fields, how about applying it to our bug-finding appetite?

In this and upcoming articles, we are interested in the use of machine learning (ML) techniques to find security vulnerabilities in source code. It is important to specify this since, as we will see, there are many other approaches that may be related to, but do not fit into this definition exactly, such as:

  • Automatically fixing vulnerabilities.

  • Vulnerability detection (VD) in binary code. We want to aid manual code review, done on human-readable source

  • ML-aided dynamic testing.

  • Other automated techniques that don’t involve ML.

  • Exploitability prediction. Again, we just want to discover them to be manually reviewed and exploited later.

The idea of using ML techniques for VD is not new. There are papers on the matter as old as 2001. Here we’ll try to describe in simple terms:

  • what has been done in this area,

  • what the current state-of-the-art is and

  • try to ellucidate new research paths.

We will be following and building on top of two previous state-of-the-art papers by Abraham and de Vel (2017)[1] and Ghaffarian and Shahriari (2017)[2], adding the papers and projects that have been added in the past year.

The approaches to applying ML to VD could be classified:

  • by the type of ML techique used,

  • by the degree of automation,

  • by the type of code they apply to, or

  • by the features that are extracted for input or pursued as output, i.e., what we want to look at in the code and what we want to able to learn from it.

We feel this last approach makes more sense, as do [2], who then divide them according to whether program syntax and semantics are analyzed, The ones that do are further subdivided into two main categories:

  • Vulnerable code pattern recognition. Usually based on labeled data (samples of faulty and safe code) determine patterns that explain that, and

  • Anomaly detection. This means, based upon a large code base, extract models of what "normal code" should look like and determining pieces of code that do not fit in with this model.

Anomaly detection approaches

Most of the papers in this category are not security-focused, but their ideas can be used for VD. Also most of these works revolve around extracting features such as:

  • proper API usage patterns, v.g. the pair malloc and free,

  • missing checks, like ensuring a number is non-zero before dividing by it,

  • lack of input validation, which may lead to injections, buffer overflows, …

  • lack of access controls, which may lead to confidential information being leaked, altered or denied access to.

These approaches have also been used in software quality assurance since they might provide, as a bonus, the automatic extraction of specifications, rules and patterns, a laborius and error-prone process if done by hand. Hence the mix with non-security related papers in this area.

The system Chucky by Yamaguchi et al. (2013) is the one that interests us the most since it is more compatible with our interests, i.e., lightening the burden of manual code auditors; also, they achieve both the aforementioned objectives: detecting missing checks through security logic (v.g. access control) and secure API usage (v.g. checking buffer size). It uses the bag-of-words model to represent the code and the k-nearest-neighbors technique to analyze it. A special graph proposed by the same authors called the code property graph, which combines other graph representations of code such as the abstract syntax tree, dependency graph and control flow graph, is then stored in the graph-oriented database neo4j and can then be traversed with the graph query language Gremlin. Chucky discovered 12 new vulnerabilites in high-profile projects such as Pidgin and LibTIFF.

A year later, Yamaguchi et al. (2014) reuse this idea of exploiting graph representations of code in order to find vulnerable code patterns. This time they propose automating the design of effective traversals which might lead to vulnerability detection using the unsupervised clustering approach. This resulted in the tool Joern, which was able to find 5 zero-day vulnerabilities in products like Pidgin.

Most of the papers in this category are not security focused. All of them use frequent itemset mining, only with different features to mine and different targets to extract. We summarize them here for the sake of completeness:

Table 1. Other anomaly-seeking approaches
Paper Mined elements Target

Livshits and Zimmermann (2005)

commit logs

app-specific patterns

Li and Zhou (2005)

source code

implicit coding rules

Wasylowski et al. (2007)

function call sequences

object usage models

Acharya et al. (2007)

API usage traces

API usage orderings

Chang et al. (2008)

neglected conditions

implicit conditional rules

Thummalapenta et al (2009)

programming rules

alternative patterns

Gruska et al (2010)

function calls

cross-project anomalies

In general terms, anomaly detection approaches have the following limitations:

  • they only apply to mature software, where we assume wrong API usage are rare occurrences,

  • that particular usage must be relatively frequent in the codebase to be identified as an anomaly (otherwise the rule becomes the norm)

  • they generally cannot identify the type of the vulnerability, or even if the anomaly is a security vulnerability, only that it is a deviant element.

  • false-positive rates are generally high.

Pattern recognition approaches

The aim is to take a large dataset of vulnerability samples and extract vulnerable code patterns using (usually supervised) machine learning algorithms. The key is the technique used for extracting features, which range from convential parsers, data-flow and control-flow analysis, and even directly text mining the source code. Most of these papers use classification algorithms.

Once more Yamaguchi et al. (2011, 2012) take the lead, mimicking the mental process behind the daily grind of the code auditor: searching for similar instances of previously (recently) discovered vulnerabilities. They sensibly call this vulnerability extrapolation. The gist: parse, embed into vector space via a bag-of-words-like method, perform semantic analysis to obtain particular matrices, and then compare to known-vulnerable code using standard distance functions.

Other approaches in this category are Scandariato et al. (2014) and Pang et al. (2015), which attempt to use techniques such as n-gram analysis using bag-of-words, but with limited results, probably due to shallow information and simple methods.

VDiscover doesn’t exactly fit our definition, but needs mentioning. They identify each trace of a call to the standard C library as a text document and process them as n-grams and encode them with word2vec. They have tested several ML techniques such as logistic regression, MLP and random forests.

In the last few months, some in-scope papers have appeared. Li et al. propose two systems: VulDeePecker (2018a) and SySeVR (2018b), which claim to extract both syntactic and semantic information from the code in the form of program slices, thus also considering both data and control flow. This information is then encoded as vectors using word2vec, and fed to different neural networks. They report good results with low false positives and 15 zero-day vulnerabilities in high-profile open libraries. However, these systems:

  • need peer-reviewing as they are in pre-print state or are conference papers

  • are designed exclusive for C(++) code-base

  • are subject to the limitations of other systems, like coarse granularity.

Lin et al. (2017) propose a different variant which simplifies the feature extraction, going back to just AST with no semantic information, using deep learning in the form of bidirectional long short-term memory (BLSTM) networks, with a completely new element: unlike the vast majority of previous works, which work in the within-project domain (which is constantly reminded to us by Ghaffarian et al.), POSTER involves software metrics (see below) in order to compare to other projects.

However interesting these approaches seem, they are not without limitations:

  • Most of these models aren’t able to identify the type of the vulnerability. They only recognize patterns of vulnerable code. This also means that most do not pinpoint the exact locations of the potential flaws.

  • Any work in machine learning for VD should take into account several aspects of the code for a richer descriptions, such as syntax, semantics and the flow of data and control,

  • The quality of the results is believed to be mostly due to the features that are extracted and fed to the learning algorithms. Ghaffarian calls this feature engineering. Features extracted from graph representations, according to them, have not been fully exploited.

  • Unsupervised machine learning algorithms, especially deep learning, are underused, although this has started to change in recent years.

Other approaches

Software metrics such as:

have been proposed as predictors for the presence of vulnerabilities in software projects. These studies use mostly manual procedures based on publicly available vulnerability sources such as NVD, with the exception of Moshtari et al. (2013), who propose a semi-automated, self-contained framework. Also noteworthy is VCCFinder by Perl et al. (2015), which works at the repository level to find vulnerability-contributing commits (VCCs).

According to [2] and Walden et al. (2014), predicting the existence of vulnerabilities based on software engineering metrics could be thought of as a case of "confusing symptoms and causes":

XKCD on correlation

That is, there might be a correlation between certain metrics and the presence of vulnerabilities, but that doesn’t tell us anything about the presence of vulnerabilities in general. Most of the papers reviewed in this category present high false positive rates and hardly one of them has explored automateed techniques. Hence, we deem these the least interesting for our purposes.

Wijayasekara et al (2012, 2014) focus on text-mining public vulnerability databases, which seems like a good idea, in order to find hidden impact bugs, i.e., bugs which have been reported but whose security implications we ignore. Several other authors focus on usage of genetic algorithms and other techniques from "computational/artificial intelligence" which fall out of the scope of this article. Sparks et al. (2007), Alvares et al.(2013), Medeiros et al. (2014) focused on classifying reported vulnerabilites using ML techniques, not discovery.


That was the panorama of machine learning in software vulnerability research as of late 2018. Some limitations that are common:

  • The problem of finding vulnerabilities is undecidable in view of Rice’s theorem, i.e., a universal algorithm for finding vulnerabilities cannot exist, since a program cannot identify semantic properties of another program in the general case.

  • Limited applicability. Be that because the technique only applies to mature systems, or a particular language, it would be nice to have techniques with broader spectra.

  • Coarse granularity and lack of explanations. Most of the reviewed systems can only say "this program might have a vulnerability", but we would like to know the line or function where it appears, what type of vulnerability it is and what causes it in order to better allocate human resources for subsequent code review.

  • A higher degree of automation is desirable, not in order to replace, but to guide, manual code auditing. Purely automated approaches are, in view of Rice’s theorem, imposible or misguided.

References

  1. T. Abraham and O. de Vel (2017). A Review of Machine Learning in Software Vulnerability Research. DST-Group-GD-0979. Australian department of defence.

  2. S. Ghaffarian and H. Shahriari (2017). Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. ACM Computing Surveys (CSUR) 50 (4)


Author picture

Rafael Ballestas

Mathematician

with an itch for CS



Related




Service status - Terms of Use