Table of contents

Title

Table of content

Title



Blog

Development

Machine learning for vulnerability discovery

cover-machine-learning-vulnerability-discovery (https://unsplash.com/photos/iar-afB0QQw)

Rafael Ballestas

Security analyst

Updated

Nov 7, 2018



6 min

To date the most important security vulnerabilities have been found via laborius code auditing. Also, this is the only way vulnerabilities can be found and fixed during development. However, as software production rates increase, so does the need for a reliable, automated method for checking or classifying this code in order to prioritize and organize human efforts in manual checks. We’re living in an age where machine learning is playing well in several other technological fields, how about applying it to our bug-finding appetite?

In this and upcoming articles, we are interested in the use of machine learning (ML) techniques to find security vulnerabilities in source code. It is important to specify this since, as we will see, there are many other related, but different, approaches such as:

Automatically fixing vulnerabilities
Vulnerability detection (VD) in binary code
ML-aided dynamic testing
Other automated techniques that don’t involve ML
Exploitability prediction

The idea of using ML techniques for VD is not new. There are papers on the matter as old as 2001. Here we’ll try to describe in simple terms:

what has been done in this area,
what the current state-of-the-art is and
try to elucidate new research paths.

We will be following and building on top of two previous state-of-the-art papers

We feel the grouping by semantic features extracted from code approach makes more sense, as do Ghaffarian and Shahriari (2017). These are further subdivided into:

Vulnerable code pattern recognition. Usually based on labeled data (samples of faulty and safe code) determine patterns that explain that, and
Anomaly detection. This means, based upon a large code base, extract models of what "normal code" should look like and determining pieces of code that do not fit in with this model.

Anomaly detection approaches

Most of the papers in this category are not security-focused, but their ideas can be used for VD. Also most of these works revolve around extracting features such as:

proper API usage patterns, v.g. the pair malloc and free,
missing checks, like ensuring a number is non-zero before dividing by it,
lack of input validation, leading to injections, buffer overflows, …
lack of access controls, which may lead to confidential information being leaked, altered or denied access to.

The system Chucky by Yamaguchi et al. (2013) is the one that interests us the most since it is more compatible with our interests, i.e., lightening the burden of manual code auditors; also, they achieve both the aforementioned objectives: detecting missing checks through security logic (v.g. access control) and secure API usage (v.g. checking buffer size). It uses the bag-of-words model to represent the code and the k-nearest-neighbors technique to analyze it. 'Chucky' discovered 12 new vulnerabilities in high-profile projects such as Pidgin and LibTIFF. See our article on Chucky for details.

A year later, Yamaguchi et al. (2014) reuse this idea of exploiting graph representations of code in order to find vulnerable code patterns. This time they propose automating the design of effective traversals which might lead to vulnerability detection using the unsupervised clustering approach. This resulted in the tool 'Joern', which was able to find 5 zero-day vulnerabilities in products like Pidgin.

Most of the papers in this category are not security focused. All of them use frequent itemset mining, only with different features to mine and different targets to extract. We summarize them here for the sake of completeness:

Table 1. Other anomaly-seeking approaches

Paper	Mined elements	Target
Livshits and Zimmermann (2005)	Commit logs	App-specific patterns
Li and Zhou (2005)	Source code	Implicit coding rules
Wasylowski et al. (2007)	Function call sequences	Object usage models
Acharya et al. (2007)	API usage traces	API usage orderings
Chang et al. (2008)	Neglected conditions	Implicit conditional rules
Thummalapenta et al (2009)	Programming rules	Alternative patterns
Gruska et al (2010)	Function calls	Cross-project anomalies

In general terms, anomaly detection approaches have the following limitations:

they only apply to mature software, where we assume wrong API usage are rare occurrences,
that particular usage must be relatively infrequent in the codebase to be identified as an anomaly (otherwise the rule becomes the norm),
they generally cannot identify the type of the vulnerability, or even 'if' the anomaly is a security vulnerability, only that it is a deviant element, and
false-positive rates are generally high.

Pattern recognition approaches

The aim is to take a large dataset of vulnerability samples and extract vulnerable code patterns using (usually supervised) machine learning algorithms. The key is the technique used for extracting features, which range from conventional parsers, data-flow and control-flow analysis, and even directly text mining the source code. Most of these papers use classification algorithms.

Once more Yamaguchi et al. (2011, 2012) take the lead, mimicking the mental process behind the daily grind of the code auditor: searching for similar instances of recently discovered vulnerabilities. They sensibly call this 'vulnerability extrapolation'. The gist: parse, embed into vector space via a bag-of-words-like method, perform semantic analysis to obtain particular matrices, and then compare to known-vulnerable code using standard distance functions.

Other approaches in this category are Scandariato et al. (2014) and Pang et al. (2015), who attempted to use techniques such as n-gram analysis using bag-of-words, but with limited results, probably due to shallow information and simple methods.

The binary analysis tool VDiscover doesn’t exactly fit our definition, but deserves mentioning. They identify each trace of a call to the standard C library as a text document and process them as n-grams and encode them with word2vec. They have tested several ML techniques such as logistic regression, MLP and random forests.

In the last few months, some in-scope papers have appeared. Li et al. propose two systems: VulDeePecker (2018a) and SySeVR (2018b), which claim to extract both syntactic and semantic information from the code, thus also considering both data and control flow. They report good results with low false positives and 15 zero-day vulnerabilities in high-profile open source libraries. See our article on these systems.

Lin et al. (2017) propose a different variant which simplifies the feature extraction, going back to just AST with no semantic information, using deep learning in the form of bidirectional long short-term memory (BLSTM) networks, plus a completely new element: unlike the vast majority of previous works, which work in the within-project domain, POSTER involves software metrics (see below) in order to compare to other projects.

However interesting these approaches seem, they are not without limitations:

Most of these models aren’t able to identify the type of the vulnerability. They only recognize patterns of vulnerable code. This also means that most do not pinpoint the exact locations of the potential flaws.
Any work in machine learning for VD should take into account several aspects of the code for richer descriptions, such as syntax, semantics and the flow of data and control.
The quality of the results is believed to be mostly due to the features that are extracted and fed to the learning algorithms. Ghaffarian calls this 'feature engineering'. Features extracted from graph representations, according to them, have not been fully exploited.
Unsupervised machine learning algorithms, especially deep learning, are underused, although this has started to change in recent years.

Other approaches

Software metrics such as:

size (logical lines of code),
cyclomatic complexity,
code churn and
developer activity

have been proposed as 'predictors' for the presence of vulnerabilities in software projects. These studies use mostly manual procedures based on publicly available vulnerability sources such as NVD. According to Ghaffarian and Shahriari (2017) and Walden et al. (2014), predicting the existence of vulnerabilities based on software engineering metrics could be thought of as a case of "confusing symptoms and causes":

Correlation vs. causation. Via XKCD.

Hence, most papers reviewed in this category present high false positive rates and hardly one of them has explored automated techniques.

That was the panorama of machine learning in software vulnerability research as of late 2018. Some limitations that are common:

The problem of finding vulnerabilities is 'undecidable' in view of Rice’s theorem, i.e., a universal algorithm for finding vulnerabilities cannot exist, since a program cannot identify semantic properties of another program in the general case.
Limited applicability.
Coarse granularity and lack of explanations.
A higher degree of automation is desirable, not in order to replace, but to guide, manual code auditing. Purely automated approaches are, in view of Rice’s theorem, impossible or misguided.

Thus, our good old pentest is not dead. Even at the level of cutting-edge research, automated vulnerability discovery, and especially confirmation and exploitation, are tasks for human experts.

References

T. Abraham and O. de Vel (2017). 'A Review of Machine Learning in Software Vulnerability Research'. DST-Group-GD-0979. Australian department of defence.
S. Ghaffarian and H. Shahriari (2017). Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. 'ACM Computing Surveys (CSUR)' 50 (4).

Get started with Fluid Attacks' application security solution right now

Tags:

machine-learning

vulnerability

security-testing







Subscribe to our newsletter

Stay updated on our upcoming events and latest blog posts, advisories and other engaging resources.

Coding with gen AI: Five best practices

Read post



cover-secure-coding-five-steps (https://unsplash.com/photos/zc9pWsPZd4Y)

Development

Felipe Ruiz

•

December 5, 2022

Secure coding in five steps? A simple approach to try out in cybersecurity training

Read post



Development

Felipe Ruiz

•

November 22, 2022

Go over and practice secure coding

Read post



cover-understand-program-semantics (https://unsplash.com/photos/j3dxI7CNYL0)

Development

Rafael Ballestas

•

February 14, 2020

Understanding program semantics with symbolic execution

Read post



cover-code-translate (https://unsplash.com/photos/r8H8K3w9AzA)

Development

Rafael Ballestas

•

January 31, 2020

Can code be translated? From code to words

Read post



cover-further-code2vec (https://unsplash.com/photos/FoiZoPtxSyA)

Development

Rafael Ballestas

•

January 24, 2020

Further down code2vec: Vector representations of code

Read post



Development

Rafael Ballestas

•

January 10, 2020

Embedding code into vectors: Vector representations of code

Read post



cover-vector-language (https://unsplash.com/photos/_E1PQXKUkMw)

Development

Rafael Ballestas

•

December 13, 2019

The vectors of language: Distributed representations of natural language

Read post



Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Fluid Attacks' solutions enable organizations to identify, prioritize, and remediate vulnerabilities in their software throughout the SDLC. Supported by AI, automated tools, and pentesters, Fluid Attacks accelerates companies' risk exposure mitigation and strengthens their cybersecurity posture.