Deep Hacking

Deep learning for vulnerability discovery

Blog Deep Hacking

| 4 min read

Table of contents

Contact us

If we have learned anything so far in our quest to understand how machine learning (ML) can be used to detect vulnerabilities in source code, it’s that what matters the most in this process are the different representations of source code which are later fed to the actual ML algorithms. Especially, that these representations should include both semantic and syntactic information about the code.

Also, that one ML technique seems particularly promising, but hardly exploited, namely, deep learning. Methods such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Deep Belief Networks (DBN) have been successful in image and natural language processing, but never applied to vulnerability discovery in a systematic fashion.

The aim of the project SySeVR is to apply deep learning techniques to the discovery of software vulnerabilities in source code, considering not only the form (syntax) which might induce a vulnerability, but also the flow of data and control in the program. They also tried to produce results as finely granular as possible, i.e., tell us exactly at which line or function the flaw arises. If that’s not enough, they also promise to explain the cause of false positives, if there are any.

When working with images and pattern recognition, objects of interest have a natural representation as vectors, which are suitable for machine learning algorithms. In that case it is easy to propose where an object in the image might be: just take smaller pieces of the image, and test their inherent features such as texture and color to determine if they are or not what we want to detect.

In order to translate this idea to code, the authors can leverage well-known patterns in previously identified vulnerabilities. Simply patterns that might trigger dangerous situations, such as the use of malloc and pointers in C, concatenating user input, importing flawed libraries, etc. Anything that a regular static analysis tool might look for, but probably with false positives. They call these Syntactic Vulnerability Candidates (SyVC). These can be either a single token (malloc) taken from the program’s Abstract Syntax Tree or a set of tokens (memset(dataBuffer…​) or a whole statement which involves one of the mentioned danger situations.

Comparison of image vs code recognition

Comparing image vs. code recognition.

In order to avoid false positives, the next logical step is to use semantic information about the program, i.e., how data and control flows in it in order to expand our knowledge about what happens before and after executing a particular line of code. And where does this information lie? As we know by know, this can be found in the Control Flow and Program Dependency graphs. Armed with these two graphs, one can find the whole "influence zone" of a particular token or line, with a technique they call program slicing. Basically it means to take all nodes in the semantic graph representations that are reachable from the token of interest, the SyVC. In other words, all lines of code that are executed before and after this particular token or are somehow altered if its value were to change. They call this a "Semantic Vulnerability Candidate". Usually if the SyVC is a whole function, then the corresponding SeVC will include all functions called by it and all the function that call it.

Get started with Fluid Attacks' Secure Code Review solution right now

The next problem to be solved is: Having already identified a piece of the program that might contain a vulnerability as a SeVC, how do we encode that as a vector or something that can be understood by machine learning algorithms? The approach chosen by the authors is to first give generic names to all the functions and variables (thus sort of obfuscating it lightly), then perform a lexical analysis on it (i.e., breaking it up into symbols) and finally representing that strings as a bag of words, a procedure we have already referred to in past articles. A fixed length must be chosen and vectors that don’t fit must be padded or truncated, since the chosen neural networks take vectors of a fixed length as input. Here is a depiction of the process for a particular piece of code:

Illustration of the process

Illustration of the process.

All that remains is to train and test the neural networks. One of the goals of SySeVR was to be able to work with different types of networks. Six (!) different types of networks were implemented in Python with the Theano library: CNN, DBN, and four types of RNN: (Bidirectional) Long-short term memory ((B)LSTM) and (Bidirectional) Gated Recurrent Unit ((B)GRU). They validated their results against a vulnerability dataset combining NVD and `SARD, labeled either as vulnerable or not, ideally some with the corresponding diff and the vulnerability type.

But which syntactic patterns to look for? Who will be the syntactic vulnerability candidates? For this, they used standard static detection tools such as Checkmarx, Flawfinder and RATS. From these results, they decided to focus on four main vulnerability types, out of the 126 different kinds of vulnerabilities contained in the dataset:

  • Insecure API usage, v.g. malloc without free.

  • Array usage.

  • Pointer usage

  • Improper arithmetic expressions.

For this particular "experiment", the graph code representations were obtained with the tool Joern by Yamaguchi et al., a sister project of Chucky of sorts. The SeVC to vector encoding was performed with word2vec.

The results of the experiment can be summarized as follows:

  • BGRU networks appear to be the best fit for vulnerability discovery, as long as the training data is good. In general the effectiveness of deep neural networks is a open research problem.

  • For any kind of neural network used, it is better to tailor them to the specific kind of vulnerability that is sought, rather than try to use a catch-all type of model.

  • SySeVR results are way better than those of current, commercial, well-established static detection tools such as mentioned Checkmarx.

SySeVR was able to identify 15 vulnerabilities new to NVD in open source projects like Thunderbird and Seamonkey, all of which were, as it should, responsibly disclosed. Some of them got listed in CVE. Others were silently patched by their manufacturers. These are, of course, the most important product of this idea and are summarized in the following table:

SySeVR results

New vulnerabilities found by SySeVR.


Thus, the idea of applying deep learning techniques to vulnerability discovery in source apparently does deliver the promised results. However as mentioned earlier, these are to be taken with a grain of salt, until the results are peer-reviewed and cross-validated by the academic and security communities, or, at least, by us.

References

  1. Z. Li, D. Zou, Shouhuai X., H. Jin, Y. Zhu and Z. Chen (2018). SySeVR: A framework for using deep learning to detect software vulnerabilities. arXiv:1807.06756 [cs.LG]

Table of contents

Share

Subscribe to our blog

Sign up for Fluid Attacks' weekly newsletter.

Recommended blog posts

You might be interested in the following related posts.

Photo by CardMapr on Unsplash

Users put their trust in you; they must be protected

Photo by Claudio Schwarz on Unsplash

Is your financial service as secure as you think?

Photo by Brian Kelly on Unsplash

We need you, but we can't give you any money

Photo by Valery Fedotov on Unsplash

A digital infrastructure issue that many still ignore

Photo by A S on Unsplash

How can we justify the investment in cybersecurity?

Photo by Pawel Czerwinski on Unsplash

Attackers can indirectly instruct AI for malicious aims

Photo by Ray Hennessy on Unsplash

Let's rather say a bunch of breaches in a single box

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which hundreds of organizations are already enjoying.

Start your 21-day free trial
Fluid Logo Footer

Hacking software for over 20 years

Fluid Attacks tests applications and other systems, covering all software development stages. Our team assists clients in quickly identifying and managing vulnerabilities to reduce the risk of incidents and deploy secure technology.

Copyright © 0 Fluid Attacks. We hack your software. All rights reserved.