Deep HackingDeep learning for vulnerability discovery
If we have learned anything so far in our quest to understand how
machine learning (
ML) can be used
to detect vulnerabilities in source code,
it’s that what matters the most in this process are the different
representations of source code which are later fed to the actual
algorithms. Especially, that these representations should include both
semantic and syntactic information about the code.
Also, that one
ML technique seems particularly promising, but hardly
exploited, namely, deep
Methods such as Recurrent Neural
RNN), Convolutional Neural
CNN), and Deep Belief
have been succesful in image and natural language processing, but never
applied to vulnerability discovery in a systematic fashion.
The aim of the project
to apply deep learning techniques to the discovery of sofware
vulnerabilities in source code, considering not only the form (syntax)
which might induce a vulnerability, but also the flow of data and
control in the program. They also tried to produce results as finely
granular as possible, i.e., tell us exactly at which line or function
the flaw arises. If that’s not enough, they also promise to explain the
cause of false positives, if there are any.
When working with images and pattern recognition, objects of interest have a natural representation as vectors, which are suitable for machine learning algorithms. In that case it is easy to propose where an object in the image might be: just take smaller pieces of the image, and test their inherent features such as texture and color to determine if they are or not what we want to detect.
In order to translate this idea to code, the authors can leverage
well-known patterns in previously identified vulnerabilites. Simply
patterns that might trigger dangerous situations, such as the use of
malloc and pointers in
C, concatenating user
flawed libraries, etc. Anything that a
regular static analysis tool might look for,
but probably with false positives. They call these Syntactic
Vulnerability Candidates (
SyVC). These can be either a single token
malloc) taken from the program’s Abstract Syntax
Tree or a set of tokens
memset(dataBuffer…) or a whole statement which involves one of the
mentioned danger situations.
Figure 1. Comparing image vs. code recognition
In order to avoid false positives, the next logical step is to use
semantic information about the program, i.e., how data and control flows
in it in order to expand our knowledge about what happens before and
after executing a particular line of code. And where does this
information lie? As we know by know, this can be found in the
Control Flow and
Armed with these two graphs, one can find the whole "influence zone" of
a particular token or line, with a technique they call program
slicing. Basically it means to take all nodes in the semantic graph
representations that are reachable from the token of interest, the
SyVC. In other words, all lines of code that are executed before and
after this particular token or are somehow altered if its value were to
change. They call this a "Semantic Vulnerability Candidate". Usually if
SyVC is a whole function, then the corresponding
include all functions called by it and all the function that call it.
The next problem to be solved is: Having already identified a piece of
the program that might contain a vulnerability as a
SeVC, how do we
encode that as a vector or something that can be understood by machine
learning algorithms? The approach chosen by the authors is to first give
generic names to all the functions and variables (thus sort of
obfuscating it lightly), then perform a lexical analysis on it (i.e.,
breaking it up into symbols) and finally representing that strings as a
bag of words, a procedure we have already referred to in past articles.
A fixed length must be chosen and vectors that don’t fit must be padded
or truncated, since the chosen neural networks take vectors of a fixed
length as input. Here is a depiction of the process for a particular
piece of code:
Figure 2. Illustration of the process
All that remains is to train and test the neural networks. One of the
SySeVR was to be able to work with different types of
networks. Six (!) different types of networks were implemented in
Python with the Theano
DBN, and four types of
(B)LSTM) and (Bidirectional) Gated Recurrent
They validated their results against a vulnerability dataset combining
labeled either as vulnerable or not, ideally some with the corresponding
diff and the vulnerability type.
But which syntactic patterns to look for? Who will be the syntactic
vulnerability candidates? For this, they used standard static detection
tools such as
From these results, they decided to focus on four main vulnerability
types, out of the 126 different kinds of vulnerabilities contained in
Improper arithmetic expressions.
For this particular "experiment", the graph code representations were
obtained with the tool Joern by Yamaguchi et
al., a sister project of
SeVC to vector encoding was performed with
The results of the experiment can be summarized as follows:
BGRUnetworks appear to be the best fit for vulnerability discovery, as long as the training data is good. In general the effectiveness of deep neural networks is a open research problem.
For any kind of neural network used, it is better to tailor them to the specific kind of vulnerability that is sought, rather than try to use a catch-all type of model.
SySeVRresults are way better than those of current, commercial, well-established static detection tools such as mentioned
SySeVR was able to identify 15 vulnerabilities new to
NVD in open
source projects like
Seamonkey, all of which were,
as it should, responsibly disclosed.
Some of them got listed in
CVE. Others were silently patched by their
manufacturers. These are, of course, the most important product of this
idea and are summarized in the following table:
Figure 3. New vulnerabilities found by
Thus, the idea of applying deep learning techniques to vulnerability discovery in source apparently does deliver the promised results. However as mentioned earlier, these are to be taken with a grain of salt, until the results are peer-reviewed and cross-validated by the academic and security communities, or, at least, by us.
- Z. Li, D. Zou, Shouhuai X., H. Jin, Y. Zhu and Z. Chen (2018).
SySeVR: A framework for using deep learning to detect software