| 4 min read
Table of contents
If we have learned anything so far in our quest to understand how
machine learning (ML
) can be used
to detect vulnerabilities in source code,
it’s that what matters the most in this process are the different
representations of source code which are later fed to the actual ML
algorithms. Especially, that these representations should include both
semantic and syntactic information about the code.
Also, that one ML
technique seems particularly promising, but hardly
exploited, namely, deep
learning.
Methods such as Recurrent Neural
Networks
(RNN
), Convolutional Neural
Networks
(CNN
), and Deep Belief
Networks (DBN
)
have been successful in image and natural language processing,
but never applied to vulnerability discovery
in a systematic fashion.
The aim of the project SySeVR
is to apply deep learning techniques
to the discovery of software vulnerabilities in source code,
considering not only the form (syntax)
which might induce a vulnerability, but also the flow of data and
control in the program. They also tried to produce results as finely
granular as possible, i.e., tell us exactly at which line or function
the flaw arises. If that’s not enough, they also promise to explain the
cause of false positives, if there are any.
When working with images and pattern recognition, objects of interest have a natural representation as vectors, which are suitable for machine learning algorithms. In that case it is easy to propose where an object in the image might be: just take smaller pieces of the image, and test their inherent features such as texture and color to determine if they are or not what we want to detect.
In order to translate this idea to code, the authors can leverage
well-known patterns in previously identified vulnerabilities. Simply
patterns that might trigger dangerous situations, such as the use of
malloc
and pointers in C
, concatenating user
input, importing
flawed libraries, etc. Anything that a
regular static analysis tool might look for,
but probably with false positives. They call these Syntactic
Vulnerability Candidates (SyVC
). These can be either a single token
(malloc
) taken from the program’s Abstract Syntax
Tree or a set of tokens
(memset(dataBuffer…
) or a whole statement which involves one of the
mentioned danger situations.
Comparing image vs. code recognition.
In order to avoid false positives, the next logical step is to use
semantic information about the program, i.e., how data and control flows
in it in order to expand our knowledge about what happens before and
after executing a particular line of code. And where does this
information lie? As we know by know, this can be found in the Control Flow
and Program Dependency
graphs.
Armed with these two graphs, one can find the whole "influence zone" of
a particular token or line, with a technique they call program
slicing. Basically it means to take all nodes in the semantic graph
representations that are reachable from the token of interest, the
SyVC
. In other words, all lines of code that are executed before and
after this particular token or are somehow altered if its value were to
change. They call this a "Semantic Vulnerability Candidate". Usually if
the SyVC
is a whole function, then the corresponding SeVC
will
include all functions called by it and all the function that call it.
The next problem to be solved is: Having already identified a piece of
the program that might contain a vulnerability as a SeVC
, how do we
encode that as a vector or something that can be understood by machine
learning algorithms? The approach chosen by the authors is to first give
generic names to all the functions and variables (thus sort of
obfuscating it lightly), then perform a lexical analysis on it (i.e.,
breaking it up into symbols) and finally representing that strings as a
bag of words, a procedure we have already referred to in past articles.
A fixed length must be chosen and vectors that don’t fit must be padded
or truncated, since the chosen neural networks take vectors of a fixed
length as input. Here is a depiction of the process for a particular
piece of code:
Illustration of the process.
All that remains is to train and test the neural networks. One of the
goals of SySeVR
was to be able to work with different types of
networks. Six (!) different types of networks were implemented in
Python
with the Theano
library: CNN
, DBN
, and four types of RNN
: (Bidirectional)
Long-short term
memory
((B)LSTM
) and (Bidirectional) Gated Recurrent
Unit ((B)GRU
).
They validated their results against a vulnerability dataset combining
NVD
and
`SARD,
labeled either as vulnerable or not, ideally some with the corresponding
diff and the vulnerability type.
But which syntactic patterns to look for? Who will be the syntactic
vulnerability candidates? For this, they used standard static detection
tools such as Checkmarx
,
Flawfinder
and
RATS
.
From these results, they decided to focus on four main vulnerability
types, out of the 126 different kinds of vulnerabilities contained in
the dataset:
-
Insecure
API
usage, v.g.malloc
withoutfree
. -
Array usage.
-
Pointer usage
-
Improper arithmetic expressions.
For this particular "experiment", the graph code representations were
obtained with the tool Joern by Yamaguchi et
al., a sister project of Chucky
of
sorts. The SeVC
to vector encoding was performed with
word2vec
.
The results of the experiment can be summarized as follows:
-
BGRU
networks appear to be the best fit for vulnerability discovery, as long as the training data is good. In general the effectiveness of deep neural networks is a open research problem. -
For any kind of neural network used, it is better to tailor them to the specific kind of vulnerability that is sought, rather than try to use a catch-all type of model.
-
SySeVR
results are way better than those of current, commercial, well-established static detection tools such as mentionedCheckmarx
.
SySeVR
was able to identify 15 vulnerabilities new to NVD
in open
source projects like Thunderbird
and Seamonkey
, all of which were,
as it should, responsibly disclosed.
Some of them got listed in CVE
. Others were silently patched by their
manufacturers. These are, of course, the most important product of this
idea and are summarized in the following table:
New vulnerabilities found by SySeVR
.
Thus, the idea of applying deep learning techniques to vulnerability discovery in source apparently does deliver the promised results. However as mentioned earlier, these are to be taken with a grain of salt, until the results are peer-reviewed and cross-validated by the academic and security communities, or, at least, by us.
References
- Z. Li, D. Zou, Shouhuai X., H. Jin, Y. Zhu and Z. Chen (2018).
SySeVR: A framework for using deep learning to detect software
vulnerabilities.
arXiv:1807.06756 [cs.LG
]
Table of contents
Share
Recommended blog posts
You might be interested in the following related posts.
Users put their trust in you; they must be protected
Is your financial service as secure as you think?
We need you, but we can't give you any money
A digital infrastructure issue that many still ignore
How can we justify the investment in cybersecurity?
Attackers can indirectly instruct AI for malicious aims
Let's rather say a bunch of breaches in a single box