Deep HackingDeep learning for vulnerability discovery
If we have learned anything so far in our quest to understand how machine learning (ML) can be used to detect vulnerabilities in source code, it’s that what matters the most in this process are the different representations of source code which are later fed to the actual ML algorithms. Especially, that these representations should include both semantic and syntactic information about the code.
Also, that one ML technique seems particularly promising, but hardly exploited, namely, deep learning. Methods such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Deep Belief Networks (DBN) have been succesful in image and natural language processing, but never applied to vulnerability discovery in a systematic fashion.
The aim of the project SySeVR is to apply deep learning techniques to the discovery of sofware vulnerabilities in source code, considering not only the form (syntax) which might induce a vulnerability, but also the flow of data and control in the program. They also tried to produce results as finely granular as possible, i.e., tell us exactly at which line or function the flaw arises. If that’s not enough, they also promise to explain the cause of false positives, if there are any.
When working with images and pattern recognition, objects of interest have a natural representation as vectors, which are suitable for machine learning algorithms. In that case it is easy to propose where an object in the image might be: just take smaller pieces of the image, and test their inherent features such as texture and color to determine if they are or not what we want to detect.
In order to translate this idea to code, the authors can leverage well-known patterns in previously identified vulnerabilites. Simply patterns that might trigger dangerous situations, such as the use of malloc and pointers in C, concatenating user input, importing flawed libraries, etc. Anything that a regular static analysis tool might look for, but probably with false positives. They call these Syntactic Vulnerability Candidates (SyVC). These can be either a single token (malloc) taken from the program’s Abstract Syntax Tree or a set of tokens (memset(dataBuffer...) or a whole statement which involves one of the mentioned danger situations.
In order to avoid false positives, the next logical step is to use semantic information about the program, i.e. how data and control flows in it in order to expand our knowledge about what happens before and after executing a particular line of code. And where does this information lie? As we know by know, this can be found in the Control Flow and Program Dependency graphs. Armed with these two graphs, one can find the whole "influence zone" of a particular token or line, with a technique they call program slicing. Basically it means to take all nodes in the semantic graph representations that are reachable from the token of interest, the SyVC. In other words, all lines of code that are executed before and after this particular token or are somehow altered if its value were to change. They call this a "Semantic Vulnerability Candidate". Usually if the SyVC is a whole function, then the corresponding SeVC will include all functions called by it and all the function that call it.
The next problem to be solved is: Having already identified a piece of the program that might contain a vulnerability as a SeVC, how do we encode that as a vector or something that can be understood by machine learning algorithms? The approach chosen by the authors is to first give generic names to all the functions and variables (thus sort of obfuscating it lightly), then perform a lexical analysis on it (i.e. breaking it up into symbols) and finally representing that strings as a bag of words, a procedure we have already referred to in past articles. A fixed length must be chosen and vectors that don’t fit must be padded or truncated, since the chosen neural networks take vectors of a fixed length as input. Here is a depiction of the process for a particular piece of code:
All that remains is to train and test the neural networks. One of the goals of SySeVR was to be able to work with different types of networks. Six (!) different types of networks were implemented in Python with the Theano library: CNN, DBN, and four types of RNN: (Bidirectional) Long-short term memory ((B)LSTM) and (Bidirectional) Gated Recurrent Unit ((B)GRU). They validated their results against a vulnerability dataset combining NVD and `SARD, labeled either as vulnerable or not, ideally some with the corresponding diff and the vulnerability type.
But which syntactic patterns to look for? Who will be the syntactic vulnerability candidates? For this, they used standard static detection tools such as Checkmarx, Flawfinder and RATS. From these results, they decided to focus on four main vulnerability types, out of the 126 different kinds of vulnerabilities contained in the dataset:
Insecure API usage, v.g. malloc without free.
Improper arithmetic expressions.
For this particular "experiment", the graph code representations were obtained with the tool Joern by Yamaguchi et al., a sister project of Chucky of sorts. The SeVC to vector encoding was performed with word2vec.
The results of the experiment can be summarized as follows:
BGRU networks appear to be the best fit for vulnerability discovery, as long as the training data is good. In general the effectiveness of deep neural networks is a open research problem.
For any kind of neural network used, it is better to tailor them to the specific kind of vulnerability that is sought, rather than try to use a catch-all type of model.
SySeVR results are way better than those of current, commercial, well-established static detection tools such as mentioned Checkmarx.
SySeVR was able to identify 15 vulnerabilities new to NVD in open source projects like Thunderbird and Seamonkey, all of which were, as it should, responsibly disclosed. Some of them got listed in CVE. Others were silently patched by their manufacturers. These are, of course, the most important product of this idea and are summarized in the following table:
Thus, the idea of applying deep learning techniques to vulnerability discovery in source apparently does deliver the promised results. However as mentioned earlier, these are to be taken with a grain of salt, until the results are peer-reviewed and cross-validated by the academic and security communities, or, at least, by.
Z. Li, D. Zou, Shouhuai X., H. Jin, Y. Zhu and Z. Chen (2018). SySeVR: A framework for using deep learning to detect software vulnerabilities. arXiv:1807.06756 [cs.LG]
with an itch for CS