The anomaly serial killer dollHunting missing checks with anomaly detection
In our previous article we focused on taint-style vulnerabilites, i.e., those that are essentially due to the lack of input sanitization which allows tainted, user-controlled data to reach sensitive functions. Some of these arise due to missing checks in code, such as:
failure to check authentication, authorization or other conditions related to the security logic of the application,
insecure API usage, like not checking the buffer size, leaving files open, not freeing memory after allocating, etc.
In production-grade systems, however, these checks are routinely performed since experienced developers are used to them. Still, they happen, as was the case with a Java 7 runtime vulnerability (CVE-2013-0422) which allowed to upload arbitrary classes, leading to a massive distribution of malware. But when they do, they are certainly anomalies. Thus searching for this kind of situation (in mature software systems) is a problem for which anomaly detection machine learning algorithms lend themselves well.
Given a function, Chucky finds all the sources and sinks referenced by it. If you’ve been following our latest articles, it should come as no surprise that the first step in the procedure is to perform a robust parsing.
Notice the use of the word robust above: as opposed to the strict parsing done by linters and compilers, which fails at the first mistake, this type of parsing does requires neither a proper build environment nor even complete or correct code.
Next, for each source or sink detected in the previous steps, the following steps are performed:
Neighborhood discovery: identify the element’s context.
Lightweight tainting: analyze the flow of data in the program.
Embed the functions into a vector space. Needed to apply any machine learning algorithm.
Perform anomaly detection via a simple score when comparing to the k nearest neighbors.
In order to determine functions similar to the current one, the bag of words approach from the field of natural language processing is used. A coarse way to represent a document is by just counting word frequencies, and forgetting about all structure, syntax, etc. In this fashion, this and other articles in this blog would have high frequencies for words like "security" and "vulnerability", but low ones for "coffee" and "recipe". This is a (perhaps too) simple, yet effective, way to distinguish security blogs from coffee-lovers blogs, but maybe it wouldn’t be able to distinguish sites which sell coffee from those who grow it.
In order to avoid a situation where very common words with little significance like "a", "in", etc, rank too high, the inverse document frequency is used. Multiplying the relative term frequency by the inverse document frequency gives a lower ranking to words like the above and higher ranks to the rare ones, so as to strike a balance.
The result of applying this to every function in the code to analyze is that every function is represented as a vector. Thus distances between them can be computed, and its k nearest neighbors chosen as that function’s neighborhood. Intuitively, this should couple every function with the k most similar functions in terms of the API symbols they use. Thus functions that use a lot of memory allocations would be put together, a method that has many return points would be paired with functions that also have many return points, etc.
Other than knowing which functions are similar to the one under escrutiny, we also need to know which checks or conditionals are really security-relevant and which ones are just noise. This, however, is not so easy since for that we need to take into account the flow of data between variables, which sound more like a job for symbolic execution and taint analysis, which is in fact what Chucky uses.
This graph is then traversed both in the source-to-sink and sink-to-source direction to determine tainted identifiers, highlighted in gray above.
Next we need to represent the function as a vector. Again, they use a bag-of-words-like approach, where instead of the term frequency or inverse document frequency, a simple indicator function is used: 1 if a term is contained and 0 if not. Furthermore, the "words" are taken from the abstract syntax tree and later normalized since we don’t care about particular operations (<, > are all treated as comparisons), particular numbers used or particular identifier names. Thus from the snippet above we would obtain a vector representation such as this one:
Now we can extract a model of "normality" for every source or sink of interest in every function. This is defined to be the center of mass (essentially an average) of all the vector representations taken over the set of neighbors of the function that contains the source or sink of interest. This is still a vector, and each of its components tell us the fraction of neighbors that contain a particular check.
Subtracting the vector representation of a function from this model of normality also gives us some insight. Negative entries in this difference vector correspond to expressions checked in this function but not so much in its neighbors, while positive entries show the opposite: checks that are commonly performed by syntactically similar functions but are missing in this particular function, i.e. the missing checks that we sought from the beggining. An anomaly score can be given to each function as the maximum element in the vector, given that functions which are mostly similar to their neighbors except for one missing check are more likely to contain security vulnerabilities than functions which are just altogether different from their neighbors.
Chucky was originally tested on five popular open source projects, versions of which contained published missing-check CVEs. Also the code was modified to introduce artificial vulnerabilites. This of course depends on the number of neighbors that should be taken into account. The results of this experiment to determine its capability to found known vulnerabilities are summarized below:
Notice that with 25 neighbors the results are perfect, i.e. no false positives and complete recall. For other reasonable values of k the results are also good.
But can it help in finding previously unknown vulnerabilites, as was set in their objectives? Indeed, Chucky was able to find 12 zero-day vulnerabilites in LibTIFF and Pidgin. One of them is particularly easy to exploit. In the function msn_message_parse_payload, Chucky reports a failure to check the value passed to strchr. An attacker can thus trigger this vulnerability by sending to the victim the message Content-Type\n\n, thus making one of the sensitive sinks on the victim side null, leading to an application crash.
Chucky has shown us how relatively simple, but "intelectually transparent" machine learning models can tell us a lot about where to look for vulnerabilities. However, let us remember that this is approach is not free of false positives and also that it doesn’t pretend to replace human auditing, only to accelerate it by aiding us in not having to review code that is most likely safe.
F. Yamaguchi, C. Wressneger, H. Gascon, K. Rieck (2013). Chucky: exposing missing checks in source code for vulnerability discovery. 20th ACM conference on computer and communications security.
with an itch for CS