| 4 min read
As we have seen in our previous revision article, probably the most interesting and successful approach to automated vulnerability detection is the pattern-based approach. Since we expect to extract meaningful patterns from the code, we also need a "comprehensive and feature-rich representation"[1] of it. Other authors have tried to work on the code as if it were merely a collection of words or documents with no regards to syntax or inherent order, but with poor results.
Combining standard code representations
The basic idea is to parse the source code, represent it as a graph which extends on the standard abstract syntax tree (AST
) to include information about data and control flow, and store it in a database which can be queried by the analyst.
Overview of the process.
Besides the AST
, another couple of useful code representations are:
-
the control flow graph (
CFG
), which represents the order in which statements are executed depending on the conditions, and -
the program dependence graph (
PDG
), which tells us how variables are modified by statements.
These are better understood by example. Consider the following C
snippet:
void foo() {
int x = source();
if( x < MAX ) {
int y = 2*x;
sink(y);
}
}
In this context, source is meant to be the place in the code where the variables might be tainted by the attacker, and sink is meant to be the function which might be vulnerable if the input from the source is not sanitized. The AST
, CFG
, and PDG
for such a snippet would look like this:
Comparison of graph representations of code.
Neither of them, alone, can tell us the whole story about what is happening in that snippet, which is, of course, necessary to identify vulnerabilities, since these are generally very context-dependent.
Yamaguchi and his team put forward the idea of combining these three graphs into a single multigraph (basically a graph that allows multiple edges for a pair of nodes) which they call the code property graph:
A code property graph.
Essentially, it’s like the AST
with additional edges "borrowed" from the PDG
and CFG
, and are represented with different colors and labels. The authors provide detailed explanations regarding how these graphs are mathematically defined, operationally built from source and later traversed. However, here we will continue using only the intuition of how this works, with the discovery of a Buffer Overflow vulnerability in the Apple
implementation of SSH
which exposed many iOS
applications. This is the culprit code:
if channelp { uint32_t namelen = _libssh2_ntohu32(data ` 9 ` sizeof("exit-signal")); channelp->exit_signal = LIBSSH2_ALLOC(session, namelen + 1); [...] memcpy(channelp->exit_signal, data ` 13 ` sizeof("exit_signal"), namelen); channelp->exit_signal[namelen] = '\0';
}
Memory is allocated in line 3
using the LIBSSH2_ALLOC
function depending on the variable namelen
. The problem is that this variable is controlled by the user, so if it is made large enough, we have a buffer overflow which might give the attacker control of the program execution. The vulnerability was originally discovered using a regular expression which looks for a pattern where the use of functions containing ALLOC
is combined with arithmetical operations:
ALLOC[A-Z0-9_]*\s*\([^,]*,[^;]*[*+-][^>][^;]*\)\s*;
However, as we know, regular expressions fail to match the nested structure of code, miss the whole context behind a potentially interesting situation and are prone to false positives. Instead, a traversal over the code property graph can be defined which selects third arguments to memcpy
which have been tainted by first arguments to get_user
that have not been sanitized.
By manually crafting such graph traversals, the authors were able to identify 88 vulnerabilities in the Linux
kernel. Out of those, these 18 were previously unknown:
Zero-day vulnerabilities found using CPG
.
That’s 15 CVEs
right there, made with a graph representation of code and four measly traversals.
Automating traversals
However good the results, they wouldn’t be pertinent for our research purposes if they couldn’t be automated using machine learning (ML
). Since we already know how to traverse the CPG
to look for potentially dangerous patterns, now we need a way to automatically infer what those patterns are.
The procedure proposed by Yamaguchi et al. is not universal though, as it does not apply to all kinds of vulnerabilities, but to those of the taint-style, like Heartbleed. Furthermore, it is not entirely automated, i.e., don’t expect to give it your code and obtain every bug. Rather, you have to feed it a sink of interest, such as memcpy
, whose unsanitized use caused Heartbleed
.
Having selected a sink, we find all invocations of it and slice the whole part of the program that involves the sink, and we represent that as a graph. This graph is then divided into individual calls. This information is then encoded as features, i.e., mapped to a vector space, which is a precondition to applying any ML
algorithm. Then, these invocations are clustered to determine sets of invocations with similar argument initializers. However, this is not enough, as some of those arguments could be sanitized. Thus, the second to last step is to create a sanitization overlay, in order to avoid false positives. Here is a depiction of the process:
Clustering to get traversals.
The product of this process is a set of graph database traversals such as the ones described in the previous section. These traversals follow a template written in the Gremlin graph query language:
Template for traversals.
getCallsTo(sink)
.taintedArgs(
[arg1Source, ..., argnSource]
).unchecked(
[arg1Sanitizer, ..., argnSanitizer)
)
This method was checked against five well-known vulnerable sinks in five different open source projects with good results. In this case, the quality of results is measured by the reduction made to manual code auditing, which is above 90%
for all of them. They also tried to generate a traversal that would be able to detect the Heartbleed vulnerability in the OpenSSL
, and a general test on all potentially taintable sinks in the VLC
media player, discovering 8
previously unknown vulnerabilities.
This was an interesting combination of the code-as-data approach and machine learning techniques to find vulnerabilities in source code (see our secure code review solution). However, many details are missing in the pattern-based approach to ML
-guided vulnerability discovery. In particular, we have yet to discuss vulnerability extrapolation using dimensionality reduction and missing check detection via anomaly detection. Stay tuned for more details on applying this, and other ML
techniques to security.
References
-
Yamaguchi, F., Golde, N., Arp, D., and Rieck, K. (2014). Modeling and discovering vulnerabilities with code property graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P).
-
Yamaguchi, F., Maier, A., Gascon, H., and Rieck, K. (2015). Automatic inference of search patterns for taint-style vulnerabilities. In Proc. of IEEE Symposium on Security and Privacy (S&P).
Share
Recommended blog posts
You might be interested in the following related posts.
Users put their trust in you; they must be protected
Is your financial service as secure as you think?
We need you, but we can't give you any money
A digital infrastructure issue that many still ignore
How can we justify the investment in cybersecurity?
Attackers can indirectly instruct AI for malicious aims
Let's rather say a bunch of breaches in a single box