Young hacker smiling

We hack your software

zero false positives

Expert intelligence + effective automation

Cartoonized dragon book cover

Exploiting code graphs

Mining graph representations for vulnerabilities
How to exploit graph representations of code in order to find security vulnerabilites. We introduce Yamaguchi's concept of code property graphs, which combines standard graph representations, how to traverse them, and how to guide a computer to traverse it on its own.

As we have seen in our previous revision article, probably the most interesting and successful approach to automated vulnerability detection is the pattern-based approach. Since we expect to extract meaningful patterns from the code we also need a "comprehensive and feature rich representation"[1] of it. Other authors have tried to work on the code as if it were merely a collection of words or documents with no regards to syntax or inherent order, but with poor results.

Combining standard code representations

The basic idea is to parse the source code, represent it as a graph which extends on the standard abstract syntax tree (AST) to include information about data and control flow, and store it in a database which can be queried by the analyst.

Overview of the process
Figure 1. Overview of the process

Besides the AST, another couple of useful code representations are:

  • the control flow graph (CFG), which represents the order in which statements are executed depending on the conditions, and

  • the program dependence graph (PDG), which tells us how variables are modified by statements.

These are better understood by example. Consider the following C snippet:

void foo() {
  int x = source();
  if( x < MAX ) {
    int y = 2*x;
    sink(y);
  }
}

In this context, source is meant to be the place in the code where the variables might be tainted by the attacker and sink is meant to be the function which might be vulnerable if the input from the source is not sanitized. The AST, CFG and PDG for such a snippet would be like this:

Code graph comparison
Figure 2. Comparison of graph representations of code

Neither of them alone can tell us the whole story about what is happening in that snippet, which is of course necessary to identify vulnerabilities, since these are generally very context-dependent.

Yamaguchi and his team put forward the idea of combining these three graphs into a single multigraph (basically a graph that allows multiple edges for a pair of nodes) which they call the code property graph:

A code property graph
Figure 3. A code property graph

Essentially, it’s like the AST with additional edges "borrowed" from the PDG and CFG, which are therefore represented with different colors and labels. The authors provide detailed explanations regarding how these graphs are mathematically defined, operationally built from source and later traversed. However here we will continue using only the intuition of how this works with the discovery of a Buffer Overflow vulnerability in the Apple implementation of SSH which exposed many iOS applications. This is the culprit code:

1
2
3
4
5
6
7
if channelp {
  uint32_t namelen = _libssh2_ntohu32(data + 9 + sizeof("exit-signal"));
  channelp->exit_signal = LIBSSH2_ALLOC(session, namelen + 1);
  [...]
  memcpy(channelp->exit_signal, data + 13 + sizeof("exit_signal"), namelen);
  channelp->exit_signal[namelen] = '\0';
}

Memory is allocated in line 3 using the LIBSSH2_ALLOC function depending on the variable namelen. The problem is that this variable is controlled by the user, so if it is made to be large enough, we have a buffer overflow which might give the attacker control of the program execution. The vulnerability was originally discovered using a regular expression which looks for a pattern where the use of functions containing ALLOC are combined with arithmetical operations:

ALLOC[A-Z0-9_]*\s*\([^,]*,[^;]*[*+-][^>][^;]*\)\s*;

However, as we know, regular expressions fail to match the nested structure of code, miss the whole context behind a potentially interesting situation and are prone to false positives. Instead, a traversal over the code property graph can be defined which selects third arguments to memcpy which have been tainted by first arguments to get_user that have not been sanitized.

By manually crafting such graph traversals, the authors were able to identify 88 vulnerabilities in the Linux kernel. Out of those, these 18 were previously unknown:

Zero-day vulnerabilities found
Figure 4. Zero-day vulnerabilites found using CPG

That’s 15 CVEs right there, made with a graph representation of code and four measly traversals.

Automating traversals

However good the results, they wouldn’t be pertinent for our research purposes if they couldn’t be automated using machine learning (ML). Since we already know how to traverse the CPG to look for potentially dangerous patterns, now we need a way to automatically infer what those patterns are.

The procedure proposed by Yamaguchi et al. is not universal, though, as it does not apply to all kinds of vulnerabilites, but to those of the taint-style, like Heartbleed. Furthermore, it is not entirely automated, i.e., don’t expect to give it your code and obtain every bug. Rather, you have to feed it a sink of interest, such as memcpy, whose unsanitized use caused Heartbleed.

Having selected a sink, we find all invocations of it and slice the whole part of the program that involves it and represent that as graph. This graph is then divided into individual calls. This information is then encoded as features i.e. mapped to a vector space, which is a precondition to apply any ML algorithm. Then these invocations are clustered to determine sets of invocations with similar argument initializers. However, this is not enough, as some of those arguments could be sanitized. Thus the second to last step is to create a sanitization overlay in order to avoid false positives. Here is a depiction of the process:

Clustering to get traversals
Figure 5. Clustering to get traversals

The product of this process is set of graph database traversals such as the ones described in the previous section. These traversals follow a template written in the Gremlin graph query language:

Template for traversals
getCallsTo(sink)
  .taintedArgs(
          [arg1Source, ..., argnSource]
  ).unchecked(
          [arg1Sanitizer, ..., argnSanitizer)
  )

The method was checked against five well-known vulnerable sinks in five different open source projects, with good results. In this case the quality of results is measured by the reduction made to manual code auditing, which is above 90% for all of them. They also tried to generate a traversal that could be able to detect the Heartbleed vulnerability in the OpenSSL and also a general test on all potentially taintable sinks in the VLC media player, discovering 8 previously unknown vulnerabilities.


This was an interesting combination of the code-as-data approach and machine learning techniques to find vulnerabilities in source code. However many details are missing in the pattern-based approach to ML-guided vulnerability discovery. In particular, we have yet to discuss vulnerability extrapolation using dimensionality reduction and missing check detection via anomaly detection. Stay tuned for more details on applying this and other ML techniques to security.

References

  1. Yamaguchi, F., Golde, N., Arp, D., and Rieck, K. (2014). Modeling and discovering vulnerabilities with code property graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P).

  2. Yamaguchi, F., Maier, A., Gascon, H., and Rieck, K. (2015). Automatic inference of search patterns for taint-style vulnerabilities. In Proc. of IEEE Symposium on Security and Privacy (S&P).


Author picture

Rafael Ballestas

Mathematician

with an itch for CS



Related




Service status - Terms of Use