| 4 min read
As we have seen in our previous revision article, probably the most interesting and successful approach to automated vulnerability detection is the pattern-based approach. Since we expect to extract meaningful patterns from the code, we also need a "comprehensive and feature-rich representation"[1] of it. Other authors have tried to work on the code as if it were merely a collection of words or documents with no regards to syntax or inherent order, but with poor results.
Combining standard code representations
The basic idea is to parse the source code, represent it as a graph
which extends on the standard abstract syntax
tree (AST
) to include
information about data and control flow, and store it in a database
which can be queried by the analyst.
Overview of the process.
Besides the AST
, another couple of useful code representations are:
-
the control flow graph (
CFG
), which represents the order in which statements are executed depending on the conditions, and -
the program dependence graph (
PDG
), which tells us how variables are modified by statements.
These are better understood by example. Consider the following C
snippet:
void foo() {
int x = source();
if( x < MAX ) {
int y = 2*x;
sink(y);
}
}
In this context, source is meant to be the place in the code where the
variables might be tainted by the attacker, and sink is meant to be
the function which might be vulnerable if the input from the source is
not sanitized. The AST
, CFG
, and PDG
for such a snippet would look
like this:
Comparison of graph representations of code.
Neither of them, alone, can tell us the whole story about what is happening in that snippet, which is, of course, necessary to identify vulnerabilities, since these are generally very context-dependent.
Yamaguchi and his team put forward the idea of combining these three graphs into a single multigraph (basically a graph that allows multiple edges for a pair of nodes) which they call the code property graph:
A code property graph.
Essentially, it’s like the AST
with additional edges "borrowed" from
the PDG
and CFG
, and are represented with different colors and
labels. The authors provide detailed explanations regarding how these
graphs are mathematically defined, operationally built from source and
later traversed. However, here we will continue using only the intuition
of how this works, with the discovery of a Buffer
Overflow vulnerability
in the Apple
implementation of SSH
which exposed many iOS
applications. This is the culprit code:
if channelp {
uint32_t namelen = _libssh2_ntohu32(data ` 9 ` sizeof("exit-signal"));
channelp->exit_signal = LIBSSH2_ALLOC(session, namelen + 1);
[...]
memcpy(channelp->exit_signal, data ` 13 ` sizeof("exit_signal"), namelen);
channelp->exit_signal[namelen] = '\0';
}
Memory is allocated in line 3
using the LIBSSH2_ALLOC
function
depending on the variable namelen
. The problem is that this variable
is controlled by the user, so if it is made large enough, we have a
buffer overflow which might give the attacker control of the program
execution. The vulnerability was originally discovered using a regular
expression which looks for a pattern where the use of functions
containing ALLOC
is combined with arithmetical operations:
ALLOC[A-Z0-9_]*\s*\([^,]*,[^;]*[*+-][^>][^;]*\)\s*;
However, as we know,
regular expressions fail to match the nested structure of code, miss the
whole context behind a potentially interesting situation and are prone
to false positives. Instead, a traversal over the code property graph
can be defined which selects third arguments to memcpy
which have been
tainted by first arguments to get_user
that have not been sanitized.
By manually crafting such graph traversals, the authors were able to
identify 88 vulnerabilities in the Linux
kernel. Out of those,
these 18 were previously unknown:
Zero-day vulnerabilities found using CPG
.
That’s 15 CVEs
right there, made with a graph representation of code
and four measly traversals.
Automating traversals
However good the results, they wouldn’t be pertinent for our research
purposes if they couldn’t be automated using machine learning (ML
).
Since we already know how to traverse the CPG
to look for potentially
dangerous patterns, now we need a way to automatically infer what those
patterns are.
The procedure proposed by Yamaguchi et al. is not universal though, as
it does not apply to all kinds of vulnerabilities, but to those of the
taint-style, like Heartbleed. Furthermore, it
is not entirely automated, i.e., don’t expect to give it your code and
obtain every bug. Rather, you have to feed it a sink of interest, such
as memcpy
, whose unsanitized use caused Heartbleed
.
Having selected a sink, we find all invocations of it and slice the
whole part of the program that involves the sink, and we represent that
as a graph. This graph is then divided into individual calls. This
information is then encoded as features, i.e., mapped to a vector space,
which is a precondition to applying any ML
algorithm. Then, these
invocations are
clustered to
determine sets of invocations with similar argument initializers.
However, this is not enough, as some of those arguments could be
sanitized. Thus, the second to last step is to create a sanitization
overlay, in order to avoid false positives. Here is a depiction of the
process:
Clustering to get traversals.
The product of this process is a set of graph database traversals such as the ones described in the previous section. These traversals follow a template written in the Gremlin graph query language:
Template for traversals.
getCallsTo(sink)
.taintedArgs(
[arg1Source, ..., argnSource]
).unchecked(
[arg1Sanitizer, ..., argnSanitizer)
)
This method was checked against five well-known vulnerable sinks in five
different open source projects with good results. In this case, the
quality of results is measured by the reduction made to manual code
auditing, which is above 90%
for all of them. They also tried to
generate a traversal that would be able to detect the
Heartbleed vulnerability in the OpenSSL
, and a
general test on all potentially taintable sinks in the VLC
media
player, discovering 8
previously unknown vulnerabilities.
This was an interesting combination
of the code-as-data approach
and machine learning techniques
to find vulnerabilities in source code
(see our secure code review solution).
However, many details are missing in the
pattern-based approach to ML
-guided vulnerability discovery. In
particular, we have yet to discuss vulnerability extrapolation using
dimensionality reduction and missing check detection via anomaly
detection.
Stay tuned for more details on applying this, and other ML
techniques
to security.
References
-
Yamaguchi, F., Golde, N., Arp, D., and Rieck, K. (2014). Modeling and discovering vulnerabilities with code property graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P).
-
Yamaguchi, F., Maier, A., Gascon, H., and Rieck, K. (2015). Automatic inference of search patterns for taint-style vulnerabilities. In Proc. of IEEE Symposium on Security and Privacy (S&P).
Share
Recommended blog posts
You might be interested in the following related posts.
Users put their trust in you; they must be protected
Is your financial service as secure as you think?
We need you, but we can't give you any money
A digital infrastructure issue that many still ignore
How can we justify the investment in cybersecurity?
Attackers can indirectly instruct AI for malicious aims
Let's rather say a bunch of breaches in a single box