
By Rafael Ballestas | November 27, 2018
As we have seen in our previous revision article, probably the most interesting and successful approach to automated vulnerability detection is the pattern-based approach. Since we expect to extract meaningful patterns from the code, we also need a "comprehensive and feature-rich representation"[1] of it. Other authors have tried to work on the code as if it were merely a collection of words or documents with no regards to syntax or inherent order, but with poor results.
The basic idea is to parse the source code,
represent it as a graph which extends on the standard
abstract syntax tree
(AST
) to include information about data and control flow,
and store it in a database which can be queried by the analyst.
Besides the AST
, another couple of useful code representations are:
the control flow graph (CFG
),
which represents the order in which statements are executed
depending on the conditions, and
the program dependence graph (PDG
),
which tells us how variables are modified by statements.
These are better understood by example.
Consider the following C
snippet:
void foo() {
int x = source();
if( x < MAX ) {
int y = 2*x;
sink(y);
}
}
In this context, source is meant to be
the place in the code where the variables might be tainted by the attacker,
and sink is meant to be the function which might be vulnerable
if the input from the source is not sanitized.
The AST
, CFG
, and PDG
for such a snippet would look like this:
Neither of them, alone, can tell us the whole story about what is happening in that snippet, which is, of course, necessary to identify vulnerabilities, since these are generally very context-dependent
Yamaguchi and his team put forward the idea of combining these three graphs into a single multigraph (basically a graph that allows multiple edges for a pair of nodes) which they call the code property graph:
Essentially, it’s like the AST
with additional edges
"borrowed" from the PDG
and CFG
,
and are represented with different colors and labels.
The authors provide detailed explanations
regarding how these graphs are mathematically defined,
operationally built from source and later traversed.
However, here we will continue using only the intuition of how this works,
with the discovery of a Buffer Overflow
vulnerability in the Apple
implementation of SSH
which exposed many iOS
applications.
This is the culprit code:
if channelp {
uint32_t namelen = _libssh2_ntohu32(data ` 9 ` sizeof("exit-signal"));
channelp->exit_signal = LIBSSH2_ALLOC(session, namelen + 1);
[...]
memcpy(channelp->exit_signal, data ` 13 ` sizeof("exit_signal"), namelen);
channelp->exit_signal[namelen] = '\0';
}
Memory is allocated in line 3
using the LIBSSH2_ALLOC
function
depending on the variable namelen
.
The problem is that this variable is controlled by the user,
so if it is made large enough,
we have a buffer overflow which might give the attacker
control of the program execution.
The vulnerability was originally discovered
using a regular expression which looks for a pattern
where the use of functions containing ALLOC
is combined with arithmetical operations:
ALLOC[A-Z0-9_]*\s*\([^,]*,[^;]*[*+-][^>][^;]*\)\s*;
However, as we know,
regular expressions fail to match the nested structure of code,
miss the whole context behind a potentially interesting situation
and are prone to false positives.
Instead, a traversal over the code property graph can be defined
which selects third arguments to memcpy
which have been
tainted by first arguments to get_user
that have not been sanitized.
By manually crafting such graph traversals,
the authors were able to identify 88 vulnerabilities in
the Linux
kernel.
Out of those, these 18 were previously unknown:
CPG
That’s 15 CVEs
right there,
made with a graph representation of code and
four measly traversals.
However good the results, they wouldn’t be pertinent
for our research purposes if they couldn’t be
automated using machine learning (ML
).
Since we already know how to traverse the CPG
to look for potentially dangerous patterns,
now we need a way to automatically infer
what those patterns are.
The procedure proposed by Yamaguchi et al. is not universal though,
as it does not apply to all kinds of vulnerabilities,
but to those of the taint-style, like Heartbleed.
Furthermore, it is not entirely automated,
i.e., don’t expect to give it your code and obtain every bug.
Rather, you have to feed it a sink of interest,
such as memcpy
, whose unsanitized use caused Heartbleed
.
Having selected a sink,
we find all invocations of it and slice
the whole part of the program that involves the sink,
and we represent that as a graph.
This graph is then divided into individual calls.
This information is then encoded as features,
i.e., mapped to a vector space,
which is a precondition to applying any ML
algorithm.
Then, these invocations
are clustered
to determine sets of invocations with similar argument initializers.
However, this is not enough,
as some of those arguments could be sanitized.
Thus, the second to last step is to create a sanitization overlay,
in order to avoid false positives.
Here is a depiction of the process:
The product of this process is a set of graph database traversals such as the ones described in the previous section. These traversals follow a template written in the Gremlin graph query language:
getCallsTo(sink)
.taintedArgs(
[arg1Source, ..., argnSource]
).unchecked(
[arg1Sanitizer, ..., argnSanitizer)
)
This method was checked against five well-known
vulnerable sinks in five different open source projects with good results.
In this case, the quality of results
is measured by the reduction made to manual code auditing,
which is above 90%
for all of them.
They also tried to generate a traversal that would be able
to detect the Heartbleed vulnerability
in the OpenSSL
, and a general test on all potentially taintable sinks
in the VLC
media player,
discovering 8
previously unknown vulnerabilities.
This was an interesting combination of the
code-as-data approach
and machine learning techniques
to find vulnerabilities in source code.
However, many details are missing
in the pattern-based approach to ML
-guided vulnerability discovery.
In particular, we have yet to discuss
vulnerability extrapolation using dimensionality reduction
and missing check detection via anomaly detection.
Stay tuned for more details on applying this,
and other ML
techniques to security.
Yamaguchi, F., Golde, N., Arp, D., and Rieck, K. (2014). Modeling and discovering vulnerabilities with code property graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P).
Yamaguchi, F., Maier, A., Gascon, H., and Rieck, K. (2015). Automatic inference of search patterns for taint-style vulnerabilities. In Proc. of IEEE Symposium on Security and Privacy (S&P).
Corporate member of The OWASP Foundation
Copyright © 2021 Fluid Attacks, We hack your software. All rights reserved.