By Rafael Ballestas | January 25, 2019
While our main focus, as stated
is to apply machine learning (
ML) techniques to
the discovery of vulnerabilities in source code,
a white-box approach to
we’ve come across an interesting approach called
which is radically different in the following sense:
Works on binaries. No source code required.
Mixes dynamic and static detection.
Guides fuzzing campagins.
Looks primarily for memory corruption.
Is very lightweight, hence scalable.
But perhaps the most distiguishing design feature of
is that it is trained and validated with test cases working on the same program,
unlike other approaches which
need to be trained with labeled samples of vulnerable code.
In a nutshell, you tell
what happens when you fuzz the program with a certain input,
you tell it that it crashes with some other input,
and hundreds more inputs with their outputs,
to complete its training phase,
and later it will be able to predict
which test cases are more likely to produce vulnerabilities
in the recall phase.
This process can be depicted as follows:
VDiscover. Taken from their site.
In this diagram, vulnerability discovery procedure means any of the tests we use daily to find security flaws, but especially black-box fuzzing of binaries, concrete symbolic ("concolic") testing and static analysis tools which, while prone to false positives, can still be useful to generate test cases or guide processes like this one.
Why use this tool if
I still need to run my tool of choice to
generate the test cases?
Running these tools is expensive,
in time, computating resources, human resources,
all of which translates to money as well.
Also it doesn’t scale well to huge projects
like entire operating systems which
consist of tens of thousands of packaged binaries.
Why not just execute your test only on
a thousand of them and let
VDiscover predict the rest,
to later focus only on the ones which are more likely
to contain vulnerabilites?
Sounds like a good deal to me!
Such a modus operandi is what makes
VDiscover stand out
among its peers, besides the fact that it is a proper,
relatively mature open-source project,
ML-guided vulnerability detectors
are still in development or provide proof-of-concept programs.
Hence, in order to test
we need to choose:
A particular kind of vulnerability. They choose heap and stack memory corruptions.
A special vulnerability detection procedure.
They chose simple, one
byte at a time,
fuzzing of inputs.
A dataset. They chose one made up from 1039 taken from the Debian Bug Tracker.
The particular machine learning models to
apply to the dataset, since
VDiscover is designed to
work with more than one of those.
This particular combination of vulnerability and detection procedure has several advantages:
Both implicit and explicit hints to determine whether
the vulnerability was triggered,
like the stack protections provided by the
which abort the execution, or the usage
of functions like
It is an important kind of vulnerability unto itself, since they might allow the attacker to execute arbitrary code in the host machine.
However, in order to be able to recognize the hints to memory corruption mentioned above, first some features need to be extracted from the target of evaluation. Dynamic features are taken from the execution of test cases, while static features are extracted from the binary code itself. This is extra information to enrich the dataset, to "provide a redundant and robust similarity measure that a machine learning model can employ to predict whether a test case will be flagged as vulnerable or not".
They avoid building
graph representations of code altogether,
and instead settle on reading the
disassembly of the code
at random, but many times,
thus ensuring capturing pretty much all possible
sequences of standard
C library calls.
On the other hand,
dynamic features is simply a set consisting of
a function call to the
C standard library,
with its arguments, and the final state of the process
which may be exit, crash, abort, or timeout.
Onward to training the machines!
They used three different models: a
a logistic regression model,
which can be thought of a particular case of their third model, the
The dataset was divided into three disjoint sets
for training, validation and testing,
preprocessed with a combination of n-grams
and adjusted the training to compensate for
class imbalance (an issue with data where
the interesting cases are too scarce amongst regular ones).
The concrete implementation was done in
The most accurate classifier was
the random forest trained with the dynamical features only,
with a prediction error of 31%.
This high error, while not critical,
shows that there is plenty of room for improvement.
Still, these are good results for what is
apparently the only (up to its moment)
ML-guided tool for vulnerability research in binaries.
On the other hand, the results are not as spectacular
in terms of producing previously unknown vulnerabilities.
They merely tell us about possible memory corruptions
in particular pieces of code and
how likely they are to be exploitable.
Probable paths that the authors would have liked to follow were to implement convolutional neural networks, try different vulnerability discovery procedures, and, perhaps more likely to be promising, using tress representing the possible sequences of library calls, the part that was done randomly in this study. However, as was their purpose, they managed to show that it is actually feasible to learn to search for vulnerabilities in binaries at the operating sytem scale.
G. Grieco, G. Grinblat, L. Uzal, S. Rawat, J. Feist, L. Mounier (2015).
Toward large-scale vulnerability discovery using machine learning.
Technical Report. The Free International Center
of Information Sciences and Systems (
National Council for Science and Technology of Argentina (
Corporate member of The OWASP Foundation