Table of content

Title



Blog

Development

Binary learning: Learning to exploit binaries

cover-binary-learning (https://unsplash.com/photos/h3sAF1cVURw)

Rafael Ballestas

Security analyst

Updated

Jan 25, 2019



4 min

While our main focus, as stated previously, is to apply machine learning (ML) techniques to the discovery of vulnerabilities in source code, that is, a white-box approach to ML-guided hacking, we’ve come across an interesting approach called VDiscover, which is radically different in the following sense:

Works on binaries. No source code required.
Mixes dynamic and static detection.
Guides fuzzing campaigns.
Looks primarily for memory corruption.
Is very lightweight, hence scalable.

But perhaps the most distinguishing design feature of VDiscover is that it is trained and validated with test cases working on the same program, unlike other approaches which need to be trained with labeled samples of vulnerable code. In a nutshell, you tell VDiscover what happens when you fuzz the program with a certain input, you tell it that it crashes with some other input, and hundreds more inputs with their outputs, to complete its training phase, and later it will be able to predict which test cases are more likely to produce vulnerabilities in the recall phase. This process can be depicted as follows:

Training (left) and recall (right) phases of VDiscover. Taken from their site.

In this diagram, vulnerability discovery procedure means any of the tests we use daily to find security flaws, but especially black-box fuzzing of binaries, concrete symbolic ("concolic") testing and static analysis tools which, while prone to false positives, can still be useful to generate test cases or guide processes like this one.

Why use this tool if I still need to run my tool of choice to generate the test cases? Running these tools is expensive, in time, computing resources, human resources, all of which translates to money as well. Also it doesn’t scale well to huge projects like entire operating systems which consist of tens of thousands of packaged binaries. Why not just execute your test only on a thousand of them and let VDiscover predict the rest, to later focus only on the ones which are more likely to contain vulnerabilities? Sounds like a good deal to me!

Such a modus operandi is what makes VDiscover stand out among its peers, besides the fact that it is a proper, relatively mature open-source project, while other ML-guided vulnerability detectors are still in development or provide proof-of-concept programs.

Hence, in order to test VDiscover, we need to choose:

A particular kind of vulnerability. They choose heap and stack memory corruptions.
A special vulnerability detection procedure. They chose simple, one byte at a time, fuzzing of inputs.
A dataset. They chose one made up from 1039 taken from the Debian Bug Tracker.
The particular machine learning models to apply to the dataset, since VDiscover` is designed to work with more than one of those.

This particular combination of vulnerability and detection procedure has several advantages:

Both implicit and explicit hints to determine whether the vulnerability was triggered, like the stack protections provided by the GNU C library which abort the execution, or the usage of functions like strcpy and fread.
It is an important kind of vulnerability unto itself, since they might allow the attacker to execute arbitrary code in the host machine.

However, in order to be able to recognize the hints to memory corruption mentioned above, first some features need to be extracted from the target of evaluation. Dynamic features are taken from the execution of test cases, while static features are extracted from the binary code itself. This is extra information to enrich the dataset, to "provide a redundant and robust similarity measure that a machine learning model can employ to predict whether a test case will be flagged as vulnerable or not".

They avoid building graph representations of code altogether, and instead settle on reading the disassembly of the code at random, but many times, thus ensuring capturing pretty much all possible sequences of standard C library calls. On the other hand, dynamic features is simply a set consisting of a function call to the C standard library, with its arguments, and the final state of the process which may be exit, crash, abort, or timeout.

Onward to training the machines! They used three different models: a random forest, a logistic regression model, which can be thought of a particular case of their third model, the multilayer perceptron The dataset was divided into three disjoint sets for training, validation and testing, preprocessed with a combination of n-grams and word2vec encoding, and adjusted the training to compensate for class imbalance (an issue with data where the interesting cases are too scarce amongst regular ones).

The concrete implementation was done in Python using the scikit-learn and pylearn2 libraries. The most accurate classifier was the random forest trained with the dynamical features only, with a prediction error of 31%. This high error, while not critical, shows that there is plenty of room for improvement. Still, these are good results for what is apparently the only (up to its moment) ML-guided tool for vulnerability research in binaries. On the other hand, the results are not as spectacular in terms of producing previously unknown vulnerabilities. They merely tell us about possible memory corruptions in particular pieces of code and how likely they are to be exploitable.

Probable paths that the authors would have liked to follow were to implement convolutional neural networks, try different vulnerability discovery procedures, and, perhaps more likely to be promising, using tress representing the possible sequences of library calls, the part that was done randomly in this study. However, as was their purpose, they managed to show that it is actually feasible to learn to search for vulnerabilities in binaries at the operating system scale.

Get started with Fluid Attacks' RBVM solution right now

Tags:

machine-learning

vulnerability

exploit







Subscribe to our newsletter

Stay updated on our upcoming events and latest blog posts, advisories and other engaging resources.

Coding with gen AI: Five best practices

Read post



cover-secure-coding-five-steps (https://unsplash.com/photos/zc9pWsPZd4Y)

Development

Felipe Ruiz

•

December 5, 2022

Secure coding in five steps? A simple approach to try out in cybersecurity training

Read post



Development

Felipe Ruiz

•

November 22, 2022

Go over and practice secure coding

Read post



cover-understand-program-semantics (https://unsplash.com/photos/j3dxI7CNYL0)

Development

Rafael Ballestas

•

February 14, 2020

Understanding program semantics with symbolic execution

Read post



cover-code-translate (https://unsplash.com/photos/r8H8K3w9AzA)

Development

Rafael Ballestas

•

January 31, 2020

Can code be translated? From code to words

Read post



cover-further-code2vec (https://unsplash.com/photos/FoiZoPtxSyA)

Development

Rafael Ballestas

•

January 24, 2020

Further down code2vec: Vector representations of code

Read post



Development

Rafael Ballestas

•

January 10, 2020

Embedding code into vectors: Vector representations of code

Read post



cover-vector-language (https://unsplash.com/photos/_E1PQXKUkMw)

Development

Rafael Ballestas

•

December 13, 2019

The vectors of language: Distributed representations of natural language

Read post



Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Fluid Attacks' solutions enable organizations to identify, prioritize, and remediate vulnerabilities in their software throughout the SDLC. Supported by AI, automated tools, and pentesters, Fluid Attacks accelerates companies' risk exposure mitigation and strengthens their cybersecurity posture.