Understanding Program Semantics

With symbolic execution

Blog Understanding Program Semantics

| 4 min read

Table of contents

Contact us

Thus far, we have established the need to represent code as vectors before feeding it into a machine learning classifier that will help us sort through a messy project. We have also discussed at length why neural network based encoders are probably the best for this task. However, a new problem arises from this. In the case of natural language, the features and labels were clear: we try to predict neighboring words from a central word, as is done in code2vec. But with code we have more freedom, in a way: code has several special properties and structure. So we need to decide how to represent pieces of code before even feeding them to the network that will give us the vector representations, if you recall, as a by-product in the middle layers.

One of these possible representations, is to represent snippets of code as bags of words (tokens), i.e., a simple assignment of words to numbers and then making up a vector around that saying which tokens occur in a snippet. And, by the way, we skip the neural network encoder. This is what we did initially, but it is clear even from the foundations that this leaves room for improvement.

Another possibility is to use the snippet’s Abstract Syntax Tree (AST), which, incidentally, is what they use in tools such as code2vec and code2seq. However an issue that we could claim to be present here is that code is not just about syntax, but it also has rich semantics which could be exploited by us to obtain more useful vector representations and thus, hopefully, better results at the classifier level. Remember: garbage in, garbage out. The better the input to the classifier, the better the triaging overall.

One interesting approach I found which does take into account the semantics of the code to be represented works by using intraprocedural symbolic execution. Let’s see if this improves accuracy and actually adds value to our toolchain.

Get started with Fluid Attacks' Secure Code Review solution right now

As a refresher, recall that symbolic execution. traverse all the paths a program could take with all possible inputs, never defining them (unlike fuzzing), but rather treating them symbolically like algebraic indeterminates.

In the work we are reviewing, these paths are known as traces which are left during the symbolic execution and are further abstracted or, if you will, simplified, so that a simple snippet like the following becomes two abstracted traces like this:

Abstracted traces from a snippet, code examples

Abstracted traces from a snippet.

Incidentally, these Call, Error, AccessPath, etc type of abstractions subtly hint at an idea we’ve been considering, namely, using an intermediate representation of code before performing the actual encoding. This could be one such candidate, but pseudocode, assembly or even some other ad hoc kind of intermediate representation would solve or improve on two issues at hand:

  • Code from different clients will come in different programming languages.

  • Protecting the customer’s code from leakage by adding another masking layer.

Back to abstracted symbolic traces, they then proceed to encode these as words in order to use GloVe, a sort of big brother to Word2Vec. (I warned you, most code encoders either work with or are heavily inspired by W2V). GloVe exploits the intuition that when words appear together frequently, they have some relation, which seems reasonable in the regular usage of language. It remains to be seen how well such an idea translates to code, though.

That describes the entire architecture of this system of embedding code into vectors. They ran comprehensive experiments to determine the added value vs the overhead of running symbolic execution regarding the quality of the learned embeddings. One is the production of code analogies: much like "king - man + woman = queen" for natural language, similar relations arise here, such as “receive` is to download as send is to `upload”, where each of those are tokens naturally occurring in snippets of code. Of course, this is not yet standardized, so the is no benchmark; the authors had to make their own benchmark and corresponding dataset, a task we will certainly be confronted with soon.

The most eye-opening result is the comparison between syntactic and semantic based embeddings:

Syntactic vs semantic embeddings

Syntactic vs. semantic embeddings.

Almost 300% better. This makes intuitive sense, since considering more information, and especially the information which gives code its meaning, should give better results. However only the experiment proves it. And quite resoundingly at that. In the image above, OOV means out-of-vocabulary tests, i.e., the ones that can’t be tested. Even leaving these out, the semantic results are twice as good as the syntax-based ones. This means, on the one hand, that is is certainly worthwile to include semantic information in the code representation for input to embedders, despite the overhead. On the other, to be fair, the syntax-only approach they use might not be as sophisticated as, say, code2vec. It could be that we are comparing the weakest machine gun to the weakest pistol, instead of to the strongest, which might win (sorry, I’m short on metaphors today, but you get the idea).

The authors were nice enough to dedicate a section to the potential usage of their tool in downstream tasks, such as error-code misuse detection. That is, they don’t just give the tool, but also show you how to integrate it into a complete machine learning toolchain, which is nice for a research paper, where they usually don’t bother and leave it up to the reader to figure that out. More interestingly for us, they focused on bug-finding (although not strictly security-focused), and the automatic repair of such code. The results gave an accuracy of 76.5%, which as we haven seen throughout the series, is not bad in this context.

This is, overall, a tool which we might actually use for our purposes but, more importantly, shows us the importance of considering semantic information before even attempting to design the network for embedding code into vectors. More on that soon.

References

  1. J. Henkel, B. Liblit, S. Lahiri, T. Reps. Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces. ESEC/FSE '18.

Table of contents

Share

Subscribe to our blog

Sign up for Fluid Attacks' weekly newsletter.

Recommended blog posts

You might be interested in the following related posts.

Photo by Logan Weaver on Unsplash

Introduction to cybersecurity in the aviation sector

Photo by Maxim Hopman on Unsplash

Why measure cybersecurity risk with our CVSSF metric?

Photo by Jukan Tateisi on Unsplash

Our new testing architecture for software development

Photo by Clay Banks on Unsplash

Protecting your PoS systems from cyber threats

Photo by Charles Etoroma on Unsplash

Top seven successful cyberattacks against this industry

Photo by Anima Visual on Unsplash

Challenges, threats, and best practices for retailers

Photo by photo nic on Unsplash

Be more secure by increasing trust in your software

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which hundreds of organizations are already enjoying.

Start your 21-day free trial
Fluid Logo Footer

Hacking software for over 20 years

Fluid Attacks tests applications and other systems, covering all software development stages. Our team assists clients in quickly identifying and managing vulnerabilities to reduce the risk of incidents and deploy secure technology.

Copyright © 0 Fluid Attacks. We hack your software. All rights reserved.