By Rafael Ballestas | February 14, 2020
Thus far, we have established the need to represent code as vectors before feeding it into a machine learning classifier that will help us sort through a messy project. We have also discussed at length why neural network based encoders are probably the best for this task. However, a new problem arises from this. In the case of natural language, the features and labels were clear: we try to predict neighboring words from a central word, as is done in code2vec. But with code we have more freedom, in a way: code has several special properties and structure. So we need to decide how to represent pieces of code before even feeding them to the network that will give us the vector representations, if you recall, as a by-product in the middle layers.
One of these possible representations, is to represent snippets of code as bags of words (tokens), i.e., a simple assignment of words to numbers and then making up a vector around that saying which tokens occurr in a snippet. And, by the way, we skip the neural network encoder. This is what we did initially, but it is clear even from the foundations that this leaves room for improvement.
Another possibility is to use the snippet’s
Abstract Syntax Tree (
incidentally, is what they use in tools such as
However an issue that we could claim to be present here
is that code is not just about syntax,
but it also has rich semantics which could be
exploited by us to obtain more useful vector representations
and thus, hopefully, better results at the classifier level.
Remember: garbage in, garbage out.
The better the input to the classifier,
the better the triaging overall.
One interesesting approach I found which does take into account the semantics of the code to be represented works by using intraprocedural symbolic execution. Let’s see if this improves accuracy and actually adds value to our toolchain.
As a refresher, recall that symbolic execution. traverse all the paths a program could take with all possible inputs, never defining them (unlike fuzzing), but rather treating them symbolically like algebraic indeterminates.
In the work we are reviewing, these paths are known as traces which are left during the symbolic execution and are further abstracted or, if you will, simplified, so that a simple snippet like the following becomes two abstracted traces like this:
type of abstractions subtly hint at an idea
we’ve been considering, namely,
using an intermediate representation of code
before performing the actual encoding.
This could be one such candidate,
but pseudocode, assembly or even some other ad hoc
kind of intermediate representation would solve or improve on
two issues at hand:
Code from different clients will come in different programming languages.
Protecting the customer’s code from leakage by adding another masking layer.
Back to abstracted symbolic traces,
they then proceed to encode these as words
in order to use
GloVe, a sort of big brother to
(I warned you, most code encoders either work with
or are heavily inspired by
GloVe exploits the intutition that when
words appear together frequently,
they have some relation,
which seems reasonable in the regular usage of language.
It remains to be seen how well such an idea translates to code, though.
That describes the entire architecture of this
system of embedding code into vectors.
They ran comprehensive experiments to determine
the added value vs the overhead of running symbolic execution
regarding the quality of the learned embeddings.
One is the production of code analogies:
much like "king - man + woman = queen" for natural language,
similar relations arise here,
such as “receive` is to
send is to `upload”,
where each of those are tokens naturally occurring in snippets of code.
Of course, this is not yet standardized,
so the is no benchmark; the authors had to make their own
benchmark and corresponding dataset,
a task we will certainly be confronted with soon.
The most eye-opening result is the comparison between syntactic and semantic based embeddings:
Almost 300% better. This makes intuitive sense,
since considering more information, and
especially the information which gives code its meaning,
should give better results.
However only the experiment proves it.
And quite resoundingly at that.
In the image above, OOV means out-of-vocabulary tests,
i.e., the ones that can’t be tested.
Even leaving these out, the semantic results are twice
as good as the syntax-based ones.
This means, on the one hand, that
is is certainly worthwile to include semantic information
in the code representation for input to embedders,
despite the overhead.
On the other, to be fair,
the syntax-only approach they use might not be as sophisticated as,
It could be that we are comparing the weakest
machine gun to the weakest pistol,
instead of to the strongest, which might win
(sorry, I’m short on metaphors today, but you get the idea).
The authors were nice enough to dedicate a section to the potential usage of their tool in downstream tasks, such as error-code misuse detection. That is, they don’t just give the tool, but also show you how to integrate it into a complete machine learning toolchain, which is nice for a research paper, where they usually don’t bother and leave it up to the reader to figure that out. More interestingly for us, they focused on bug-finding (although not strictly security-focused), and the automatic repair of such code. The results gave an accuracy of 76.5%, which as we haven seen throughout the series, is not bad in this context.
This is, overall, a tool which we might actually use for our purposes but, more importantly, shows us the importance of considering semantic information before even attempting to design the network for embedding code into vectors. More on that soon.
J. Henkel, B. Liblit, S. Lahiri, T. Reps. Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces. ESEC/FSE '18.
Corporate member of The OWASP Foundation