Natural CodeNatural language processing for code security
Our return to the Machine Learning (
ML) for secure code
series is a bit of a
digression, but one too interesting to resist. It
is not too far a digression though, because the Natural Language
NLP) field is also part of what is
currently considered to be
And, as we will state in this article, it has great potential for
applications in information security.
Basically, every cell phone currently in use employs a predictive keyboard. Besides completing words for you based on the first few letters, they are also able to suggest entire words after you have written some. And some of these combinations just make sense because they are used more frequently in common phrasing. Certainly, "peanut" is more likely to be followed by "butter" than "wrench". Extending that idea to more words, such as "peanut butter and jelly" we see they are definitely more likely to be followed by "sandwich" than "salad". The same holds true for "star" followed by "trek", as seen in this demo for the Android Predictive Keyboard:
Figure 1. An n-gram based predictive keyboard at work
This is the basic idea behind n-gram analysis, a technique we have
mentioned before in passing. It has been applied to a couple of the
ML-powered vulnerability detectors we have discussed, most notably by
the binary static analysis tool VDiscover.
An n-gram is simply a sequence of n consecutive words occurring in a piece of real text, which we use as a basis for training. This text is called a corpus in the Natural Language Processing context. This training essentially consists of:
Extracting all the possible n-grams in the corpus taking punctuation into account, so that "now. But before" will not be considered a valid 3-gram.
Counting the occurrence of each n-gram vs the total, i.e., finding the relative frequency of each.
That’s it! Now if you see "peanut butter and jelly", we look at all the 5-grams that contain this 4-gram, and see which one has the highest relative frequency. Suppose the "peanut butter and jelly sandwich" occurs the most in our training corpus. Then the first suggested word to come after the given 4 is, of course, "sandwich", rather than "wrench".
If the corpus is good enough regarding the context in which such words appear, then the suggestions should be just as good. The quality of results, and hence the accuracy of our classifier, is highly dependent on the training corpus' quality. Cell phone predictive keyboards exploit this fact by learning from your typing habits. Depending on who you are "machine" might be more likely followed by "shop", "head" or "learning".
If all this can be done on natural language, which has all sorts of ambiguities, mistakes in the training corpus, irregularities, etc, imagine what could be done if we applied this same idea to code, which is highly regular, ordered and syntactically strict? The possible applications are promising.
Automatically complete code like the text above.
Finding bugs in code via n-gram analysis.
Make code more natural by enforcing coding conventions, i.e. a special kind of linting.
Generate pseudo-code or documentation automatically.
Of course, all these applications require, as do the ones we have
previously presented, a useful representation of code in a way that it
is always referred to as "machine learning" algorithms. This comes as no
surprise if you have been following our previous series. The methods
chosen for this particular application are Abstract Syntax Trees, and an
word2vec for code, aptly named
With representation out of the way, let’s dive into the actual methods.
The main idea behind bug finding via n-gram analysis is to decompose
every function into n-grams that represent their elements, such as
calls, variable names, etc. Then, compare them to one another for
similarity. If we find rare (with low-occurrence frequency) n-grams that
are highly similar to common (high-occurrence frequency) code, then the
rare ones are probably buggy and worthy of further analysis. Take for
example the following snippets from Apache
Figure 2. Snippets found by Bugram
The above snippet is buggy due to the lack of
toString. In fact, it is
exactly the same as the other snippet, only without
suggested it as a possible bug because it was so similar to a commonly
occurring snippet. The bug was reported to the
Pig team and confirmed.
In the test proposed by the paper,
Bugram was able to find 42
confirmed bugs plus 17 false positives across 16 well-known open source
Java projects such as
This approach, while simple and effective, is not without drawbacks, namely, that the weapon cannot be focused on security-related bugs or any specific kind of bug. The same authors later proposed an approach based on deep learning rather than n-grams, but again with the same aim of predicting sofware defects in general.
Another possible application of n-gram analysis that might indirectly
contribute to writing more secure code follows the idea that "cleaner
code leads to secure code". If a person’s writing style can be learned
by n-gram analysis, the same can be true of a particular coder’s
style, or even a whole software project. Take for example our very last
Asserts closure checker engine. Not only do we stick to the
guidelines when naming variables and methods, and separating words by
underscores, we also have a particular way of naming functions.
Sample function names from
fluidasserts.proto.http.can_brute_force fluidasserts.proto.http.has_dirlisting fluidasserts.proto.smb.is_anonymous_enabled fluidasserts.cloud.aws.iam.has_not_support_role
Do you see a tendency here? So did
Naturalize, a project that
tries to "learn natural coding conventions" in order to improve naming
suggestions. The goal is to infer a good name for a function given its
code. That is to say, if I know what it does, I should be able to know
what its name is, assuming that the names are not entirely random or
Behind the scenes
Naturalize uses natural language processing
techniques, such as n-gram analysis to suggest more natural-sounding
names to identifiers. This is the one place where developers can get
creative, perhaps affecting the overall readability or fitting into
project conventions. The package can be integrated in the development
pipeline such as a
pre-commit hook or during developing as an
Figure 3. Naturalize Eclipse plugin at work.
As can be seen here
each is not considered to be a very descriptive or
convention-conforming name, so
testClass is suggested as a better
Natural Language Processing has moved beyond the "natural language" line and is moving increasingly into the "machine learning" or "artificial intelligence" arena. Natural Language Processing will soon have a wider scope of purposes, such as static code analysis, bug finding, and potentially, vulnerability detection. In the future, we are more likely to encounter more applications of NLP in the least expected places.