Young hacker smiling

Zero false positives

Expert intelligence + effective automation

Photo by Andres Urena on Unsplash. Credits: https://unsplash.com/photos/k1osF_h2fzA

Natural code

Natural language processing for code security
Natural Language Processing has transcended the scope of natural language. Nowadays it has several applications in other realms such as static code analysis. In particular, we show applications to bug finding and coding conventions linting both based upon the n-gram model.

Our return to the Machine Learning (ML) for secure code series is a bit of a digression, but one too interesting to resist. At the same time, it is not, since the Natural Language Processing (NLP) field is also part of what, at least today, is considered to be Artificial Intelligence. And, as we will claim in this article, it has great potential for applications to information security.

Nowadays basically every cell phone uses a predictive keyboard. Besides completing words for you based on the first few letters, they are also able to suggest entire words after you have written some. And some of these combinations just make sense because they are used more frequently in common parlance. Certainly, "peanut" is more likely to be followed by "butter" than "wrench". Extending that idea to more words, "peanut butter and jelly" is definitely more likely to be followed by "sandwich" than "salad". As "star" by "trek", like in this demo for the Android Predictive Keyboard:

Android Predictive Keyboard demo animation
Figure 1. An n-gram based predictive keyboard at work

This is the basic idea behind n-gram analysis, a technique we have mentioned before in passing. It has been applied to a couple of the ML-powered vulnerability detectors we have discussed, most notably by the binary static analysis tool VDiscover.

An n-gram is simply a sequence of n consecutive words occurring in a piece of real text, which we take as basis for training. This text is called a corpus in the Natural Language Processing context. This training essentially consists of

  • Extracting all the possible n-grams in the corpus, taking punctuation into account so that "now. But before" will not be considered a valid 3-gram.

  • Counting the occurrence of each n-gram vs the total, i.e., finding the relative frequency of each.

That’s it! Now if you see "peanut butter and jelly", we look at all the 5-grams that contain this 4-gram, and see which one has the highest relative frequency. Suppose the "peanut butter and jelly sandwich" occurs the most in our training corpus. Then the first suggested word to come after the given 4 is, of course, "sandwich", rather than "wrench".

If the corpus is good enough regarding the context in which such words would appear, then the suggestions should be just as good. The quality of results and hence the accuracy of our classifier is highly dependent on the training corpus' quality. Cell phone predictive keyboards exploit this fact by learning from your typing habits. Depending on who you are, "machine" might be more likely followed by "shop", "head" or "learning".

Now if all this can be done on natural language, which has all sorts of ambiguities, mistakes in the training corpus, irregularities, etc, image what can be done if we applied this same idea to code, which is highly regular, ordered and syntactically strict? The possible applications are very promising:

  • Automatically complete code like text above.

  • Finding bugs in code via n-gram analysis.

  • Make code more natural by enforcing coding conventions, i.e. a special kind of linting.

  • Generate pseudo-code or documentation automatically.

Of course, all of these applications require, as do the ones we have previously presented, a useful representation of code in a way that can be used by machine learning algorithms. This comes as no surprise by now if you have been following our series. The methods chosen for this particular application are Abstract Syntax Trees, and an adaptation of word2vec for code aptly named code2vec.

With representation out of the way, let us dive into the actual methods. The main idea behind bug finding via n-gram analysis is to decompose every function into n-grams that represent their elements, such as API calls, variable names, etc. Then, compare them to one another for similarity. If we find rare (i.e. with low occurrence frequency) n-grams which are highly similar to common (high frequency) code, then the rare ones are probably buggy and worthy of further analysis. Take for example the following snippets from Apache Pig:

Snippets found by Bugram
Figure 2. Snippets found by Bugram

The above snippet is buggy due to the lack of toString. In fact, it is exactly the same as the other snippet, only without toString. Bugram suggested it as a possible bug because it was so similar to a commonly occurring snippet. The bug was reported to the Pig team and confirmed. In the test proposed by the paper, Bugram was able to find 42 confirmed bugs plus 17 false positives across 16 well-known open source Java projects such as Pig.

This approach, while simple and effective, is not without drawbacks, namely, that the weapon cannot be focused toward security-related bugs or any specific kind of bug. The same authors later proposed an approach based on deep learning rather than n-grams, but again with the same aim: predicting sofware defects in general.


Another possible application of n-gram analysis that might indirectly contribute to writing more secure code follows the idea that "cleaner code leads to secure code". If a person’s writing style can be learned by n-gram analysis, the same can be true of a particular coder’s style, or even a whole software project. Take for example our very own Asserts closure checker engine: not only do we stick to the Python guidelines when naming variables and methods, separating words by underscores, but we also have a particular way of naming functions:

Sample function names from Asserts
fluidasserts.proto.http.can_brute_force
fluidasserts.proto.http.has_dirlisting
fluidasserts.proto.smb.is_anonymous_enabled
fluidasserts.cloud.aws.iam.has_not_support_role

See a tendency here? So could Naturalize, a project that aims to "learn natural coding conventions" and thus improve naming suggestions. The aim is, essentially, to infer a good name for a function given its code. That is to say, if I know what it does, I should be able to know what its name is, assuming that the names are not entirely random or humorously unmaintainable.

Behind the scenes, Naturalize uses natural language processing techniques such as n-gram analysis as discussed above to suggest more, well natural, names to identifiers, which is the one place where developers can get creative, perhaps affecting the overall readability or fitting into project conventions. The package can be integrated in the development pipeline such as a pre-commit hook or during developing as an Eclipse plugin:

Naturalize Eclipse plugin at work
Figure 3. Naturalize Eclipse plugin at work.

As can be seen here each is not considered to be a very descriptive or convention-conforming name, so testClass is suggested as a better alternative.


Natural Language Processing has thus moved beyond the "natural language" line and increasingly into the "machine learning" or "artificial intelligence" for a wider scope of purposes, such as static code analysis, bug finding, and potentially, vulnerability finding. In the future, we are more likely to encounter more applications of NLP in the least expected places.

References

  1. S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan. Bugram: bug detection with n-gram language models. ASE 2016.

  2. M. Allamanis, E. Barr, C. Bird, C. Sutton. Learning Natural Coding Conventions. arXiv.


Author picture

Rafael Ballestas

Mathematician

with an itch for CS



Related




Service status - Terms of Use