Table of content

Title



Blog

Development

Natural code: Natural language processing for code security

cover-natural-code (https://unsplash.com/photos/k1osF_h2fzA)

Rafael Ballestas

Security analyst

Updated

Jul 26, 2019



5 min

Our return to the Machine Learning (ML) for secure code series is a bit of a digression, but one too interesting to resist. It is not too far a digression though, because the Natural Language Processing (NLP) field is also part of what is currently considered to be Artificial Intelligence. And, as we will state in this article, it has great potential for applications in information security.

Basically, every cell phone currently in use employs a predictive keyboard. Besides completing words for you based on the first few letters, they are also able to suggest entire words after you have written some. And some of these combinations just make sense because they are used more frequently in common phrasing. Certainly, "peanut" is more likely to be followed by "butter" than "wrench". Extending that idea to more words, such as "peanut butter and jelly" we see they are definitely more likely to be followed by "sandwich" than "salad". The same holds true for "star" followed by "trek", as seen in this demo for the Android Predictive Keyboard:

An n-gram based predictive keyboard at work.

This is the basic idea behind n-gram analysis, a technique we have mentioned before in passing. It has been applied to a couple of the ML-powered vulnerability detectors we have discussed, most notably by the binary static analysis tool VDiscover.

An n-gram is simply a sequence of n consecutive words occurring in a piece of real text, which we use as a basis for training. This text is called a corpus in the Natural Language Processing context. This training essentially consists of:

Extracting all the possible n-grams in the corpus taking punctuation into account, so that "now. But before" will not be considered a valid 3-gram.
Counting the occurrence of each n-gram vs the total, i.e., finding the relative frequency of each.

That’s it! Now, if you see "peanut butter and jelly", we look at all the 5-grams that contain this 4-gram, and see which one has the highest relative frequency. Suppose the "peanut butter and jelly sandwich" occurs the most in our training corpus. Then the first suggested word to come after the given 4 is, of course, "sandwich", rather than "wrench".

If the corpus is good enough regarding the context in which such words appear, then the suggestions should be just as good. The quality of results, and hence the accuracy of our classifier, is highly dependent on the training corpus' quality. Cell phone predictive keyboards exploit this fact by learning from your typing habits. Depending on who you are "machine" might be more likely followed by "shop", "head" or "learning".

If all this can be done on natural language, which has all sorts of ambiguities, mistakes in the training corpus, irregularities, etc, imagine what could be done if we applied this same idea to code, which is highly regular, ordered and syntactically strict? The possible applications are promising.

Automatically complete code like the text above.
Finding bugs in code via n-gram analysis.
Make code more natural by enforcing coding conventions, i.e. a special kind of linting.
Generate pseudo-code or documentation automatically.

Of course, all these applications require, as do the ones we have previously presented, a useful representation of code in a way that it is always referred to as "machine learning" algorithms. This comes as no surprise if you have been following our previous series. The methods chosen for this particular application are Abstract Syntax Trees, and an adaptation of word2vec for code, aptly named code2vec.

With representation out of the way, let’s dive into the actual methods. The main idea behind bug finding via n-gram analysis is to decompose every function into n-grams that represent their elements, such as API calls, variable names, etc. Then, compare them to one another for similarity. If we find rare (with low-occurrence frequency) n-grams that are highly similar to common (high-occurrence frequency) code, then the rare ones are probably buggy and worthy of further analysis. Take for example the following snippets from Apache Pig.

Snippets found by Bugram.

The above snippet is buggy due to the lack of toString. In fact, it is exactly the same as the other snippet, only without toString. Bugram suggested it as a possible bug because it was so similar to a commonly occurring snippet. The bug was reported to the Pig team and confirmed. In the test proposed by the paper, Bugram was able to find 42 confirmed bugs plus 17 false positives across 16 well-known open source Java projects such as Pig.

This approach, while simple and effective, is not without drawbacks, namely, that the weapon cannot be focused on security-related bugs or any specific kind of bug. The same authors later proposed an approach based on deep learning rather than n-grams, but again with the same aim of predicting software defects in general.

Another possible application of n-gram analysis that might indirectly contribute to writing more secure code follows the idea that "cleaner code leads to secure code". If a person’s writing style can be learned by n-gram analysis, the same can be true of a particular coder’s style, or even a whole software project. Take for example our very last Asserts closure checker engine. Not only do we stick to the Python guidelines when naming variables and methods, and separating words by underscores, we also have a particular way of naming functions.

Sample function names from Asserts.

fluidasserts.proto.http.can_brute_force
fluidasserts.proto.http.has_dirlisting
fluidasserts.proto.smb.is_anonymous_enabled
fluidasserts.cloud.aws.iam.has_not_support_role

Do you see a tendency here? So did Naturalize, a project that tries to "learn natural coding conventions" in order to improve naming suggestions. The goal is to infer a good name for a function given its code. That is to say, if I know what it does, I should be able to know what its name is, assuming that the names are not entirely random or humorously unmaintainable.

Behind the scenes Naturalize uses natural language processing techniques, such as n-gram analysis to suggest more natural-sounding names to identifiers. This is the one place where developers can get creative, perhaps affecting the overall readability or fitting into project conventions. The package can be integrated in the development pipeline such as a pre-commit hook or during developing as an Eclipse plugin.

Naturalize Eclipse plugin at work.

As can be seen here each is not considered to be a very descriptive or convention-conforming name, so testClass is suggested as a better alternative.

Natural Language Processing has moved beyond the "natural language" line and is moving increasingly into the "machine learning" or "artificial intelligence" arena. Natural Language Processing will soon have a wider scope of purposes, such as static code analysis, bug finding, and potentially, vulnerability detection. In the future, we are more likely to encounter more applications of NLP in the least expected places.

References

S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan. Bugram: Bug detection with N-gram Language Models. ASE 2016.
M. Allamanis, E. Barr, C. Bird, C. Sutton. Learning Natural Coding Conventions. arXiv.

Get started with Fluid Attacks' SSCS solution right now

Tags:

machine-learning

vulnerability

code







Subscribe to our newsletter

Stay updated on our upcoming events and latest blog posts, advisories and other engaging resources.

Coding with gen AI: Five best practices

Read post



cover-secure-coding-five-steps (https://unsplash.com/photos/zc9pWsPZd4Y)

Development

Felipe Ruiz

•

December 5, 2022

Secure coding in five steps? A simple approach to try out in cybersecurity training

Read post



Development

Felipe Ruiz

•

November 22, 2022

Go over and practice secure coding

Read post



cover-understand-program-semantics (https://unsplash.com/photos/j3dxI7CNYL0)

Development

Rafael Ballestas

•

February 14, 2020

Understanding program semantics with symbolic execution

Read post



cover-code-translate (https://unsplash.com/photos/r8H8K3w9AzA)

Development

Rafael Ballestas

•

January 31, 2020

Can code be translated? From code to words

Read post



cover-further-code2vec (https://unsplash.com/photos/FoiZoPtxSyA)

Development

Rafael Ballestas

•

January 24, 2020

Further down code2vec: Vector representations of code

Read post



Development

Rafael Ballestas

•

January 10, 2020

Embedding code into vectors: Vector representations of code

Read post



cover-vector-language (https://unsplash.com/photos/_E1PQXKUkMw)

Development

Rafael Ballestas

•

December 13, 2019

The vectors of language: Distributed representations of natural language

Read post



Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Fluid Attacks' solutions enable organizations to identify, prioritize, and remediate vulnerabilities in their software throughout the SDLC. Supported by AI, automated tools, and pentesters, Fluid Attacks accelerates companies' risk exposure mitigation and strengthens their cybersecurity posture.