| 6 min read
``Most programs are too large to understand in complete detail''. This was written in the 80’s.[1] Imagine the situation today. Hence the need for automated tools to aid in the process of analyzing code. The solution, according to Oege de Moor from Semmle, is obvious: treat code 'as data'.
This idea is not really new: Already in 1983, Mark Linton discussed that programmers select 'views' to understand some 'aspects' of the code base. This is already database-talk. It was only natural to propose making a database out of the program structure, in order to be able to ask questions about the program’s behavior.
Mere text analysis is not enough to understand the way variables are used, the idioms and idiosyncrasies of the programming language, the efficiency and complexity of your code, etc. In other words, we want to understand the 'semantics' of your program, as well as its 'structure', not just the syntax.
The most common way to represent interrelated, structured data is via 'relational databases'. These are typically queried via the SQL
language, which features simple, almost english-like queries, but is very limited. As we’ll see, the database for a program’s source code will be even more so. Thus the query system must be as advanced, but not so much that it would hurt performance.
Databases out of programs
Languages like lisp
have an inherent 'code as data' mantra: every line of code is a piece of manipulable data and vice versa. Markup languages like XML
have their own `database'' of sorts, the `DOM
, which can be queried via XPath
.
But what about commonly used languages like Java
and Python
? To every program we can associate a tree, called the 'abstract syntax tree' (AST
) which sort of displays the syntactic structure of the code. Consider this Ada
snippet from Linton’s thesis:[1]
Sample Ada
code.
prevmax := max; if a > b then
max := a;
else
max := b;
end if;
Quite simple, but the same cannot be said about its AST
. We have three assignments (:=
), one if-then-else
conditional and five variables, all interrelated, and all that has to be shown on the tree. The AST
would be like this:
Adapted from Linton's original diagram.
Next, how do we make a database out of this? Neither Semmle
nor other players in the 'application intelligence' field such as CAST and Modern Systems tell us much about the process, since obviously the magic behind their products must be there, but let’s try to learn it ourselves from Linton’s ideas.
For the purposes of this article, it suffices to think of a database as a set of entities joined by some relations. Now, the relevant entities for source code analysis would be, as you expect, variables, conditionals, functions, but also statements, expressions and assignments. Further, some bits of code may actually be several kinds at the same time. So you get an idea of the complexity of such a database.
For the code snippet above, Linton shows the actual tables and relations cut to the most relevant parts. I’ll try to simplify that further by reducing it to an informal entity-relationship diagram. Here you can read unlabeled relations as `have' or `is' (generic relations).
ER diagram for a simple code snippet.
The most basic entity is the statement
, which is basically every line of code. It can be a conditional
, an assignment
or maybe a function call (fcalls
). An assignment
, in turn, sets the value of the left-hand-side variable to that of the right side. Variables
are related to the entity type
(v.g. int
, String
, etc) and name
.
This is only a subset of our already reduced model of the database. And, hey, it’s only a model, which barely begins to compare to compare with the actual database.
A typical 21st century Java
program takes around 77 tables (!), according to Oege de Moor. This justifies our previous claim that a powerful query language is needed to work with such a database. And the guys at Semmle
set out to do just that, and that is what differentiates them from competitors.
A query language to rule them all
The query language should be simple like SQL
, but more powerful, including Object-Oriented Programming (OOP
) constructs and be able to compose queries using logical connectors and quantifiers, like the logical programming language Prolog
. However, this would add too much overhead, so a smaller subset called Datalog
was chosen as the basis for the tailor-made QL
language.
This query language can be used to locate code defects. For example, in OOP
, when defining the notion of equality in a class, one usually needs to define a hash function, because object equality should imply hash equality. So, let’s look for classes that declare an equals()
method but not a hashCode()
method using QL
:
QL
example from [2].
from Class c
where c.declaresMethod("equals") and
not( c.declaresMethod("hashCode") ) and
c.fromSource()
select c.getPackage(), c
The clauses are similar to SQL
, but there are object-like constructs (Class c
) which have their own methods (c.declaresMethod()
) and the logical connectors work a bit differently from SQL
and have a larger scope. In QL
, one can:
-
define and use 'predicates' in queries (expressions that can be true or false depending on the parameters),
-
use logical quantifiers (for all, exists) in order to simplify aggregation and grouping (find the number of lines of code in a given package), which is complicated in
SQL
-
define generic queries that can be inherited and overridden, just like in
OOP
We cannot go further into the details of QL
here, but instead let’s focus on what we can do with it.
Applications
When you can ask questions about your code to an omniscient oracle, you can really bring the ``data age'' into your development flow.
You can use the 'code-as-data' approach to:
-
increase productivity by computing metrics about the development process,
-
ensure the following of coding standards and whichever development model your team has chosen,
-
objectively determine the quality of the code, and
-
find security bugs and vulnerabilities.
This is what interests us most. Semmle
maintains a public queries repository and a website with general rules that should be followed for some of the supported languages, namely, Java
, C
, Python
and some of their derivatives. Included are some security guidelines, with their corresponding CWE
. For example, we can detect XSS
in Java
with this query:
Java XSS
detection query.
import semmle.code.java.security.XSS
from XssSink sink, RemoteUserInput source
where source.flowsTo(sink)
select sink, "Cross-site scripting vulnerability due to $@.",
source, "user-provided value"
And it would detect this kind of vulnerable code, which does not properly validate user input:
XSS
-vulnerable Java
code. Via Semmle.
public class XSS extends HttpServlet {
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
// BAD: a request parameter is written directly to an error response page
response.sendError(HttpServletResponse.SC_NOT_FOUND,
"The page \"" ` request.getParameter("page") ` "\" was not found.");
}
}
No query for the vulnerability you’re testing for? That’s what the QL
language is for. You just write your own query.
To wrap this up with more spectacular examples, here is a query to find Heartbleed
-like vulnerabilities:
QL
to detect Heartbleed
.
from FunctionCall memcpy, Struct s, Field f, Field g, float perc
where f = s.getAField() and g = s.getAField() and
memcpy(memcpy, f) and
memcpy_usually_guarded(f, g, perc) and
not guarded_memcpy(memcpy, f, g) and
forall (Field gg, float pperc | memcpy_usually_guarded(f, gg, pperc) | pperc <= perc)
select memcpy, "memcpy from " ` s.toString() ` "::" ` f `
" is guarded by comparison against " ` s.toString() ` "::" ` g `
" in " ` perc ` "% of all cases, but not here."
Notice the universal quantifier (forall
) we mentioned earlier, and also that this is not the full query, since it is based upon predicates that have to be defined 'ad hoc' in addition to built-in ones. See the full query and a discussion at Semmle.
The Apache Struts
vulnerability CVE-2017-9805
— related but not to be confused with CVE-2017-5638
— the one that was exploited in the Equifax
breach, was found and announced by lgtm.com
. Through this service FOSS
projects can take advantage of Semmle’s technologies in application intelligence, as long as their repository is open on `GitHub
.
The basic idea is simple enough: look for deserialization of untrusted (i.e., user-controlled) data. In this particular case, we’re interested in the flow of data from a ContentTypeHandler
which gets the input to an unsafe deserialization method. The query text reflects just this idea:
from ContentTypeHandlerInput source, UnsafeDeserializationSink sink
where source.flowsTo(sink)
select source, sink
Again, this is not the full query. See the lgtm
blog entry on this discovery.
Semmle
has thus made into a reality what was deemed impossible time and again for 30 years: bring data analysis techniques and source code analysis together. This powerful combination has already paid off for users like NASA
and Google
, as well as countless FOSS
projects. Only the Pythia knows what the future of the code-as-data approach will bring.
References
Share
Recommended blog posts
You might be interested in the following related posts.
Our new testing architecture for software development
Be more secure by increasing trust in your software
How it works and how it improves your security posture
Sophisticated web-based attacks and proactive measures
The importance of API security in this app-driven world
Protecting your cloud-based apps from cyber threats
Details on this trend and related data privacy concerns