Young hacker smiling

We hack your software

zero false positives

Expert intelligence + Specialized technology
DXST - SAST - IAST - SCA - DevSecOps
White Box - Gray Box - Black Box
Attacking Web Applications, APIs, Mobile Apps
Client-Server, Servers, Networks, IoT Devices
ICS: Industrial Control System

Python proofreading a document

Pars orationis non est secura

Using parser combinators to detect flaws
How to use Python and pyparsing to parse application code in order to find SQL injections in bWAPP. A SQL injection is defined as a SQL query with concatenations to PHP variables or function calls. The parser detects most of the SQL injections despite its limitations, and no false positives.

We like bWAPP around here, because it’s very buggy!. We have shown here how to find and exploit vulnerabilities like SQL injection, directory traversal, XPath injection, and UNIX command injection. All of these have one thing in common, namely: they could have been prevented with a little Input Validation. In PHP this could have been done with functions such as fgetss, str_replace or strpos.

Taking some ideas from static code analysis and the code-as-data approach, what if we could use some sort of code or syntax analysis tool in order to gain intelligence about where an apps’s weaknesses lie? That’s what parsers are for.

Manual detection

Let us consider, for example, this site in our favorite buggy web app:

Add entry to blog screenshot
Figure 1. Adding an entry to the “blog.”

Every time we load the page, the current entries in the blog are SELECTed from a MySQL database.

The page itself is pretty simple: a header, menu, sidebar and footer which are pretty much uniform across the site, with the real stuff inside a <div> of its own. The source code for such a page is like this:

Adapted from bWAPP. Braces and loads of lines removed.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<?php
include("functions_external.php"); // more includes
$entry = ""; // more variables
function sqli($data)
  $data = sqli_check_1($data); ... return $data;
?>
<html>...<header><h1>bWAPP</h1> <h2>an extremely buggy web app !</h2> </header>
<div id="menu">  ... </div> // other divs
<div id="main">
  <h1>SQL Injection - Stored (Blog)</h1>
  <form action="<?php echo($_SERVER["SCRIPT_NAME"]);?>" method="POST">
      <?php
        $entry = sqli($_POST["entry"]);
        $owner = $_SESSION["login"];
        $sql = "INSERT INTO blog (date, entry, owner) VALUES (now(),'" . $entry . "','" . $owner . "')";
        $recordset = $link->query($sql);
      ?>
  </form>
  <?php
    $sql = "SELECT * FROM blog";
    $recordset = $link->query($sql);
    while($row = $recordset->fetch_object())
       <tr height="40">
            <td align="center"> <?php echo $row->id; ?></td>
            <td><?php echo $row->owner; ?></td>
            <td><?php echo $row->date; ?></td>
            <td><?php echo xss_check_3($row->entry); ?></td>
        </tr>
  ?>
</div>
<div id="side">...</div> // other divs
</body></html>

Of course, we’re mainly interested in the PHP and HTML mixed in the <div id="main">, which is just what we cropped here, because that’s where the SQL is. Looking at a few more sources, we see we always exploit the same weakness:

An SQL query is made up by concatenating literal values, PHP function calls and PHP variables like $entry above. That variable comes from a POST request and passed through the sanitizing function SQLi. After concatenating and building the query, it is sent to the database for processing.

Thus we could type

a','b'; drop table blog;--

into the entry field to turn the query into a dangerous, blog-deleting one.

In order to successfully identify these SQL injections, we need to look for strings which contain SQL code, and also use the PHP concatenation (first . last). That’s not enough, because we also need to relate the concatenated variable with the input or parameter where we are going to place the malicious SQL code.

To hunt SQL injections in bWAPP, our tool of choice will be a set of parsers, i.e., a piece of software used to scan a string or file to look for parts that conform to a specific set of rules.

Specifying the targets

Before going into parsing and grammar issues, let us first reflect about what we want to find. We want to detect pieces of text in the code that conform to the syntax of an SQL SELECT or INSERT statement. But also they must have concatenations, because a simple statement like

SELECT * FROM blog;

is perfectly safe. Where could we possibly inject anything?

So we need SELECT or INSERT with concatenations. Also, we want our tool to be able to identify

  • which variable is at risk,

  • where and how it is defined,

  • whether or not it is protected by some function, and

  • everything else we can find out about it, really.

The INSERT syntax, according to MySQL, goes as follows:

INSERT [LOW_PRIORITY | DELAYED | HIGH_PRIORITY] [IGNORE]
    [INTO] tbl_name
    [PARTITION (partition_name [, partition_name] ...)]
    [(col_name [, col_name] ...)]
    {VALUES | VALUE} (value_list) [, (value_list)] ...
    [ON DUPLICATE KEY UPDATE assignment_list]

This works as an excuse for us to introduce the Backus-Naur form (BNF) language syntax notation. Essentially, inside square brackets is [optional], the pipe stands for “or” and means optionally repeat.

For our purposes, the INSERT statement has this form:

INSERT INTO tbl_name [column-names] VALUES (value-list)

We’ll use subparsers to define what each of these elements mean. For example tbl_name is any SQL identifier, which can consist of alphanumeric symbols and underscores (_). And value-list is either a single value or a comma-separated list of values.

But what is a value? That’s where things get interesting. Consider this rich example:

INSERT INTO table VALUES ('1', NOW(), '" . $var1 . "','" . clean($var2, 1) . "'

A value can thus be:

  • a fixed number or string,

  • a MySQL function like NOW(),

  • a concatenation of a string or number obtained from

    • a PHP variable ($var1) or

    • a PHP function (clean()), which may also take arguments.

This is where parsers shine and the alternative approach, regular expressions, fail. Imagine trying to write a regex to match such an INSERT with concatenations. It would be humongous, not to mention very hard to understand. Other disadvantages of regular expressions are that they have to deal with white space explicitly and are hard to maintain when there are any changes to the language syntax. As the famous saying goes:

Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems.
— Jamie Zawinski

Learning the parser-tongue

Now that we’ve opted out of regular expressions, we have to choose a parser. Apparently, the standard for a time used to be the combination of lex and yacc, which was pretty complicated. Nowadays we have two solid choices: ANTLR for the Java people, and pyparsing for us.

Comic about pyparsing and Harry Potter
Figure 2. You don’t need to be a wizard to use pyparsing!

Some nice features about pyparsing:

  • uses a simple syntax that makes the grammar transparent

  • fits well in your Python code,

  • uses standard class constructs and plain language instead of cryptic odd symbols.

  • is tolerant to change and easy to adapt to different input or targets to match,

  • includes a few nice helper functions, like parsing actions (v.g. convert a string of digits to an actual integer)

In pyparsing, the outermost parser for the INSERT above translates to:

sql_insert = CaselessKeyword("INSERT INTO") + sql_identifier + Optional(column_names)
       + CaselessKeyword("VALUES") + values

The functions in SentenceCase are built into pyparsing, and their names are pretty self-explanatory. The + operator is overloaded to mean "followed by" and is insensitive to white space. The snake_case identifiers are just identifiers and we have to define them:

sql_identifier = Word(alphanums + "_")
values = Group( Literal("(") + delimitedList(value) + Literal(")") ).setResultsName("values")

Take Word to mean any combination of the given characters. Thus sql_identifier is just a combination of alphanumeric characters and the underscore. values is just a delimited list of values, enclosed in parentheses. We Group that list into a single entity so that we may refer to it by name later.

PHP identifiers are like SQL names, but must start with the symbol $. We also define function calls:

php_identifier = Combine( Literal("$") + sql_identifier ).setResultsName("php identifier")
php_funcall = Combine( sql_identifier + Literal("(") +
             + Optional(delimitedList(php_identifier)) + Literal(")") )

Unlike Group, Combine squashes all matched tokens into one. We do that because we don’t really care about every single part of a function call, only the php_identifiers inside, and we can access that by the name with which we baptized PHP identifiers above. Note that, although sql_identifiers don’t belong here logically, they have the same syntax as function names and the tails of identifiers, so we might as well reuse that parser.

Finally, we get to the heart of the matter: a value to be inserted is either a literal word or number, the result of a function, or a dangerous concatenation:

value = varchar ^ php_funcall ^ danger_concat
danger = ( ... + (php_identifier ^ php_funcall) + ... ).setResultsName("danger identifier")

Here ^ is the logical connector or, and we’ve omitted a bunch of Literal parsers for all the quotes and dots. By the way, notice that all these named parts of our big parser are parsers themselves, and we can use them on their own.

One way to use a parser is the parseString method. This will return the structure of tokens, if it is a match, or throw a ParseException if not.

>>> print(test_values)
('1', now(), '" . $var1 "')
>>> result = values.parseString(test_values)
>>> print(result)
[['1', 'now()', '$var', '3']]
>>> print(result["values"]["danger identifier"])
['$var']
>>> print(values.parseString("not a list of values")
...
pyparsing.ParseException: Expected "(" (at char 0), (line:1, col:1)

Notice how we refer to the list of returned values as result["values"], and from that we can get the identifiers which take part in the dangerous concatenation.

The function scanString looks for substrings that match the grammar. Quite useful. It also tells you where the substring was found. We use it to tell the user the line and column where the potential SQL injection was found:

How to use pyparsing.scanString
1
2
3
4
5
for tokens, start, end in sql_injection.scanString(content):
    sqli_line = line(start, content)
    print("In file {0}, line {1}, col {2}:\n{3:^}"
          .format(path.split("/")[-1], lineno(start,content),
          col(start,content), sqli_line.strip() ))

These are just some of the pyparsing built-in helper functions mentioned earlier: scanString returns an iterator which gives tokens, just like parseString(), but also starting and ending characters. To convert them to line and column numbers, we use the functions lineno() and colno(), respectively.

Thus far, this is nothing out of this world: nothing we couldn’t have done with grep or Python regular expressions (re). Where parsers actually beat regular expressions is in extracting information and structure from the input, as we did above to identify the inserted values and from those, which are the variables where we can inject SQL. For that, sadly, we need to parse again because we don’t know beforehand whether the inserted value is a function call or a PHP identifier:

injectable_variables = tokens["values"]
for injectable_variable in injectable_variables:
    res = (php_identifier ^ php_funcall).parseString(injectable_variable)
    injectable_variable = res["phpvar"]
    print(" Injectable variable {0}. Other occurrences:".format(injectable_variable))

Remember we need to detect lines with SQL queries that contain dangerously concatenated variables, but also where those variables are taken from user input and whether they are protected. But since we already have the “injectable variable” as a regular string, we can create yet another parser on-the-fly to find the lines where that variable is mentioned. This one is simple:

tpar = Literal(injectable_variable)
for tokens, start, end in tpar.scanString(content):
    print("  L{0:<3} {1}".format(lineno(start2, content),
                                 line(start2,content).strip()))

At any point, if a ParseException or KeyError is thrown, that just means a parseString failed, or we tried to access an element that wasn’t previously parsed. In plain language, it means that particular query was not really dangerous.

Finally, we run this code for every PHP file in the bWAPP server root. The output we get is very long (see the full report) Here is part of it:

In file sqli_4.php, line 131, col 17:
$sql = "SELECT * FROM movies WHERE title = '" . sqli($title) . "'";
 Injectable variable $title. Other ocurrences:
  L129 $title = $_REQUEST["title"];
  L131 $sql = "SELECT * FROM movies WHERE title = '" . sqli($title) . "'";
Found 1 SQL injection in bWAPP/sqli_4.php.
...
In file sqli_1.php, line 143, col 13:
$sql = "SELECT * FROM movies WHERE title LIKE '%" . sqli($title) . "%'";
 Injectable variable $title. Other ocurrences:
  L141 $title = $_GET["title"];
  L143 $sql = "SELECT * FROM movies WHERE title LIKE '%" . sqli($title) . "%'";
Found 1 SQL injection in bWAPP/sqli_1.php.
...
In file xss_stored_1.php, line 253, col 31:
$sql = "SELECT * FROM blog WHERE owner = '" . $_SESSION["login"] . "'";
 No dangerous concatenations in this query.
...
Total SQL injections found: 56

Boy, that’s a load of SQL injections! However, we do not expect every single one of these matches to be an actual SQL injection (false positives) and maybe some files contain effective SQL injections have escaped our scrutiny.

To find the ratio of discovered vulnerabilities to existing ones (the yield), let us use the Netsparker report on bWAPP as a reference. This report lists 57 SQL injections. Compared to our 56, that gives us a yield of 98%. Not too shabby for our simple parser. Hence the escapes, the ratio of what we did not find to the number of existing vulnerabilities, is 2% in this case.

These numbers should be taken with a grain of salt, though, since some of these vulnerabilities may be considered XSS or CSRF depending on the source. Despite that our criterion is clear: all those we found are SQL queries with concatenations, and all of those can be injected, however hard or easy.

Of course, there are bound to be some false positives. However, given the parser design, and checking the script output, we see that only really dangerous concatenations are reported. Some of those might be protected by good input sanitizers like mysqli_real_escape_string, but none is bullet-proof.

Thus we might say, with a statistically sound 95% confidence, that our pyparsing parser reports

zero false positives.

References

  • McGuire, Paul (2008). Getting started with pyparsing. O’Reilly shortcuts.

Appendix: Full SQLi parser

Download code and test cases. Run from the root of the tested PHP server.

sqli_parser.py
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
"""
Module sqli-parser.py
Module to detect SQL insertions and selections that include
PHP variables or function calls in concatenations,
which is considered dangerous. Also finds other occurrences
of the dangerously concatenated variable in the same file.
"""

import os
from pyparsing import (Combine, Literal, CaselessKeyword, Word, alphanums,
                       delimitedList, Group, Optional, ZeroOrMore, line,
                       lineno, col, ParseException)

KW_INSERT = Combine(Literal('"') + CaselessKeyword("INSERT INTO"))
KW_VALUES = CaselessKeyword("VALUES")
KW_END = Literal('"')
KW_SELECT = CaselessKeyword("SELECT")
KW_FROM = CaselessKeyword("FROM")
KW_WHERE = CaselessKeyword("WHERE")
# $var_name_3
PHP_IDENTIFIER = Combine(Literal('$')
                         + Word(alphanums+"_")).setResultsName("phpvar")
# db2_table_1, column_1
SQL_IDENTIFIER = Word(alphanums + "_")
COLUMN_NAMES = Literal("(") + delimitedList(SQL_IDENTIFIER) + Literal(")")
VARCHAR = Literal("'").suppress() + Word(alphanums) + Literal("'").suppress()
# now(), mysqli_real_escape_string($link, $secret)
PHP_FUNCALL = Combine(Word(alphanums + "_") + Literal("(")
                      + Optional(delimitedList(PHP_IDENTIFIER)) + Literal(")"))
# '" . validate($user_input) . "' or LIKE '%" . $var . "%'
DANGER_CONCAT = (Literal("'").suppress() + Optional("%")
                 + Literal('"').suppress() + Literal(".").suppress()
                 + (PHP_IDENTIFIER | PHP_FUNCALL)
                 + Literal(".").suppress() + Literal('"').suppress()
                 + Optional("%") + Literal("'").suppress()
                ).setResultsName("danger identifier")
VALUE = VARCHAR ^ PHP_FUNCALL ^ DANGER_CONCAT
VALUES = Group(Literal("(").suppress() + delimitedList(VALUE)
               + Literal(")").suppress()).setResultsName("values")
WHAT_TO_SELECT = Literal("*") ^ delimitedList(SQL_IDENTIFIER)
# name='" . $name . "'
COND_COL_EQ_VAL = SQL_IDENTIFIER + Literal("=") + VALUE
# name LIKE '%".$name."%'
COND_COL_LIKE_VAL = SQL_IDENTIFIER + Literal("LIKE") + DANGER_CONCAT
ATOM_COND = COND_COL_EQ_VAL | COND_COL_LIKE_VAL
CONNECTOR = CaselessKeyword("OR") ^ CaselessKeyword("AND")
# col_user = '" . $user . "' AND password = '" . hash($password) . "'
CONDITION = Group(ATOM_COND + ZeroOrMore(CONNECTOR + ATOM_COND)
                 ).setResultsName("values")
# "INSERT INTO blog VALUES (now(), 'a', '" . $inj1 . "','" . $inj2 . "')";
SQL_INSERT = (KW_INSERT + SQL_IDENTIFIER + Optional(COLUMN_NAMES)
              + KW_VALUES + VALUES + KW_END)
# "SELECT * FROM ta_ble WHERE name = '" . $name . "'";
SQL_SELECT = (KW_SELECT + WHAT_TO_SELECT + KW_FROM
              + SQL_IDENTIFIER + Optional(KW_WHERE + CONDITION))
SQL_INJECTION = SQL_INSERT ^ SQL_SELECT

def scan_test_file(path='test-cases.lst'):
    """ Scan an entire file to test for occurrences of SQL_INJECTION parser,
    determine what the injectable variables are and parse again looking for
    other occurrences of that variable."""
    with open(path) as test_file:
        content = test_file.read()
        print("\nScanning file {}".format(path))
        queries = 0
        for tokens, start, _end in SQL_INJECTION.scanString(content):
            sqli_line = line(start, content)
            print("In file {0}, line {1}, col {2}:\n{3:^}"
                  .format(path.split("/")[-1], lineno(start, content),
                          col(start, content), sqli_line.strip()))
            try:
                injectable_variables = tokens["values"]
                for injectable_variable in injectable_variables:
                    try:
                        # Re-parsing is ugly but necessary since a PHP_FUNCALL
                        # might not contain any PHP_IDENTIFIERs.
                        res = (PHP_IDENTIFIER ^ PHP_FUNCALL
                              ).parseString(injectable_variable)
                        injectable_variable = res["phpvar"]
                        print(" Injectable variable {0}. Other ocurrences:"
                              .format(injectable_variable))
                        # Make a parser on-the-fly to detect other
                        # occurrences of the injectable variable
                        tpar = Literal(injectable_variable)
                        for _tokens2, start2, _end2 in tpar.scanString(content):
                            print("  L{0:<3} {1}"
                                  .format(lineno(start2, content),
                                          line(start2, content).strip()))
                        queries += + 1
                    except (ParseException, KeyError):
                        # If parse fails or key "phpvar" does not exist,
                        # there is simply no danger variable.
                        pass
            except (ParseException, KeyError): # Same as above
                print(" No dangerous concatenations in this query.")
        print("Found {0} SQL injections in {1}.".format(queries, path))
        return queries

def test_directory():
    """ Iterate scan_test_file() for all source files in a directory
    and report total number of SQL injection vulnerabilities."""
    path = "./"
    vulns = 0
    for filename in sorted(os.listdir(path)):
        if filename.split('.')[-1] == "php":
            vulns += scan_test_file(path + filename)
    print("\nTotal SQL injections found: {}".format(vulns))

scan_test_file()
#test_directory()

Author picture

Rafael Ballestas

Mathematician

with an itch for CS



Related




Service status - Terms of Use