Markup languages are ``systems for annotating a document in a way that is syntactically distinguishable from the text.''  What does that really mean? I reckon that’d be better understood with examples. But before, a warning: if you use them for sensitive information storage, you should be really careful in how they are manipulated.
Perhaps the best example is the ubiquitous
HTML, the language of the
internet. When you visit a webpage, you download a file with plain text
mixed with a bunch of 'tags' which make the text look the way it does
when rendered in a web browser. The tags are used to define:
page structure like division in sections
general page style
the inclusion of media in the page (images, videos, etc).
A tag looks like this:
<h1> some big text </h1> and it would be
rendered like a title in your browser. If you’re interested in learning
HTML check out
Figure 1. Some tags and their results.
There are other markup languages for different purposes, like:
A trait of all these markup languages is that their main goal should be to explicitly state the structure and hierarchy of a document, separating content from appearance.
When a markup language is not enough
Now, some clever guys liked the 'tags' and structure idea. But not so
much the restricted set of tags in
HTML or the specific purposes of
others. So they took it upon themselves to design a markup language they
could use for 'anything'. And thus was born an
`eXtensible Markup Language'' (`XML).
You can really use it for anything. For example:
Office suites like
LibreOfficeuse them in their document formats.
Vector images (the ones you can zoom in on indefinitely without pixelating them).
Atomfeeds are ways of keeping up-to-date with a website without going there and are
But you can also use them as your own on-the-fly file format or information exchange protocol. Say, if you want to exchange a person’s information with someone else, you could do it like this:
H James Wyatt 65789 37498 1101 1014 W Broadway
But then what is what? Among all those numbers, which one is the post code, which the street number, and which the phone?
OK, you might say you can just agree upon the order of the columns. And that would work for a while, but it would be difficult to maintain, not to say messy.
What if we could do it like
HTML, with some new tags?
<people> <person> <name> Wyatt </name> <initial> H </initial> <last> James </last> <home> 37498 </home> <mobile> 65789 </mobile> <address type="US"> <street> W Broadway </street> <number> 1101 </number> <postcode> 1014 </postcode> </address> </person> <person> ... </person> </people>
OK, maybe that is a little verbose, but it does have structure, it is
readable even for a person who does not know the format, and it has the
advantage of being machine-readable. Your website can easily read
files with a few lines of
XML has rapidly become a
web standard, even a
W3C recommendation due to the ease of use to
share data in a structured way.
Given a structure like the one above, you can think of such an
document as a 'tree' made up of 'nodes'. One way a program can read from
XML is by using this tree-like structure to navigate it.
Suppose we have two more people in our file. You could access the streets where all of them live by saying
`queries'', which are not unlike `SQL queries, are part of the
What they return is an ordered list: you can access the individual
streets by their position or by asking questions about them (select the
people who live on Broadway). These are called 'predicates', for
selects all street names from people whose address number is larger than 1000.
You can even do math with the results of your queries. You can mix and match those queries with logical operators, and you can even use wildcards and refer to other nodes in virtue of their relative position to other nodes in the tree.
It gets better: you don’t need to know
these queries. These kind of queries can be made, as with
pretty much any programming language. But even this apparently good
neutrality has its dark side: being implementation independent also
implies that attacks could be automated.
What? Attacks? Like databases,
XML files can be a useful tool for
storing and sharing data, but they can also be made into an attack
surface by malicious users. They can take advantage of a website that
XPath in order to inject malicious queries which may do something
as innocent as listing the entire file or as harmful as deleting the
files and even elevate their privileges on the website.
injections are particularly dangerous when
XML files are used to store
passwords, authentication details or other sensitive information.
Injecting XPath into a vulnerable app
bWAPP? It’s vulnerable to
XPath injection, too! Here we have a website where superheroes can
log in. Assume we don’t know that this authentication uses
XML. If we
try normal text or empty fields, we just get "invalid credentials" as
response. But we do know that the site is
PHP-based, and in that
language strings can be single (
') or double (
") quoted. If we try
just that, we get the following response:
Figure 2. Login form response to testing query
The important bit is what is hiding behind the bee:
Warning: SimpleXMLElement::xpath(): Invalid predicate in /app/xmli_1.php on line 78 Warning: SimpleXMLElement::xpath(): xmlXPathEval: evaluation failed in /app/xmli_1.php on line 78
So now we know they are using the
xpath() function to run an
XPath query on
XML data. Since we don’t know the structure of the
file, we may never know the exact
XPath, but we may guess that it ends
login='<input1>' and password='<input2>'
Thus if we type anything like
x' closing the quote, and append
or 'a'='a, then the expression evaluates to true. Let’s do that in both
password field, so that the end of the expression becomes:
login='x' or 'a'='a' and password='x' or 'a'='a'
or expressions evaluate to true since the
statement is, and so the outer expression
and will also be true. In
that case the
XPath will select all entries in the tree. However the
page is designed to give this response to a successful login:
Welcome Neo, how are you today? Your secret: Oh why didn't I took that BLACK pill?
So Neo must be the first node in the
XML authentication file tree.
We know now they are using
XML for authentication because of the two
injections: the good and the bad one.
The source of the problem
This is the actual line that runs the
$result = $xml->xpath("/heroes/hero[login='" . $login . "' and password='" . $password . "']");
And in effect, the
XML file has a structure like this:
<heroes> <hero> <id>1</id> <login>neo</login> <password>trinity</password> <secret>Oh why didn't I took that BLACK pill?</secret> <movie>The Matrix</movie> <genre>action sci-fi</genre> </hero> <hero> ... </hero> </heroes>
It’s generally not a good idea to store users and passwords (and in this
`secrets'') in plain text files, even with the `XML structure.
And it’s even worse to use them to check authentications, specially with
XML files since, as we’ve just shown, they can be vulnerable to the
XPath injection attack.
This goes to show once more the importance of input validation: never take input from users as-is, because then you’re opening a window attackers will try to get in through.
Recommended blog posts
You might be interested in the following related posts.
Benefits and risks of these increasingly used programs
Description and critique of CEH certifications
Injecting JS into one site is harmful, into all, lethal
So it's the app itself that delivers the cookie to me?
A Black Hat talk follow up
Chances are you're vulnerable with Microsoft Office
Cross-process memory patching with Python
Conti gang relentlessly lashes their vulnerable systems