
By Rafael Ballestas | February 16, 2018
Markup languages are ``systems for annotating a document in a way that is syntactically distinguishable from the text.'' [1] What does that really mean? I reckon that’d be better understood with examples. But before, a warning: if you use them for sensitive information storage, you should be really careful in how they are manipulated.
Perhaps the best example
is the ubiquitous HTML
,
the language of the internet.
When you visit a webpage,
you download a file with plain text
mixed with a bunch of 'tags' which
make the text look the way it does
when rendered in a web browser.
The tags are used to define:
page structure like division in sections
text formatting
general page style
the inclusion of media in the page (images, videos, etc).
A tag looks like this:
<h1> some big text </h1>
and it would be rendered like a
title in your browser.
If you’re interested in learning more HTML
check out
W3Schools
:
There are other markup languages for different purposes, like:
A trait of all these markup languages is that their main goal should be to explicitly state the structure and hierarchy of a document, separating content from appearance.
Now, some clever guys liked the 'tags' and structure idea.
But not so much the restricted set of tags in HTML
or the specific purposes of others.
So they took it upon themselves to
design a markup language they could use for 'anything'.
And thus was born an `eXtensible Markup Language'' (`XML
).
You can really use it for anything. For example:
Office suites like LibreOffice
use them in their document formats.
Vector images (the ones you can zoom in on indefinitely without pixelating them).
RSS
and Atom
feeds are ways of keeping up-to-date with a website
without going there and are XML
-based.
But you can also use them as your own on-the-fly file format or information exchange protocol. Say, if you want to exchange a person’s information with someone else, you could do it like this:
H James Wyatt 65789 37498 1101 1014 W Broadway
But then what is what? Among all those numbers, which one is the post code, which the street number, and which the phone?
OK, you might say you can just agree upon the order of the columns. And that would work for a while, but it would be difficult to maintain, not to say messy.
What if we could do it like HTML
,
with some new tags?
<people>
<person>
<name> Wyatt </name>
<initial> H </initial>
<last> James </last>
<home> 37498 </home>
<mobile> 65789 </mobile>
<address type="US">
<street> W Broadway </street>
<number> 1101 </number>
<postcode> 1014 </postcode>
</address>
</person>
<person>
...
</person>
</people>
OK, maybe that is a little verbose,
but it does have structure,
it is readable even for a person who does not know the format,
and it has the advantage of being machine-readable.
Your website can easily read XML
files
with a few lines of JavaScript
.
Thus, XML
has rapidly become a web standard,
even a W3C
recommendation
due to the ease of use to share data in a structured way.
Given a structure like the one above,
you can think of such an XML
document
as a 'tree' made up of 'nodes'.
One way a program can read from an XML
is
by using this tree-like structure to navigate it.
Suppose we have two more people in our file. You could access the streets where all of them live by saying
/people/person/address/street
These `queries'',
which are not unlike `SQL
queries,
are part of the XPath
language.
What they return is an ordered list:
you can access the individual streets
by their position
or by asking questions about them
(select the people who live on Broadway).
These are called 'predicates', for example:
/people/person/address[number>1000]/street
selects all street names from people whose address number is larger than 1000.
You can even do math with the results of your queries. You can mix and match those queries with logical operators, and you can even use wildcards and refer to other nodes in virtue of their relative position to other nodes in the tree.
It gets better:
you don’t need to know JavaScript
in order to make these queries.
These kind of queries can be made, as with SQL
,
from pretty much any programming language.
But even this apparently good neutrality
has its dark side:
being implementation independent also
implies that attacks could be automated.
What? Attacks?
Like databases, XML
files can be
a useful tool for storing and sharing data,
but they can also be made into an attack surface
by malicious users.
They can take advantage of a website that uses XPath
in order to inject malicious queries which
may do something as innocent as listing the entire file or
as harmful as deleting the files and
even elevate their privileges on the website.
XPath
injections are particularly
dangerous when XML
files are used to
store passwords, authentication details or
other sensitive information.
Remember bWAPP
?
It’s vulnerable to XPath
injection, too!
Here we have a website where superheroes can log in.
Assume we don’t know that
this authentication uses XML
.
If we try normal text or empty fields,
we just get "invalid credentials" as response.
But we do know that the site is PHP
-based, and
in that language strings can be single ('
) or double ("
) quoted.
If we try just that, we get the following response:
The important bit is what is hiding behind the bee:
Warning: SimpleXMLElement::xpath(): Invalid predicate in /app/xmli_1.php on line 78 Warning: SimpleXMLElement::xpath(): xmlXPathEval: evaluation failed in /app/xmli_1.php on line 78
So now we know they are using the PHP
xpath()
function
to run an XPath
query on XML
data.
Since we don’t know the structure of the file,
we may never know the exact XPath
,
but we may guess that it ends like this:
login='<input1>' and password='<input2>'
Thus if we type anything like x'
closing the quote,
and append or 'a'='a
, then
the expression evaluates to true.
Let’s do that in both login
and password
field,
so that the end of the expression becomes:
login='x' or 'a'='a' and password='x' or 'a'='a'
Then both or
expressions evaluate to true
since the 'a'='a'
statement is, and so
the outer expression and
will also be true.
In that case the XPath
will select all
entries in the tree.
However the page is designed to give
this response to a successful login:
Welcome Neo, how are you today? Your secret: Oh why didn't I took that BLACK pill?
So Neo must be the first node in
the XML
authentication file tree.
We know now they are using XML
for authentication
because of the two injections:
the good and the bad one.
This is the actual line that runs the XPath
:
$result = $xml->xpath("/heroes/hero[login='" . $login . "' and password='" . $password . "']");
And in effect, the XML
file has a structure like this:
<heroes>
<hero>
<id>1</id>
<login>neo</login>
<password>trinity</password>
<secret>Oh why didn't I took that BLACK pill?</secret>
<movie>The Matrix</movie>
<genre>action sci-fi</genre>
</hero>
<hero>
...
</hero>
</heroes>
It’s generally not a good idea to store
users and passwords (and in this case, `secrets'')
in plain text files, even with the `XML
structure.
And it’s even worse to use them to check
authentications, specially with XML
files
since, as we’ve just shown, they can be
vulnerable to the XPath
injection attack.
This goes to show once more the importance of input validation: never take input from users as-is, because then you’re opening a window attackers will try to get in through.
Corporate member of The OWASP Foundation
Copyright © 2021 Fluid Attacks, We hack your software. All rights reserved.