Three Ways To Parse XML in Qt

Update 1/12/2015: I’ve written a follow-up to this post comparing the performance of the different parsers. I’ve also fixed a few mistakes in the code and text below. I’ve also changed my mind about QXmlSimpleReader now that I’ve found a simple way to use it.

For this past February’s CoderNight meetup, I thought I would write the solution using Qt and take the time to explore qdoc, Qt’s excellent documentation tool. So, of course, I spent all my time  figuring out the three native ways to parse XML using Qt and completely ran out of time for qdoc. While researching the XML parsing, I couldn’t find any webpages addressing and comparing the methods all at once. Since then, I’ve discovered that the documentation for Qt 5 has an XBEL bookmarks example for each of the methods and you can compare those examples to get a feel for the differences, but there’s still no Qt XML parsing rosetta stone page. Here’s a shorter, incomplete comparison of the three methods.

The Problem

We’ve got to read in a table of currency conversions stored in an xml file that looks like this:

There are three Qt ways to parse this:

  1. Use QXmlStreamReader from QtCore to parse the xml linearly
  2. Use QDomDocument from the QtXml module to model the entire xml as a tree data structure of XML objects
  3. Use QXmlQuery from the QtXmlPatterns modules to create an XQuery to search the xml and return a formatted result set

Note: All of the code in this post and the problem description can be found in this github repository. Later, there’s some code from the Qt git repository for 5.3.

1. Stream Reader

Qt describes a stream reader as [building] recursive descent parsers, allowing XML parsing code to be split into different methods or classes. This means that there’s usually a separate function to process each level of elements with functions getting more detailed the deeper you get. Qt has a good example for parsing XBEL (web browser bookmark) files, but we’ll use our own problem to illustrate.

Our stream reader class is XmlRateReader which parses the rates from a given xml file.

The mapping from the xml works like this:

We start by looking at the top level start tag in the xml stream, calling processRates() when we find rates.

From within processRates(), we use readNextStartElement() to loop over tags calling processRate() when we find a rate start tag and skipping all other elements. We call skipCurrentElement() for start elements that do not match because readNextStartElement() always descends the tree. Calling skipCurrentElement() jumps to the corresponding end element and keeps us moving on the same level.

Lastly, processRate() iterates over the start elements and grabs the values we want as they’re seen. It also shows an alternative implementation using readElementText().

QXmlStreamReader can be made to function more simply. I’ll cover that in a follow-up. The biggest issue I’ve found with this approach is that the main two methods for moving around, readNextStartElement() and skipCurrentElement(), are confusing to use in practice. I’ll let this code explain it for me:

 Pros & Cons

  • ++ fastest parser
  • ++ for simple parsing, can be made as simple as QDomDocument
  • + easy to understand and follow
  • + parsing is linear and follows the code
  • + low memory
  • + can be parsed incrementally (chunks of xml at a time)
  • - extremely verbose for simple parsing
  • - very easy to mess up traversal using readNextStartElement() and skipCurrentElement()
  • - lots of extra boilerplate that junks up your parsing

2. QDomDocument

QDomDocument processes an xml document into a fully processed internal tree of xml objects that can be read and manipulated through a single interface. Here we read the entire xml file into our QDomDocument, doc, and ask for a list of all of the rate elements. Then, it’s just a couple more steps to parse out the child elements.

Note: If there are other types of rate elements in the xml file, we might get the wrong types as well. There might be a way to be more specific using namespaces or to just first find the rates element(s) and then find the rate elements within those.

 Pros & Cons

  • ++ very easy to use and understand
  • + less code
  • ? simplicity might break for complex files
  • - potential for high memory usage (entire xml is parsed and stored in memory)
  • - not actively maintained anymore1

3. XQuery/XPath

From the QtXmlPatterns module, XQuery/XPath is a language for parsing XML and formatting the parsed results. To be more specific, XPath is the syntax specification that XQuery uses and greatly extends to indicate what and how to parse. Both of them are standardized and are therefore not unique to Qt. Qt has two good introductions to XQuery: there is a detailed introduction to the XQuery language itself, but their [XQuery documentation] does a better job of introducing XQuery by showing it used in practice. I found it rather difficult to figure out how to pull out my rate information. As you’ll see below, I found two methods of doing it, but there’s still room for improvement.

Before moving on to the solutions, you should know that Qt comes with a command line utility called xmlpatterns. If you put your XQuery into a text file, you can test it from the command line saving you lots of time re-compiling. The tool was my primary method of testing and learning XQuery. You can find my tests in this folder on github.

I started by pulling out all of the from, to, and conversion elements as separate lists and then assuming the three lists would be in the same order. I ran the same query for each element name. Note: I should have used variable binding instead of string substitution here. It doesn’t matter here, but it’s a good habit to form because XQuery is subject to injection attacks.

Next, I tried to jazz up the code a bit by storing the lists and element names in a list themselves and writing generic code to do the rest. For my three elements, the code ended up being one line longer and more complex than the simple approach above. Given lots more terms, this would’ve paid for itself. But there’s something to be said for simplicity.

Eventually, I returned to the problem and found a single query to pull all three elements out at once. The result is a list of rate elements separated by commas with each rate grouping separated by a space. XQuery has support for types, but it’s pointless to use them because Qt’s XQuery code will just convert them back to strings before you get them which just wastes the effort.2

I think for applications like this where you want to pull some information out of XML, XQuery is the way to go. It’s significantly harder to figure out (at first), but the potential is great and the fact that it’s a standard means you can use the same technique elsewhere. Also, as the queries get more complex, your Xquery string may get more complex, but your code still remains relatively short.

Pros & Cons

  • ++ less code with potential for greater savings as the parsing problem gets more complex
  • ++ very powerful
  • + standardized, use it other places where XQuery is supported
  • + very easy to test using xmlpatterns from the commandline
  • + can parse data that “looks like XML”
  • + can convert all or part of one XML format to a different XML format in only a few lines of code
  • - extremely hard to understand for all but very simple queries
  • - makes regular expressions look simple
  • - most of the online tutorials and documentation google returns only show extremely simple examples
  • - potential for bugs due to complexity
  • -- slowest parsing method (4 times slower in one test)

4. SAX2

Wait, what? I thought you said THREE methods.

Well, I did… but that’s because I intentionally skipped SAX.

Why?

Honestly… I mistakenly thought it was deprecated.1 3

SAX is an event-based XML parser. You create a class with callback methods and give your XML and your event-handling class to the SAX parser. The parser will then call methods in your class for each event such as: start tag, character data, end tag, and error. XML data is parsed serially and not kept in memory, so it’s very similar to stream reader in that respect. The difference is that stream reader implementations typically model the XML structure in code where specially coded and named methods for different levels track the XML. For SAX, a single interface is called. I think you could make SAX almost identical to stream reader though by having your event methods use the strategy pattern. You could then have a different strategy class for each XML element similar to the stream reader’s methods. To give you an idea of what SAX looks like, here are some excerpts from the bookmarks SAX example.

In the example, the XML is parsed into a tree widget and therefore, the xml handler has intimate knowledge of the tree. You can see here in the open() method how the parsing is established.

Now look at the class declaration for the handler. Notice how the handler has to maintain state versus the stream reader where the state was implicitly defined by the code.

Looking at the declaration for XbelHandler::startElement(), you can see how the handler must use a combination of element name matching and saved state in order to take action.

Pros & Cons

  • + can be parsed incrementally (chunks of xml at a time)
  • + based on a well-established Java pattern of parsing XML so perhaps porting code is easier
  • - a little more difficult to understand and code
  • - probably more prone to encapsulation problems and keeping tons of state in the handler as the parsing gets more complex
  • - not actively maintained anymore1

Conclusions & Caveats

Searching XML, Minimal Parsing

If I wanted to quickly pull a selection of data out of xml, I would use XQuery. If I know the source XML is guaranteed to be small, I might choose QDomDocument for it’s ease-of-use.

Update: I’d actually use QXmlStreamReader along with a helper method that makes it about as simple as QDomDocument to use. This can be seen in the follow-up post.

Extensive Parsing

If I were processing an entire XML file (like a configuration file where every line is important), I’d use QXmlStreamReader or maybe SAX. I think by now you can see why we might use a stream reader, but why SAX? Well, both streaming and SAX support partial input, but stream reader has to detect PrematureEndOfDocumentError. Here’s the kicker, though. According to the docs, once you’ve determined that it’s safe to resume, you have to resume from the code position where you left off since stream reader state is stored in the code position. Alternatively, SAX stores it’s state in variables, so it knows exactly where it left off. Furthermore, since SAX is a push-based parser rather than pull-based like stream reader (meaning your SAX handler has events handed to it while stream reader requests data as needed), there’s no need to catch the end of document error. It’s all event driven, so you let the parser run event-based until it’s done, errors out (for real), or times out. Then, you signal that parsing is finished. Also, I believe that if you used the strategy pattern for the different element types of your XML, you could get the benefits of stream readers recursive descent parser design with SAX. In fact, if you make it such that the strategy classes keep their own state, you could have a generic reusable SAX handler that implements any set of strategy classes. The Qt documentation says “QXmlStreamReader is a faster and more convenient replacement for Qt’s own SAX parser”, and they don’t indicate how much faster, so your mileage may vary. Additionally, since stream reader was written after sax, the developers may have more interest in it and it may be better maintained, but that’s just speculation.

Caveat Emptor

Finally, I don’t use XML much in my work, and my knowledge is limited. I assume that I’ve made mistakes in this document. Please leave me any comments, corrections, or suggestions below. I’ll do my best then to fix problems here.

Wrap Up

Hopefully, this document will help you quickly evaluate all of the different Qt native XML methods in one place and saves you some time. Use the excellent Qt documentation to refine your choice.


  1. Qt documentation says that the QtXml module in not actively maintained anymore here and here, but maybe because it worked well enough. Here’s a mailing list e-mail stating that it’s really slow and proposes an alternative library, pugixml, to use with a similar interface. It’s only 2 headers and 1 source file. Although, many people anecdotally claim to use the DOM interface with small XML documents and have never noticed performance problems warranting attention. According to QTBUG-32926, QtXml is probably not going to be removed, it just won’t be improved. Critical bug fixes, if any, may still occur, just don’t expect it to ever change. 

  2. Although, maybe I’m wrong. There’s a lot about types just underneath the section on variable binding

  3. The stream reader is apparently faster, easier to code, and more Qt-like. Additionally, SAX is modeled after the Java SAX2 interface for parsing XML. So, the main reason you’d use it is to port a Java parser to Qt. So, unless you’re porting, effectively deprecated. 

  • Scottamus

    Thanks! This is the most useful, concise explanation for parsing xml in qt I’ve found all day. I’m maintaining some software that uses xQuery and found that the parsing takes over 130 seconds for processing a single xml file! (it’s only about 130k) Looking online a lot the the xQuery calls seem to take a long time to do cleanups. When the xQuery destructor or bindvariable is called it takes seconds to resolve. So not only is xQuery extremely confusing it’s horribly inefficient. I don’t know if you’ve ran into the same thing but I’d definitely add that to the minuses. I plan to rewrite it using your QDomDocument example as a template. It looks 1000 times more readable and may actually finish in a sensible amount of time. Thanks again.

    Scott.

    • I’ve added a large-file-test project to the git repo. Empirically, I’ve found the timing ratio to be roughly 1:2:4 (StreamReader:DomDocument:XQuery). I also wrote a small helper class EasyXmlStreamReader that makes it easier to use QXmlStreamReader like QDomDocument. I plan to revise this blog post a little and post a followup soon. Would you mind looking at the updates and making suggestions? I’m not sure of how to muck with the XQuery code to make it more realistic. I’m not sure how XQuery gets used in practice.

    • I posted a follow-up post with some numbers: http://3gfp.com/2015/01/qt-xml-parsing-continued/

  • I’m glad it helped you. You make a good point regarding performance tradeoffs. My tests were very small and contrived. I never ran any of the techniques at scale. I have another project I’m working on now in Python that uses a much larger XML file. I’ll try to run some example parsing against it and capture some arm-chair stats.

  • Michele

    this post is really helpful, thanks!!

    There seems to be a bug in your QXmlStreamReader example. In processRate() you need xml.readNext() before extracting xml.text().toString() , otherwise you will get an empty text.

    • Thank you for spotting that. I introduced that bug while switching from readElementText() to text().toString(). I ended up adding both methods to the example. I’ve updated the post and the code. I’m glad the post has helped.

      • Michele

        cool, now it’s ok 🙂 personally I like the xml.readElementText() better.

        Another tiny bug, the last line of XmlRateReader::read() should be:

        qDebug() << xml.errorString();

        • I originally used xml.readElementText(). I thought that it might be better to have an error occur if someone gave you a malformed XML document with child elements inside of the “from,” “to,” and “currency” elements.

          There’s an XmlRateReader::errorString() method at the bottom of the file that calls xml.errorString().