In a [previous post], I compared the different methods for parsing XML in Qt. After a comment about XQuery’s performance, I [added some code] to test performance using a simple, but large (304MB) XML file. Times are sorted and normalized to the smallest value.
| Run Time | Method |
| :——: | :—————————————————- |
| 1 | XQmlStreamReader – processElementsByTagNameHierarchy – text() |
| 1.1 | XQmlStreamReader – processElementsByTagNameHierarchy |
| 1.1 | XQmlStreamReader – processElementsByTagName |
| 2.2 | QDomDocument |
| 4.0 | XQuery |
[previous post]: http://3gfp.com/2014/07/three-ways-to-parse-xml-in-qt/
[added some code]: https://github.com/sr105/international_trade/
## What was the test? ##
I found a 300kB XML file which was part of a software update manifest. I duplicated the file 1000 times. I decided to add up all of the file sizes where the file hash began with a digit. I then ran that test against each method 40 times, dropped the highest and lowest values, averaging the rest.
Don’t give too much credit to the test. I didn’t try to reduce the effects of reading from disk, make the test complex, or any of a number of other things. It’s just anecdotal and a conversation starter.
# Improving QXmlStreamReader #
You may have noticed `processElementsByTagName` and `processElementsByTagNameHierarchy` above. After trying to clean up my stream reader code and check for bugs, I realized that some helper methods would make things a bit easier. So, I wrote a class named `EasyXmlStreamReader`. It uses QXmlStreamReader like `QDomDocument::elementsByTagName()`. Let’s look at how it’s used first.
Just like that, you get the performance of QXmlStreamReader with the near-simplicity of QDomDocument. The reader finds all XML elements with the given name and passes them to a callback. So, how does it work?
## processElementsByTagName(ElementName, CallbackFunction, UserData) ##
Lines 90-91 will visit every start tag at every level while lines 92-93 will send the stream to a callback method if the start element name matches. Lines 116-117 are a convenience that try to insulate the callback author from ensuring that they consume all of the element passed to them. Those lines will attempt to find the end element corresponding to the start element passed to the method. The code comments go into more detail.
## QHash
I glossed over `getTextElements()` above. It looks at each of the current element’s children and returns as a hash table the text for any that match names in the passed QStringList.
The code uses `readElementText(SkipChildElements)` which returns the concatenation of all text for an element and its children (except that I told it to skip the children). It leaves the stream pointer at the end element. Given that I tell it to skip the children, there’s another version of the method that calls `text()` which is just a tiny bit faster. See the comments for optimization ideas.
## processElementsByTagNameHierarchy(ElementNameList, CallbackFunction, UserData) ##
This method is similar to `processElementsByTagName()` except that instead of matching _any_ element with a given name, it only matches elements with a given name and hierarchy. So instead of matching all _file_ elements, it only matches the ones that are children of _install_ elements which themselves are children of an _update_ element. Notice in the XML file above that there’s a _file_ element method inside the _dependencies_ element. Using the hierarchy method is marginally faster than just processing all nodes in my test, but it might save quite a bit of time if your data is just a small piece of your XML.
## Some QXmlStreamReader thoughts ##
Anyone making extensive use of QXmlStreamReader would probably benefit from subclassing it and using saner stream processing methods. It would also be beneficial to have the subclass maintain state about the current position in the document, i.e. the current element hierarchy. The `EasyXmlStreamReader` code would be much simpler with these changes. It would be nice to have a method to iterate over the children of the current element similar to how `processElementsByTagNameHierarchy()` tries to do internally.
### Wrap Up ###
In a future post, I may create the subclass above. I may also add a test using [pugixml]. For now, I know a bit more about the QXmlStreamReader parser and have some performance numbers to bandy about over coffee.
[pugixml]: http://pugixml.org/
Leave a Reply to apater Cancel reply