Qt XML Parsing Continued

In a previous post, I compared the different methods for parsing XML in Qt. After a comment about XQuery’s performance, I added some code to test performance using a simple, but large (304MB) XML file. Times are sorted and normalized to the smallest value.

Run Time Method
1 XQmlStreamReader – processElementsByTagNameHierarchy – text()
1.1 XQmlStreamReader – processElementsByTagNameHierarchy
1.1 XQmlStreamReader – processElementsByTagName
2.2 QDomDocument
4.0 XQuery

What was the test?

I found a 300kB XML file which was part of a software update manifest. I duplicated the file 1000 times. I decided to add up all of the file sizes where the file hash began with a digit. I then ran that test against each method 40 times, dropped the highest and lowest values, averaging the rest.

Don’t give too much credit to the test. I didn’t try to reduce the effects of reading from disk, make the test complex, or any of a number of other things. It’s just anecdotal and a conversation starter.

Improving QXmlStreamReader

You may have noticed processElementsByTagName and processElementsByTagNameHierarchy above. After trying to clean up my stream reader code and check for bugs, I realized that some helper methods would make things a bit easier. So, I wrote a class named EasyXmlStreamReader. It uses QXmlStreamReader like QDomDocument::elementsByTagName(). Let’s look at how it’s used first.

Just like that, you get the performance of QXmlStreamReader with the near-simplicity of QDomDocument. The reader finds all XML elements with the given name and passes them to a callback. So, how does it work?

processElementsByTagName(ElementName, CallbackFunction, UserData)

Lines 90-91 will visit every start tag at every level while lines 92-93 will send the stream to a callback method if the start element name matches. Lines 116-117 are a convenience that try to insulate the callback author from ensuring that they consume all of the element passed to them. Those lines will attempt to find the end element corresponding to the start element passed to the method. The code comments go into more detail.

QHash<QString, QString> getTextElements(QStringList)

I glossed over getTextElements() above. It looks at each of the current element’s children and returns as a hash table the text for any that match names in the passed QStringList.

The code uses readElementText(SkipChildElements) which returns the concatenation of all text for an element and its children (except that I told it to skip the children). It leaves the stream pointer at the end element. Given that I tell it to skip the children, there’s another version of the method that calls text() which is just a tiny bit faster. See the comments for optimization ideas.

processElementsByTagNameHierarchy(ElementNameList, CallbackFunction, UserData)

This method is similar to processElementsByTagName() except that instead of matching any element with a given name, it only matches elements with a given name and hierarchy. So instead of matching all file elements, it only matches the ones that are children of install elements which themselves are children of an update element. Notice in the XML file above that there’s a file element method inside the dependencies element. Using the hierarchy method is marginally faster than just processing all nodes in my test, but it might save quite a bit of time if your data is just a small piece of your XML.

Some QXmlStreamReader thoughts

Anyone making extensive use of QXmlStreamReader would probably benefit from subclassing it and using saner stream processing methods. It would also be beneficial to have the subclass maintain state about the current position in the document, i.e. the current element hierarchy. The EasyXmlStreamReader code would be much simpler with these changes. It would be nice to have a method to iterate over the children of the current element similar to how processElementsByTagNameHierarchy() tries to do internally.

Wrap Up

In a future post, I may create the subclass above. I may also add a test using pugixml. For now, I know a bit more about the QXmlStreamReader parser and have some performance numbers to bandy about over coffee.

  • apater

    I have what is likely a much simpler xml string to parse and convert. Unfortunately I have zero skill in coding such things. Hopefully I can make sense of your post, if not perhaps you could take a look?

    // turning this: 6 Level6.2 Level6.3 Level6.4 Level6.5 Level

    // into this: 6 Level|6.2 Level|6.3 Level|6.4 Level|6.5 Level, 6 Level|6.2 Level

    • That’s a very incomplete spec. Can your category trees nest? What do you do then?

      • apater

        Thanks for taking a look Harvey.

        I understand that the xml is incomplete, but that is all the output provided by ACDSee. The problem is to convert the incomplete xml into a “|” separated list.

        What is needed is that every time one encounters ‘assigned=”1″‘, generate a separated list up to that point. Continue until no more ‘assigned=”1″‘ are found.

  • Mike

    In the processElementsByTagName() declaration you have fn_type method. My compiler is balking at fn_type – where is it defined?