Qt XML Parsing Continued

Written by

In a [previous post], I compared the different methods for parsing XML in Qt. After a comment about XQuery’s performance, I [added some code] to test performance using a simple, but large (304MB) XML file. Times are sorted and normalized to the smallest value.

| Run Time | Method |
| :——: | :—————————————————- |
| 1 | XQmlStreamReader – processElementsByTagNameHierarchy – text() |
| 1.1 | XQmlStreamReader – processElementsByTagNameHierarchy |
| 1.1 | XQmlStreamReader – processElementsByTagName |
| 2.2 | QDomDocument |
| 4.0 | XQuery |

[previous post]: http://3gfp.com/2014/07/three-ways-to-parse-xml-in-qt/
[added some code]: https://github.com/sr105/international_trade/

## What was the test? ##

I found a 300kB XML file which was part of a software update manifest. I duplicated the file 1000 times. I decided to add up all of the file sizes where the file hash began with a digit. I then ran that test against each method 40 times, dropped the highest and lowest values, averaging the rest.

Don’t give too much credit to the test. I didn’t try to reduce the effects of reading from disk, make the test complex, or any of a number of other things. It’s just anecdotal and a conversation starter.

# Improving QXmlStreamReader #

You may have noticed `processElementsByTagName` and `processElementsByTagNameHierarchy` above. After trying to clean up my stream reader code and check for bugs, I realized that some helper methods would make things a bit easier. So, I wrote a class named `EasyXmlStreamReader`. It uses QXmlStreamReader like `QDomDocument::elementsByTagName()`. Let’s look at how it’s used first.

Just like that, you get the performance of QXmlStreamReader with the near-simplicity of QDomDocument. The reader finds all XML elements with the given name and passes them to a callback. So, how does it work?

## processElementsByTagName(ElementName, CallbackFunction, UserData) ##

Lines 90-91 will visit every start tag at every level while lines 92-93 will send the stream to a callback method if the start element name matches. Lines 116-117 are a convenience that try to insulate the callback author from ensuring that they consume all of the element passed to them. Those lines will attempt to find the end element corresponding to the start element passed to the method. The code comments go into more detail.

## QHash getTextElements(QStringList) ##

I glossed over `getTextElements()` above. It looks at each of the current element’s children and returns as a hash table the text for any that match names in the passed QStringList.

The code uses `readElementText(SkipChildElements)` which returns the concatenation of all text for an element and its children (except that I told it to skip the children). It leaves the stream pointer at the end element. Given that I tell it to skip the children, there’s another version of the method that calls `text()` which is just a tiny bit faster. See the comments for optimization ideas.

## processElementsByTagNameHierarchy(ElementNameList, CallbackFunction, UserData) ##

This method is similar to `processElementsByTagName()` except that instead of matching _any_ element with a given name, it only matches elements with a given name and hierarchy. So instead of matching all _file_ elements, it only matches the ones that are children of _install_ elements which themselves are children of an _update_ element. Notice in the XML file above that there’s a _file_ element method inside the _dependencies_ element. Using the hierarchy method is marginally faster than just processing all nodes in my test, but it might save quite a bit of time if your data is just a small piece of your XML.

## Some QXmlStreamReader thoughts ##

Anyone making extensive use of QXmlStreamReader would probably benefit from subclassing it and using saner stream processing methods. It would also be beneficial to have the subclass maintain state about the current position in the document, i.e. the current element hierarchy. The `EasyXmlStreamReader` code would be much simpler with these changes. It would be nice to have a method to iterate over the children of the current element similar to how `processElementsByTagNameHierarchy()` tries to do internally.

### Wrap Up ###

In a future post, I may create the subclass above. I may also add a test using [pugixml]. For now, I know a bit more about the QXmlStreamReader parser and have some performance numbers to bandy about over coffee.

[pugixml]: http://pugixml.org/

Comments

6 responses to “Qt XML Parsing Continued”

March 26, 2015

apater
I have what is likely a much simpler xml string to parse and convert. Unfortunately I have zero skill in coding such things. Hopefully I can make sense of your post, if not perhaps you could take a look?

https://bugs.kde.org/show_bug.cgi?id=345220

// turning this:
6 Level6.2 Level6.3 Level6.4 Level6.5 Level

// into this:
6 Level|6.2 Level|6.3 Level|6.4 Level|6.5 Level, 6 Level|6.2 Level
Reply
1. March 26, 2015
  
  Harvey
  
  That’s a very incomplete spec. Can your category trees nest? What do you do then?
  
  Reply
  1. March 30, 2015
    
    apater
    
    Thanks for taking a look Harvey.
    
    I understand that the xml is incomplete, but that is all the output provided by ACDSee. The problem is to convert the incomplete xml into a “|” separated list.
    
    What is needed is that every time one encounters ‘assigned=”1″‘, generate a separated list up to that point. Continue until no more ‘assigned=”1″‘ are found.
    
    Reply
June 10, 2015

Mike

In the processElementsByTagName() declaration you have fn_type method. My compiler is balking at fn_type – where is it defined?

Reply
1. June 10, 2015
  
  Harvey
  
  It’s in the header: https://github.com/sr105/international_trade/blob/master/large-file-test/easyxmlstreamreader.h#L30
  
  Reply
  1. June 11, 2015
    
    Mike
    
    Thank you
    
    Reply

Qt XML Parsing Continued

Comments

6 responses to “Qt XML Parsing Continued”

Leave a Reply to apater Cancel reply

More posts

Sniffing TCP Packets With Python

Qt XML Parsing Continued

Emacs Tramp and Speedier SSH Access

Formatting SD Cards for Speed and Lifetime