In a previous post, I compared the different methods for parsing XML in Qt. After a comment about XQuery’s performance, I added some code to test performance using a simple, but large (304MB) XML file. Times are sorted and normalized to the smallest value.
Run Time | Method |
---|---|
1 | XQmlStreamReader – processElementsByTagNameHierarchy – text() |
1.1 | XQmlStreamReader – processElementsByTagNameHierarchy |
1.1 | XQmlStreamReader – processElementsByTagName |
2.2 | QDomDocument |
4.0 | XQuery |
What was the test?
I found a 300kB XML file which was part of a software update manifest. I duplicated the file 1000 times. I decided to add up all of the file sizes where the file hash began with a digit. I then ran that test against each method 40 times, dropped the highest and lowest values, averaging the rest.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<update version="3"> <targetVersion>1.0.13.222-ff029016</targetVersion> <platform>macosx-x86_64</platform> <dependencies> <file>updater</file> </dependencies> <install> <file> <name>Contents/_CodeSignature/CodeResources</name> <size>271172</size> <permissions>0644</permissions> <hash>9e4ebf7520e484a8b45af123fc4bfbca846191d7</hash> <package>PlexHomeTheater-1.0.13.222-ff029016-macosx-x86_64</package> </file> |
Don’t give too much credit to the test. I didn’t try to reduce the effects of reading from disk, make the test complex, or any of a number of other things. It’s just anecdotal and a conversation starter.
Improving QXmlStreamReader
You may have noticed processElementsByTagName
and processElementsByTagNameHierarchy
above. After trying to clean up my stream reader code and check for bugs, I realized that some helper methods would make things a bit easier. So, I wrote a class named EasyXmlStreamReader
. It uses QXmlStreamReader like QDomDocument::elementsByTagName()
. Let’s look at how it’s used first.
59 60 61 62 63 64 |
void readUsingEasyStreamReader1(const QString filename) { EasyXmlStreamReader reader(filename); qulonglong total = 0; reader.processElementsByTagName("file", processFile, &total); // qDebug("total = %llu", total); } |
35 36 37 38 39 40 41 42 43 44 45 |
void processFile(EasyXmlStreamReader &reader, void *data) { QHash<QString, QString> results = reader.getTextElements(QStringList() << "size" << "hash"); QString size = results.value("size", ""); QString hash = results.value("hash", ""); if (hash.isEmpty() || size.isEmpty()) return; qulonglong &total = *(qulonglong *)data; if (hash[0].isDigit()) total += size.toULongLong(); } |
Just like that, you get the performance of QXmlStreamReader with the near-simplicity of QDomDocument. The reader finds all XML elements with the given name and passes them to a callback. So, how does it work?
processElementsByTagName(ElementName, CallbackFunction, UserData)
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
// Acts like QDomDocument::elementsByTagName(name), applying method to each one void EasyXmlStreamReader::processElementsByTagName(QString name, fn_type method, void *data) { if (name.isEmpty() || method == 0) return; while (!_xml.atEnd()) { while (_xml.readNextStartElement()) { if (_xml.name() == name) { method(*this, data); // Ensure that we are at the end of the element // matching name or else we might match a nested // element with the same name: // <name> ... <name>...<name/> ... </name> // // skipCurrentElement() moves forward to the nearest // end element which moves the current element up a // level. Using <0><1></1><2></2></0>: if we start at // <1>, we'd skip to </1>, then </0>. // // Edge case that we can't do anything about: // <0> <1> <0></0> <0></0> </1> </0> // ^- start here // If we start from the second <0>, skip moves to the // nearby </0>. This will drop us out of the skip loop. // Then readNextStartElement() will put as at the next // <0> which will trigger a call to method(). The only // way to avoid this is to build our own local readNext // and skip methods keeping track of what level we're on. // Then, you'd need to ensure that method() only used // our methods to move around. QXmlStreamReader doesn't // expose any way of knowing where we are in the tree. while(!(_xml.isEndElement() && _xml.name() == name)) _xml.skipCurrentElement(); } } } } |
Lines 90-91 will visit every start tag at every level while lines 92-93 will send the stream to a callback method if the start element name matches. Lines 116-117 are a convenience that try to insulate the callback author from ensuring that they consume all of the element passed to them. Those lines will attempt to find the end element corresponding to the start element passed to the method. The code comments go into more detail.
QHash<QString, QString> getTextElements(QStringList)
I glossed over getTextElements()
above. It looks at each of the current element’s children and returns as a hash table the text for any that match names in the passed QStringList.
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
// Find each child element of the current element with a name found in names. // Return a hash table of string results. // The code can be made to return early once all names have been found. // One method is to stop when the number of results equals the number of names. // The other is to remove keys from names as they're found and return when // names is empty. Performance may differ based on the number of child elements // and the number of keys in names. Regardless of which approach is used, sorting // names by the order that the keys can expect to be seen will yield the best results. QHash<QString, QString> EasyXmlStreamReader::getTextElements(QStringList names) { QHash<QString, QString> results; // readNextStartElement() descends the XML tree which // is undesirable, but we skip over descendants below // which keeps us on the same level. while (!names.isEmpty() && _xml.readNextStartElement()) { if (!names.contains(_xml.name().toString())) { _xml.skipCurrentElement(); continue; } // readElementText() internally skips the current element results.insert(_xml.name().toString(), _xml.readElementText(_xml.SkipChildElements)); names.removeOne(_xml.name().toString()); } return results; } |
The code uses readElementText(SkipChildElements)
which returns the concatenation of all text for an element and its children (except that I told it to skip the children). It leaves the stream pointer at the end element. Given that I tell it to skip the children, there’s another version of the method that calls text()
which is just a tiny bit faster. See the comments for optimization ideas.
processElementsByTagNameHierarchy(ElementNameList, CallbackFunction, UserData)
This method is similar to processElementsByTagName()
except that instead of matching any element with a given name, it only matches elements with a given name and hierarchy. So instead of matching all file elements, it only matches the ones that are children of install elements which themselves are children of an update element. Notice in the XML file above that there’s a file element method inside the dependencies element. Using the hierarchy method is marginally faster than just processing all nodes in my test, but it might save quite a bit of time if your data is just a small piece of your XML.
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
// Like processElementsByTagName() above but with a defined hierarchy. // method will only be applied to elements with a hierarchy that matches names. // Example: names = QStringList() << "root element" << "next level" << ... << "name"; // This is a little faster than processElementsByTagName() void EasyXmlStreamReader::processElementsByTagNameHierarchy(QStringList names, fn_type method, void *data) { if (names.isEmpty() || method == 0) return; QString currentElementName = _xml.name().toString(); QString name = names.first(); names.removeFirst(); while (!_xml.atEnd()) { while (_xml.readNextStartElement()) { if (_xml.name() != name) { _xml.skipCurrentElement(); continue; } if (names.isEmpty()) method(*this, data); else processElementsByTagNameHierarchy(names, method, data); // The just called level may not have left us // at the end of the element named name. This // explicitly moves us there. Otherwise we might // recurse unexpectedly. Example: if "name" has // a child element named "name" and we've returned // here before seeing the child, we'll end up // processing the child as if it were the parent. while(!(_xml.isEndElement() && _xml.name() == name)) _xml.skipCurrentElement(); } if (_xml.isEndElement() && _xml.name() == currentElementName) break; } } |
Some QXmlStreamReader thoughts
Anyone making extensive use of QXmlStreamReader would probably benefit from subclassing it and using saner stream processing methods. It would also be beneficial to have the subclass maintain state about the current position in the document, i.e. the current element hierarchy. The EasyXmlStreamReader
code would be much simpler with these changes. It would be nice to have a method to iterate over the children of the current element similar to how processElementsByTagNameHierarchy()
tries to do internally.
Wrap Up
In a future post, I may create the subclass above. I may also add a test using pugixml. For now, I know a bit more about the QXmlStreamReader parser and have some performance numbers to bandy about over coffee.