0
0

QHtmlParser: writing an HTML parser with your brain switched off

Posted on 2021-10-10 19:57:28 UTC.

While developing MiTubo I've recently felt the need of parsing HTML pages: the first problem I wanted to solve was implementing proper RSS feed detection when the user entered a website URL into MiTubo's search box, so that MiTubo would parse the site's HTML, look for <link rel="alternate"...> URLs in the HEAD section, and let the user subscribe to any video feeds found there.

A quick search in the internet did not provide a clear answer: I found a Qt HTML parser in (stalled) development, and a few other C++ or C parsers (among the latters, lexbor is the most inspiring), but all of them seem to take the approach of parsing the HTML file into a DOM tree, while I was hoping to find a lightweight SAX-like parser. Pretty much like Python's html.parser.

Anyway, I don't remember how it happened, but at a certain point I found myself looking at html.parser source code, and I was surprised to see how compact it was (apart, of course, for the long list of character references for the HTML entities!). Upon a closer look, it also appeared that the code was not making much use of Python's dynamic typing, so, I thought, maybe I could give it a try to rewrite that into a Qt class. And a few hours later QHtmlParser was born.

As this post's title suggests, the process of rewriting html.parser with Qt was quite straightforward, and the nice thing about it is that I didn't have to spend any time reading the HTML standard or trying to figure out how to implement the parser: I just had to translate Python code into C++ code, and thanks to the nice API of QString (which in many ways resembles Python's — or vice versa) this was not too hard. I even left most of the original code comments untouched, and reused quite a few tests from the test suite.

It was time well spent. :-)

If you think you might need an HTML parser for your Qt application, you are welcome to give it a try. It's not a library, just a set of files that you can import into your project; for the time being I only have a build file for QBS, but I'll happily accept contributions to make it easier to use QHtmlParser with projects built using other build systems. You can see here the changes I made in MiTubo to start using it and detect RSS feed in a webpage's HEAD.

That's all for now. And in case you missed the link before, you can find QHtmlParser here.

Back