Parsing HTML Easily in C#
Nope.
There was a time when I was all over building my own kernel and tweeking things. I’m past that now and I just want stuff to work. God, just work (I have a whole rant about printers and printing not working but it descends into language which would melt your face so quickly that I dare not post it). So that went out the Window and I switched to C#. Actually, first I switched to using a JavaScript scheduled task in Azure mobile services. However there were limited node modules available and I didn’t feel like writing or including a whole HTTP library.
Right. I chose to use the HTML agility pack which is a pretty good parser for HTML. Frequently sites don’t have well formed HTML which makes parsing them with a full featured XML parser impossible. You don’t want to get involved in XML anyway - it is a gateway drug to proprietary file formats.
The agility pack has a built in HTTP client but I wasn’t too impressed with it as it just returned the document and not the status code. Without knowing if the page was a 404 my particular parsing job was difficult. Instead I made use of the HTTP client in System.Net.Http to pull the page. In this case I’m pulling down a page from the web comic XKCD.
All this stuff is asynchronous in .net 4.5 which is great. For all too long we’ve laboured in code bases which assume that network operations run at local speeds. I’m looking at you windows file system.
Parsing the returned document is also very simple
Now we have the document loaded into the agility pack we can perform simple XPath queries against it.
The first query there selects the first image in the first div with an id attribute with value comic. Using a double // means that I’m not concerned about how deep in the tree the node appears. The second query just grabs the first div with an id of ctitle. You can also use a SelectNodes function to select multiple matching nodes.
I’m not crazy about using xpath for this sort of thing. It would be great to have a CSS selector based library in C#. As it turns out there is a project called Fizzler. I guess it is still in beta, but the few things I read about it suggest it works quite well. I’ll have to play with it for another post. It certainly would be nice to only have to know one HTML query language.