Logo




Subscribe:
RSS 2.0 | Atom 1.0
Categories:

Sign In


[Giagnocavo]Michael::Write()

 Monday, March 01, 2004
Processing HTML into safe HTML with .NET - Part 2

Following up from part 1, I reviewed three different libraries:

HTMLDocument is a commercial component ($249 per dev, inc. source code).  The other two are libraries written by some cool people at Microsoft and include source code.

SgmlReader is basically an XmlReader that can handle HTML.  To write, we need to use an XmlWriter, and that can mess up the HTML, and we don't want that.  SgmlReader seems like it'd be ok if all we wanted to do is determine if there's unsafe content and then return false, but that's not what we need.

However, both HtmlAgilityPack and HTMLDocument read HTML and create a DOM out of it, allowing you to modify it and write the HTML back out.  This is what we need.  I briefly looked over both libraries to see which one I want to program against.  I gave them both an equal rating to start off with, but the scales rapidly tipped in favour of one library.

HTMLDocument definately loses as far as API niceness and robustness.  Some problems:
  • Inconsistency when loading data into the HtmlDocument.  If you have a string, it needs to go in the constructor, otherwise, use an instance method.
  • Enums (both of them) are prefixed with “e”.  Why?
  • Lack of types.  There are four types total.  That's all.  No HtmlAttribute.  No HtmlElementCollection.  Nothing like that.
  • Weak-typed collections.  ArrayLists and HashTables are used as the collections, instead of strongly-typed collections.  So you must cast, and if you insert an unsupported object, then it will throw an exception  when writing the HTML.  Not very robust.
  • And the silliest thing of all: No encoding support.  Worse than that, FORCED ASCII.  If you open a file, their code opens a stream, manually passing ASCII encoding.  No BOM detection, no system default, just ASCII.  Ouch.
These things made me seriously doubt how professional a library HTMLDocument is.  Most of these things are ultra-simple to fix.  If I was forced to use this, I'd have to buy the source code just to make it right.  It seems like it's purpose is to demonstrate how not to construct a class library.

What's more is that HtmlAgilityPack doesn't have any of these flaws.  In fact, it seems like it's actually a missing piece of the base class libraries.  Superbly done.  Writing code against it was so easy and natural.  I'm extremely impressed.  Even the documentation is much more complete (it comes with a 180KB HTML Help file, compared to HTMLDocument's 36KB HTML Help file).

Hands-down-winner: HtmlAgilityPack.
Code | Security
Monday, March 01, 2004 6:54:09 PM UTC  #    Comments [0]  |  Trackback Tracked by:
"http://uimerar.biz/learn-how-to-deepthroat.html" (http://uimerar.biz/learn-how-... [Pingback]
"http://aajs1yy.biz/olsen-twins-porn.html" (http://aajs1yy.biz/olsen-twins-porn.... [Pingback]


Name
E-mail
Home page

Comment (HTML not allowed)  

Enter the code shown (prevents robots):

Live Comment Preview