In an application I'm currently writing, we allow users to write messages with HTML markup in them, to deliver a rich experience. The obvious problem is making this secure. We don't want UserA to write a malicious script and steal some of UserB's data. IE provides some cross-site scripting defense, but defense-in-depth (well, not even that deep in this case) would want to make us ensure that the HTML doesn't contain anything executable. I've seen some samples that claim to clean the HTML with not much code at all. They check a few tags, and they think they're done. Of course, they aren't.The problem is that IE is extremely powerful. While this is great when developing an intranet application, it makes finding all the attack vectors nearly impossible. For instance, we might think that a style attribute is ok, right? Wrong. There are two problems that I can think of (without thinking too hard). First, someone could use styles to “overwrite” links on the page by using absolute positioning. They could then change the “My Account” link into a link that goes to their own server, and steal the user's information. Second, the style attribute can be used to load an HTML Component (.HTC). This can contain lots of script. That's bad. And this is just in one little attribute!Needless to say, there are many, many more attack vectors. Even if we could find them all, that doesn't help users when they get a new browser with upgraded and different capabilities. So, we're going to have to resort to a “safe” HTML subset. We'll go though the MSDN reference and pick out the tags and attributes that we consider safe, and anything else will simply get deleted.Sounds easy enough, except we've got to parse the HTML. Not fun. Fortunately, I've found two libraries that do this. The HtmlAgilityPack, written in C# by Simon Mourier from Microsoft (source included), and DevComponents.com's HTMLDocument, a commercial but inexpensive library. If anyone knows of other HTML parsing libraries, please leave a comment. In part 2, I'm going to review the APIs of the different libraries.
Remember Me