Strict feed parsers are useless

Erich,
I’m not entirely sure what you did to break Planet, but using a strict
feed parser will just result in you missing a significant number of
entries. People sadly don’t produce valid feeds and will blame your
software rather than their feeds. It doesn’t help that a number of
validators aren’t entirely strict and that RSS doesn’t have a very
comprehensive spec. RSS is a lot worse than Atom, in part thanks to the
Atom validator and very well thought out spec. It’s for this reason that
I ended up writing Eddie rather
than using ROME as it was a DOM
parser and just failed to get any information out of a non-wellformed
feed. Eddie on the other hand is a SAX-based parser. In a recent
comparison, an Eddie based aggregator managed to correctly parse several
more entries than a ROME based aggregator one particular day.

You also have major
aggregators being liberal. Sam Ruby discussed
this recently
with Bloglines becoming the defacto validator; if
bloglines parses it, then it’s valid. We had the same problem with HTML
with people making sure their pages worked in a browser rather than met
the spec.

I suspect the problem you had with Planet is that you failed to close
a tag, causing the rest of the page to be in bold or be a link etc.
This is fairly easily solvable and in fact has been with FeedParser,
which is the feed parsing library Planet uses. It has support for using
HTMLTidy and similar libraries for fixing unbalanced elements. Eddie
uses TagSoup to do a similar thing. As a result I’ve not noticed any
particular entry leaking markup and breaking the page. Parhaps Planet
Debian just needs to install one of the markup cleaning libraries.

I agree that people should use XML tools where possible.
Unfortunately, most blogging tools use text based templating systems,
which makes producing non-wellformed XML too easy. To deal with this I
pass all my output through an XSLT filter, which means that everything
is either well formed or doesn’t output at all. Unfortunately I don’t
think everyone would be capable or willing to use XSLT.

3 thoughts on “Strict feed parsers are useless

  1. Hi,
    well, dropping invalid entries in my own feed parser of course doesn’t make sense; but if you’d do it at a planet level, people will notice that some of their posts are being lost because of bad markup.
    Maybe they’ll then install some validator for their blog.

    Actually that is how HTML evolved… think of planet being the browser. People will fix their pages to be readable by the “browser”.

    Another way would be to convert any invalid markup to plain text. Then people will notice their blog posts appearing “unformatted”, but still appearing.

    Oh, there is no reason why a DOM parser should be less fault-tolerant than a SAX parser. Often DOM builders are actually fed by a SAX parser. I’m not aware of any DOM parser actually using random access.

    The problem I had with planet was a mistyped /ul instead of the /li I intended… the result was still rather easy to parse (in fact browsers would display it still mostly correct). So I didn’t boldface or enumerate the rest of the planet; it was just the rss2email tool which started choking on the planet rss feed.

    Oh, and you don’t need to use a XSLT stylesheet to verify it. xmllint should be sufficient. Well, it probably doesn’t handle encoded content, but I doubt you can easily verify them in XSLT either… any my “breakage” was in the encoded parts IIRC.

  2. Erich, all my pages are transformed when they are served. I don’t encode my feeds. I use type=”xhtml” in my atom feeds, and rss is a lost cause, but I recommend people use my atom feeds. I know my atom feeds are valid because the html view is.

    Usually DOM parsers don’t make any attempt at correcting errors. Using SAX allows you slightly more control over your parsing. Running ROME over the FeedParser testsuite resulting in none of the illformed tests being parsed, let alone being valid.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>