feed parser Archives - David Pashley.comDavid Pashley.com

Erich,
I’m not entirely sure what you did to break Planet, but using a strict
feed parser will just result in you missing a significant number of
entries. People sadly don’t produce valid feeds and will blame your
software rather than their feeds. It doesn’t help that a number of
validators aren’t entirely strict and that RSS doesn’t have a very
comprehensive spec. RSS is a lot worse than Atom, in part thanks to the
Atom validator and very well thought out spec. It’s for this reason that
I ended up writing Eddie rather
than using ROME as it was a DOM
parser and just failed to get any information out of a non-wellformed
feed. Eddie on the other hand is a SAX-based parser. In a recent
comparison, an Eddie based aggregator managed to correctly parse several
more entries than a ROME based aggregator one particular day.

You also have major
aggregators being liberal. Sam Ruby discussed
this recently with Bloglines becoming the defacto validator; if
bloglines parses it, then it’s valid. We had the same problem with HTML
with people making sure their pages worked in a browser rather than met
the spec.

I suspect the problem you had with Planet is that you failed to close
a tag, causing the rest of the page to be in bold or be a link etc.
This is fairly easily solvable and in fact has been with FeedParser,
which is the feed parsing library Planet uses. It has support for using
HTMLTidy and similar libraries for fixing unbalanced elements. Eddie
uses TagSoup to do a similar thing. As a result I’ve not noticed any
particular entry leaking markup and breaking the page. Parhaps Planet
Debian just needs to install one of the markup cleaning libraries.

I agree that people should use XML tools where possible.
Unfortunately, most blogging tools use text based templating systems,
which makes producing non-wellformed XML too easy. To deal with this I
pass all my output through an XSLT filter, which means that everything
is either well formed or doesn’t output at all. Unfortunately I don’t
think everyone would be capable or willing to use XSLT.

Tag Archives: feed parser

Strict feed parsers are useless