XHTML instead of RSS

Lots of people — Anil, Scott Andrew LePera, and Tantek among them, luminaries all — are talking about using structured XHTML to format weblogs in place of a separate syndication format such as RSS. My opinion is that they’re going about it the wrong way. Bear in mind here that kryogenix.org is HTML 4.01 Transitional, and doesn’t always quite manage to validate as that, so I probably don’t know what I’m talking about. Anil’s suggestion is that we mark up posts using particular tag and class combinations for a particular set part of a weblog post (so, a post title would be \<h3 class=”title”>, for example). Scott complains that that’s overloading the class attribute with semantic meaning, which you’re not supposed to do. Tantek thinks that we shouldn’t try and specify a format explicitly, but have everyone make one up and see which wins; market forces in action. My problem with all of these options is that they all assume that aggregators and other RSS readers are forced to be stupid, that an aggregator should have handed to it, on a plate, what everything means. What’s wrong with having our aggregator’s parser show a bit of intelligence when parsing? For example, people are complaining that mandating \<h3> for a post title renders a page semantically meaningless if there are no \<h1> or \<h2> tags, which is true. Moreover, suggesting that \<h3 class=”title”> means “title of a weblog posting” is attempting to layer semantic meaning over and above that which XHTML gives to the tag, which you shouldn’t really do, and that’s what I think that Scott was complaining about. I see no problem with a custom XML format where you define the semantic meaning of each tag — a \<blogposttitle> tag, for example — and then either syndicate from this and display in the browser with XML stylesheets (as Mozquito XForms does, superbly), or instead just XSLT it into HTML and RSS. But, failing all that, we could have the aggregator parser show intelligence when parsing — is it unreasonable to assume, for instance, that the “lowest level” header is one for blog titles? So a page with h1, h2, and h3 headers uses h3 as the post title? A hyperlink preceded by “Posted by” is an author link? A hyperlink containing a text node containing only a # symbol is a permalink? These aren’t the only possibilities, and they’re not necessarily true in all cases. But you could come up with a set of reasonable heuristics for this sort of thing that wouldn’t dictate a(nother) custom XML format or a hackish overloading of XHTML semantics. The issue here is that we’re either contorting our HTML or providing an extra syndication format file — both of which are, as Scott points out, “compromises on the part of the author” — in order to make life easier for machines. Nuh-uh. Machines are meant to make life easier for us. Look at, for example, Mark Pilgrim’s ultra-liberal RSS parser, which tries all kinds of ways of parsing (not finding; sorry, Mark) your RSS file — the machine works hard so you don’t have to. That’s the way it’s supposed to be. If we’re working hard to make the machine’s life easier, then we’re doing something wrong. ——-

More in the discussion (powered by webmentions)

  • (no mentions, yet.)