RSS feeds with changing enclosure URLs
On the LugRadio site we publish an RSS feed of episodes, and each episode includes an enclosure tag which references the downloadable MP3 for that episode. So, one snippet from the feed looks like:
<item>
<title>Bars with coconut in</title>
<link>http://www.lugradio.org/episodes/31</link>
<description>Jono Bacon, Stuart Langridge (Aq), Matt Revell, and <span title="On Call
Bald"
style="border-bottom: 1px dotted #ccc">Ade Bradshaw</span>
talk about Linux and whatever else comes along, including:
</p>
<ul>
<li>An interview with Yannick and Carlos from Nokia about the Nokia
Internet Tablet and the company's approach to open-source software</li>
<li>Bounties for writing code: are they a good idea?</li>
<li>Ian Brown, head of FIPR, on ID cards in the UK, and whether they
should happen or not</li>
<li>Samba: should we be inventing our own open protocols rather than
chasing the tail-lights of closed competitors?</li>
</ul>
</description>
<enclosure url="http://lug.mtu.edu.nyud.net:8090/lugradio/lugradio-s02e19-040705-high.mp3?podcast" type="audio/mpeg" length="17659904" />
</item>
The problem is this: if the enclosure URL changes, people’s podcast clients and RSS aggregators download stuff again. How can I avoid this happening?
A couple of suggested wrong solutions:
1. Never change the URL
Can’t do that. The URL points to a mirrored copy of the episode’s mp3 file. If that mirror goes offline (our mirrors are run by volunteers without payment), we have to change the URL to point to another.
2. Use a redirect
Lots of people say “just make the URL be http://www.lugradio.org/mirrors/season2/episode19/mp3” and have that be a CGI script that redirects to a mirror. That would be great if podcast clients were compliant HTTP clients, and followed redirects. In practice, they are not and do not. This means that, if we implemented the redirect anyway, we’d be secure in our integrity but lots of people couldn’t download the show the way they want. Knowing that we are right and they are wrong is cold comfort when we’re annoying our listeners; we can’t use redirects.
3. Coralize the podcast enclosure URLs
Use the Coral distribution network to not put too much pressure on the archive that goes into the mirror feed. We’re already doing this, but if that archive goes away, the Coralized URL won’t point to anything, and we’ll have to change the URL. The Coral people are very cool, but they won’t cache all our mp3 files indefinitely.
4. Set up an archive that never goes away
Essentially, this is a suggestion that goes with suggestion 1: make sure that old mirror URLs don’t break by setting up the One Canonical Archive that the podcast feed points at. The issue with doing that is that that one archive gets hit pretty hard for bandwidth, because all the podcast readers use it. This means that it has to be an archive with lots of bandwidth to cover the initial download spike when an episode is released. We could use archive.org, and we do upload episodes there, but their upload process is long and laborious and would delay an episode’s release by quite some time, which we’d rather not do.
5. Do a ‘redirect’ by streaming the data through our URL
No-one’s suggested this, but I’ve thought of it. We could put the URL from the “redirect” suggestion above in the feed, but instead of having that URL redirect to a mirror, that URL points to a CGI script which downloads the data from a mirror and streams it out on the fly to the consumer. I don’t want to do this because it puts a horrific bandwidth requirement on the lugradio.org machine; every downloaded byte of a LugRadio episode will go through that machine, and we can’t afford the bandwidth for that.
At the moment, we Coralize URLs, and we’re trying to set up a Canonical Archive based on a very useful donation (about which more in a few days). But is there a better way? I can’t help but think that there should be a better way around this; something with tag URIs or guids or something. Remember that we need something which works with the podcast clients that currently exist; I don’t want to hear “clients should support XXXX and if they don’t then you should ignore them“, because we don’t have the luxury of doing that, sadly.
Any help will be greatly, greatly appreciated; I’ve been wrestling with this problem for months now and I’m still not sure how to solve it. Thanks!
coralize the link to podcasts.lugradio.org which is cname to a mirror who can actually control his/her own httpd.conf
Am pretty sure there’s quite a few of them out there – most decent hosting companies let you and I suspect in most cases the box admin is your mirror admin anyway.
That way, if a mirror goes away you just point podcasts.lugradio.org to someone else. And it’s not like that particular mirror will get overhammered, ‘cos coral takes the strain.
The “crux” here is the ability to create a URL that will never change – and a cname, while not perfect, will give you a pretty good chance.
-ttfn, Xalior
-5461 seconds later
Some thoughts:
(a) As per Wrong Suggestion #4, I am under the impression that one can use ourmedia.org to submit content to archive.org in a way that bypasses the review process and waiting limit. It might be worth investigating.
(b) Rely on coral and perhaps some provider like libsyn.com for the initial burst of downloads for the first two or four weeks of the episode’s life. Rely on archive.org for archives of old shows. (The RSS feeds should only really contain one or two shows to protect against people accidentally downloading your whole catalogue when they first subscribe.)
(c) Follow Dave Slusher’s example (evilgeniuschronicles.org) and have your default RSS feed serve only the torrent files. He puts a plain mp3 file in the feed that will only show up if a person’s podcatcher doesn’t do BitTorrent. The mp3 file explains that BitTorrent didn’t work for them and they need to subscribe to the alternative mp3 feed. It’s a compromise solution that encourages tech-savy people to use his BitTorrent feed, but holds the hands of non geeks.
4 hours later
I’m inclined towards going with the CNAME approach for those who can control their mirror config. It is the first thing I thought of and obviously seems to have come up before.
Jon.
11 hours later
The CNAME approach means that all mirrors not only have to mirror files, but mirror a specific directory structure as well, which they don’t currently do…
18 hours later
#2: Do you actually know of any clients that don’t handle redirects? Or are you just assuming that a significant number get confused by them? That wouldn’t be an assumption I would make.
24 hours later
Jim; when I first set up the LugRadio podcast feed I used redirects, and we had complaints from lots of people that their clients failed to downlaod episodes; I eventually fixed it by not doing the redirect. It’s possible that that has since been fixed, but it was a real problem in the early days and I haven’t revisited it.
24 hours later
The CNAME approach is obviously opt-in – not all mirrors can modify their httpd.conf, and there’s not a great disadvantage in reorganising the directories to a standard layout once (except for the need to resubmit all the old archives).
28 hours later
httpd-triggered redirects, rather than CGI redirects, should be honoured by every client, surely? i.e. put a ‘Redirect’ statement in the Apache config for each episode … ?
3 days later
httpd-triggered redirects, rather than CGI redirects, should be honoured by every client, surely? i.e. put a ‘Redirect’ statement in the Apache config for each episode … ?
3 days later
davee: There’s no difference between “httpd redirects” and “cgi redirects”: they both just return a status code of 301 or 302…
3 days later
in the enclosure tag specify a guid=“” where guid is a unique identifier (perhaps the day of the broadcast or something) this way the URL can change, but the file does not get downloaded again. Its what that property is there for.
4 days later
William: that’s what I was hoping for! Thanks. Do clients pay attention to that? If I change the URLs in the enclosure tags in a feed but keep the GUIDs the same, will RSS aggregators redisplay the feed as new or not?
4 days later
William seems to be talking about a nonexistent guid attribute on the enclosure element, not the guid element which does actually exist. Any client which treats a change in URL for the enclosure when the guid doesn’t change as being a new item to download is broken, plain and simple, but remember, you’re talking about catering to clients so broken that they can’t follow a redirect.
I wish I had the sort of answer you want, a cunning plan to sneak around the broken parts, but the only one I have is the one that’s always worked for us before: shame.
Use a guid element: it’s the only thing that gives aggregators a fighting chance at recognizing an item they’ve seen before. Then, change the enclosure URL, and examine the logfile for both mirrors to see what UAs are downloading it again. Tell the developers that they are broken, loudly. Tell everyone you know that they are broken, and ask them to tell everyone they know.
Then, go back to a redirect. It’s dead simple basic HTTP, and anything that can’t follow one has no business poking its nose out of its local machine. Compare your local logfile with the logfile on the redirected mirror, to see which UAs are requesting the URL in the enclosure element and not requesting the redirected file. Tell their developers that they are broken, loudly. Tell everyone you know that they are broken, and ask them to tell everyone they know.
A technical solution on your end is more the sort of thing that I would choose, and the sort of thing I’d generally expect to work, but do you remember how RSS clients almost all used to ignore ETags, and fail to request with If Modified Since? The reason they started to behave properly was nothing more than shame, and not all that much of it. No reason why it won’t work again, with a new generation and a new response code.
13 days later
Oh, and Darren’s (b) is absolutely right: having your feed contain every single episode ever is just begging to have them all downloaded multiple times, even if you don’t change URLs (“ooh, shiny new podcatcher client, I’ll import all my subscriptions to see how it works…“).
13 days later
Ah-ha! I was hoping for a response from you, Phil. So much so that this was almost an email to you, actually :)
So, we think a guid on the item plus a redirect, eh? I need to find out how well podcast clients handle redirects now; I’m led to believe that they might be better than before. If they are (i.e., if redirecting cuts out a minority rather than a majority) then I may go for that.
13 days later
On the “Set up an archive that never goes away” approach, we used archive.org to host our audio files. If you use a creative commons licence they seem to be happy, and they seem like they are there for the long haul. They are also free, pretty reliable, and have good permanent links.
66 weeks later
[...] Interestingly, though, I’ve been poking through my logs for lugradio.org and associated mirrors. In the past, people have asked (repeatedly) for an estimate of how many people listen to the show, and I normally quote some figures I worked out a long time ago showing that between eight and twelve thousand people grab it every two weeks. For various reasons it’s quite difficult to get figures across our mirror network and RSS and so on, but I believe I’ve put these together properly. It looks like, since the old days when I worked those numbers out, it’s changed rather a lot. As far as I can tell, an “average” episode (if there is such a thing) of LugRadio has about 20,000 people listen to it, and the most popular episodes get somewhere around 30,000 listeners. Thirty thousand! Blimey. So, I would like to say thanks to those thirty thousand people: we love it, yes we do. Keep on doing what you do, and we’ll keep on doing what we do. [...]
117 weeks later