Distributed backups to friends

It ought to be possible to have a backup system with the following characteristics:

  1. You download a backup client and run it. It asks for a backup group name and a password. It is cross-platform, or at least ported to Linux, Mac OS X, and Windows. It also asks how much space you’re prepared to devote to backups.
  2. You choose directories and files to back up by finding them in your file manager and tagging them as “For backup”.
  3. That’s then all the user interaction that there is.

The way the backup actually works is that:

  • It takes the stuff you want to back up, and creates a big backup file out of it, every night.
  • It breaks the file up into bits, using the PAR stuff from parchive.sourceforge.net. This means that to recover your backup, you need some but not all of the bits, so if some bits get lost it doesn’t matter.
  • It then ships the bits out to other people in your backup group and stores it on their systems, not on yours, giving you off-site backups.

That would make it very easy for a group of people to do mutual backups without having to think very hard about it.

Implementation thoughts

You’d need a server somewhere, to store password details for backup groups and to co-ordinate shipping the data around (since everyone’s likely to be behind a firewall). No-one should ever see or know about this server, though. There is no “sign-up procedure”; to create a new backup group, you just run the client and provide a backup group name and password. That’s all. You don’t have to sign up on the web or explicitly invite anyone into the group; anyone with the username and password can join.

There’s nothing in the above about how to restore from a backup, I know. That needs some kind of UI, but I’m not sure what that should be.

It needs to warn you if there’s not enough space out there on the group to back up all the stuff you’re trying to back up. Some kind of algorithm which demands that if you want to back up N megabytes you have to offer 3N megabytes of space to the group or something.

There should be some rsyncness in it. If not much has changed, it shouldn’t need to send much out to the group. However, this might be complex, because the previous backup is in scattered bits, and you don’t want to do incremental backups because then you need the full backup as well.

Backups must be encrypted, because they’re stored on someone else’s machine. There will probably need to be some kind of UI to provide a passphrase or similar. This also makes rsyncness difficult.

I think this would be a really useful project. The key point, the absolutely critical point, is that the client must be as described above: it just asks you which backup group you’re in and that’s all. No ten pages of options, no need for you to tell it who else is in the group or to maintain a list of who that is or where you want backups to go. If it’s in any way difficult, it won’t get used, and then no-one has backups.

Wish I had time to write this. The big problem that needs solving is how to have the rsyncness in it, so that it only ships changes around rather than a full backup every night. Other than that, it’s all doable, and not all that difficult.

25 Responses to “Distributed backups to friends”

  1. Sounds really cool.

    Presumably you would also want a way to limit the amount of storage available from the client side?

    mrben
  2. mrben: sorry? not sure I understood that…

    sil
  3. Sounds good, when are you going to write it? ;)

    David Reynolds
  4. I was thinking of something similar a while back, but with a more P2P angle on it (i.e. the people in the network are entirely unknown):

    http://strugglers.net/wiki/P2P_Backup_Network

    Andy Smith
  5. I had similar thoughts at SXSW 2005. ;)

    I’m not sure I understand what the point of the PAR file is. Is this so that when a backup node goes missing, you can still recover your data? Doesn’t that pre-suppose that you have a large # of nodes (so that PAR is still useful in recovery)?

    I, too, had thought of a more P2P angle, though a reflector server is a good idea. My thought was a system where groups can agree upon a certain backup multiple, e.g. if you are in a 3x group, you can backup 1 MB for each 3 MB you’re willing to give up for backup.

    One other thing that struck me was that your backup group names would be in a global namespace if no group registration, right? This is something that struck me as an odd thing about S3– that bucket names are not namespaced.

    I’m still mulling how that is useful. I guess if ACLs get stronger in S3, different apps could share the same bucket…

    Jeremy Dunck
  6. Jeremy: yes, they’d be a global namespace. You encourage people to make their namespace a domain name to get around that, I suspect.

    The PAR file stuff is precisely so that if nodes go missing you can still recover. Nodes can go missing in two ways, permanently or temporarily. A temporary missing node occurs when I want to do my recovery at 3am and one of the people in my group doesn’t have their computer switched on. A permanent missing node occurs when someone pulls out of the backup group. It’s critical that permanent missing nodes don’t break everything, because you should be able to pull out of the backup group without doing anything other than uninstalling the client; a pulling-out user should not have to tell everyone in their backup group that they’re pulling out!

    The PAR stuff, as I understand it, lets me say “split this file into N bits such that I only need P bits to recover it all”. Tuning what N-P is is the complicated bit; I’m inclined to say that it should be about 3 or so, so if three people pull out of the group it doesn’t break everything. I suspect that the overnight backup thing should work out how many people there are in the group (N) and then calculate N and P accordingly for that night’s backup.

    sil
  7. Duplicity
    or rdiff-backup might be interesting
    to you. They don’t do all that you want, but they do have the
    rsyncness you’re after. I believe that they do this by using
    librsync; in any case, it should be relatively simple to use librsync
    to add rsync-ness (the rsync nature?).

    Saint Aardvark the Carpeted
  8. sil, my understanding is that for PAR to work, you need a PAR file for each missing data file. PAR files are much smaller than data files, but it’s not a force-multiplier, unless you mean bits in the literal sense. ;-)

    Why not just have the whole backup set on each node, and have the group’s multiplier be the insurance that your files are available?

    Jeremy Dunck
  9. Jeremy: I don’t want to back up the whole set onto every node because, as a member of a group with 5 members, I have to allocate enough space to have 4 people’s complete backups on my machine. That’s a lot of space required…

    sil
  10. Sounds feasible and useful for small amounts of data only. Which I guess is fine. Most people would just want to back up “my documents” or whatever.

    Given that it needs to be encrypted and distributed, I think you can abandon the concept of rsyncability. This doesn’t need to be a terrible thing however. Instead of one big file to backup daily, why not backup each file individually (with metadata about paths, timestamp, etc)? Then you only need to backup files that are changed since the last backup. You can have the client (or the central “where everyone is” server) delete older versions of the file as newer versions become available. Say have two copies of every file on different machines in the group. PAR type files are most useful for very big files, not for granny’s home-made recipe book document.

    If you write it, I’ll use it :)

    What I’d be even more interested in (it probably already exists, but I’m having a massive attack of “can’t be arsed googling for it right now”) is a similar idea, but for a revision control system. Being able to cvs (or svn or whatever) from multiple different servers which all keep themselves in sync somehow. I don’t have any specific use for it, but it would be Very Cool.

    ssta
  11. Encryption with something like gpg is quite easy to do, but would obviously break most of the advantage of rsync - in that a small change in a source file should result in all of the compressed/encrypted files changing.

    Assuming you can find perhaps 4 friends to form a “group”, there is still a very high chance that files necessary for a restoration immediately are unavailable.

    If you have a far larger group, say perhaps 30 machines, the chance of anyone being temporarily unavailable is going to be far higher, and it’s going to be almost certain that 2 + people will be offline. Therefore .par files (or similar) may not be the answer to the problem - you need to distribute the same files to multiple nodes to get redundancy, which means everyone would have to offer far more backup space to others than they themselves use from the pool.

    However, a node being temporarily unavailable may not be a problem - considering the bandwidth ratio used (upstream/downstream) it’s likely that restoration would take a long time anyway.

    I suspect, you’d need a ‘formal’ process to go through should a machine wish to gracefully leave a group - e.g. you have to notify your peers, and then wait while the backups you have are redistributed. Otherwise you’re at risk of data loss should other group members decide to rely on files that machine still “has”.

    But in the end - why bother? google and amazon are both offering online storage facilities - which are going to be far more reliable than Bob’s computer. Use them, but make sure you encrypt the data first so they can’t crawl it.

    I suspect the bandwidth implications of your proposal could be nasty - imagine someone emailing their friends an email with a 1mb attachment. Come the evening/night, all of those computers are going to try and backup that 1mb attachment to each other. Ergh.

    DG
  12. ssta: what you want is a distributed version control system. There are lots of them.
    http://en.wikipedia.org/wiki/List_of_revision_control_software#Software_using_a_distributed_approach

    DG hit on several of the points I was glancing at, but I’ll try to write a proper contribution sometime soon.

    Jeremy Dunck
  13. Sounds like this could be a target for Brackup.

    Decklin Foster
  14. Decklin: perhaps, although Brackup fails the test on how much you need to set up. (brackup.conf? No chance.) The GUI system could be a shell around Brackup, I suppose, but it could also be a shell around darnnear any other backup system that exists too.

    sil
  15. Aq(/sil - how many nicks does 1 guy need?) - what I meant was thus:

    This is a distrbuted system, whereby there is information stored on my machine that relates to other peoples backups. If my mate down the road decides to backup his 1 terrabyte partition, then I would presume that the amount of data on my system would be significantly greater than if I backup my 2 gig /home directory. Thus I want to be able to say, via my client, ‘only use 30GB for backups’, rather than just letting it take over my drive with other peoples backup data.

    Did I understand right, and does this help explain?

    mrben
  16. mrben: yes. This is why it said “It also asks how much space you’re prepared to devote to backups.” in the initial spec :)

    sil
  17. Erm - shit. I can’t believe I didn’t see that.

    Sorry :(

    mrben
  18. Interestingly, there is an existing project I’ve just come across, called DIBS - which is written in python: http://www.csua.berkeley.edu/~emin/source_code/dibs/

    I’ve not had the chance to run it yet, or investigate very much, but on a quick glance it could be an initial boost to get something like this working.

    the_angry_angel
  19. the_angry_angel: yes. That looks like it’s done a good proportion of the hard backup work. What would be needed is the UI work so that it all happens seamlessly (it should never mention the word “gpg”, for example) and some way of putting a central server into the equation for when both parties are behind firewalls (rather than the communication-by-email approach that it currently uses).

    sil
  20. Hm. Remind me to mail the DIBS guy…

    sil
  21. Mail the DIBS guy.

    erm
  22. DG
  23. You might also think about how to buffer the data stream. If it requires a staging area for the data to create the backup image file, it could be an issue. (I have 3GB to backup so I need at least 2GB space to create the image). This is based on 75% space for compressed data. This is why cds and dvd suck as backup media.

    bigredradio
  24. See this on slashdot? http://www.cleversafe.org/

    mrben
  25. mail the dibs guy

    Anonymous

Leave a Reply