A few days ago I got an email from Google saying “hey, did you know we’ve just added Gmail and Google Calendar to Google Takeout?”
I did not.
Google Takeout is the entirely laudable effort by Google to make it possible to get all the data you have stored in a particular Google service out of that Google service, whether because you want to leave or just because backups are a good idea. I’ve been a Gmail user for quite a long time, and have quite a lot of mail in there, and it’d be nice to have a backup of it. So, I click “create an archive”, and then some hours later1 I get a nudge from Google saying “we’ve created an archive of all your mail, and now you can download it”. So that’s exactly what I did.
what shall we do with the drunken mailbox, err-lie in the mornin’
OK, so I’ve got a 4GB
.mbox file of all my mail since 2004.2 It’s good to have a backup. What else can I do with it?
One obvious thing is to point a search engine at it. Gmail is pretty good at searching mail, don’t get me wrong, but it’s nice to be able to search locally without needing internet access, especially since sometimes Gmail goes down3 or my cable connection decides that connections to gmail and twitter should be slow today.4 The clear leader for this seems to be notmuch, which bills itself as “the mail indexer”. Notmuch doesn’t fetch mail, it doesn’t send mail; it just indexes and searches it.
a brief digression into mail storage formats
First step, though, is to put the mail in Maildir format. Google has you
download the mail in the standard
mbox format: one file, with all your
mail in it. Mbox format has been around pretty much exactly as long as
there has been electronic email at all: here it is in a man page from
1975. Maildir was invented as a better format in 1995; instead of
having one epic file with all your mail in it, you have one folder and
each email is a separate file in that folder. This is approximately
thirteen billion times easier to deal with for applications, especially
those trying to deal with a lot of mail, which notmuch is. So we need to
convert the Gmail export mbox into a Maildir. I dropped the Gmail mbox
into a folder
~/gmail-backup, and then did
mb2md -s gmail-backup.
~/Maildir and puts your stuff in it.
Next, install notmuch, and
notmuch setup which walks you through a
few basic questions about your mail. Then
notmuch new reads and
indexes it all. This takes a little while.
a sidebar: “Ignoring non-mail file”
Either gmail or
mb2md did something weird: notmuch rejected a whole
bunch of my mails because they had a blank second line. If you get the
same thing, notmuch will print a bunch of lines like
Note: Ignoring non-mail file: /home/myself/Maildir/.All mail Including Spam and Trash_mbox/cur/1234567890.123456.mbox:2,.
If that happens, take a look at the file it says it’s ignoring. If it
looks like a legitimate email but it’s got a blank second line, then
you’ve hit the same problem. I needed something to walk through my mail
folder and patch these up, so as usual in these situations I wrote an
ultranoddy Python script.
an ultranoddy Python script
1 2 3 4 5 6 7 8 9 10 11 12 13
Once you’ve done that,
touch ~/Maildir; touch ~/Maildir/.All* to let
things know that you changed something, and then
notmuch new again
should read in all the fixed mail (and keep the previously-read lot
There’ll still be a bunch that notmail ignores: gmail (handily) stores chat logs as emails, but (unhandily) these are not actually emails, and notmuch will dislike them. That’s fine.
seek and ye shall find
Now all your mail is searchable. Try
notmuch search whatever and, lo,
you get all the matching mails. Very cool. Notmuch can handle some
pretty complicated searches: check their website for details.
ultranoddy II: this time it’s personal
Of course, I don’t want to ssh into my home server (which is where this stuff is) and type commands to search my mail. So instead I wrote the world’s simplest notmuch web search UI in Python. It is ugly, it doesn’t do formatting properly, it hates foreigners and so smashes Unicode down to question marks, and I don’t care because all I need is to get search results over the web, and it does that fine. There’s notmuch-web, which seems very nice5 but requires notmuch v0.15 or better, and Ubuntu 12.04 only has 0.12. So, once more forth into noddy Python scripts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
To be clear, this is pretty horrid. All the HTML is baked into it; it
does the absolute bare minimum required. It does what I need it to,
though. I just did
crontab -e to edit my list of scheduled apps and
@reboot python /home/me/noddy-search-server.py and now I can
just connect to
http://homeserver:8411 and search my mail. Nice.
- modulo that it weirdly didn’t work the first time, as per me mithering on Google+ about it ↩
- yes, I
know I could have been doing this with
offlineimap. I never got around to setting it up, and gmail’s imap implementation is odd because it treats folders as labels, meaning that a message with two labels appears in two imap folders. I might set it up now, though, since I don’t care about the offline imap Maildir other than so that
notmuchcan index it, and
notmuchis clever about finding two mails with the same message ID ↩
- vanishingly rare ↩
- nowhere near as rare ↩
- except for being written in Haskell, but I’m not bigoted ↩