Not moving to Sofa (or moving from the sofa)

This weekend I’ve been playing with Sofa, jchris‘s CouchDB blog software. I like CouchDB, and I’m increasingly starting to thing that although Wordpress is pretty good, it’s the wrong path; it’s serious evolution running on a dead-end evolutionary path. Like an electric-powered go-faster-striped penny farthing. So a move away is something worth investigating. Sofa is pretty feature-poor by comparison with Wordpress, but I actually don’t care about almost all of the Wordpress features anyway. Sofa’s not too hard to set up, but then I work with CouchDB all the time anyway. It’d likely be harder for others. Importing posts from Wordpress took the longest to do, because I had to write a script to do it, and overcome a Wordpress problem. When exporting posts from Wordpress (under Tools > Export) the downloaded export XML file ended up being truncated because the export script took too long to run (I’ve been posting here for seven years; it’s built up.) So I added php_value max_execution_time 600 to my Wordpress .htaccess file and that made it work. After that, and after installing Sofa to CouchDB with CouchApp (as per Sofa instructions), a script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#!/usr/bin/python
import sys, datetime, re
from xml.etree import ElementTree as ET
import couchdb.client
s = couchdb.client.Server("http://localhost:5984")
s.resource.http.add_credentials("sil","couchdb")
print "Trying to connect to database...",
while 1:
    try:
        print "#",
        d = s["blogdb"]
        print "succeeded."
        break
    except:
        pass

def parse_items(tree):
    things = []
    total_things = 0
    for item in tree.findall("channel/item"):
        parts = item.findtext("link").split("/")
        if len(parts) != 8: continue # skip ones without proper URLs
        created = item.findtext("{http://wordpress.org/export/1.0/}post_date_gmt")
        if created == "0000-00-00 00:00:00": created = "2000-01-01 00:00:00"
        created = datetime.datetime.strptime(created,"%Y-%m-%d %H:%M:%S")
        pid = "%s-%s" % (created.strftime("%Y-%m-%d"), parts[7])
        post = {
            "_id": pid,
            "format": "html",
            "body": item.findtext("{http://purl.org/rss/1.0/modules/content/}encoded"),
            "html": item.findtext("{http://purl.org/rss/1.0/modules/content/}encoded"),
            "author": "sil",
            "created_at": created.strftime("%Y/%m/%d %H:%M:%S +0000"),
            "type":"post",
            "title": item.findtext("title"),
            "slug": pid,
            "tags": dict([(x.findtext("."),"") for x in item.findall("category")]).keys()
        }
        if not post["title"]: post["title"] = post["slug"]
        things.append(post)
        for comment in item.findall("{http://wordpress.org/export/1.0/}comment"):
            approved = comment.findtext("{http://wordpress.org/export/1.0/}comment_approved")
            if approved != "1": continue
            c_created = comment.findtext("{http://wordpress.org/export/1.0/}comment_date_gmt")
            if not c_created or c_created == "0000-00-00 00:00:00": 
                c_created = "2000-01-01 00:00:00"
            comment_data = {
                "_id": "comment%s" % comment.findtext("{http://wordpress.org/export/1.0/}comment_id"),
                "commenter": {
                    "name": comment.findtext("{http://wordpress.org/export/1.0/}comment_author"),
                    "email": comment.findtext("{http://wordpress.org/export/1.0/}comment_author_email"),
                    "url": comment.findtext("{http://wordpress.org/export/1.0/}comment_author_url"),
                },
                "created_at": datetime.datetime.strptime(c_created,"%Y-%m-%d %H:%M:%S").strftime("%Y/%m/%d %H:%M:%S +0000"),
                "format": "html",
                "html": comment.findtext("{http://wordpress.org/export/1.0/}comment_content"),
                "comment": comment.findtext("{http://wordpress.org/export/1.0/}comment_content"),
                "post_id": pid,
                "type": "comment"
            }
            if not comment_data["commenter"]["name"]:
                comment_data["commenter"]["name"] = "Anonymous Coward"
            if not comment_data["commenter"]["email"]:
                comment_data["commenter"]["email"] = "unknown@mailinator.com"
            if not comment_data["commenter"]["url"]:
                del(comment_data["commenter"]["url"])
            else:
                if not re.match("^https?://[^.]*..*/", comment_data["commenter"]["url"]):
                    del(comment_data["commenter"]["url"])
            if not comment_data["comment"]: continue
            things.append(comment_data)
        if len(things) > 500:
            total_things += len(things)
            print "Saving %s things (so far %s in total)" % (len(things), total_things)
            for i in things: 
                try:
                    d.create(i)
                except:
                    print "Couldn't update", i["_id"]
                    raise
            things = []
    if things:
        print "Saving final %s things" % len(things)
        for i in things: d.create(i)

if __name__ == "__main__":
    print "Loading export file"
    wp_file = sys.argv[1]
    fp = open(wp_file)
    data = fp.read()
    fp.close()
    print "Parsing export file"
    tree = ET.fromstring(data)
    print "Creating posts and comments"
    parse_items(tree)

This script is totally, totally hardcoded to do the things that I want it to do, so you can’t just run it, you’ll need to change it. It also doesn’t recover from errors much, either. Anyway, to run the script:

curl -X DELETE http://sil:couchdb@localhost:5984/blogdb # delete existing DB 
couchapp push . http://sil:couchdb@127.0.0.1:5984/blogdb # install Sofa
python wordpress-import.py wordpress.2009-09-12.xml # import all Wordpress posts

Then, of course, there’s a problem. If you look at jchris’s blog, the URLs are all horrible: the index page is http://jchrisa.net/drl/_design/sofa/_list/index/recent-posts?descending=true&limit=5, and an individual post is http://jchrisa.net/drl/_design/sofa/_show/post/Book-progress. I want my nice Wordpress URLs - / for the index, /2009/09/13/slug-name for a post. This is, of course, doable by putting CouchDB behind some sort of proxy. Some people would use nginx for this; since I have Apache running on our server anyway, I went with mod_rewrite and mod_proxy:

RewriteEngine on
# Front page
RewriteRule ^$ http://127.0.0.1:5984/blogdb/_design/sofa/_list/index/recent-posts?descending=true&limit=5 [P]
# assets
RewriteRule assets/script/(.*) http://127.0.0.1:5984/_utils/script/$1 [P]
RewriteRule assets/(.*) http://127.0.0.1:5984/blogdb/_design/sofa/$1 [P]
# feed
RewriteRule feed/atom /blogdb/_design/sofa/_list/index/recent-posts?descending=true&limit=5&format=atom [P]
# /2009
RewriteRule ^([0-9][0-9][0-9][0-9])/?$ http://127.0.0.1:5984/blogdb/_design/sofa/_list/index/recent-posts?descending=true&limit=500&startkey="$1/12/32"&endkey="$1/01/00" [P]
# /2009/08
RewriteRule ^([0-9][0-9][0-9][0-9])/([0-9][0-9])/?$ http://127.0.0.1:5984/blogdb/_design/sofa/_list/index/recent-posts?descending=true&limit=500&startkey="$1/$2/32"&endkey="$1/$2/00" [P]
# /2009/08/12
RewriteRule ^([0-9][0-9][0-9][0-9])/([0-9][0-9])/([0-9][0-9])/?$ http://127.0.0.1:5984/blogdb/_design/sofa/_list/index/recent-posts?descending=true&limit=500&startkey="$1/$2/$3 24"&endkey="$1/$2/$3 00" [P]
# /2009/08/12/post.html
RewriteRule ^([0-9][0-9][0-9][0-9])/([0-9][0-9])/([0-9][0-9])/post.html$ http://127.0.0.1:5984/blogdb/_design/sofa/_show/post/post.html
# /2009/08/12/slug
RewriteRule ^([0-9][0-9][0-9][0-9])/([0-9][0-9])/([0-9][0-9])/?(.*)$ http://127.0.0.1:5984/blogdb/_design/sofa/_show/post/$1-$2-$3-$4 [P]

All well and good. But…then we hit the problem, and it’s not a very resolveable problem. You see, Sofa writes out HTML, and it writes out links and so on in the nasty Sofa format. So, you can poke Sofa’s JS files to write them out differently, which I’ve done (edit indexPath, feedPath, and post.link in sofa/lists/index.js), but changing things like the URL that comments are fetched from is harder. You see, the Couch JavaScript files assume they are running on Couch, and can therefore do things like inspect the current URL to work out which database they’re in; using one of jchris’s URLs, for example, you see http://jchrisa.net/drl/_design/sofa/_show/post/Book-progress, with the database name, design document name, and post ID highlighted. If you’ve rewritten your URL to /2009/09/13/book-progress, that doesn’t work. Fixing this is awkward, because view URLs are calculated by jquery.couch.js, which is a stock part of Couch, and so a fix would involve forking rather a lot of Sofa. Apparently there’s work going on by the Couch team to do native URL rewrites inside Couch itself. My migration to Sofa will likely have to wait until then, unless I write my own CouchDB blogging engine, which is more than a weekend’s job I fear. Shame. At least now I don’t have to worry about how I get CouchDB 0.10 running on ancient Debian…

I'm currently available for hire, to help you plan, architect, and build new systems, and for technical writing and articles. You can take a look at some projects I've worked on and some of my writing. If you'd like to talk about your upcoming project, do get in touch.

More in the discussion (powered by webmentions)

  • (no mentions, yet.)