Brad Wilson - The .NET Guy

Technologist. Agile Evangelist. Poker Player. Amateur Neologist. Metalhead.

My Links

Post Categories

Article Categories

Archives

Blog Stats

Stuff

Blog Importer in Ruby

Introduction

This year, I'm learning Ruby and Rails in an effort to keep my mind wrapped around newer technologies. My first major Ruby project was to write an import script to move my content from MovableType to .Text. This article documents the process and results.

The Old Content

I figured the easiest way to get the old content off my box was in some kind of XML format. I already had one of those in RSS, so I just needed a MovableType template that exported my entire blog to RSS, rather than just the newest few posts. I used RSS 2.0, and ended up with a giant XML file that looks like this:

<rss>
   <channel>
      ...irrelevant stuff describing the blog...
   </channel>
   <item>
      <title>Welcome!</title>
      <link>http://dotnetguy.techieswithcats.com/archives/001082.shtml</link>
      <description><![CDATA[...post content...]]></description>
      <category>This Blog</category>
      <pubDate>2001-11-30T10:29:00-08:00</pubDate>
   </item>
   ...many more items...
</rss>

One significant challenge is that the content needed to be cleaned up a few ways; namely:

  • There are links to the existing blog posts and content
  • There are old broken links to previous sites that never got cleaned up
  • The HTML is in various states of disrepair, and could do with a good tidying

The New Blog

The new blog runs .Text, and its primary remote API for posting is the metaWebLog API. The hardest part of the project was picking through the spec and guessing how .Text was expecting the post to look. I ended up making a simple wrapper around the metaWebLog API for Ruby:

require 'xmlrpc/client'

module MetaWebLogAPI
  class Client
    def initialize(server, urlPath, blogid, username, password)
      @client = XMLRPC::Client.new(server, urlPath)
      @blogid = blogid
      @username = username
      @password = password
    end
   
    def newPost(content, publish)
      @client.call('metaWeblog.newPost', @blogid, @username,
          @password, content, publish)
    end

    def getPost(postid)
      @client.call('metaWeblog.getPost', postid, @username,
          @password)
    end

    def editPost(postid, content, publish)
      @client.call('metaWeblog.editPost', postid, @username,
          @password, content, publish)
    end
  end
end

You can see that the existing Ruby XML-RPC library makes writing this wrapper very simple. The XML-RPC system represents structs as Hash objects, so making a new post is pretty simple:

client = MetaWebLogAPI::Client.new('1.2.3.4', '/path/to/weblogapi',
    'blogid', 'username', 'password')

blogpost = {
  'title' => 'New post!',
  'description' => 'This is the body of my new post',
  'pubDate' => Time.gm(2005,05,31,15,0,0,0) # May 31, 2005 at 3:00 PM
}

client.newPost(blogpost, true)

A little testing and it's clear that I have a good handle on what I need to be able to post to .Text from Ruby. The beginnings of our application become clear:

require 'metaweblogapi'

class MetaWebLogImport
  def initialize
    # MetaWebLogAPI configuration items
    metaBlogServer = 'www.agileprogrammer.com'
    metaBlogApi = '/dotnetguy/services/metablogapi.aspx'
    metaBlogId = 'dotnetguy'
    metaBlogUser = 'myuser'
    metaBlogPassword = 'mypassword'

    @metaBlogClient = MetaWebLogAPI::Client.new(metaBlogServer,
        metaBlogApi, metaBlogId, metaBlogUser, metaBlogPassword)
  end

  def run
  end
end

MetaWebLogImport.new.run

The intention is clear: fill in run as we move from task to task.

Parsing

My existing content is sitting in an RSS file that needs to be parsed. I lucked out here, since there's an RSS reader library available for Ruby, which means I can read the RSS in a single step.

require 'rss/2.0'

class MetaWebLogImport
  def read_original_rss
    File.open('export.xml') do |file|
      @originalRss = RSS::Parser.parse(file.read, false)
    end
  end

  def run
    read_original_rss
  end
end

The RSS::Parser class returns an object that has attributes for all the members of an RSS feed (notably, we'll take advantage of link, title, description, category, and pubDate). 

Posting

Now we have all the old content, easily accessible. However, we know that the content contains links to itself, and we don't know what the new links will be; getting a link requires us to create a post. The most unique piece of the existing content is its existing link, so we'll use that information to our advantage.

We'll keep a couple hashtables to help us keep track of this. The first is @newPostIdsByOldLink, which will give us a list of the post IDs given to us by the blog engine. The second is @linksRedirect, which is a general bucket of old URLs to new URLs. We add all the posts to the bucket, so we can scour the old post content eventually and fix up all the links. The benefit here is that you can pre-populate the bucket with anything you might move over by hand (in my case, images and binaries) as well as finding old broken links and fixing them up later (we'll produce a report later that can help you do this).

Here's the code I wrote to support this next step.

class MetaWebLogImport
  def make_post_content(item, description = "[place-holder content]")
    return {
      "title" => item.title,
      "description" => description,
      "dateCreated" => item.pubDate,
      "categories" => [ item.category.content ]
    }
  end

  def post_id_to_permalink(postId)
    @metaBlogClient.getPost(postId).link
  end

  def post_items
    @originalRss.items.each do |item|
      newPostId = @metaBlogClient.newPost(make_post_content(item), false)
      @newPostIdsByOldLink[item.link] = newPostId
      @linksRedirects[item.link] = post_id_to_permalink(newPostId)
    end
  end

  def run
    read_original_rss
    post_items
  end
end

Cleaning Up and Re-Posting

I mentioned earlier that the content needs to be cleaned up. We seem to have all the data we need now to make that happen. I didn't realize at the time I wrote my importer that a tidy library exists for Ruby; I left my old implementation to show how easy it is to shell out to applications and feed the standard input and consume the standard output. Tidy is cleaning up our HTML and turning into XHTML, so I can easily parse it using REXML (an XML parser for Ruby).

There's a big bunch of stuff going on here, so I'll explain it after we see the code.

require 'xml/document'

class MetaWebLogImport
  def tidy_to_xhtml(input)
    open("|tidy -q -b -c -asxml -f /dev/null", "w+") do |cmd|
      cmd.puts input
      cmd.close_write
      cmd.read
    end
  end

  def cleansed_content(element)
    content = "";
    element.elements.each { |e| content += e.to_s }
    return content
  end

  def replace_links(xml, xpath, attributeName, oldLink)
    xml.elements.each(xpath) do |e|
      href = e.attributes[attributeName]
      newlink = @linksRedirects[href]

      if newlink
        e.attributes[attributeName] = newlink
      else
        if @oldSiteUrlRegex.match(href) then
          (@unmatchedLinks[href] ||= []) << oldLink
        end
      end
    end
  end

  def scan_content_for_links
    @originalRss.items.each do |item|
      xml = REXML::Document.new(tidy_to_xtml(item.description))
      replace_links(xml, '//a', 'href', item.link)
      replace_links(xml, '//img', 'src', item.link)
      post = make_post_content(item, cleansed_content(xml.elements["//body"]))
      @metaBlogClient.editPost(@newPostIdsByOldLink[item.link], post, true)
    end
  end

  def run
    read_original_rss
    post_items
    scan_content_for_links
  end
end

Starting at the bottom with scan_content_for_links, you can see that we have just a few things to do. For each of our posts, we run the content through tidy to turn it into XML. We use XPath queries to discover the links we want to clean up (href attributes of a tags, and src attributes of img tags). Then we extract that cleansed content and push it back up to the blog server.

In tidy_to_xhtml, you can see that we shell out to run tidy, and feed the content to standard input while consuming standard output. Ruby makes things like this remarkably easy.

The purpose of cleansed_content is to extract the content out of the <body> tag that tidy gives me back. It feels wrong, but since I'm new to Ruby I'm not sure whether there's a more compact and expressive way to do it. (You can see how much Ruby changes your thinking, when you're concerned that a method body of 3 lines feels too long!)

Finally, replace_links. This method also feels wrong because of its size; it may be doing too much, but I didn't bother to refactor it. It searches the XML for all links, using the @linksRedirect bucket we made earlier to replace old links with new ones. Additionally, I stash away any link that looks like a local link, that didn't have a replacement.

Apache Redirects

My old site runs on a Unix host running Apache, so I used my redirect bucket to generate the redirect file for me automatically.

class MetaWebLogImport
  def write_htaccess
    open('.htaccess', 'w+') do |file|
      @linksRedirects.each do |key, value|
        match = @oldBlogUrlRegex.match(key)
        file.puts 'RedirectPermanent ' + match.post_match + ' ' + value if match
      end
    end
  end

  def run
    read_original_rss
    post_items
    scan_content_for_links
    write_htaccess
  end
end

Finding What We Missed

The first time you run the script, you will discover that there are a lot of things that will still be badly linked. A blog is rarely just its posts: at a minimum, people will generally upload photos and other ancillary binary files. Having a list of such broken links would be very helpful. Remember when we stashed our unresolved links into the @unmatchedLinks bucket? That's why.

class MetaWebLogImport
  def write_unresolved_links_list
    open('unresolved_links.txt', 'w+') do |file|
    @unmatchedLinks.each do |link, references|
      file.puts link + ":"
      references.each { |ref| file.puts "    " + ref }
      file.puts
    end
  end

  def run
    read_original_rss
    post_items
    scan_content_for_links
    write_htaccess
    write_unresolved_links_list
  end
end

Closing Thoughts

Being a Ruby newbie, I learned a lot putting it to a well-defined task. It took me about 8 hours to write (and today, I could probably re-write it in about 2 hours). Another bunch of time was spent trying to understand how XML-RPC worked inside of Ruby, and what .Text (the blog engine I'm moving to) expected when I was adding and updating posts.

posted on Saturday, June 25, 2005 12:10 PM