Migrating Squarespace to Drupal - Part 1: Giant XML File of Doom

This is Part 1 in a series of posts on migrating a site from Squarespace 5 into Drupal 7. This is an overview of what we're doing and why, mostly discussing the format of the Squarespace 5 XML file, which is undocumented. My hope is to make the process of importing Squarespace data easier for everyone else.

The Mission

I'm working on a fairly large site (ApertureExpert.com), currently running on Squarespace 5, migrating to Drupal 7. Squarespace is an excellent platform with many virtues, but it doesn't allow the level of customization that my client is looking for, so we're moving the site to Drupal. Now, of course you lose the built-in, high-availability hosting, and some of the very good polished UI that Squarespace gives you, so this move is not for everyone. But if you want control, and to have your blog, forums, and e-commerce wrapped up in a single package, Drupal is your friend. One example: my client wants users to have all their purchased products, forum posts, and comments available through their user profile. Squarespace can't do that, Drupal can.

This site has been active for several years now, and has a very active community of users, so the first big task was to import all of those items into Drupal. Fortunately, Squarespace 5 will let you export your entire website as XML. Note: it doesn't look like Squarespace 6 (the current version) supports full site export. There is a WordPress-compatible export that you could import into Drupal using WordPress Migrate (or, obviously, into WordPress), but it seems to cover only blog posts. This site is older, running Squarespace 5, so we can get everything.

Now that we have exported data, what can we do with it?

Migrate

Enter Migrate module, maintained by Mike Ryan. My friend and colleague Ashok Modi has talked about Migrate many times at LA Drupal meetups and camps, and this was one of the first projects I've had where I really needed it. To move data using Migrate, you write a fairly simple PHP class for each thing (node, comment, user, taxonomy term, etc) and wrap them in a Drupal module. Migrate provides all the essentials of data migration: importing (duh), rollback (deleting imported items so you can try again, especially important if there's existing content on your Drupal site), updating previously imported items, easy mapping of fields, data massaging functions, dependency management, and more. As I'll now say to anyone who will listen: it's the best. Migrate can import from nearly anything: any SQL database Drupal supports (MySQL, Postgres, Oracle, MSSQL, etc), CSV and XML files, probably more. It has very good documentation and excellent example code. The example code is so good you can probably write many migrations with copy and paste, and a few modifications. It's been a while since I've had to do this kind of work manually, but if there's a framework or CMS out there with a better migration tool than this, I'll eat my hat.

Giant XML File of Doom (GXFD)

It's great Squarespace will export all your data, but unfortunately the file is completely undocumented. There's no schema, no comments, no web page explaining anything about it. You can't even use it to backup a site and import it back into Squarespace! It's a lot better than nothing, and at least it's XML (i.e. not some weird proprietary format), but it's just barely sufficient.

Another problem: it's One. Giant. File. I can't provide an example file from this website, but I can say this: this site that's been in existence since the beginning of 2010 has hundreds of blog entries with thousands of comments, over ten thousand forum posts, all by several thousand users. Its export file is about 50 MB of ugly XML. That's a lot of text, making it difficult to deal with. Without documentation, that meant I had to open the file and inspect it myself to figure out how it was put together. I tried several dedicated XML editors, thinking their outlining tools would help, but none of them was able to work with a file this large efficiently. At least not on my computer (which is just a couple years old, and quite a champ in my opinion). I thought I might finally need to upgrade to an SSD to make this work! BBEdit, my go-to editor, was able to open the file, probably because it doesn't do outlining or structure scanning on most XML files, so I kept searching for something helpful.

XMLstarlet

Enter XMLstarlet. This command-line utility (I installed it with homebrew) is an XML powerhouse. I used it primarily to run XPath queries while figuring out the structure of the GXFD, and to extract its structure. I strongly recommend it. It doesn't support XSLT 2.0, which would have helped me a time or two, but overall it's the bee's knees. Once installed, the command is xml, with lots of options (see its documentation for everything). Here are a couple examples:

To dump the element structure of an XML file (the -u switch gives you all the elements, but cuts out duplicates):

% xml el -u ss-data.xml

The output looks like this, showing attributes with @

squarespace-wireframe
squarespace-wireframe/@websiteId
squarespace-wireframe/@websiteIdentifier
squarespace-wireframe/@createdOn
squarespace-wireframe/audiences
squarespace-wireframe/audiences/audience
squarespace-wireframe/audiences/audience/name
squarespace-wireframe/audiences/audience/id
squarespace-wireframe/audiences/audience/display-name
squarespace-wireframe/audiences/audience/description
squarespace-wireframe/audiences/audience/description/@encoding
...

I usually save these directly into a file using a redirect:

% xml el -u ss-data.xml > filename.txt

Or pipe the output right into BBEdit:

% xml el -u ss-data.xml | bbedit

You can download the full structure of the Squarespace 5 site I'm working on here. It's a zipped text file.

As you can see, there's a lot of stuff in there, and trying to sort out this structure visually would have been basically impossible. I imagine, but don't know, that this is pretty similar across Squarespace 5 sites. Without any data, it's 71 KB. That's a lot of elements!

Segmenting the Export File with XPath

To build the migration, I needed to find the XPath queries that would return the data I wanted. This took a while, but using XMLstarlet on a BBEdit shell worksheet, I was able to experiment easily, and save the commands as I went. Shell worksheets are like a scratchpad for the terminal, and were very helpful in this work. In each of the following cases, I used XMLstarlet to extract the relevant queried item into its own file, preferably one per migration. It means some extra work up-front to produce the files, but that could be automated if needed, and each migration can use less memory by processing a smaller file (or files in some cases).

The commands I used to create these smaller files look like this:

% xml sel -I -t -m "/xpath/query/element" -c . -n squarespace-data-file.xml > squarespace-subset.xml

The command sel tells XMLstarlet to select a portion of the document using XPath. You can see what all the switches mean in the documentation, but basically in this command I'm select /xpath/query/element with XPath against the source document squarespace-data-file.xml, and saving the result to a file called squarespace-subset.xml.

After you've done this, you end up with a file full of elements, like this:

<element>(lots of data)</element>
<element>(lots of data)</element>
<element>(lots of data)</element>
<element>(lots of data)</element>
<element>(lots of data)</element>

To make this a valid XML file, you need to add a containing element, and an XML header:

<?xml version="1.0" encoding="utf-8"?>
<data>
  <element>(lots of data)</element>
  <element>(lots of data)</element>
  <element>(lots of data)</element>
  <element>(lots of data)</element>
  <element>(lots of data)</element>
</data>

Having done that prep work, you'll be able to use the file with Migrate.

Finding the Goodies with XPath

Here are the XPath queries I used to extract the relevant data, and samples (trimmed and modified for public consumption) of the results.

User accounts:

/squarespace-wireframe/website-member-accounts/website-member-account

These are fairly straightforward, thankfully.

<website-member-account>
  <website-id>XXXXXXX</website-id>
  <email>[email protected]</email>
  <first-name>Mister</first-name>
  <last-name>Person</last-name>
  <login>misterperson</login>
  <deleted>false</deleted>
  <is-enabled>true</is-enabled>
  <is-confirmed>true</is-confirmed>
  <encrypted-password>cbjRTKkQDuQcHpjdQR2MqQS3JlE=</encrypted-password>
  <display-name>Mister Personality</display-name>
</website-member-account>

Everything I need is in these elements. Users will need to reset their passwords, but their email addresses, usernames, and everything else is preserved. Nice!

Blog posts:

/squarespace-wireframe/sections/section/modules/module[id='0000000']/content/journal-entry[not (parent-id)]

This query says "give me all the blog posts in module 0000000 that do not have a parent-id element". parent-id indicates a Followup post, explained below.

The biggest issue here is finding the Module ID. View the HTML source of your blog's main page, and you should see something like this in some initial JavaScript:

Squarespace.Constants.CURRENT_MODULE_ID = "0000000";

Or this in the page's body tag:

<body id="modulePage0000000">

Once you have that ID, plug it into that XPath query, and you should be all set.

Sample data:

<journal-entry>
  <title>My first post!</title>
  <url-id>2010/1/1/my-first-post.html</url-id>
  <body encoding="base64">R29vZCB0aW1lcyE=</body>
  <registered-author-id>1111111</registered-author-id>
  <added-on>2010-01-01 18:28:07.0</added-on>
  <updated-on>2010-01-12 01:05:58.287</updated-on>
  <excerpt/>
  <module-id>0000000</module-id>
  <published>true</published>
  <allow-comments>true</allow-comments>
  <id>9999999</id>
</journal-entry>

Once again, pretty much everything I need to create a node is here. I can even preserve URL paths using that url-id element. +1 for Squarespace here. -100 for making me dig through XML to find it on my own, but still, thanks. Also note the base64-encoded body, which will be dealt with later.

Blog comments:

/squarespace-wireframe/sections/section/modules/module[id='0000000']/content/comment
/squarespace-wireframe/sections/section/modules/module[id='0000000']/content/journal-entry-comment

Yes, two queries here. Unfortunately, comments do not contain a reference to their parent post. The first query returns something like this:

<comment>
  <body>I have a comment!</body>
  <registered-author-id>123456</registered-author-id>
  <module-id>0000000</module-id>
  <approved>true</approved>
  <id>8675309</id>
</comment>

The second:

<journal-entry-comment>
  <journal-entry-id>654321</journal-entry-id>
  <module-id>0000000</module-id>
  <comment-id>8675309</comment-id>
  <id>1111111</id>
</journal-entry-comment>

Life would be a lot easier if that journal-entry-id in the second query were in the first comment element instead of in this separate element. I wasn't able to find a way to modify the XML using XSLT to make that change. If I were going to spend my life migrating Squarespace sites to Drupal, I would write some code to do this transformation, in XSLT or some other language. If you're reading this and know how to do that, let's talk!

Forum posts

/squarespace-wireframe/sections/section/modules/module/modules/module/content/discussion-post[not (parent-id)]

This query says "give me all the forum posts that do not have a parent-id element". (I didn't include a module ID because I wanted all forum posts at once, so I could segment them by container in the migration code.) Unlike blog posts, the content of these posts is not included in the returned data. Example:

<discussion-post>
  <locked>false</locked>
  <module-id>0000000</module-id>
  <comment-id>7777777</comment-id>
  <id>1111111</id>
</discussion-post>

As with blog comments, you need a separate query to retrieve the content.

/squarespace-wireframe/sections/section/modules/module/modules/module/content/comment

The returned data looks like the second blog comment query above. In this case, it would be nice to insert the content of the forum post right into that discussion-post element, but as with the comment issues, I didn't spend the time to figure out how to do that.

Forum comments

In Drupal's core Forum module, original forum posts/topics are nodes, and all follow-ups are comments. Each Migrate class creates a single type of object, so I had a separate class for forum comments. Here's the XPath to retrieve those:

/squarespace-wireframe/sections/section/modules/module/modules/module/content/discussion-post[boolean(parent-id)]

This is very similar to the forum posts above, but instead of not, I use boolean, which means "give me all the discussion-post elements that do have contain a parent-id element". So it's just like the content above, but with one extra element:

<discussion-post>
  <locked>false</locked>
  <module-id>0000000</module-id>
  <comment-id>7777777</comment-id>
  <parent-id>1111111</parent-id>
  <id>2222222</id>
</discussion-post>

As with blog comments, I need a separate query to retrieve the content. In this case, I used the same file generated by the second query for forum posts above, and let my migration code pull out the right ones.

Followup posts

Squarespace lets you create separate posts that are displayed above comments, but below the original post's content. These are just like the Blog posts above, but include a parent-id element.

/squarespace-wireframe/sections/section/modules/module[id='000000']/content/journal-entry[boolean(parent-id)]

The data look exactly the same as regular blog posts. The trick here was figuring out what to do with them, since Drupal doesn't offer an equivalent object. For this first pass, we opted to update the bodies of the posts, but you could also create special comments, or even create a custom Drupal entity.

Next steps

Having worked through all this, you'd have a set of smaller, more readable XML files for your migrations. Next up, we'll look at how to write the PHP code that powers these migrations, and how to run them using every Drupalero's friend: drush.

Special thanks to Joseph Linaschke, fine photographer and proprieter of ApertureExpert.com, who hired me to do all this work.

Categories: 
Comments are closed on this post to keep spammers at bay. If you want to chime in, please email or send a tweet.