Using the Migrate module to handle big data imports
We recently migrated a large, popular radio site from a bespoke .NET application to Drupal. Migrating large amounts of content, users, tags, relationships and files is made easier with the migrate module. Here's how we did it.
There is always a fear surrounding the decision to move a site or application to a new platform, new host, new database engine. Nobody can truly predict, at least with any absolute certainty that it will go smoothly, or that it will work out well. Then, once the decision to make the move has been made, the bigger task of working out how the move will be made must be played out.
We recently moved a very large digital-media focused site, NovaFM, to Drupal and these were the concerns that we were faced with. How do we move all that content, media, users, tags, relationships, to Drupal? Especially considering the client had a custom CMS that used a MS-SQL database.
This was the task I was facing and naive, positively charged me thought, no problem. Initially I thought it would be best to write some custom shell-based PHP for copying the data over from the existing database, how quickly I was proven wrong.
In came the Migrate module.
Lets have a look at a very simple Migration for this project I was working on, and we can build on that Migration through the rest of this post.
class ArticlesMigration extends Migration { public function __construct() { parent::__construct(MigrateGroup::getInstance()); $this->description = 'Imports articles'; $fields = array( 'Title', 'Description', ); $query = db_select('articles', 'a') ->fields('a', $fields); $this->source = new MigrateSourceSQL($query); $this->destination = new MigrateDestinationNode('article'); $pkmap = array('article_id' => array('type' => 'int', 'unsigned' => TRUE, 'not null' => TRUE, 'alias' => 'a')); $this->map = new MigrateSQLMap($this->machineName, $pkmap, MigrateDestinationNode::getKeySchema()); $this->addSimpleMappings(array('title')); $this->addFieldMapping('body', 'description') ->arguments(array('format' => 'full_html')); } }
This is as simple as it gets. Create a module, have a dependency of the 'Migrate' module, create a class and define your migration.
Lets go through what is happening above.
- Give your migration class a name unique to your project.
- In your constructor, construct the parent and you can pass a reference here to a specific instance, but we'll leave this blank.
- Give your migration a description - this can help if you are building many migrations. The Migrate module UI can easily become confusing.
- Define your source fields, you dont need to do it separately from your query, but when you are dealing with lots of columns, it can be much easier to read if they are kept separate.
- Define your destination. This is the node type to create with your articles. The destination can also be an instance of several other classes to accomodate for Terms (MigrateDestinationTerm) or Users (MigrateDestinationUser) or Media (MigrateDestinationMedia provided by Migrate Extras,) but we wont be covering those in this post.
- Define your primary key map. This is the primary key in the source table, aptly named 'article_id,' we use this primary key to keep track of the articles we have already migrated (for accessing or relating Migrations - more on that below).
- Lastly, we tell Migrate what field goes where - ie, we write our map. We use two different methods here. The first 'addSimpleMappings' is for simple mappings, ie, the column name doesn't change between source and destination and we dont need to add any other data. The second, addFieldMapping is the more utilised method and allows us to change the name (put into the 'body' field, the contents of the 'description' column.) It also allows us to specify on that field, that the content should be treated as full_html.
Thats it, your first Migration has been written. Super easy! But, highly doubtful that this handles everything you need, including all your edge cases.
Lets build on that migration by first addressing some potential edge cases.
What if that integer or boolean column is sometimes empty or null, whereas we want to ensure there is always a value?
No problems, we can specify a default value.
$this->addFieldMapping('my_boolean_field', 'my_boolean_column')
->defaultValue('0');
What if that wont cut it? For example, we have a datetime column but we really need a timestamp?
We can add an expression after the query, with a column name (this is more the Drupal 7 database layer, but I thought it was worth mentioning.)
$query->addExpression("UNIX_TIMESTAMP(DatePosted)", 'created');
What if we need to add an author to the node?
No problems, we can specify a 'Source Migration' for the value of that field. For example, because Migrate keeps a copy of the original source id (seen as 'article_id' in the example above) we can reference those original ids.
Firstly, we need to ensure that the Source Migration has a 'createStub' method (explained in more detail below) and then we can get Migrate to create us an empty user (to be populated later) if they don't already exist (because, we dont know their new Drupal issued user id) and then provide our current migration with the user's id.
$this->addFieldMapping('uid', 'user_id')
->sourceMigration('Users');
What if we need to do some form of preprocessing on the MySQL/PGSQL row between the query and the migration?
No problems, we can prepare the row ready for the migration using the prepareRow method.
public function prepareRow($row) {
// $row is the row from the source query.
// We might want to append something to the title
$row->article_title = $row->article_title . ' - THE GREATEST POST IN THE WORLD!';
}
Or, we might want to do something before we even query the source table. Such as append to the log
public function prepare() {
watchdog('my_migration', 'Attempting to migrate a new row', WATCHDOG_DEBUG);
}
Or something after we've imported the entity?
public function complete($entity, $row) {
// $entity is our new node/user/media etc
// $row is the same db row we saw in prepareRow()
watchdog('my_migration', 'The new ' . $entity->type . ' migrated successfully with an id of ' . $entity->entity_id, WATCHDOG_DEBUG);
}
But the single greatest part of the Migrate module was its ability to create stubs. This doesn't sound like much, but when you are importing tens of thousands of nodes and you need to maintain a relationship to a node that may or may not yet exist, it's a god send.
Using our example above, we have an article with an author, but that author may or may not exist in our database yet and rather than do a bunch of ugly pre processing to find that user, we can use the SourceMigration method (mentioned above) to either locate the user in the new Drupal db, or just create a stub.
In this scenario, we would have another migration and it would be called 'Users' (ie 'class UsersMigration extends Migration') and within the Users migration class, we would have a method named 'createStub' - this method is called when we utilise 'SourceMigration' and the source record does not yet exist (ie, our user has not yet been imported.) We create an empty placeholder (stub) and associate that stub with the user's id (user_id) so that when the user is migrated, we can update the existing stub with the user's details, rather than add a new user. Make sense?
So, in our UsersMigration class, we create a new method.
protected function createStub($migration) {
// $migration is our Migrate module object.// Create a new object
$user = new stdClass;
$user->name = 'Stub User';
$user->mail = 'me@example.com';
$user->roles[DRUPAL_AUTHENTICATED_RID] = 'authenticated user';
// Save our buddy Stub.
user_save($user);if (isset($user->uid)) {
return array($user->uid);
}
else {
return FALSE;
}
}
This gives our Article migration a uid (user id) it can use for its author requirements and allows us to continue with no user (yet.)
To put all these examples (minus the stub) together, we get a class that looks like this;
class ArticlesMigration extends Migration {
public function __construct() { parent::__construct(MigrateGroup::getInstance());
$this->description = 'Imports articles';
$fields = array(
'Title',
'Description',
'my_boolean_column',
'user_id',
);$query = db_select('articles', 'a')
->fields('a', $fields);// Our expression
$query->addExpression("UNIX_TIMESTAMP(DatePosted)", 'created');$this->source = new MigrateSourceSQL($query);
$this->destination = new MigrateDestinationNode('article');
$pkmap = array('article_id' => array('type' => 'int',
'unsigned' => TRUE,
'not null' => TRUE,
'alias' => 'a'));$this->map = new MigrateSQLMap($this->machineName, $pkmap, MigrateDestinationNode::getKeySchema());
// Note column names are all lower case - see Notes at the bottom
$this->addSimpleMappings(array('title', 'created'));
$this->addFieldMapping('body', 'description')
->arguments(array('format' => 'full_html'));
// Our default value
$this->addFieldMapping('my_boolean_field', 'my_boolean_column')
->defaultValue('0');
// Our source migration
$this->addFieldMapping('uid', 'user_id')
->sourceMigration('Users');
}
/**
* Prepare our row
*/
public function prepareRow($row) {
$row->article_title = $row->article_title . ' - THE GREATEST POST IN THE WORLD!';
}
/**
* Our pre processing method
*/
public function prepare() {
watchdog('my_migration', 'Attempting to migrate a new row', WATCHDOG_DEBUG);
}
/**
* Our post processing method
*/
public function complete($entity, $row) {
watchdog('my_migration', 'The new ' . $entity->type . ' migrated successfully with an id of ' . $entity->entity_id, WATCHDOG_DEBUG);
}
}
And that is how to write a relatively simple migration using the Migrate module.
Notes and Gotchas;
- Ensure you specify each of your Migrate class files as a 'file[]' in your modules .info. Or they wont be seen by Drupal and Migrate (unless you add them straight into your .module file.)
- All column names are converted to lower case by Migrate. Really helpful, but it can throw you when you're not expecting it, one column will be blank (because it was camel cased) and all the others will be working beautifully.
- When creating stubs, there are several Drupal enforcements you may face, eg, in the createStub method above, the user gets an email address of 'me@example.com' but more than one of these will throw errors and not save, leave it blank and it'll throw errors and not save, so we'd need to generate a random string to populate that field with so that its both unique and not empty.