Migrate from AWS RDS/SQL using ADF

Recently, I came across a requirement from my customer to migrate the data from AWS RDS/SQL service to Azure for some Big Data Analysis. Obvious choice for this sort of activity in Azure is to use Azure Data Factory (ADF) feature. Now there are many examples of ADF on MSDN with various different data sources and destinations except for some and one of which is AWS RDS.

So how do you achieve it? Simple, treat AWS RDS/SQL as an on-prem SQL Server and follow the guidance for this specific scenario using Data Management Gateway.

Essentially you need to do the following from a very high level perspective-

  1. Create an instance on EC2 in AWS and configure relevant firewall rules (as specified in guidance)
  2. Deploy Data Management Gateway on the above instance.
  3. Test the RDS/SQL access via Data Management Gateway tool from the above instance.
  4. Create ADF factory to read from SQL Server linked service via Gateway.
  5. Do the mapping of data.
  6. Store it in the destination of your choice (e.g. Blob storage)

Azure Data Factory Table Storage Partition Key

To use the source column as a partition key for the destination table storage table (a sink type) you will have to use the property called azureTablePartitionKeyName in the pipeline definition as below-

"sink": {
 "type": "AzureTableSink",
 "azureTablePartitionKeyName": "PostcodeFull",
 "writeBatchSize": 100,
 "writeBatchTimeout": "01:00:00"
 }

Simple, right? well it is but you do have to remember that if you don’t map the column (assuming you are mapping specific columns to the destination table) which you want to use for partition key in the translator section as below you wouldn’t get the output you want.

"translator": {
 "type": "TabularTranslator",
 "columnMappings": "PostTown: PostTown, PostcodeFull: PostcodeFull"
 }

Good thing is that this additional mapping would not create the column in the destination table.

ADF with Batch Compute Linked Service- Error with no description.

Azure Data Factory (ADF) is one of the promising services provided by Azure, bear in mind that this is still in public preview at the moment so the points mentioned here can get outdated pretty soon (and also that’s why you dont see many books on Azure as well). I had a requirement for processing a large amount of incoming data for analytics reasons which forced me to dig a bit deeper into this service along with other related services (SQL DW, Events Hub, HDInsight etc.) recently.

Problem I was facing was related to the Azure Batch recent changes when it went into GA. If you dont specify the region along with the batch account name in the compute Linked Service then it will fail without a much descriptive error message-

“ENTITY PROVISIONING FAILED: AZURE BATCH OPERATION FAILED. CODE: ” MESSAGE: ”” 

JSON script I was using was-

{
 "name": "HostLinkedService",
 "properties": {
 "type": "AzureBatch",
 "typeProperties": {
 "accountName": "bigdata",
 "accessKey": "**********",
 "poolName": "xmltoavropool",
 "linkedServiceName": "BatchStorageLinkedService"
 }
 }
}

Pay attention to the bold account name above which was causing this error. Change it to bigdata.northeurope and it will start to work. This was pointed to me by one of the MS program managers Harish Agarwal on the Disqus forum as a recent breaking change in Azure Batch.

As a side note using -Debug flag via powershell is very useful when you are deploying new artefacts in Azure, it shows you each step of the command and a lot of chatty AD interactions, you’ve been warned.