Azure Data Factory – Self Hosted Integration Runtime Sharing

By July 30, 2020 Blogs, data

Written by Tejaswee Das, Software Engineer, Powerupcloud Technologies

Contributors: Sagar Gupta, Senior Software Engineer | Amruta Despande, Azure Associate Architect | Suraj R, Azure Cloud Engineer

Introduction

Continuing our discussion on Azure Data Factory(ADF) from our previous blogs. In the past we have discussed ADF and configuration steps for a high availability self hosted integration runtime (IR). You can read more about that here: Azure Data Factory – Setting up Self-Hosted IR HA enabled

This is a quick short post on IR sharing in ADFs for better cost optimization and resource utilization also covers common shortcomings while creating ADF using Terraform and/or SDKs.

Use Case

This is again part of a major data migration assignment from AWS to Azure. We are extensively using ADF to setup ETL pipelines and migrate data effectively – both historical and incremental data.

Problem Statement

 Since the data migration activity involves different types of databases and complex data operations, we are using multiple ADFs to achieve this. Handling private production data required self-hosted IRs to be configured to connect to the production environment. The general best practices for self-hosted IR is a high-availability architecture. An IR can have max 4 nodes – a minimum of 2 nodes for high availability. So here arises the problem – for multiple ADFs how many such self-hosted IRs would one use to power this?

Solution

This is where IR sharing comes into the picture. ADF has this brilliant feature of IR sharing wherein many ADFs can share the same IR. The advantage of this will be price & resource reduction. Suppose you had to run 2 ADFs – one to perform various heavy migrations for AWS RDS MySQL to Azure, and the other one for AWS RDS PostgreSQL. Ideally we would have created 2 different IRs one each able to connect to MySQL & PostgreSQL separately. For a production level implementation, this would mean 2X4 = 8 nodes (Windows VMs). Using IR sharing, we can create one self-hosted IR with 4 nodes and share this IR with both ADFs cutting cost on 4 extra nodes. Please note – The IR node sizing depends on your workloads. That’s a separate calculation. This is only from a high level consideration.

Steps to enable IR sharing between ADFs

Step1: Login to the Azure Portal.

Step 2: Search forData Factories in the main search bar.

Step3: Select your Data Factory. Click on Author & Monitor.

Click on Pencil icon to edit.

Step 4: Click on Connections. Open Management Hub.

Step 5: Click on Integration runtimes to view all your IRs. Select your self-hosted IR for which you want to enable sharing.

Refer to https://www.powerupcloud.com/azure-data-factory-setting-up-self-hosted-ir-ha-enabled/ for detailed information on creating self-hosted IRs.

Step 6: This opens the Edit integration runtime tab on the right side. Go to Sharing and Click on + Grant permission to another Data Factory.

Copy the Resource ID from this step. We will use it in Step 9.

This will list down all ADFs with which you can share this IR.

Step 7: You can either search your ADF or manually enter service identity application ID. Click on Add

Note: You may sometimes be unable to find the ADF from this dropdown list. Even though your ADF lists in the Data Factory page, it does not show up in this list. That will leave you puzzled. Not to worry – such a case might arise when you are creating ADFs using the Azure APIs programmatically or through Terraform. Don’t forget to add the optional identity parameter while creating. This assigns a system generated Identity to the resource.

Sample Terraform for ADF

provider "azurerm" {
    version = "~>2.0"
  features {}
}

resource "azurerm_data_factory" "adf-demo" {
  name                = "adf-terraform-demo"
  location            = "East US 2"
  resource_group_name = "DEMO-ADF-RG"
  identity {
    type = "SystemAssigned"
  }
}

To locate the service identity id of ADF. Go to Data Factories page, select the ADF and click on Properties.

Step 8: Click on Apply for the changes to effect.

Incase you do not have required permissions, you might get the following error

Error occurred when grant permission to xxxxxxxx-xxxx-xxxx-xxx-xxxxxxxxx. Error: {"error":{"code":"AuthorizationFailed","message":"The client 'xxxxxxxx@powerupcloud.com' with object id 'xxxxxxxx-xxxx-xxxx-xxx-xxxxxxxxx' does not have authorization to perform action 'Microsoft.Authorization/roleAssignments/write' over scope '/subscriptions/xxxxxxxx-xxxx-xxxx-xxx-xxxxxxxxx/resourcegroups/DEMO-ADF-RG/providers/Microsoft.DataFactory/factories/adf-terraform-demo-Postgres-to-MySQL/integrationRuntimes/integrationRuntime3/providers/Microsoft.Authorization/roleAssignments/xxxxxxxx-xxxx-xxxx-xxx-xxxxxxxxx' or the scope is invalid. If access was recently granted, please refresh your credentials."}}

Step 9:

Now go to the ADF where this has to be shared (one added in the sharing list – adf-terraform-demo). Go to Connections → Integration runtimes → +New  →  Azure, Self Hosted

Here you will find Type as  Self-Hosted (Linked). Enter the Resource ID from Step 6 and Create.

After successful creation, you can find the new IR with sub-type Linked

The IR sharing setup is complete. Be seamless with your ADF pipelines now.

Conclusion

Sharing IRs between ADFs will save greatly on the infrastructure costs. Sharing is simple & effective. We will come up with more ADF use cases and share our problem statements, approaches and solutions.

Hope this was informative. Do leave your comments below for any questions.

Read the series here

Leave a Reply