Azure Data Factory – Setting up Self-Hosted IR HA enabled

By June 25, 2020 Blogs, data

Written by Tejaswee Das, Software Engineer, Powerupcloud Technologies

Introduction

In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. However, on its own, raw data doesn’t have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision-makers.

Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory(ADF) is a managed cloud service that’s built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.

This is how Azure introduces you to ADF. You can refer to the Azure documentation on ADF to know more.

Simply said, ADF is an ETL tool that will help you connect to various data sources to load data, perform transformations as per your business logic, and store them into different types of storages. It is a powerful tool and will help solve a variety of use cases.

In this blog, we will create a self hosted integration runtime (IR) with two nodes for high availability.

Use Case

A reputed client on OTT building an entire Content Management System (CMS) application on Azure having to migrate their old data or historical data from AWS which is hosting their current production environment. That’s when ADFs with self-hosted IRs come to your rescue – we were required  to connect to a different cloud, different VPC, private network, or on-premise data sources.

Our use-case here was to read data from a production AWS RDS MySQL Server inside a private VPC from ADF. To make this happen, we set up a two node self-hosted IR with high availability (HA).

Pre-requisites

  •  Windows Server VMs (Min 2 – Node1 & Node2)
  • .NET Framework 4.6.1 or later
  • For working with Parquet, ORC, and Avro formats you will require 
    • Visual C++ 2010 Redistributable Package (x64)
    • Java

Installation Steps

Step1: Login to the Azure Portal. Go to https://portal.azure.com

Step 2: Search for Data Factory in the Search bar. Click on + Add to create a new Data Factory.

Step 3: Enter a valid name for your ADF.

Note: The name can contain only letters, numbers, and hyphens. The first and last characters must be a letter or number. Spaces are not allowed.

Select the Subscription & Resource Group you want to create this ADF in. It is usually a good practice to enable Git for your ADF. Apart from being able to  store all your code safely, this also helps you when you have to migrate your ADF to a production subscription. You can get all your pipelines on the go.

Step 4: Click Create

You will need to wait for a few minutes, till your deployment is complete. If you get any error messages here, check your Subscription & Permission level to make sure you have the required permissions to create data factories.

Click on Go to resource

Step 5:

Click on Author & Monitor

Next, click on the Pencil button on the left side panel

Step 6: Click on Connections

Step 7: Under Connections tab, click on Integration runtimes, click on + New to create a new IR

Step 8: On clicking New, you will be taken to the IR set-up wizard.

Select Azure, Self-Hosted and click on Continue

Step 9: Select Self-Hosted  and Continue

Step 10: Enter a valid name for your IR, and click Create

Note: Integration runtime Name can contain only letters, numbers and the dash (-) character. The first and last characters must be a letter or number. Every dash (-) character must be immediately preceded and followed by a letter or a number. Consecutive dashes are not permitted in integration runtime names.

Step 11:

On clicking Create, your IR will be created.

Next you will need to install the IRs in your Windows VMs. At this point you should login to your VM (Node1) or wherever you want to install your

You are provided with two options for installation :

  • Express Setup – This is the easiest way to install and configure your IRs.  We are following the Express Setup in this setup. Connect to your Windows Server where you want to install.

Login to Azure Portal in your browser (inside your VM) → Data Factory →  select your ADF → Connections → Integration Runtimes →  integrationRuntime1 → Click Express Setup → Click on the link to download setup files.

  • Manual Setup – You can download the integration runtime and add the authentication keys to validate your installation.

Step 12: Express Setup

Click on the downloaded file.

On clicking on the downloaded file, your installation will start automatically.

Step 13:

Once the installation and authentication is successfully completed, go to the Start Menu → Microsoft Integration Runtime → Microsoft Integration Runtime

Step 14: You will need to wait till your node is able to connect to the cloud service. If for any reason, you get any error at this step, you can troubleshoot by referring to self hosted integration runtime troubleshoot guide

Step 15: High availability 

One node setup is complete. For high availability, we will need to set up at least 2 nodes. An IR can have a max of 4 nodes.

Note: Before setting up other nodes, you need to enable remote access. To enable remote access, you need to make sure you are doing it in your very first node, i.e, you have a single node when you are doing this configuration, you might face issues with connectivity later if you forget this step.

Go to Settings tab and  Click on Change under Remote access from intranet

Step 16:

Select Enable without TLS/SSL certificate (Basic) for dev/test purpose, or use TLS/SSL for a more secured connection.

You can select a different TCP port – else use the default 8060

Step 17:

Click on OK. Your IR will need to be restarted for this change to be effected. Click OK again.

You will notice remote access enabled for your node.

Step 18:

Login to your other VM (Node2). Repeat Steps 11 to 17. At this point you will probably get a Connection Limited message stating your nodes are not able to connect to each other. Guess why? We will need to enable inbound access to port 8060 for both nodes.

Go to Azure Portal → Virtual Machines → Select your VM (Node1) → Networking.

Click on Add inbound port rule

Step 19:

Select Source → IP Addresses → Set Source IP as the IP of your Node2. Node2 will need to connect to Port 8060 of Node 1. Click Add

Node1 IP – 10.0.0.1 & Node2 IP – 10.0.0.2. You can use either of private or public IP addresses.

We will need to do a similar exercise for Node2.

Go to the VM page of Node2 and add Inbound rule for Port 8060. Node1 & Node2 need to be able to communicate with each other via port 8060.

Step 20:

If you go to your IR inside your Node1 and Node2, you will see the green tick implying your nodes are successfully connected to each other and also to the cloud. You can wait for some time for this sync to happen. If for some reason, you get an error at this step, you can view integration runtime logs from Windows Event Viewer to further troubleshoot. Restart both of your nodes.

To verify this connection, you can also check in the ADF Console.

Go to your Data Factory → Monitor (Watch symbol on the left panel, below Pencil symbol – Check Step 5) → Integration runtimes

Here you can see the number of registered nodes and their resource utilization. The HIGH AVAILABILITY ENABLED featured is turned ON now.

Step 21: Test Database connectivity from your Node

If you want to test database connectivity from your Node, make sure you have whitelisted the Public IP of your Node at the Database Server inbound security rules.

For e.g, if your Node1 has an IP address 66.666.66.66 and needs to connect to an AWS RDS MySQL Server. Go to your RDS security group and add Inbound rules of your MySQL Port for this IP.

To test this. Login to your Node1 → Start → Microsoft Integration Runtime → Diagnostics → Add your RDS connection details → Click on Test

Conclusion

This brings you to the end of successfully setting up a self-hosted IR with high availability enabled.

Hope this was informative. Do leave your comments below. Thanks for reading.

References

Leave a Reply