Written by Tejaswee Das, Software Engineer, Powerupcloud Technologies
In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. However, on its own, raw data doesn’t have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision-makers.
Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory(ADF) is a managed cloud service that’s built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
This is how Azure introduces you to ADF. You can refer to the Azure documentation on ADF to know more.
Simply said, ADF is an ETL tool that will help you connect to various data sources to load data, perform transformations as per your business logic, and store them into different types of storages. It is a powerful tool and will help solve a variety of use cases.
In this blog, we will create a self hosted integration runtime (IR) with two nodes for high availability.
A reputed client on OTT building an entire Content Management System (CMS) application on Azure having to migrate their old data or historical data from AWS which is hosting their current production environment. That’s when ADFs with self-hosted IRs come to your rescue – we were required to connect to a different cloud, different VPC, private network, or on-premise data sources.
Our use-case here was to read data from a production AWS RDS MySQL Server inside a private VPC from ADF. To make this happen, we set up a two node self-hosted IR with high availability (HA).
- Windows Server VMs (Min 2 – Node1 & Node2)
- .NET Framework 4.6.1 or later
- For working with Parquet, ORC, and Avro formats you will require
- Visual C++ 2010 Redistributable Package (x64)
Step1: Login to the Azure Portal. Go to https://portal.azure.com
Step 2: Search for Data Factory in the Search bar. Click on + Add to create a new Data Factory.
Step 3: Enter a valid name for your ADF.
Note: The name can contain only letters, numbers, and hyphens. The first and last characters must be a letter or number. Spaces are not allowed.
Select the Subscription & Resource Group you want to create this ADF in. It is usually a good practice to enable Git for your ADF. Apart from being able to store all your code safely, this also helps you when you have to migrate your ADF to a production subscription. You can get all your pipelines on the go.
Step 4: Click Create
You will need to wait for a few minutes, till your deployment is complete. If you get any error messages here, check your Subscription & Permission level to make sure you have the required permissions to create data factories.
Click on Go to resource
Click on Author & Monitor
Next, click on the Pencil button on the left side panel
Step 6: Click on Connections
Step 7: Under Connections tab, click on Integration runtimes, click on + New to create a new IR
Step 8: On clicking New, you will be taken to the IR set-up wizard.
Select Azure, Self-Hosted and click on Continue
Step 9: Select Self-Hosted and Continue
Step 10: Enter a valid name for your IR, and click Create
Note: Integration runtime Name can contain only letters, numbers and the dash (-) character. The first and last characters must be a letter or number. Every dash (-) character must be immediately preceded and followed by a letter or a number. Consecutive dashes are not permitted in integration runtime names.
On clicking Create, your IR will be created.
Next you will need to install the IRs in your Windows VMs. At this point you should login to your VM (Node1) or wherever you want to install your
You are provided with two options for installation :
- Express Setup – This is the easiest way to install and configure your IRs. We are following the Express Setup in this setup. Connect to your Windows Server where you want to install.
Login to Azure Portal in your browser (inside your VM) → Data Factory → select your ADF → Connections → Integration Runtimes → integrationRuntime1 → Click Express Setup → Click on the link to download setup files.
- Manual Setup – You can download the integration runtime and add the authentication keys to validate your installation.
Step 12: Express Setup
Click on the downloaded file.
On clicking on the downloaded file, your installation will start automatically.
Once the installation and authentication is successfully completed, go to the Start Menu → Microsoft Integration Runtime → Microsoft Integration Runtime
Step 14: You will need to wait till your node is able to connect to the cloud service. If for any reason, you get any error at this step, you can troubleshoot by referring to self hosted integration runtime troubleshoot guide
Step 15: High availability
One node setup is complete. For high availability, we will need to set up at least 2 nodes. An IR can have a max of 4 nodes.
Note: Before setting up other nodes, you need to enable remote access. To enable remote access, you need to make sure you are doing it in your very first node, i.e, you have a single node when you are doing this configuration, you might face issues with connectivity later if you forget this step.
Go to Settings tab and Click on Change under Remote access from intranet
Select Enable without TLS/SSL certificate (Basic) for dev/test purpose, or use TLS/SSL for a more secured connection.
You can select a different TCP port – else use the default 8060
Click on OK. Your IR will need to be restarted for this change to be effected. Click OK again.
You will notice remote access enabled for your node.
Login to your other VM (Node2). Repeat Steps 11 to 17. At this point you will probably get a Connection Limited message stating your nodes are not able to connect to each other. Guess why? We will need to enable inbound access to port 8060 for both nodes.
Go to Azure Portal → Virtual Machines → Select your VM (Node1) → Networking.
Click on Add inbound port rule
Select Source → IP Addresses → Set Source IP as the IP of your Node2. Node2 will need to connect to Port 8060 of Node 1. Click Add
Node1 IP – 10.0.0.1 & Node2 IP – 10.0.0.2. You can use either of private or public IP addresses.
We will need to do a similar exercise for Node2.
Go to the VM page of Node2 and add Inbound rule for Port 8060. Node1 & Node2 need to be able to communicate with each other via port 8060.
If you go to your IR inside your Node1 and Node2, you will see the green tick implying your nodes are successfully connected to each other and also to the cloud. You can wait for some time for this sync to happen. If for some reason, you get an error at this step, you can view integration runtime logs from Windows Event Viewer to further troubleshoot. Restart both of your nodes.
To verify this connection, you can also check in the ADF Console.
Go to your Data Factory → Monitor (Watch symbol on the left panel, below Pencil symbol – Check Step 5) → Integration runtimes
Here you can see the number of registered nodes and their resource utilization. The HIGH AVAILABILITY ENABLED featured is turned ON now.
Step 21: Test Database connectivity from your Node
If you want to test database connectivity from your Node, make sure you have whitelisted the Public IP of your Node at the Database Server inbound security rules.
For e.g, if your Node1 has an IP address 66.666.66.66 and needs to connect to an AWS RDS MySQL Server. Go to your RDS security group and add Inbound rules of your MySQL Port for this IP.
To test this. Login to your Node1 → Start → Microsoft Integration Runtime → Diagnostics → Add your RDS connection details → Click on Test
This brings you to the end of successfully setting up a self-hosted IR with high availability enabled.
Hope this was informative. Do leave your comments below. Thanks for reading.