Written by Anjali Sharma, Software Engineer at Powerupcloud Technologies
What is Cloud Data Fusion Pipeline? –
Cloud Data fusion is an enterprise data integration service provided by Google for quickly building and managing pipelines. A fully managed, cloud-native platform where a source(MySQL) and sink(bigquery) can be connected easily without using any code.
Since it’s a code-free environment anyone can use it easily without having any hindrances of technical skills or coding knowledge.
Cloud Data Fusion is built on the open-source project CDAP, and this open core ensures data pipeline portability for users. CDAP’s broad integration with on-premises and public cloud platforms gives Cloud Data Fusion users the ability to break down silos and deliver insights that were previously inaccessible.
How it does look on Cloud Data Fusion Platform
Why Cloud Data Fusion?
Now the question arises, why do we use cloud data fusion as we have other options to make a connection from MySQL DB to bigquery for ETL/ELT.
Data fusion pipelines provide fully managed, virtual interface, easy to use, fully scalable, fully distributed platform that enables you to connect to many different data sources easily.
Data fusion pipelines have the flexibility to have all the pipelines as code and enable you to use rest API calls to create and trigger pipelines. Hence cloud data fusion is a complete package to develop data pipelines easily and efficiently.
How do we create the data pipeline? –
Creating data fusion pipeline is quite easy on the Google cloud platform, we can get it done by following a few steps-
Step1- Go to GCP console find Cloud data fusion click on ‘Create Instance’.
Step2- Fill the instance name and region name and click on create.
Step3- It will take 10-15 minutes to create an instance, now go to view instance and click on redirect URL.
Step4- Now you are inside cloud data fusion instance, click on HUB and choose a pipeline(import data from MySQL).
Step5- Along with pipelines in HUB you are getting several options. Choose import data from MySQL. Now we’re going to install Driver.
Step6- Install Google cloud JDBC driver which will make a connection to let MySQL database communicate bigquery. We can find the driver from here itself but make sure the driver is of the latest version.
Step7- Now go to Navigation Bar and click on the control center.
Step8- Go to green encircled plus symbol and upload the latest version of JDBC driver.
Step9- Give a name to the driver and a suitable class name which is invalid format com.example.myclass and click on finish.
Step10- Now again go to HUB, click on import data from MySQL pipeline and click on create. Give a name to the pipeline and finish. Now you are able to customize your pipeline.
Here in the cloud data fusion studio we can change source and sink accordingly as here we need to connect Database(source) to Bigquery(sink).
Step11- Go to database properties and fill the plugin name and types. After filling the details Browse database and click on Add connection.
Step12- Here you will find installed Mysql driver click on it and put connection name, host, port, database name, id, password.
Step13- Test the connection and add it.
Step14- Now you are able to import your query. Deploy the Pipeline.
Step15- You have deployed your data fusion pipeline successfully.
Cloud Data Fusion takes care of most of ETL/ELT works for you. And since it’s part of Google Cloud, you can take advantage of built-in security benefits when using Cloud Data Fusion rather than self-managed CDAP servers:
- Cloud-native security control with Cloud IAM—Identity management and authentication efforts are taken care of by Cloud Identity.
- Full observability with Stackdriver Logging and Monitoring—Logs include pipeline logs and audit logs
- Reduced exposure to the public internet with private networking.
Cloud Data Fusion offers both preconfigured transformations from an OSS library as well as the ability to create an internal library of custom connections and transformations that can be validated, shared, and reused across an organization. It lays the foundation for collaborative data engineering and improves productivity. That means less waiting for data engineers and, importantly, less sweating about code quality.