Data lake implementation improved processing time by 4X for India’s largest media company

By January 2, 2020 August 13th, 2020 Alexa, Case Study, Data Case Study

Customer: India’s largest media company

Summary

The customer is one of India’s leading media and entertainment companies. They were looking to improve their ad placements across channels for improved conversion along with taking other parameters like social media feedback into consideration. With the push towards digital content, shifting the on-premise infrastructure set up to cloud was necessary to optimize costs and manage high volumes of data. Powerup was also to support and maintain the deployed AWS services for them.

About Customer

The customer is one of the substantial News network houses in India with rights to more than 3,818 movie titles, it entertains over 1 billion viewers across 172+ countries offering 80+ channels. The customer has always provided quality entertainment across the globe and is committed to achieving maximum efficiency in their ad conversion rates through strategic advertisement placements across their multiple channels.

Problem Statement

The customer’s current on-premise infrastructure was proving expensive due to the volume of data being generated and the shift to the cloud was the need of the hour. Softwares provided TRP information on a weekly basis whereas some reports needed to be generated every 6 to 12 minutes from the source to destination. All the players in the business generated these reports to aid critical

decisions and the customer, with some major failure in its existing scheduled-based processes were facing further delays. The time is taken to generate reports while making changes in promos and ad placements were proving to be highly critical.

The customer is looking at a fully managed and scalable Infrastructure setup and configuration on AWS. We proposed to create a data lake on AWS with all the source data getting pushed to a commonplace. All data warehouse objects are to be created as in the existing system. Migrate existing SQL Server Integration Services (SSIS) extract, transform, load (ETL) jobs on Talend for the new data to start moving to data warehouse along with the movement of Tableau dashboards on AWS and point to Redshift, Redshift spectrum.

Proposed Solution

A plan was drafted to make the shift from a tight-knit synchronous architecture to an event-based loosely coupled asynchronous architecture in order to ensure accurate and on-time report generation as per the user’s requirement.

Their entire process transformation involved a cloud-first approach where the client gathered on-premise data from multiple sources like SAP, Chrome feeds, Twitter feeds, social media feedback in excel files and then piped it to cloud. This gave birth to data warehouses where data got extracted and moved from physical to digital form.

AWS landing zone to be setup with the following – organization account, centralized logging, shared services, security and production accounts. The shared service account used to deploy common applications like Bastion and Tableau server whereas the security account is created only for audit purposes.

Appropriate users, groups and permissions created using Identity and Access Management (IAM) service to access different AWS service along with Multi-factor authentication (MFA) activation. The network is setup using Virtual Private Cloud (VPC) with appropriate Classless inter-domain routing (CIDR) range, subnets and route table creation.

VPN tunnel is setup between AWS and customer location. One-time data transfer will be done directly to Amazon Simple Storage Service (S3) after which, the backup file on S3 will be restored in Amazon Relational Database Service (RDS) and then the same will be moved to Redshift entirely. Ultimately S3 will have the entire dump, which will be deployed on Amazon Elastic Compute Cloud (EC2) on AWS.

Once collated as one single repository, the data could be easily transformed from raw to columnar format using Lambda functions that can then be smoothly pushed and visualized on Tableau.

An extract, transform, load (ETL) tool like Talend can be used to transfer incremental data to S3. An SSH File Transfer Protocol (SFTP) server can be used to upload the excel files to S3. Alternatively, Talend can be used to extract the data from excel files and load it into S3. Active Directory Federation Services (ADFS) is configured to provide federated access to Tableau server as on-premise AD has employees as well as third party vendors added. Glue Crawlers will run periodically and scan the S3 data lake to automatically populate structured as well as unstructured data in S3, which in turn can be connected with Amazon Redshift and all other data warehouses being used.

CloudWatch service will be used for monitoring and Amazon Simple Notification Service (SNS) will be used to notify the users in case of alarms or metrics crossing thresholds. All snapshot backups will be regularly taken and automated based on best practices.

Security and Logging

It was ensured that the system also had a built-in centralized log system which, kept a check on various parameters like time taken for each process, success/failure of a process, the reason for the failure of the process and so on. Data will be secured and security groups will be used to control traffic at the Virtual Machine (VM) level.

Network Access Control Lists (NACLs) are used to control traffic at the subnet level and VPC flow logs will be enabled to capture the entire network traffic. CloudTrail will be enabled to capture all the Application Program Interface (API) activities. All the logs will be sent to AWS Guard Duty for threat detection and identifying malicious activities in the account and AWS Config will be enabled.

To ease the process, even an EC2 auto-recovery feature is enabled to address failures, if any, so that data is not lost.

The solution was designed in a modular manner keeping in mind the possibility of the addition of new channels and scalability in future, where components could be added or removed without any code changes.

The solution architecture

Application Support

Our support involved components that were developed or modified as a part of the project implementation process.

  • Broadcast Audience Research Council (BARC) sequences related to ETL (extract, transform, load) pipeline functionality support.
  • Data lake support on S3.
  • For Amazon Redshift data warehouse, support on data issues on the stored procedures migrated as a part of the project.
  • Tableau Dashboard support on data links to Redshift.

The customer was to submit requests, prioritize defects and provide technical support to Powerup before planned releases. Based on these inputs, we were able to maintain a backlog of defects reported by client stakeholders. Powerup cloud resources were to plan and schedule each release jointly with the customer’s team. Conduct unit tests and provides support acceptance tests by the customer while fixing business-critical issues if any. Deploy releases to production and warranty support for any defects found in production releases. Conduct weekly project status meetings with client stakeholders to review work progress, planned activities, risks and issues, dependencies and action items, if any.

Opportunities:

Post the shift to cloud, the customer was able to derive Sentiment Analysis through TRP. These TRPs were based on Social Media data on new as well as existing shows. It also paved the way for them to conduct GAP Analysis in order to understand and compare the current Infrastructure and process improvements with potential or expected performance, which helped enhance their efficiency.

Cloud platform

AWS.

Technologies used

Tableau, Redshift, DMS, Glue, Athena.

Business Benefits

  • The customer enjoyed a fully managed and scalable Infrastructure set up and configuration on AWS.
  • 120 dashboards created and data processing time reduced from 2-hr to 30-min.
  • The immediate business impact recorded was the modular solution resulting in the management being able to take improved and timely business-critical decisions.
  • Migration to cloud-enabled a swift generation of critical reports in an end-to-end time span of 3-min from 6-12min which improved the decision-making capability of business leaders to a significant extent. Going forward, it is anticipated that TRP is also to increase further due to this digital shift.

Leave a Reply