Customer: An e-commerce Company-Running Websites at Scale on App Service.
One of India’s largest media companies, uses various SaaS platforms to run their OTT streaming application resulting in data is stored a several disparate sources. With around 20 of these data sources, resulting in an overall daily raw data aggregating to ~600 GB. This made extracting customer meta-data complex while making search and building recommendations difficult.
Building a Data Lake to bring all their customers’ and operations’ data at one place to understand their business better. Powerupcloud built real-time and batch ETL jobs to bring the data from varied data sources to S3. The raw data was stored in S3. The data was then populated in Redshift for further reporting while advanced analytics was run using Hadoop based ML engines on EMR. Reporting was done using QuickSight.
Written by: Nagarjun K, Software engineer at powerupcloud technologies
Given the cloud imperative, a lot of organizations migrate their workloads from on-prem/cloud to AWS. However, while migrating old data into AWS S3, organizations find it hard to enable date-based partitioning. Given the inability to retrospectively implement this feature, organizations usually end-up with disparate storage sources within their AWS environment. The blog equips you with some best practices on implementing date-based partitioning in historical data, as well as, provides key guidelines to convert CSV/Json files to Parquet format before migrating your data.
It is common knowledge that Parquet file format is desirable because of size and cost benefits. Hence a recommended approach for converting old data to Parquet format is crucial from a migration success point of view. To enable this, organizations often explore AWS EMR and DataProc clusters. However, these approaches introduce other challenges such as large cluster size and associated cost for running the clusters. Therefore, a solution that can address these concerns and also rid the organization from cluster administrative chores is deeply valuable. For these reasons, AWS Glue seems to be a prudent choice. Below is the list on interchangeable format conversions supported by Glue:
Data is usually constrained by storage, which has a bearing on costing aspects. Correspondingly, Parquet is a columnar file format and allows unparalleled storage optimization due to its size benefits. Additionally, there are a great deal of options available in the market for compression and encoding of Parquet files. Date warehousing services such as BigQuery and Snowflake support Parquet file format, enabling granular control on performance and cost.
As discussed above, partitioning files on the basis of date directly confines the amount of data that needs to be processed and, therefore, allows read-optimization. While unpartitioned data can also be queried, the antiquated approach introduces performance and cost inefficiencies. In essence, partitioning helps optimize data that needs to be scanned by the user, enabling higher performance throughputs.
Steps to convert the files into Parquet
Step 1: Extract of Old Data
As first steps, extract historical data from the source database along with with headers in CSV format. To enable better readability of data, you may also use Pipe separator(). After structuring the data with Pipe separator, store the CSV file in S3 bucket.
Step 2: Creating Crawlers for Fetching File Meta-data
With the purpose of identifying the schema of CSV files, you need to create and run Crawlers. Find the steps below:
Go to AWS Glue home page.
After selecting Crawlers section , click “Add crawler”
Name your crawler.
Select the path of your CSV folder in S3 (Do not select specific CSV files). As a prerequisite, create a folder that includes all your CSV files.
As demonstrated below, we give a path name instead of selecting the filename s3://Bucketname/foldername
You may add additional data sources, else click “NO”
Since the crawlers need both read and write access in order to read the source file and write the parquet file back to S3, you need to create an IAM that allows both read and write access.
Set up the crawler as Run as On Demand
Enter the database name to create a table schema for the CSV file
Step 3: Running the Crawler
After you successfully create the crawlers, click “Run it Now” and wait for a few minutes. Shortly you will see a new table that has the same schema as your CSV file in the Data Catalog section
Here, we see the csv file table created by the crawler
Step 4: Adding the partition columns to Historical data using Athena
Once the table is created by the crawler open athena and click “Run query”.
As illustrated in the figure below, the Date Column is in yyyy/mm/dd As part of the partitioning procedure, you can separate columns for year, month and day by running the partitioning query:
Step 5: Running ETL for converting to Parquet format
Select ETL Section, go to Jobs and click “Add Job”
Name your job and select the IAM role(select the role you created in the earlier step).
Select the data source created by the crawler
Choose your data target as s3
The next screen allows column mapping. If you need to remap or remove any column from CSV, you may modify it from this screen.
The following screen shows you the Diagram and source code for the job. As a next step, add PartitionKey and mention the column name for year,month and day to enable partition in that order. See example below: “partitionKeys”:[“year”,”month”,”day”]
Save the changes and click “Run Job” button. Standby for a few mins( based on your total data size) to allow the job to complete. You can see the logs from the bottom.
Step6 : Verifying the files in S3.
Go to s3 bucket where you have saved the parquet file. You will see that there new folders structured in year–month–date format.
As organizations continue to move workloads on the cloud, there will be considerable increase in volume, velocity and variety of data. In order to maintain a healthy trade off between cost and performance, measures such as converting to Parquet format and date-based partitioning can help organizations manage their data requirements with more effectively.
Visit us at Amazon AI Conclave 2019 and witness the lives demos of Botzer Personal Assistant and Advanced Recognition System. Meet our solution experts to understand how we could enable your business to get future-ready. Register here
Siva S from Tirupur in Tamil Nadu used newspapers to bind his school books as his family could not afford to buy the brown paper. He also designed his own labels for the books. Impressed with his artwork, his classmates started paying him money for these labels.
The seeds of entrepreneurship were very much there in Siva right from his childhood, which eventually led to an exciting journey for the engineering graduate.
Siva’s four-year-old bootstrapped technology startup PowerupCloud was acquired by Larsen & Toubro Infotech (LTI) for $15 million (approx Rs 105 crore) in October this year. The startup had grown without any external funding.
Filings made with stock exchange reveal that PowerupCloud registered a revenue of $3.5 million for FY 2018-19, which gives the startup almost five-time valuation on its topline.
“My parents did not want me to start anything on my own since they wanted me to secure a job in a well-known MNC or go abroad,” says Siva.
Call it by design or destiny, there were other plans for Siva. Though he did work with a couple of large technology companies, which were cushy and well-paid jobs, he eventually did start out on his own. While working Siva realised that Cloud was going to be the next big thing in the area of technology with companies like Amazon Web Services entering India. He worked with a couple of smaller companies engaged with cloud technology to get a first-hand feel of this segment.
Written by Jeremiah Peter, Solution specialist-Advanced Services Group, Powerupcloud technologies
A Not So Distant Future
As we usher into an era dominated by technological innovation, Artificial Intelligence continues to draw heated debates for its unparalleled ability to automate tasks and eliminate human dependency. The growing cynicism on automation has captured our imagination in leading cinematic marvels such as ‘2001 A Space Odyssey’ and ‘Terminator’. Painting a deeply poignant future, these movies induce fears of a machine-led holocaust sparked by AI’s transcendence into Singularity- a point where AI supersedes human intelligence. Veering away from the dystopian narrative and objectively analyzing the realm of AI, it seems apparent that we can leverage AI for social good without descending into chaos. The call for beneficence in intelligent design is best captured by American computer scientist, Alan Kay’s words- “The best way to predict the future is to invent it”.
The blog presents a new frame of analysis for the responsible use of Artificial Intelligence technologies to augment human and social development. Additionally, the blog also delineates key Machine Learning and Computer Vision (Object Detection) concepts to solve a real-world problem, outlining a discourse for pragmatic solutions with broad social impact.
The Next Frontier
Under the dense canopy of commercial AI clutter, there are several AI initiatives that continue to garner both awe and adulation in the social sciences and humanities spectrum. Cancer detection algorithms, disaster forecast systems and voice-enabled navigation for the visually impaired are a few notable mentions. Although socially-relevant applications have achieved a fair degree of implementational success, they fail to attain outreach at a global level due to a lack of data accessibility and the dearth of AI talent.
Alternatively, innovative technologies that supplement large scale human-assistance programs in enhancing efficacy could be considered a worthwhile undertaking. Infusing computer vision technology in human-centered programs can dramatically improve last-mile coverage, enhance transparency, mitigate risks and measure the overall impact of assistance programs. In the next section, we delve into some core issues that afflict large-scale human assistance programs and underscore the need for technological intervention.
The State of Human-assistance Programs
According to The State of Food Security and Nutrition in the World Report (2019), around 820 million people in the world are hungry and over 66 million of them are under the age of 5. With numbers increasing steeply in most parts of Africa, South America, and Asia, the fate of Sustainable Development Goal of Zero Hunger by 2030 hangs by a thread. Perturbed by the growing scourge, some nations responded by releasing a slew of measures to take corrective action.
One such initiative called The Midday Meal Scheme (MDMS), launched by the government of India in 1995, serves around 120 million children across government and government-aided schools. Recognized as one of the largest food assistance programs in the world, MDMS was laid out with a bold vision to enhance enrolment, retention, and attendance with the overarching aim of improving nutritional status among children in India. However, not including the logistical and infrastructural shortcomings, the initiative loses significant funds to pilferage each year (Source: CAG Report2015). Shackled by a lack of resources, the program struggles to counter aberrant practices with institutional measures and seeks remediation through innovative solutions.
Given the unprecedented magnitude, large-scale human assistance schemes such as MDMS require well-crafted solutions that can instill governance and accountability into their dispensation process. In our constant endeavor to design Intelligent Apps with profound social impact, Powerup experts examined a few libraries and models to carve out a computer vision model that could reign in governance under such programs. The following section explores the pre-requisites for formulating a desirable solution.
Empowering Social Initiatives with Object Detection
Evidently, the success of an AI application is hugely predicated on a well-conceived algorithm that can handle varying degrees of complexity. However, developing a nuanced program from scratch is often a cumbersome and time-intensive process. To accelerate application development, programmers usually rely on pre-compiled libraries, which are frequently accessed code routines used iteratively in the program. After gleaning several open-source image processing libraries (VXL, AForge.Net, LTI-Lib), Powerup team narrowed down on OpenCV for its unique image processing functions and algorithms.
Besides a good library, the solution also requires a robust image classification engine to parse objects and scenes within images. However, despite several key advances in vision technology, most classification systems fail to interact with the complexities of the physical world in that these systems can only identify a limited set of objects under a controlled environment. To develop advanced object detection capabilities, the application needs to be powered by an intelligent model that can elicit a more refined outcome- to make sense of what it sees.
In order to develop a broad understanding of the real-world, computer vision systems require comprehensive datasets that consist of a vast array of labeled images to facilitate object detection within acceptable bounds of accuracy. Apart from identifying the position of the object/s in the image, the engine should also be able to discern the relationship between objects and stitch a coherent story. Imagenet is a diverse open-source dataset that has over a billion labeled images and, perhaps, serves as a foundation for similar computer vision explorations.
Moreover, computer vision systems also hinge on neural networks for developing self-learning capabilities. Though popular deep-learning based models such as R-CNN, R-FCN, and SSD offer ground-breaking features, YOLO (You Only Look Once) stands out for its capabilities in super real-time object detection, clocking an impressive 45 FPS on GPU. The high processing power enables the application to not only interact with images but also process videos in real-time. Apart from an impressive processing capacity, the YOLO9000 is trained on both the ImageNet classification dataset and the COCO detection dataset that enables the model to interact with a diverse set of object classes. We labeled and annotated local food image sets containing items such as rice, eggs, beans, etc. to sensitize the model toward domain specific data.
As demonstrated in the image above, the model employs bounding boxes to identify individuals and food items in the picture. Acting as a robust deterrent against pilferage, the application can help induce more accountability, better scalability, and improved governance.
A New Reckoning
While a seventh of the population goes hungry every day, a third of the food in the world is wasted. The figure serves as a cause for deep contemplation of the growing disparities in a world spawned by industrialization and capitalism. As we stand on the cusp of modern society, phenomena such as unbridled population growth, disease control, climate change and unequal distribution of resources continue to present grave new challenges. Seeking innovative and sustainable solutions, therefore, becomes not just a moral obligation, but also the 21st century imperative.
Aside from the broad benefits, the domain of AI also presents a few substantive concerns that need general oversight. To that effect, the evolving technological landscape presents two inherent risks: wilful misuse (Eg- Cambridge Analytica Case) and unintended consequences (COMPAS- a biased parole granting application).
While concerns such as wilful misuse raise moral questions pertaining to data governance and preservation of user self-determination, risks such as algorithm bias and inexplicability of decision-making expose design loopholes. However, these apprehensions can be largely mitigated through a commonly accepted framework that is vetted by civil society organizations, academe, tech & business community, and policymakers around the globe. Owing to this pressing need, the European Parliament launched AI4People in February 2018 to design a unified charter for intelligent AI-based design. Upholding values such as protection of human self-determination, privacy, and transparency, the initiative is aimed at proposing recommendations, subsequently leading to policies, for ethical and socially preferable development of AI.
Governed by ethical tenets, innovative solutions such as Object Detection can operate within the purview of the proposed framework to alleviate new-age challenges. With reasonable caution and radical solution-seeking, AI promises to be an indispensable vector of change that can transform society by amplifying human agency (what we can do).
Asia’s leading communications group and provides a diverse range of services including fixed, mobile, data, internet, TV, infocomms technology (ICT) and digital solutions. It is headquartered in Singapore and has 140 years of operating experience and played a pivotal role in the Singapore’s development as a major communications hub.
As one of the largest telecom provider with a subscriber base in Asia, Australia & Africa. They have 25000+ employees spread across the globe. It is a nightmare for the HR department to respond to employee queries around policies & real time updates their HRMS, hosted on Success Factors on SAP. This created a need to have a comprehensive solution to cater to needs across geographies by automate the process for New Hire On-boarding and Leave Application process. This helps solve user request quickly, avoiding any delay in response by the HR function through a chatbot, which understands user’s exact questions and answers appropriately.
Powerup integrated with Success Factors on SAP & deployed a chatbot on their HR Central website, which allows their employees to query policies & get real time updates on HRMS data. The bot, built on Botzer, also allowed the employees to apply for leaves & approve pending requests in SAP.
A global Information Technology and consulting company that harnesses the power of digital, cloud, analytics and other emerging technologies. It is the 3rd largest IT services player in India with an employee base of over 1.7 lakhs. It serves clients across the US, Canada, Latin America, Continental Europe, India and the Middle East, and the Asia Pacific.
One of our Customer companies, which serves clients across six continents has a complex IT landscape to manage. The underlying infrastructure supports a huge employee base and all critical applications. The digital platform for self-service that gives employees a seamless experience across various processes and workflows. The application enables all company employees and contractors to manage business transactions, access productivity tools, news, videos, communications, and other content via one single application interface. Tens of thousands of employees worldwide depend on their “company application” and an associated suite of 150+ applications for their day-to-day activities. But the existing approval-based systems for requests rendered it difficult to handle higher numbers of transactions and larger volumes of data resulting in delays in approvals and decreased employee satisfaction. Our customer needed a smart Artificial Intelligence (AI) solution which uses advanced decision-making and machine learning to not only resolve this but also customize the process as per the request while also reducing the number of inputs by the user.
Powerup conducted an in-depth study of customer application systems and interacted with the users to understand the challenges. The major bottleneck was not the sheer number of requests being received on the portal, but the systems’ inability to understand user context and the number of steps involved in getting simple issues resolved.
Powerup designed a solution for the customer, which will integrate with their company application portal as a voice engine to automate the user journey on the system.
This also has to be a voice-first solution that executes the action on voice inputs of the user. The engine backed by strong neural networks understands the user context and personalizes the engine for the user. The engine is built on an unsupervised learning model, where the engine personalizes the conversation based on the user’s past interactions. Thus, providing a unique and easy to navigate through a journey for each user.
In this process, the users can get rid of the transactional system and get issues resolved, from approval to task submission, within 2-3 steps. Powerup also implemented the Botzer chatbot platform with Amazon Lex & Polly. Customer calls get diverted from IVR to the chatbot, which takes customers’ requests as voice input, does entity matching, triggers workflows and answers back immediately. The voice engine supports 2 languages today – English and Hindi.
Customers can get details like
Statement of Account,
Balance Due etc.
The intelligence built into the system allows it to behave differently with different users during a different time of the day, thus if the user accesses different applications during the morning than the evening hours, the engine will respond accordingly during the respective hours.
Below is a high-level Solution workflow of the engine, being developed on AWS Lex & Polly, utilizing Botzer APIs at the backend.
Following is the high-level technical architecture of the implementation. The engine is hosted on the customer’s AWS VPC, ensuring data integrity & security.
The current architecture is capable of hosting 1lakhs+ employee, with 150+ applications
Faster ticket resolution and better communication with third-party application providers led to an increase in the number of tickets resolved. At the same time, the number of false positives decreased.