How to Convert Historical Data into Parquet Format with Date Partitioning

By December 30, 2019 January 2nd, 2020 Blogs, data

Written by: Nagarjun K, Software engineer at powerupcloud technologies

Given the cloud imperative, a lot of organizations migrate their workloads from on-prem/cloud to AWS. However, while migrating old data into AWS S3, organizations find it hard to enable date-based partitioning. Given the inability to retrospectively implement this feature, organizations usually end-up with disparate storage sources within their AWS environment. The blog equips you with some best practices on implementing date-based partitioning in historical data, as well as, provides key guidelines to convert CSV/Json files to Parquet format before migrating your data.

Source: aws.amazon.com

Source: aws.amazon.com

It is common knowledge that Parquet file format is desirable because of size and cost benefits. Hence a recommended approach  for converting old data to Parquet format is crucial from a migration success point of view. To enable this, organizations often explore AWS EMR and DataProc clusters. However, these approaches introduce other challenges such as large cluster size and associated cost for running the clusters. Therefore, a solution that can address these concerns and also rid the organization from cluster administrative chores is deeply valuable. For these reasons, AWS Glue seems to be a prudent choice. Below is the list on interchangeable format conversions supported by Glue:

  • CSV
  • JSON
  • Parquet
  • Avro
  • ORC

Why Parquet?

Data is usually constrained by storage, which has a bearing on costing aspects. Correspondingly, Parquet is a columnar file format and allows unparalleled storage optimization due to its size benefits. Additionally, there are a great deal of options available in the market for compression and encoding of Parquet files. Date warehousing services such as BigQuery and Snowflake support Parquet file format, enabling granular control on performance and cost.

Why Partition?

As discussed above, partitioning files on the basis of date directly confines the amount of data that needs to be processed and, therefore, allows read-optimization. While unpartitioned data can also be queried, the antiquated approach introduces performance and cost inefficiencies. In essence, partitioning helps optimize data that needs to be scanned by the user, enabling higher performance throughputs.

Steps to convert the files into Parquet

 

Step 1: Extract of Old Data

As first steps, extract historical data from the source database along with with headers in CSV format. To enable better readability of data, you may also use Pipe separator(). After structuring the data with Pipe separator, store the CSV file in S3 bucket.

Step 2: Creating Crawlers for Fetching File Meta-data

With the purpose of identifying the schema of CSV files, you need to create and run Crawlers. Find the steps below:

  • Go to AWS Glue home page.
  • After selecting Crawlers section , click “Add crawler”
  • Name your crawler.

  • Select the path of your CSV folder in S3 (Do not select specific CSV files). As a prerequisite, create a folder that includes all your CSV files.

  • As demonstrated below, we give a path name instead of selecting the filename s3://Bucketname/foldername

  • You may add additional data sources, else click “NO”
  • Since the crawlers need both read and write access in order to read the source file and write the parquet file back to S3, you need to create an IAM that allows both read and write access.

  • Set up the crawler as Run as On Demand

  • Enter the database name to create a table schema for the CSV file

Step 3: Running the Crawler

After you successfully create the crawlers, click “Run it Now” and wait for a few minutes. Shortly you will see a new table that has the same schema as your CSV file in the Data Catalog section

  • Here, we see the csv file table created by the crawler

Step 4: Adding the partition columns to Historical data using Athena

  • Once the table is created by the crawler open athena and click “Run query”.

  • As illustrated in the figure below, the Date Column is in yyyy/mm/dd As part of the partitioning procedure, you can separate columns for year, month and day by running the partitioning query:

Step 5: Running ETL for converting to Parquet format

  • Select ETL Section, go to Jobs and click “Add Job”
  • Name your job and select the IAM role(select the role you created in the earlier step).

  • Select the data source created by the crawler

  • Choose your data target as s3

  • The next screen allows column mapping. If you need to remap or remove any column from CSV, you may modify it from this screen.

  • The following screen shows you the Diagram and source code for the job. As a next step, add PartitionKey and mention the column name for year,month and day to enable partition in that order. See example  below: “partitionKeys”:[“year”,”month”,”day”]

  • Save the changes and click “Run Job” button. Standby for a few mins( based on your total data size) to allow the job to complete. You can see the logs from the bottom.

Step6 : Verifying the files in S3.

Go to s3 bucket where you have saved the parquet file.  You will see that there new folders structured in year–month–date format.

Conclusion:

As organizations continue to move workloads on the cloud, there will be considerable increase  in volume, velocity and variety of data. In order to maintain a healthy trade off between cost and performance, measures such as converting to Parquet format and date-based partitioning can help organizations manage their data requirements with more effectively.

Leave a Reply