Easy CSV to Parquet conversion

a year ago   •   3 min read

By Maik Wiesmüller
How to use AWS Athena for quick conversions to a more efficient data format.
Table of contents

How to use AWS Athena for quick conversions to a more efficient data format.

Why Apache Parquet?

Apache Parquet is an open source, column oriented data format and also part of the Hadoop ecosystem. The column based approach, combined with the support for efficient compression, makes Apache Parquet the perfect match for storing and querying large amounts of data.

While working with 3rd party software exports or other metrics providers it is not uncommon to end up with data in CSV format or LD JSON. This is ok and can easily be used with services like AWS Glue and AWS Athena.

It can be a good thing to have a human readable data format, while on the other hand these formats are rather inefficient when it comes to storing and querying large data sets.

Parquet is more efficient when it comes to the amount of scanned data. While a scan over CSV data will need to consider all columns, Parquet enables AWS Athena to read only the columns you need.

Whenever you have a large amount of data and want to analyse it, you should think about converting your data to Parquet.

Example scenario

To showcase this, we have downloaded the 2018 New York City Taxi Trip Record data and uploaded it into an S3 Bucket:

s3://incs-blog-taxidata/csv/
--
~8.45 GB of CSV data
102'804'262 rows

You can get the tlc-trip-records here.

The data contains all kind of information about taxi trips, like passenger count per trip or pick-up locations. A Glue Crawler already classified the data and stored the schema in a Glue table: taxidata.csv .

To get the average passenger per ride we use AWS Athena and query the average like this:

SELECT AVG(passenger_count) FROM "taxidata"."csv";

The Results:

  • 4.18 seconds
  • 8.45 GB scanned
  • an average of ~1.59 passengers per trip

Even if we query this data only for the passenger_count, AWS Athena still needs to scan the entire data set. So let’s use the advantages of Parquet!

Converting data with Glue ETL Jobs

To be more cost efficient, we want to convert the data to Parquet. One common way to do this, is to write an ETL job with AWS Glue. This is where you create a Glue Spark job to map data sources and destinations, set up triggers and tweak the DPU settings for your job.

While this may still be the go-to way of doing it, for an automated conversion, it is not the most convenient or quickest way. Instead of Glue ETL we will look for an easy-to-use method for these “let me check that quickly” moments.

Using AWS Athena for quick conversion

Instead of building an ETL job and map all the columns, we can use AWS Athena’s CREATE TABLE statement. We set the new location to s3://incs-blog-taxidata/parquet and specify the new format as PARQUET.

Given that we already have the data schema classified in AWS Glue, we can simply SELECT * from our existing data and AWS Athena will shift and convert everything to the new location.

CREATE TABLE "taxidata"."parquet"
WITH (
    external_location='s3://incs-blog-taxidata/parquet',  
    format='PARQUET'
) AS
SELECT * FROM "taxidata"."csv";

After ~ 29 seconds AWS Athena has converted all the data to parquet and stored it in our bucket.

We also now have a new Glue table "taxidata"."parquet".

Size in S3:

  • CSV 8.45 GB (uncompressed¹)
  • Parquet 1.3 GB

Now let’s execute the example query again, to get the average number of passengers over all rows. But this time we query the parquet data:

SELECT AVG(passenger_count) FROM "taxidata"."parquet";

The Results:

  • 3.01 seconds
  • 21.43 MB scanned (instead of 8.45 GB)
  • an average of ~1.59 passengers per trip

Conclusion

If we run our query 3 times - with CSV data - we scan (and get charged for) 25.35 GB (3 x 8.45 GB).

In case we convert the data to Parquet first, we only pay for scanning 64.29 MB (3 x 21.43 MB) plus the initial conversion, which sums up to ~8.51 GB.

With this easy-to-use conversion of CSV or LD JSON data to Parquet you can significantly save cost with very little effort.


¹ CSV can also be compressed with good results.

Spread the word

Keep reading