When it comes to storing data in files, CSV has been the most common format so far or may be delimited-file format, in a general sense. Parquet is the format as first choice when dealing with big data analytics. Let’s look at the some of the points to build a quick understanding around this modern file format –
- This is an open source file format
- Stores data in a columnar format i.e. rather than storing the data row-by-row, it’s stored as column-by-column. All values from a column are stored together
- Data is stored in a compressed fashion
- Best suited for the use cases where data needs to be queries for limited columns rather than from all the columns and considering data from each column is stored together, it’s fast to query because of reduced disk I/O
- Because of homogenous data stored together, it offers great compression
- Data compression and columnar format results in reduced storage and query performance
- Data is stored in binary format so you can’t read the contents of a parquet file. In fact, this is a directory structure. If you need to check the size of parquet file, check the size of directory
e.g. if you have a table that looks like –
in a row-oriented storage, this table is stored as –
whereas on a columnar storage format, it would be stored as –
i.e. data from a column is stored together which could be interpreted as homogenous data is stored together which supports an excellent compression.
As per the case study posted on databricks site on a certain dataset, parquet is clearly the winner as compared to CSV files.
So, not just the reduced storage and faster queries, parquet promotes savings in the cost too.
So, in a nutshell, if you are working with big data analytical tools like Apache Spark, databricks etc. and you need to store data in offline files where they are not meant to be human readable but require more a data transportation mechanism for faster processing, consider to use parquet file format. This has been seen as a recommended approach to bring data from relational database system into data lake in parquet format.