Deep dive into Apache Parquet Format

Introduction. When we want to store we meet the trade-off between flexibility and storage efficiency. Old school formats like csv, json and xml let you have high flexibility with human readable features as a benefit but at the same time it is not so good when you have a lot of data. For big data processing where data is stored in an archive way we can use binary formats with storage tricks like indexes and compression. So we can not ignore file formats like Apache Parquet, Apache Iceberg, Apache Avro, Apache Orc, etc. In this post we will go deeper to Apache Parquet format.

Storage layout. The first trick is the physical storage layout. At the logical level we work with tables: rows with a fixed set of columns. But we can store tables at disk in three ways:

Row wise. Store each row sequentially. We store row 1 with all row’s column values. For example: [(ROW_1 COL_1), (ROW_1 COL_2), (ROW_2 COL_1), …]. This method used by CSV, JSON and XML file formats.
Columnar. Store each column sequentially: values of each column grouped in sequancies. For example: [(ROW_1 COL_1), (ROW_2 COL_1), …]. This method allows to navigate over wide tables (a lot of column in table).
Hybrid. Split table in into chunks horizontally and store by column.

For OLTP (a lot of small operations involving whole rows) it is better to use a row wide method but for OLAP (few large operations involving a subset of columns) better to use hybrid format. For this reason relation databases like PostgreSQL / MySQL prefer row wise format and feel bad with wide columns table and analytical databases like Clickhouse / Vertica prefer hybrid format.

Parquet like archive storage format use Hybrid physical storage layout.
Format in files level. In disk parquet usually splitted into multiple files in one or multiple sub directories. Where each sub directory can be used for partition needs.
Format in file level. Data organized in each file with fixed schema. File contains:

Row groups (default size is 128MB like Hadoop file chunk).
Column chunks
Pages (default 1MB)

Metadata stored in page and file level. Remember that if the file metadata is corrupt, the file is lost. If the column metadata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.

There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All Apache Thrift structures are serialized using the TCompactProtocol

The main advantage of metadata is quick navigation over chunks without lookup into each one. For example we can do predicate push down in case where we want to filter column > X using metadata.

Encoding. Another optimization is dictionary encoding that detects small number of unique values and compress that using dictionary. In extension of that parquet use bit backing technique that allows to store small integers packed into same space. RLE (Run-length encoding) optimize storage of multiple occurrences of the same value, a single value is stored once along with the number of occurrences.

Compression. Apache Parquet allows to choose compression algorithm of entire pages. Configuration spark.sql.parquet.compression.codec can be none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. Note that `zstd` requires `ZStandardCodec` to be installed before Hadoop 2.9.0, `brotli` requires `BrotliCodec` to be installed.

Partitioning. In this way we can partition by date (YYYY-MM-DD) and open only target files. Use df.write.partitionBy("col1", "col2", ...).parquet(filename) for partitioning.

Parquet for streaming. Parquet is created for archive storage and has metadata overhead that means that parquet is not a good choice for streaming. Prefer Apache Avro that uses row based storage schema without huge metadata overhead.

Value Types. Supported value types:

BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints (spark use that format for storing timestamp, use flag spark.sql.parquet.int96AsTimestamp tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems)
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays.

View parquet file data. Use parquet-tools with cat for scan file like parquet-tools cat --json <file>. Also parquet-tools meta <part> shows a file metadata.

General tips or Best practices for Parquet (not only) like a summary.

Avoid small files
Avoid large files
Use right partition key
Use denormalization and pre compute : replace evaluation / joins with simple filter in future
Use Apache Parquet for archive storage and Apache Avro for streaming

References:

https://parquet.apache.org/docs/
https://www.youtube.com/watch?v=1j8SdS7s_NY
https://www.youtube.com/watch?v=rVC9F1y38oU
https://www.youtube.com/watch?v=_0Wpwj_gvzg

Deep dive into Apache Parquet Format

Jul 4, 2022 | - views