Is it better to have one large parquet file or many smaller parquet files?

I understand that hdfs splits files into something like 64MB chunks. We have data streaming in and we can store it in large or medium sized files. What is the optimal size for storing columnar files? If I can store files where the smallest column is 64MB, would it save any computation time, say 1gb files?

+3


source to share


1 answer


The target is about 1 GB per file (spark section) (1).

Ideally, you would use instant compression (default) due to the fact that the combined compressed parquet files were split (2).

Using snappy instead of gzip will increase the file size significantly, so if memory space is an issue this must be considered.

.option("compression", "gzip")

is the ability to override the default instant compression.



If you need to resize / repartition your dataset / DataFrame / RDD call a function .coalesce(<num_partitions>

or worst case .repartition(<num_partitions>)

. Warning: especially for redistribution, but also for merging may result in data permutation, so use with some caution.

In addition, the parquet file size and generally all files should usually be larger than the HDFS block size (default 128 MB).

1) https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html 2) http://boristyukin.com/is-snappy-compressed -parquet-file-splittable /

+7


source







All Articles