Creating parquet files - differences between R and Python
We created a file parquet
in Dask
(Python) and Drill
(R using package Sergeant
). We noticed several problems:
- The format
Dask
(i.e.fastparquet
) has files_metadata
and_common_metadata
, while the fileparquet
inR \ Drill
does not have these files and hasparquet.crc
files instead (which can be deleted). what is the difference between these implementationsparquet
?
source to share
(only answer to 1), please write separate questions to make it easier to answer)
_metadata
and _common_metadata
are auxiliary files that are not required for the Parquet dataset, these are used by Spark / Dask / Hive / ... to output the metadata of all Parquet files in the dataset without having to read the footer of all files. In contrast, Apache Drill creates a similar file in each folder (on demand) that contains all the footers of all Parquet files. Only on the first request in the dataset will all files be read, further requests will only read the file that caches all footers.
Tools using _metadata
and _common_metadata
should be able to use them for faster execution, but do not depend on them for operations. In case they don't exist, the request engine simply has to read all the footers.
source to share