Creating parquet files - differences between R and Python

We created a file parquet

in Dask

(Python) and Drill

(R using package Sergeant

). We noticed several problems:

  • The format Dask

    (i.e. fastparquet

    ) has files _metadata

    and _common_metadata

    , while the file parquet

    in R \ Drill

    does not have these files and has parquet.crc

    files instead (which can be deleted). what is the difference between these implementations parquet

    ?
+3


source to share


1 answer


(only answer to 1), please write separate questions to make it easier to answer)

_metadata

and _common_metadata

are auxiliary files that are not required for the Parquet dataset, these are used by Spark / Dask / Hive / ... to output the metadata of all Parquet files in the dataset without having to read the footer of all files. In contrast, Apache Drill creates a similar file in each folder (on demand) that contains all the footers of all Parquet files. Only on the first request in the dataset will all files be read, further requests will only read the file that caches all footers.



Tools using _metadata

and _common_metadata

should be able to use them for faster execution, but do not depend on them for operations. In case they don't exist, the request engine simply has to read all the footers.

+2


source







All Articles