What controls the number of partitions when reading Parquet files?

My setup:

Two Spark clusters. One on EC2 and one on Amazon EMR. And with Spark 1.3.1.

The EMR cluster was installed with emr-bootstrap-actions . The EC2 cluster was installed with the default EC2 scripts Spark.

Code:

Read the folder containing 12 Parquet files and count the number of partitions

val logs = sqlContext.parquetFile("s3n://mylogs/")
logs.rdd.partitions.length

      

Observations:

  • In EC2, this code gives me 12 sections (one per file, makes sense).
  • In EMR, this code gives me 138 (!) Sections.

Question:

What controls the number of partitions when reading Parquet files?

I read the same folder on S3, with the same Spark release. This leads me to think that there might be some configuration settings that govern how the split occurs. Does anyone have more information on this?

Views would be greatly appreciated.

Thank.

UPDATE:

It looks like many partitions are created with the EMR S3 ( com.amazon.ws.emr.hadoop.fs.EmrFileSystem

) filesystem implementation .

When deleting

<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>

      

from core-site.xml

(going back to the Hadoop S3 filesystem) I get 12 partitions.

When working with EmrFileSystem

, it seems like the number of partitions can be controlled with:

<property><name>fs.s3n.block.size</name><value>xxx</value></property>

      

Could there be a cleaner way to manage # the sections in use EmrFileSystem

?

+3


source to share





All Articles