What controls the number of partitions when reading Parquet files?

My setup:

Two Spark clusters. One on EC2 and one on Amazon EMR. And with Spark 1.3.1.

The EMR cluster was installed with emr-bootstrap-actions . The EC2 cluster was installed with the default EC2 scripts Spark.

Code:

Read the folder containing 12 Parquet files and count the number of partitions

val logs = sqlContext.parquetFile("s3n://mylogs/")
logs.rdd.partitions.length

Observations:

In EC2, this code gives me 12 sections (one per file, makes sense).
In EMR, this code gives me 138 (!) Sections.

Question:

What controls the number of partitions when reading Parquet files?

I read the same folder on S3, with the same Spark release. This leads me to think that there might be some configuration settings that govern how the split occurs. Does anyone have more information on this?

Views would be greatly appreciated.

Thank.

UPDATE:

It looks like many partitions are created with the EMR S3 ( com.amazon.ws.emr.hadoop.fs.EmrFileSystem

) filesystem implementation .

When deleting

<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>

from core-site.xml

(going back to the Hadoop S3 filesystem) I get 12 partitions.

When working with EmrFileSystem

, it seems like the number of partitions can be controlled with:

<property><name>fs.s3n.block.size</name><value>xxx</value></property>

Could there be a cleaner way to manage # the sections in use EmrFileSystem

?

+3

amazon-web-services apache-spark parquet

Eric Eijkelenboom May 11 '15 at 12:56

source to share

No one has answered this question yet

See similar questions:

1

Predefining the number of RDD partitions

or similar:

12

Adding New Data to Partitioned Parquet Files

7

Hive does not read partitioned parquet files generated by Spark

five

Parquet support as I / O format when working with S3

1

Sql sparks: exceeding GC upper limit when reading parquet split files

1

Preserve file system partitioning when writing and re-reading to parquet file

0

Difference in default breakdown by instance type

0

Apache Spark: including partition columns in parquet file

0

Number of light parquets given parquet

0

Spark DataFrame Repartition and Parquet Partition

All Articles