PySpark partitionBy, redistribution or nothing?

Question

PySpark partitionBy, redistribution or nothing?

So what I did is

rdd.flatMap(lambda x: enumerate(x))

Generating keys 0-49 for my data. Then I decided to do:

rdd.flatMap(lambda x: enumerate(x)).partitionBy(50)

I noticed something odd, so for the next file size, 10GB took 46 seconds to complete my calculations, and a 50GB file took 10 minutes 31 seconds. I checked the file and for some reason it was only in 4 blocks.

So I changed:

sc.textFile("file", 100)

I deleted the partition and the 50GB file crashed for about 1 minute. I was wondering if it makes sense to repeat the data section after loading? Maybe a key?

+3

python apache-spark pyspark

theMadKing Apr 19 At 15:01

source to share

1 answer

ipoteka · Accepted Answer · 2015-04-19T16:00:16+0000

If I understand your question correctly, you ask when you need additional redistribution. First, you must remember that repartition is an expensive operation . Use it wisely. Secondly, there is no hard answer, and it comes with experience. But some common cases:

You can try calling repartition

your date before join, leftOuterJoin, cogroup...

Sometimes it can speed up the calculation.
You flatMap

will meet your data in more "heavy" data Java heap space Exception java.lang.OutOfMemoryError

. Then you should definitely make your partitions smaller to fit the data after flatMap

.
You load data into database \ mongoDb \ elasticSearch ... You call repartition

into your data and then inside a block of code forEachPartition

you do a massive insert of this whole section into the database. Therefore, the size of these chunks must be reasonable.

PySpark partitionBy, redistribution or nothing?

More articles: