PySpark partitionBy, redistribution or nothing?

So what I did is

rdd.flatMap(lambda x: enumerate(x))

      

Generating keys 0-49 for my data. Then I decided to do:

rdd.flatMap(lambda x: enumerate(x)).partitionBy(50)

      

I noticed something odd, so for the next file size, 10GB took 46 seconds to complete my calculations, and a 50GB file took 10 minutes 31 seconds. I checked the file and for some reason it was only in 4 blocks.

So I changed:

sc.textFile("file", 100)

      

I deleted the partition and the 50GB file crashed for about 1 minute. I was wondering if it makes sense to repeat the data section after loading? Maybe a key?

+3


source to share


1 answer


If I understand your question correctly, you ask when you need additional redistribution. First, you must remember that repartition is an expensive operation . Use it wisely. Secondly, there is no hard answer, and it comes with experience. But some common cases:



  • You can try calling repartition

    your date before join, leftOuterJoin, cogroup...

    Sometimes it can speed up the calculation.

  • You flatMap

    will meet your data in more "heavy" data Java heap space Exception java.lang.OutOfMemoryError

    . Then you should definitely make your partitions smaller to fit the data after flatMap

    .

  • You load data into database \ mongoDb \ elasticSearch ... You call repartition

    into your data and then inside a block of code forEachPartition

    you do a massive insert of this whole section into the database. Therefore, the size of these chunks must be reasonable.

+3


source







All Articles