PySpark partitionBy, redistribution or nothing?
So what I did is
rdd.flatMap(lambda x: enumerate(x))
Generating keys 0-49 for my data. Then I decided to do:
rdd.flatMap(lambda x: enumerate(x)).partitionBy(50)
I noticed something odd, so for the next file size, 10GB took 46 seconds to complete my calculations, and a 50GB file took 10 minutes 31 seconds. I checked the file and for some reason it was only in 4 blocks.
So I changed:
sc.textFile("file", 100)
I deleted the partition and the 50GB file crashed for about 1 minute. I was wondering if it makes sense to repeat the data section after loading? Maybe a key?
source to share
If I understand your question correctly, you ask when you need additional redistribution. First, you must remember that repartition is an expensive operation . Use it wisely. Secondly, there is no hard answer, and it comes with experience. But some common cases:
-
You can try calling
repartition
your date beforejoin, leftOuterJoin, cogroup...
Sometimes it can speed up the calculation. -
You
flatMap
will meet your data in more "heavy" dataJava heap space Exception java.lang.OutOfMemoryError
. Then you should definitely make your partitions smaller to fit the data afterflatMap
. -
You load data into database \ mongoDb \ elasticSearch ... You call
repartition
into your data and then inside a block of codeforEachPartition
you do a massive insert of this whole section into the database. Therefore, the size of these chunks must be reasonable.
source to share