Spark Dataframe - groupBy - memory overflow memory
I am working on a cluster of 4 EC2 r3.2xlarge instances. I am using spark 1.3.
val test = clt.rdd.groupBy { r: Row =>
val clt = r.get(0)
clt
}
clt is a DataFrame and it comes from an 8.5Go csv file composed of 200 million lines.
In the Spark interface, I can see that my groupBy has over 220 partitions, and I can also see that the "Shuffle spill (memory)" is larger than 4TB. VM parameters: -Xms80g -Xmx80g
My questions:
-
Why is the memory of spills so long?
-
How can I optimize this?
I've already tried clt.rdd.repartition (1200) and I get the same result, but this time in a reallocation task (the spill overflow is very large and the request is really slow).
EDIT
I found something "strange":
I have a DataFrame name test that contains 5 columns .
This code works in 5 / 10mins :
val test1 = test.rdd.map {
row =>
val a = row.get(0)
val b = row.get(1)
val c = row.get(2)
val d = row.get(3)
val e = row.get(5)
(a, Array(a, b, c, d, e))
}.groupByKey
This code runs after 3/5 hours (and generates a lot of memory spill memory):
val test1 = test.rdd.map {
row =>
val a = row.get(0)
(a, row)
}.groupByKey
Any idea why?
source to share
No one has answered this question yet
See similar questions:
or similar: