Spark Dataframe - groupBy - memory overflow memory

Question

Spark Dataframe - groupBy - memory overflow memory

I am working on a cluster of 4 EC2 r3.2xlarge instances. I am using spark 1.3.

val test = clt.rdd.groupBy { r: Row =>
  val clt = r.get(0)
  clt
}

clt is a DataFrame and it comes from an 8.5Go csv file composed of 200 million lines.

In the Spark interface, I can see that my groupBy has over 220 partitions, and I can also see that the "Shuffle spill (memory)" is larger than 4TB. VM parameters: -Xms80g -Xmx80g

My questions:

Why is the memory of spills so long?
How can I optimize this?

I've already tried clt.rdd.repartition (1200) and I get the same result, but this time in a reallocation task (the spill overflow is very large and the request is really slow).

EDIT

I found something "strange":

I have a DataFrame name test that contains 5 columns .

This code works in 5 / 10mins :

 val test1 = test.rdd.map {
  row =>
    val a = row.get(0)
    val b = row.get(1)
    val c = row.get(2)
    val d = row.get(3)
    val e = row.get(5)
    (a, Array(a, b, c, d, e))
}.groupByKey

This code runs after 3/5 hours (and generates a lot of memory spill memory):

val test1 = test.rdd.map {
  row =>
    val a = row.get(0)
    (a, row)
}.groupByKey

Any idea why?

+3

shuffle apache-spark apache-spark-sql

GermainGum June 15. 15 at 15:20

source to share

No one has answered this question yet

See similar questions:

24

How to optimize shuffle spill in Apache Spark application

or similar:

1113

How to randomize (shuffle) a JavaScript array?

710

Shuffling the list of objects

315

Shuffle DataFrame Rows

218

Difference between DataFrame, Dataset and RDD in Spark

2

Spark shuffle spill rates

2

Spark combByKey optimization

0

SPARK: one powerful machine Vs. several small cars

0

Understanding the Huge Spray Sizes in Sparks

0

How to fix the problem of accidental spill when writing data to hdfs parquet file?

-4

Spark: action after GroupBy (filtered) calls OOM

Spark Dataframe - groupBy - memory overflow memory

More articles: