Problems with collect () or take (n) on MB sized data by spark

Question

Problems with collect () or take (n) on MB sized data by spark

We are a group of people working with spark (spark 2.1.0 standalone with 2 workers; programming is done in scala and everything runs inside multiple dockers). We are faced with a problem that "collect" or "receive (n)" is very slow when the collected data hits some cap size.

We ran into the problem several times, but we boiled down to a simple example: it reads a file (either from the local filesystem or hdfs; we tested both) and then the result. It works fine up to a certain file size (about 2 MB) and then it is very slow (and about 3 MB it breaks completely). If it doesn't compile (for example, it just does saveAtextFile), the install can deal with files up to 200 GB in size. We tested the increase (from 2 GB of RAM to 20 GB of RAM), but this did not solve the problem; in fact, our tests show that our little experiment slows down at the same file size no matter how much RAM we give the driver or workers.

I've summarized my experiment here: https://pastebin.com/raw/6yXztq0H

In this experiment, the program reads file "s" and "take (n)" with "n" incrementally increases. As the timestamped result shows, it works almost instantaneously for "n≤104145" (it changes almost a little, despite the big changes in the setting), and then it is quite slow. For large "n" (see second run), the driver fails with "TaskResultLost" error. The last experiment (third run) shows that this does not seem to be a memory problem (and it seems logical since the file is relatively small, about 2 MB).

(This is not shown in the experiment, but we also played with SPARK_DAEMON_MEM, but that doesn't change anything.)

Has anyone had the same problem? Does anyone have an idea to help us look further?

+3

apache-spark

Louis 27 Mar 17 at 13:50

source to share

2 answers

If you've already tried increasing spark.driver.memory try increasing spark.driver.maxResultSize

0

Anil 28 Mar 17 at 17:31

source to share

Louis · Accepted Answer · 2017-03-29T09:36:11+0000

Okay, we managed to figure out what's going on. Here's a description of the problem for future reference:

When the size of the collected data is large enough, the driver will interact directly with the executors instead of going through the wizard; so our problem only occurs after a certain size.
Our installation had a networking issue between some executors and the driver, causing some connections to fail.

Problems with collect () or take (n) on MB sized data by spark

More articles: