You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Christopher Brady <ch...@oracle.com> on 2016/02/24 22:31:25 UTC
coalesce executor memory explosion

Short: Why does coalesce use huge amounts of memory? How does it work 
internally?

Long version:
I asked a similar question a few weeks ago, but I have a simpler test 
with better numbers now. I have an RDD created from some HDFS files. I 
want to sample it and then coalesce it into fewer partitions. For some 
reason coalesce uses huge amounts of memory. From what I've read, 
coalesce does not require full partitions to be in memory at once, so I 
don't understand what's causing this. Can anyone explain to me why 
coalesce needs so much memory? Are there any rules for determining the 
best number of partitions to coalesce into?

Spark version:
1.5.0

Test data:
241 GB of compress parquet files

Executors:
27 executors
16 GB memory each
3 cores each

In my tests I'm reading the data from HDFS, sampling it, coalescing into 
fewer partitions, and then doing a count just to have an action.

Without coalesce there is no memory issue. The size of the data makes no 
difference:
hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) -> 
count()
Per executor memory usage: 0.4 GB

Adding coalesce increases the memory usage substantially and it is still 
using more partitions than I'd like:
hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) -> 
coalesce (to 668 partitions) -> count()
Per executor memory usage: 3.1 GB

Going down to 201 partitions uses most of the available memory just for 
the coalesce:
hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) -> 
coalesce (to 201 partitions) -> count()
Per executor memory usage: 9.8 GB

Any number of partitions smaller than this will crash all the executors 
with out of memory. I don't really understand what is happening in 
Spark. That sample size should result in partitions smaller than the 
original partitions.

I've gone through the Spark documentation, youtube videos, and the 
Learning Spark book, but I haven't seen anything about this. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org