You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tw UxTLi51Nus <Tw...@posteo.co> on 2017/06/23 07:07:00 UTC

OutOfMemoryError

Hi,

I have a dataset with ~5M rows x 20 columns, containing a groupID and a 
rowID. My goal is to check whether (some) columns contain more than a 
fixed fraction (say, 50%) of missing (null) values within a group. If 
this is found, the entire column is set to missing (null), for that 
group.

The Problem:
The loop runs like a charm during the first iterations, but towards the 
end, around the 6th or 7th iteration I see my CPU utilization dropping 
(using 1 instead of 6 cores). Along with that, execution time for one 
iteration increases significantly. At some point, I get an OutOfMemory 
Error:

* spark.driver.memory < 4G: at collect() (FAIL 1)
* 4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)

Enabling a HeapDump on OOM (and analyzing it with Eclipse MAT) showed 
two classes taking up lots of memory:

* java.lang.Thread
       - char (2G)
       - scala.collection.IndexedSeqLike
           - scala.collection.mutable.WrappedArray (1G)
       - java.lang.String (1G)

* org.apache.spark.sql.execution.ui.SQLListener
       - org.apache.spark.sql.execution.ui.SQLExecutionUIData
         (various of up to 1G in size)
           - java.lang.String
       - ...

Turning off the SparkUI and/or setting spark.ui.retainedXXX to something 
low (e.g. 1) did not solve the issue.

Any idea what I am doing wrong? Or is this a bug?

My Code can be found as a Github Gist [0]. More details can be found on 
the StackOverflow Question [1] I posted, but did not receive any answers 
until now.

Thanks!

[0] 
https://gist.github.com/TwUxTLi51Nus/4accdb291494be9201abfad72541ce74
[1] 
http://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace

PS: As a workaround, I have been using "checkpoint" after every few 
iterations.


--
Tw UxTLi51Nus
Email: TwUxTLi51Nus@posteo.co


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org