You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liwei Lin (JIRA)" <ji...@apache.org> on 2016/04/15 09:53:25 UTC
[jira] [Commented] (SPARK-14652) pyspark streaming driver unable to cleanup metadata for cached RDDs leading to driver OOM

    [ https://issues.apache.org/jira/browse/SPARK-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242604#comment-15242604 ] 

Liwei Lin commented on SPARK-14652:
-----------------------------------

hi [~weideng], this problem is probably caused by the absence of {{DataFrame.unpersist()}} -- please call {{readingsDataFrame.unpersist()}} at the end of {{foreachRDD}}.


If you do need {{cache()}}, I think a better way to do this is call {{cache()}} on {{DStream}} rather than on {{DataFrame}}, i.e. {{readings.cache()}} then {{readings.foreachRDD}}. The reason is that Spark Streaming will call {{unpersist()}} automatically for you after each batch if you use {{DStream.cache()}}, but won't do the same if you use {{DataFrame.cache()}}.

> pyspark streaming driver unable to cleanup metadata for cached RDDs leading to driver OOM
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-14652
>                 URL: https://issues.apache.org/jira/browse/SPARK-14652
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Streaming
>    Affects Versions: 1.6.1
>         Environment: pyspark 1.6.1
> python 2.7.6
> Ubuntu 14.04.2 LTS
> Oracle JDK 1.8.0_77
>            Reporter: Wei Deng
>
> ContextCleaner was introduced in SPARK-1103 and according to its PR [here|https://github.com/apache/spark/pull/126]:
> {quote}
> RDD cleanup:
> {{ContextCleaner}} calls {{RDD.unpersist()}} is used to cleanup persisted RDDs. Regarding metadata, the DAGScheduler automatically cleans up all metadata related to a RDD after all jobs have completed. Only the {{SparkContext.persistentRDDs}} keeps strong references to persisted RDDs. The {{TimeStampedHashMap}} used for that has been replaced by {{TimeStampedWeakValueHashMap}} that keeps only weak references to the RDDs, allowing them to be garbage collected.
> {quote}
> However, we have observed that for a cached RDD in pyspark streaming code this is not the case with the current latest Spark 1.6.1 version. This is reflected in the forever growing number of RDDs in the {{Storage}} tab of the Spark Streaming application's UI once a pyspark streaming code starts to run. We used the [writemetrics.py|https://github.com/weideng1/energyiot/blob/f74d3a8b5b01639e6ff53ac461b87bb8a7b1976f/analytics/writemetrics.py] code to reproduce the problem, and every time after running for 20+ hours, the driver's JVM will start to show signs of OOM, with old gen being filled up and JVM stuck in full GC cycles without any old gen JVM space being freed up, and eventually the driver will crash with OOM.
> We have collected heap dump right before the OOM happened, and can make it available for analysis if it's considered as useful. However, it might be easier to just monitor the growth of the number of RDDs in the {{Storage}} tab from the Spark application's UI to confirm this is happening. To illustrate the problem, we also tried to set {{--conf spark.cleaner.periodicGC.interval=10s}} in the spark-submit command line of pyspark code and enabled DEBUG level logging of the driver's logback.xml and confirmed that even if the cleaner gets triggered as quickly as every 10 seconds, none of the cached RDDs will be unpersisted automatically by ContextCleaner.
> Currently we have resorted to manually calling unpersist() to work around the problem. However, this goes against the spirit of SPARK-1103, i.e. automated garbage collection in the SparkContext.
> We also conducted a simple test with Scala code and with setting {{--conf spark.cleaner.periodicGC.interval=10s}}, and found the Scala code was able to clean up the RDDs every 10 seconds as expected, so this appears to be a pyspark specific issue. We suspect it has something to do with python not being able to pass those out of scope RDDs as weak references to the Context Cleaner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org