You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tarun Rajput <Ta...@microsoft.com.INVALID> on 2020/09/15 21:05:01 UTC

Query /Bug Spark Streaming / Context Cleaner/ GC question

Hi spark-community,

Can someone please suggest on below question related to spark streaming query / context cleaner / garbage collection issue we are facing.  We suspect it's bug causing memory leak.

We have a spark 2.3 cluster running streaming query. We are observing behavior that no matter how much memory we allocate to executor, JVM heap eventually grows to the limit and the JVM's GC starts to cause frequent timeouts. Eventually the executor is marked "lost" or "dead". GC logging is enabled and it takes about 30-45 min to fill the heap. After that full GCs become much more frequent. We tried to increase more memory, gc interval and other relevant parameters of memory but have been observing same issue.

We enabled context cleaner debug logs and observe only bordcast/Accumulator related cleaning messages. We don't see RDDs being received for cleanup with message  "Cleaning RDD.." (Ref: this code ContextCleaner.scala#L213<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L213>) . I have attached context cleaner logs for reference as well.


2020-09-14 20:00:12,270: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 538
2020-09-14 20:00:12,271: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanBroadcast(540)
2020-09-14 20:00:12,271: DEBUG [org.apache.spark.ContextCleaner] Cleaning broadcast 540
2020-09-14 20:00:21,915: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 540
2020-09-14 20:00:21,915: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanBroadcast(536)
2020-09-14 20:00:21,915: DEBUG [org.apache.spark.ContextCleaner] Cleaning broadcast 536
2020-09-14 20:00:21,922: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 536
2020-09-14 20:00:21,922: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanBroadcast(537)
2020-09-14 20:00:21,922: DEBUG [org.apache.spark.ContextCleaner] Cleaning broadcast 537
2020-09-14 20:00:21,926: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 537
2020-09-14 20:00:21,926: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanAccum(14783)
2020-09-14 20:00:21,926: DEBUG [org.apache.spark.ContextCleaner] Cleaning accumulator 14783

We see there is plenty of executor storage memory available shown in below screenshot also.

[cid:image001.jpg@01D68B69.327ABDA0]

Any inputs or suggestions would be very appreciated!!

Thanks
Tarun