You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Fu Chen (Jira)" <ji...@apache.org> on 2021/04/29 10:44:00 UTC

[jira] [Comment Edited] (SPARK-35262) Memory leak when dataset is being persisted

    [ https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335343#comment-17335343 ] 

Fu Chen edited comment on SPARK-35262 at 4/29/21, 10:43 AM:
------------------------------------------------------------

[~iamelin] This is should be a duplicate bug with SPARK-34087 and has been fixed by PR-31919. Spark 3.1.1 has a memory leak when we clone the SparkSession.

When you disabled `spark.sql.sources.bucketing.autoBucketedScan.enabled` and `spark.sql.adaptive.enabled`, the CacheManager cache query using original SparkSession (means spark not clone session).


was (Author: fchen):
[~iamelin] This is should be a duplicate bug with SPARK-34087 and has been fixed by PR-31919. Spark 3.1.1 has a memory leak when we clone the SparkSession.

When you disabled `spark.sql.sources.bucketing.autoBucketedScan.enabled` and `spark.sql.adaptive.enabled, the CacheManager cache query using original SparkSession (means spark not clone session).

> Memory leak when dataset is being persisted
> -------------------------------------------
>
>                 Key: SPARK-35262
>                 URL: https://issues.apache.org/jira/browse/SPARK-35262
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Igor Amelin
>            Priority: Critical
>
> If a Java- or Scala-application with SparkSession runs a long time and persists a lot of datasets, it can crash because of a memory leak.
>  I've noticed the following. When we have a dataset and persist it, the SparkSession used to load that dataset is cloned in CacheManager, and this clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But this clone isn't removed from the list of listeners after that, e.g. unpersisting the dataset. If we persist a lot of datasets, the SparkSession is cloned and added to `ListenerBus` many times. This leads to a memory leak since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and `spark.sql.adaptive.enabled` are true. The first one is true by default, and this default behavior leads to the problem. When auto bucketed scan is disabled, the SparkSession isn't cloned, and there are no duplicates in ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak: [https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org