You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gabor Somogyi (Jira)" <ji...@apache.org> on 2020/03/10 08:50:00 UTC
[jira] [Updated] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

     [ https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Somogyi updated SPARK-30443:
----------------------------------
    Affects Version/s: 3.0.0

> "Managed memory leak detected" even with no calls to take() or limit()
> ----------------------------------------------------------------------
>
>                 Key: SPARK-30443
>                 URL: https://issues.apache.org/jira/browse/SPARK-30443
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2, 2.4.4, 3.0.0
>            Reporter: Luke Richter
>            Priority: Major
>         Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, even though we are not calling take() or limit().
> According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 managed memory leaks should only be caused by not reading an iterator to completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have created a minimal test program that reproduces the warning: 
> {code:java}
> import pyspark.sql
> import pyspark.sql.functions as fx
> def main():
>     builder = pyspark.sql.SparkSession.builder
>     builder = builder.appName("spark-jira")
>     spark = builder.getOrCreate()
>     reader = spark.read
>     reader = reader.format("csv")
>     reader = reader.option("inferSchema", "true")
>     reader = reader.option("header", "true")
>     table_c = reader.load("c.csv")
>     table_a = reader.load("a.csv")
>     table_b = reader.load("b.csv")
>     primary_filter = fx.col("some_code").isNull()
>     new_primary_data = table_a.filter(primary_filter)
>     new_ids = new_primary_data.select("some_id")
>     new_data = table_b.join(new_ids, "some_id")
>     new_data = new_data.select("some_id")
>     result = table_c.join(new_data, "some_id", "left")
>     result.repartition(1).write.json("results.json", mode="overwrite")
>     spark.stop()
> if __name__ == "__main__":
>     main()
> {code}
> Our code isn't anything out of the ordinary, just some filters, selects and joins.
> The input data is made up of 3 CSV files. The input data files are quite large, roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows in the CSV input files but this caused the warning to no longer appear. After compressing the files I was able to attach them below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org