You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Pentreath (JIRA)" <ji...@apache.org> on 2016/12/08 07:29:59 UTC

[jira] [Commented] (SPARK-18274) Memory leak in PySpark StringIndexer

    [ https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731418#comment-15731418 ] 

Nick Pentreath commented on SPARK-18274:
----------------------------------------

Went ahead and re-marked fix version to {{2.1.0}} since RC2 has been cut.

> Memory leak in PySpark StringIndexer
> ------------------------------------
>
>                 Key: SPARK-18274
>                 URL: https://issues.apache.org/jira/browse/SPARK-18274
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
>            Reporter: Jonas Amrich
>            Assignee: Sandeep Singh
>            Priority: Critical
>             Fix For: 2.0.3, 2.1.0, 2.2.0
>
>
> StringIndexerModel won't get collected by GC in Java even when deleted in Python. It can be reproduced by this code, which fails after couple of iterations (around 7 if you set driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
>     indexer = StringIndexer(inputCol='string', outputCol='index')
>     indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
>     indexer = StringIndexer(inputCol='string', outputCol='index')
>     indexer.fit(df)
>     gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm detach in model's destructor. This is implemented in pyspark.mlib.common.JavaModelWrapper but missing in pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be affected by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org