You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/09/16 01:36:00 UTC

[jira] [Updated] (SPARK-29035) unpersist() ignoring cache/persist()

     [ https://issues.apache.org/jira/browse/SPARK-29035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-29035:
---------------------------------
    Component/s:     (was: Input/Output)
                 SQL

> unpersist() ignoring cache/persist()
> ------------------------------------
>
>                 Key: SPARK-29035
>                 URL: https://issues.apache.org/jira/browse/SPARK-29035
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3
>         Environment: Amazon EMR - Spark 2.4.3
>            Reporter: Jose Silva
>            Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Calling {{unpersist()}}, even though the {{DataFrame}} is not used anymore removes all the InMemoryTableScan from the DAG.
> Here's a simplified version of the code i'm using:
> {code}
> df = spark.read(...).where(...).cache()
> df_a = union(df.select(...), df.select(...), df.select(...))
> df_b = df.select(...)
> df_c = df.select(...)
> df_d = df.select(...)
> df.unpersist()
> join(df_a, df_b, df_c, df_d).write()
> {code}
> I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with and without the {{unpersist()}} call.
> I call unpersist in order to prevent OOM during the join. From what I understand even though all the DataFrames come from df, unpersisting df after doing the selects shouldn't ignore the cache call, right?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org