You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jose Silva (Jira)" <ji...@apache.org> on 2019/09/10 08:46:00 UTC
[jira] [Created] (SPARK-29035) unpersist() ignoring cache/persist()
Jose Silva created SPARK-29035:
----------------------------------
Summary: unpersist() ignoring cache/persist()
Key: SPARK-29035
URL: https://issues.apache.org/jira/browse/SPARK-29035
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.4.3
Environment: Amazon EMR - Spark 2.4.3
Reporter: Jose Silva
Calling unpersist(), even though the DataFrame is not used anymore removes all the InMemoryTableScan from the DAG.
Here's a simplified version of the code i'm using:
df = spark.read(...).where(...).cache()
df_a = union(df.select(...), df.select(...), df.select(...))
df_b = df.select(...)
df_c = df.select(...)
df_d = df.select(...)
df.unpersist()
join(df_a, df_b, df_c, df_d).write()
I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with and without the unpersist() call.
I call unpersist in order to prevent OoM during the join. From what I understand even though all the DataFrames come from df, unpersisting df after doing the selects shouldn't ignore the cache call, right?
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org