You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:02:12 UTC

[jira] [Updated] (SPARK-20683) Make table uncache chaining optional

     [ https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-20683:
---------------------------------
    Labels: bulk-closed  (was: )

> Make table uncache chaining optional
> ------------------------------------
>
>                 Key: SPARK-20683
>                 URL: https://issues.apache.org/jira/browse/SPARK-20683
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1
>         Environment: Not particularly environment sensitive.  Encountered/tested on Linux and Windows.
>            Reporter: Shea Parkes
>            Priority: Major
>              Labels: bulk-closed
>
> A recent change was made in SPARK-19765 that causes table uncaching to chain.  That is, if table B is a child of table A, and they are both cached, now uncaching table A will automatically uncache table B.
> At first I did not understand the need for this, but when reading the unit tests, I see that it is likely that many people do not keep named references to the child table (e.g. B).  Perhaps B is just made and cached as some part of data exploration.  In that situation, it makes sense for B to automatically be uncached when you are finished with A.
> However, we commonly utilize a different design pattern that is now harmed by this automatic uncaching.  It is common for us to cache table A to then make two, independent children tables (e.g. B and C).  Once those two child tables are realized and cached, we'd then uncache table A (as it was no longer needed and could be quite large).  After this change now, when we uncache table A, we suddenly lose our cached status on both table B and C (which is quite frustrating).  All of these tables are often quite large, and we view what we're doing as mindful memory management.  We are maintaining named references to B and C at all times, so we can always uncache them ourselves when it makes sense.
> Would it be acceptable/feasible to make this table uncache chaining optional?  I would be fine if the default is for the chaining to happen, as long as we can turn it off via parameters.
> If acceptable, I can try to work towards making the required changes.  I am most comfortable in Python (and would want the optional parameter surfaced in Python), but have found the places required to make this change in Scala (since I reverted the functionality in a private fork already).  Any help would be greatly appreciated however.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org