You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maryann Xue (JIRA)" <ji...@apache.org> on 2018/06/22 01:19:00 UTC

[jira] [Updated] (SPARK-24596) Non-cascading Cache Invalidation

     [ https://issues.apache.org/jira/browse/SPARK-24596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maryann Xue updated SPARK-24596:
--------------------------------
    Description: 
When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation. And we choose between the existing mode and the new mode for different cache invalidation scenarios:
 # Drop tables and regular (persistent) views: regular mode
 # Drop temporary views: non-cascading mode
 # Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
 # Call {{DataSet.unpersist()}}: non-cascading mode
 # Call {{Catalog.uncacheTable()}}: follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest

Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.

  was:
When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation. And we choose between the existing mode and the new mode for different cache invalidation scenarios:
 # Drop tables and regular (persistent) views: regular mode
 # Drop temporary views: non-cascading mode
 # Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
 # Call DataSet.unpersist(): non-cascading mode

Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.


> Non-cascading Cache Invalidation
> --------------------------------
>
>                 Key: SPARK-24596
>                 URL: https://issues.apache.org/jira/browse/SPARK-24596
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Maryann Xue
>            Priority: Major
>             Fix For: 2.4.0
>
>
> When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.
> However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation. And we choose between the existing mode and the new mode for different cache invalidation scenarios:
>  # Drop tables and regular (persistent) views: regular mode
>  # Drop temporary views: non-cascading mode
>  # Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
>  # Call {{DataSet.unpersist()}}: non-cascading mode
>  # Call {{Catalog.uncacheTable()}}: follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest
> Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org