You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/09/22 16:01:14 UTC

[GitHub] [iceberg] aokolnychyi opened a new issue #1485: Reconsider caching behavior in Spark 3

aokolnychyi opened a new issue #1485:
URL: https://github.com/apache/iceberg/issues/1485


   I think we need to discuss the following two cases.
   
   **Case 1 (refresh of cached relations)**
   
   ```
   sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg", tableName);
   sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
   sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
   
   Dataset<Row> query = spark.sql("SELECT * FROM " + tableName + " WHERE id = 1");
   query.createOrReplaceTempView("tmp");
   
   spark.sql("CACHE TABLE tmp");
   
   sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);    
   // can be a procedure call as well
   // sql("CALL iceberg.system.rollback_to_snapshot(...)")
       
   // !!! cache must  be refreshed !!!
   assertEquals("View cache must be invalidated",
      ImmutableList.of(row(1L, "a"), row(1L, "a"), row(1L, "a")),
      sql("SELECT * FROM tmp"));
   ```
   
   Currently, Spark does not invalidate cached relations that contain a V2 table we just wrote to. I think that's a bug and Spark must match `InsertIntoDataSourceCommand` that refreshes the cache for V1 tables.
   
   **Case 2 (stored tables in view definitions)**
   
   ```
   sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg", tableName);
   sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
   
   Dataset<Row> query = spark.sql("SELECT * FROM " + tableName + " WHERE id = 1");
   query.createOrReplaceTempView("tmp");
   
   assertEquals("View should have expected rows",
      ImmutableList.of(row(1L, "a")),
       sql("SELECT * FROM tmp"));
   
   sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
   
   assertEquals("View should have expected rows",
      ImmutableList.of(row(1L, "a"), row(1L, "a")),
      sql("SELECT * FROM tmp"));
   ```
   
   Currently, if we disable caching in Iceberg catalogs, the view will get stale results as the table instance is always different and there is no way we can refresh it now. As a user, I was surprised because I would expect my table to be always up to date if caching is disabled.
   
   I propose the following:
   - Spark should refresh all cached datasets that contain a reference to the table it just wrote just like for V1 tables.
   - Procedures must refresh all cached datasets that contain a reference to the table we modified.
   - We should override `equals` and `hashCode` for `SparkTable` and `Table` and it must be independent from caching in Iceberg catalogs. If I have two instances of `Table` but they point to the same physical table, I think `equals` should return true. I believe Spark will call `spark.sharedState.cacheManager.refreshByPlan()` with an instance of `DataSourceV2Relation` so proper equality will matter. 
   - If caching in Iceberg catalogs is disabled, we should always refresh the table before reading.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-701049992


   Actually, you are right. We probably even should cache them separately. Let me submit a PR for this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698621585


   @omalley, I'd be curious to know how Hive handles case 1.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698421717


   I see merits in both approaches but the biggest concern for me is that we deviate from the way Spark handles cache for built-in tables (and all other data source v1 tables). I am afraid this will cause a lot of confusion for the users. Moreover, our `SparkSessionCatalog` is not consistent now. It will invalidate cache for non-Iceberg tables only.
   
   > a temporary table that should reflect the state of the Dataset at the time it was created
   
   How Hive or Presto handle temp views? My understanding was that views provide a way to refer to and reuse a logical plan but they are stateless. @danielcweeks, what do you think about case 2? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698423535


   @dbtsai @dongjoon-hyun @viirya is there a description of how Spark should behave in the cases outlined above? Or is it up to catalogs to interpret that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699242399


   I agree with most of what's been said about case 1. I would expect the behavior to be a point in time, but that doesn't fit with what Spark actually does for existing tables. And I think it is right to be concerned about a behavior change.
   
   But I don't think that case 1 is an area where Iceberg should make a choice. Caching datasets and invalidating those datasets is implemented above the source in Spark, so I think this is a concern that Spark should solve, not Iceberg. If Spark chooses to invalidate cached data after a write, then that's fine. Similarly, if a `REFRESH` also invalidates cached data, that high-level choice is up to Spark (case 3). I don't think it would be a good idea for Iceberg to call into Spark to refresh or invalidate, when Spark is calling to Iceberg to perform an operation in the first place.
   
   Case 2 is the one that I think we should worry about. A temporary view is basically a named logical plan, like a `Dataset`, so I'll just refer to them both as a "dataset".
   
   The first question is where a dataset lies on the spectrum between a view (always up to date) and an RDD (effectively immutable). I think that we probably agree here that **a dataset should have view behavior**, because:
   
   * A temporary view would have view behavior
   * Spark currently has no guarantee that a dataset will give the same result because the planner is used. Each addition to a Dataset results in a different logical plan that gets resolved and optimized in isolation. For example, `df = spark.table("t").filterExpr("date = '2020-09-25'")` could have refinements `odds = df.filterExpr("category = 'odd')` and `evens = df.filterExpr("category = 'even'")`. There is no guarantee that `odds.union(evens)` has any relationship to `df` because the extra filters require building separate scans at different times.
   * Spark's v1 behavior for refresh is based on invalidating cached data about a table and it can't cache what has not been loaded.
   * Spark supports refreshing tables, which appears to also refresh existing datasets.
   
   If we agree on what the behavior should be, then I think that the current caching catalog is correct. Because it reuses the same `Table` instance, all datasets that have a reference from that catalog are in sync. Changes to that table will cause it to be refreshed, and the `refresh` command from Spark also works.
   
   When not using the caching catalog, then the behavior would be that each dataset is independent and fixed at load time. I don't think that's correct. So we should implement a way for `SparkTable` to check whether it should refresh the Iceberg table. Then we could have the same behavior regardless of caching. And, we should also consider routing IcebergSource queries through a catalog so they are covered by the same mechanism.
   
   I think that agrees with all of @aokolnychyi's points, except for the last one. I think it is important to have a consistent version for operations like self-joins. I'd rather invest in a Spark catalog refresh system, although I think the suggestion to refresh tables for each operation would work in the short term.
   
   Also, I didn't look much into the caching point. Implementing `SparkTable.equals` sounds good to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-701009397


   > What would this be based on? The table identifier or path?
   
   I think `SparkTable` can delegate to `equals` and `hashCode` of the underlying Iceberg `Table`. That, in turn, can use either its name (contains catalog name + full table name) or table location. If we use table names (i.e. identifiers), the same table loaded through different catalogs will NOT be considered the same. So it seems the table location is more promising. 
   
   @rdblue, thoughts?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-701049992


   Actually, you are right. We probably even *should* cache them separately. Let me submit a PR for this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699567199


   > I actually would have assumed it was a bug if a "Cache" command was invalidated by another table operation, in my mind it should snapshot the table state at that moment. I know because the behavior is lazy in Spark your guarantees on "when" are a bit more iffy, but I think the Spark cache shouldn't be automatically invalidated.
   
   @RussellSpitzer  I'd like to point out that as of Spark 3.0, the behavior of `CACHE TABLE` is eager, unless the DDL is `CACHE LAZY TABLE` [as documented here](https://spark.apache.org/docs/3.0.0/sql-ref-syntax-aux-cache-cache-table.html#syntax) for caching tables (at least non-temporary tables).
   
   I understand that the behavior is different depending on whether or not the cached data is a `VIEW` or a `TABLE`, most notably whether or not those cached plans / datasets are considered temporary or not.
   
   So I should say that I originally agreed with @RussellSpitzer's view, that a cache should be an immutable reference, but that appears to not have been the intention even prior to Spark 3.0. Though Spark 3.0 did make cache creation an action / eager so that the point in time / consistency issue is somewhat more dealt with.
   
   I went searching for the best documentation I could on this, but the most recent thing I could find was the JIRA ticket that covers the update to allow for `Non-cascading cache invalidation`. It appears that non-cascading cache invalidation _should_ only take place when a temporary dataset or table/view is dropped in order to save memory, but the underlying query plans for the other cached collections of data are still valid (to avoid using `Table` or `View` or `Dataset`). E.g. cache invalidation is not cascaded as of Spark 2.4.x unless the query plans of the downstream cached datasets would have been invalidated, which does not include strictly dropping the original cached view / table / dataset that they might have been derived from. The non-cascading behavior appears to really only cover the cases when data is no longer persisted in memory or on disk etc in order to reclaim resources, but that otherwise `REFRESH`ing temporary tables _should_ invalidate the cache (possibly depen
 dent on changes, I'm not sure) and any downstream cached views / query plans that would also be invalidated.  cc @aokolnychyi. So what I read in this ticket, which might be the most up to date documentation on "intended" cache behavior in various scenarios seems to line up with @rdblue's latest response.
   
   The JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24596
   
   For convenience and reference, I've quoted the pertinent description of the intended high level behaviors from that JIRA ticket. Note that `regular mode` here refers to cache invalidation that does cascade to derived on cluster caches of datasets / query plans.
   
   ```
   1. Drop tables and regular (persistent) views: regular mode2
   2. Drop temporary views: non-cascading mode
   3. Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
   4. Call DataSet.unpersist(): non-cascading mode
   5. Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest
   ```
   
   In the associated PR (which does appear to have been merged in and released as of Spark 2.4.x), the description is further clarified with the following TLDR ([link to the PR])(https://github.com/apache/spark/pull/21594)
   
   ```
   Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.
   ```
   
   I know that this discussion is about the observed change in behaviors that occur because of V2 tables (and potentially from using a catalog that is not part of Spark proper), but since this is the most recent explanation I could find that covered the questions that are being discussed, I thought it was relevant to the conversation.
   
   After going through all of this, I will say that I had originally agreed with @RussellSpitzer's take, especially on point 1, that the cache is a point in time referece and immutable (throwing away potentially the notion of consistency with `lazy` caching, which is no longer the default). However, our own experience and the documentation I could find does not line up with that.
   I will say now that I'm mostly in agreement with @rdblue, especially that the (possible) need / desire to call `REFRESH` should be documented and recommended in the near term.
   
   I also agree that we should delegate the choosing of cache invalidation for queries to Spark, but ensure that we have a means to get that information to Spark to allow it to do its thing. Particularly at the Catalog level.
   
   In the long term, I think Spark users will expect the latest behavior of Spark as the standard. Though how long it takes for changes in behavior to propagate into the minds of users as the latest / most correct is not something we could really estimate. I use Spark quite often (though I've yet to do a production upgrade to Spark 3, admittedly) and I will admit I was surprised by some of the things I learned.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-705369551


   I think we've done all we could on Iceberg side. Now we need to fix Spark and @sunchao will try to look into that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699568949


   I'd also like to point out that, according to what little documentation I could find, the cached query plan _might_ not be invalidated if its using a hadoop / file based Catalog. Specifically, the documentation implies that there's a need to call `REFRESH` to actually invalidate path based data. The [documentation on `REFRESH`](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-refresh.html) has this definition:
   
   ```
   REFRESH is used to invalidate and refresh all the cached data (and the associated metadata)
   for all Datasets that contains the given data source path.
   Path matching is by prefix, i.e. “/” would invalidate everything that is cached.
   ```
   and further drives the point home that a `REFRESH` call is likely needed if things like `INSERT` etc are done to what I can only see as Hadoop based catalogs by giving this example in reference to path based resources.
   
   ```
   -- The Path is resolved using the datasource's File Index.
   CREATE TABLE test(ID INT) using parquet;
   INSERT INTO test SELECT 1000;
   CACHE TABLE test;
   INSERT INTO test SELECT 100;
   REFRESH "hdfs://path/to/table";
   ```
   
   I hope that this investigation can shed further light on the intended behavior and our assumptions, as well as what needs to be done to ensure correctness, particularly when dealing with `Tables` or essentially non-temporary `Views`. I believe what I've found heavily agrees with @rdblue's assertions and explanations about the caching of `Datasets` to be equivalent to the caching of `temporary views`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698405831


   I would tend to agree with @RussellSpitzer's original point.  I don't feel like cache should be invalidated.  Keep in mind that `createOrReplaceTempView()` is a Dataset operation and though in these examples we are referring to a single table, the query can be arbitrarily complex.  Expiring the caches to reflect the source tables in the more complicated case would result in the cache being more like a materialized view than simply a cache and possibly result in side-effects if you expect the cache to be constant (say using it to update the original table).
   
   I feel like the cache is more like a temporary table that should reflect the state of the Dataset at the time it was created (though I recognize that cache is not an action, so it may not accurately reflect the table state at the time it's called).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-700967918


   This is a great discussion, thanks to everyone for taking a look at this!
   
   I completely agree with @rdblue and @bryanck that it is up to Spark to invalidate the cached relations just like it did for V1 tables. The only case where we would have to match it in Iceberg is in our stored procedures that may rollback table state or produce new snapshots (we will review this case by case).
   
   Then I'll sum up this discussion with the following points (feel free to correct):
   - Spark should refresh its cache for V2 tables to match V1 behavior (covers REFRESH, INSERT, etc).
   - We should extend `SparkTable` with an ability to refresh the underlying table if Iceberg caching catalog is not used to have consistent behavior independently whether cashing is enabled in Iceberg catalogs.
   - We need to override `equals` and `hashCode` in our table implementations.
   - We need to route writes through `IcebergSource` to catalogs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-705369551


   I think we've done all we could on Iceberg side. Now we need to fix Spark and @sunchao will try to look into that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698425965


   @danielcweeks, I think I misunderstood your comment after reading it once more. You were referring to case 1 only, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696816484






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696837537






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696913004


   > I actually would have assumed it was a bug if a "Cache" command was invalidated by another table operation, in my mind it should snapshot the table state at that moment. I know because the behavior is lazy in Spark your guarantees on "when" are a bit more iffy, but I think the Spark cache shouldn't be automatically invalidated.
   
   That depends on how we interpret the cache in Spark. I don't see a description how it is supposed to work so I rely on the current behavior as the correct way of doing things. I interpret cache as something specific to Spark, even more, specific to a Spark application. Internally, `CacheManager` stores a full logical plan and if a table that is part of that plan is refreshed, the cache is invalidated (done by `InsertIntoDataSourceCommand`). I believe users are used to this behavior and would be really surprised if V2 tables behave differently. If we modify a table from the same driver and already know the cache entry is no longer valid, it seems reasonable to update it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699242399


   I agree with most of what's been said about case 1. I would expect the behavior to be a point in time, but that doesn't fit with what Spark actually does for existing tables. And I think it is right to be concerned about a behavior change.
   
   But I don't think that case 1 is an area where Iceberg should make a choice. Caching datasets and invalidating those datasets is implemented above the source in Spark, so I think this is a concern that Spark should solve, not Iceberg. If Spark chooses to invalidate cached data after a write, then that's fine. Similarly, if a `REFRESH` also invalidates cached data, that high-level choice is up to Spark (case 3). I don't think it would be a good idea for Iceberg to call into Spark to refresh or invalidate, when Spark is calling to Iceberg to perform an operation in the first place.
   
   Case 2 is the one that I think we should worry about. A temporary view is basically a named logical plan, like a `Dataset`, so I'll just refer to them both as a "dataset".
   
   The first question is where a dataset lies on the spectrum between a view (always up to date) and an RDD (effectively immutable). I think that we probably agree here that **a dataset should have view behavior**, because:
   
   * A temporary view would have view behavior
   * Spark currently has no guarantee that a dataset will give the same result because the planner is used. Each addition to a Dataset results in a different logical plan that gets resolved and optimized in isolation. For example, `df = spark.table("t").filterExpr("date = '2020-09-25'")` could have refinements `odds = df.filterExpr("category = 'odd')` and `evens = df.filterExpr("category = 'even'")`. There is no guarantee that `odds.union(evens)` has any relationship to `df` because the extra filters require building separate scans at different times.
   * Spark's v1 behavior for refresh is based on invalidating cached metadata about a table's partitions and it can't cache what has not been loaded.
   * Spark supports refreshing tables, which appears to also refresh existing datasets.
   
   If we agree on what the behavior should be, then I think that the current caching catalog is correct. Because it reuses the same `Table` instance, all datasets that have a reference from that catalog are in sync. Changes to that table will cause it to be refreshed, and the `refresh` command from Spark also works.
   
   When not using the caching catalog, then the behavior would be that each dataset is independent and fixed at load time. I don't think that's correct. So we should implement a way for `SparkTable` to check whether it should refresh the Iceberg table. Then we could have the same behavior regardless of caching. And, we should also consider routing IcebergSource queries through a catalog so they are covered by the same mechanism.
   
   I think that agrees with all of @aokolnychyi's points, except for the last one. I think it is important to have a consistent version for operations like self-joins. I'd rather invest in a Spark catalog refresh system, although I think the suggestion to refresh tables for each operation would work in the short term.
   
   Also, I didn't look much into the point about `recacheByPlan`. Implementing `SparkTable.equals` sounds good to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698405831






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696816484


   @rdblue @RussellSpitzer @HeartSaVioR what do you think guys?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-700973213


   > We need to override equals and hashCode in our table implementations.
   
   What would this be based on? The table identifier or path?
   
   > We need to route writes through IcebergSource to catalogs.
   
   Yes, I think this is a good idea. We will need to make sure we have a good plan for how to pass paths through `Identifier` though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696916207


   @danielcweeks, it would be great to hear your thoughts on what Spark _should_ do in some of these cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-700967918


   This is a great discussion, thanks to everyone for taking a look at this!
   
   I completely agree with @rdblue and @bryanck that it is up to Spark to invalidate the cached relations just like it did for V1 tables. The only case where we would have to match it in Iceberg is in our stored procedures that may rollback table state or produce new snapshots (we will review this case by case).
   
   Then I'll sum up this discussion with the following points (feel free to correct):
   - Spark should refresh its cache for V2 tables to match V1 behavior (covers REFRESH, INSERT, etc).
   - Iceberg should refresh Spark cache when modifying tables in procedures.
   - We should extend `SparkTable` with an ability to refresh the underlying table if Iceberg caching catalog is not used to have consistent behavior independently whether cashing is enabled in Iceberg catalogs.
   - We need to override `equals` and `hashCode` in our table implementations.
   - We need to route writes through `IcebergSource` to catalogs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-701028183


   I think equality should be by full table name table for metastore tables or by location for Hadoop tables.
   
   Tables can share the same location, so it wouldn't be safe to assume that two tables are the same based on the backing location. And I also don't think we need to solve the case where the same table is loaded by two different catalogs. That's not going to come up very often and it is entirely reasonable to cache them separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696916207


   @danielcweeks, it would be great to hear your thoughts on what Spark _should_ do in some of these cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699568949


   I'd also like to point out that, according to what little documentation I could find, the cached query plan _might_ not be invalidated if its using a hadoop / file based Catalog. Specifically, the documentation implies that there's a need to call `REFRESH` to actually invalidate path based data. The [documentation on `REFRESH`](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-refresh.html) has this definition:
   
   ```
   REFRESH is used to invalidate and refresh all the cached data (and the associated metadata) for all Datasets that contains the 
   given data source path. Path matching is by prefix, i.e. “/” would invalidate everything that is cached.
   ```
   and further drives the point home that a `REFRESH` call is likely needed if things like `INSERT` etc are done to what I can only see as Hadoop based catalogs by giving this example in reference to path based resources.
   
   ```
   -- The Path is resolved using the datasource's File Index.
   CREATE TABLE test(ID INT) using parquet;
   INSERT INTO test SELECT 1000;
   CACHE TABLE test;
   INSERT INTO test SELECT 100;
   REFRESH "hdfs://path/to/table";
   ```
   
   I hope that this investigation can shed further light on the intended behavior and our assumptions, as well as what needs to be done to ensure correctness, particularly when dealing with `Tables` or essentially non-temporary `Views`. I believe what I've found heavily agrees with @rdblue's assertions and explanations about the caching of `Datasets` to be equivalent to the caching of `temporary views`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696922254


   cc @holdenk as well


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699567199


   > I actually would have assumed it was a bug if a "Cache" command was invalidated by another table operation, in my mind it should snapshot the table state at that moment. I know because the behavior is lazy in Spark your guarantees on "when" are a bit more iffy, but I think the Spark cache shouldn't be automatically invalidated.
   
   @RussellSpitzer  I'd like to point out that as of Spark 3.0, the behavior of `CACHE TABLE` is eager, unless the DDL is `CACHE LAZY TABLE` [as documented here](https://spark.apache.org/docs/3.0.0/sql-ref-syntax-aux-cache-cache-table.html#syntax) for caching tables (at least non-temporary tables).
   
   I understand that the behavior is different depending on whether or not the cached data is a `VIEW` or a `TABLE`, most notably whether or not those cached plans / datasets are considered temporary or not.
   
   So I should say that I originally agreed with @RussellSpitzer's view, that a cache should be an immutable reference, but that appears to not have been the intention even prior to Spark 3.0. Though Spark 3.0 did make cache creation an action / eager so that the point in time / consistency issue is somewhat more dealt with.
   
   I went searching for the best documentation I could on this, but the most recent thing I could find was the JIRA ticket that covers the update to allow for `Non-cascading cache invalidation`. It appears that non-cascading cache invalidation _should_ only take place when a temporary dataset or table/view is dropped in order to save memory, but the underlying query plans for the other cached collections of data are still valid (to avoid using `Table` or `View` or `Dataset`). E.g. cache invalidation is not cascaded as of Spark 2.4.x unless the query plans of the downstream cached datasets would have been invalidated, which does not include strictly dropping the original cached view / table / dataset that they might have been derived from. The non-cascading behavior appears to really only cover the cases when data is no longer persisted in memory or on disk etc in order to reclaim resources, but that otherwise `REFRESH`ing temporary tables _should_ invalidate the cache (possibly depen
 dent on changes, I'm not sure) and any downstream cached views / query plans that would also be invalidated.  cc @aokolnychyi. So what I read in this ticket, which might be the most up to date documentation on "intended" cache behavior in various scenarios seems to line up with @rdblue's latest response.
   
   The JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24596
   
   For convenience and reference, I've quoted the pertinent description of the intended high level behaviors from that JIRA ticket. Note that `regular mode` here refers to cache invalidation that does cascade to derived on cluster caches of datasets / query plans.
   
   ```
   1. Drop tables and regular (persistent) views: regular mode2
   2. Drop temporary views: non-cascading mode
   3. Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
   4. Call DataSet.unpersist(): non-cascading mode
   5. Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest
   ```
   
   In the associated PR (which does appear to have been merged in and released as of Spark 2.4.x), the description is further clarified with the following TLDR ([link to the PR])(https://github.com/apache/spark/pull/21594)
   
   ```
   Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.
   ```
   
   I know that this discussion is about the observed change in behaviors that occur because of V2 tables (and potentially from using a catalog that is not part of Spark proper), but since this is the most recent explanation I could find that covered the questions that are being discussed, I thought it was relevant to the conversation.
   
   After going through all of this, I will say that I had originally agreed with @RussellSpitzer's take, especially on point 1, that the cache is a point in time referece and immutable (throwing away potentially the notion of consistency with `lazy` caching, which is no longer the default). However, our own experience and the documentation I could find does not line up with that.
   I will say now that I'm mostly in agreement with @rdblue, especially that the (possible) need / desire to call `REFRESH` should be documented and recommended in the near term.
   
   I also agree that we should delegate the choosing of cache invalidation to Spark, but ensure that we have a means to get that information to Spark to allow it to do its thing.
   
   In the long term, I think Spark users will expect the latest behavior of Spark as the standard. Though how long it takes for changes in behavior to propagate into the minds of users as the latest / most correct is not something we could really estimate. I use Spark quite often (though I've yet to do a production upgrade to Spark 3, admittedly) and I will admit I was surprised by some of the things I learned.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699567199


   > I actually would have assumed it was a bug if a "Cache" command was invalidated by another table operation, in my mind it should snapshot the table state at that moment. I know because the behavior is lazy in Spark your guarantees on "when" are a bit more iffy, but I think the Spark cache shouldn't be automatically invalidated.
   
   @RussellSpitzer  I'd like to point out that as of Spark 3.0, the behavior of `CACHE TABLE` is eager, unless the DDL is `CACHE LAZY TABLE` [as documented here](https://spark.apache.org/docs/3.0.0/sql-ref-syntax-aux-cache-cache-table.html#syntax) for caching tables (at least non-temporary tables).
   
   I understand that the behavior is different depending on whether or not the cached data is a `VIEW` or a `TABLE`, most notably whether or not those cached plans / datasets are considered temporary or not.
   
   I went searching for the best documentation I could on this, but the most recent thing I could find was the JIRA ticket that covers the update to allow for `Non-cascading cache invalidation`. It appears that non-cascading cache invalidation _should_ only take place when a temporary dataset or table/view is dropped in order to save memory, but the underlying query plans for the other cached collections of data are still valid (to avoid using `Table` or `View` or `Dataset`). E.g. cache invalidation is not cascaded as of Spark 2.4.x unless the query plans of the downstream cached datasets would have been invalidated, which does not include strictly dropping the original cached view / table / dataset that they might have been derived from. The non-cascading behavior appears to really only cover the cases when data is no longer persisted in memory or on disk etc in order to reclaim resources, but that otherwise `REFRESH`ing temporary tables _should_ invalidate the cache (possibly depen
 dent on changes, I'm not sure) and any downstream cached views / query plans that would also be invalidated.  cc @aokolnychyi. So what I read in this ticket, which might be the most up to date documentation on "intended" cache behavior in various scenarios seems to line up with @rdblue's latest response.
   
   The JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24596
   
   For convenience and reference, I've quoted the pertinent description of the intended high level behaviors from that JIRA ticket. Note that `regular mode` here refers to cache invalidation that does cascade to derived on cluster caches of datasets / query plans.
   
   ```
   1. Drop tables and regular (persistent) views: regular mode2
   2. Drop temporary views: non-cascading mode
   3. Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
   4. Call DataSet.unpersist(): non-cascading mode
   5. Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest
   ```
   
   In the associated PR (which does appear to have been merged in and released as of Spark 2.4.x), the description is further clarified with the following TLDR ([link to the PR])(https://github.com/apache/spark/pull/21594)
   
   ```
   Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view 
   (whether cached or not cached), any query referring to that view should no long be valid. 
   Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches 
   so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in 
   fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries 
   referencing that view.
   
   Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent 
   uncaching behavior between temporary views and unnamed DataSets.
   ```
   
   I know that this discussion is about the observed change in behaviors that occur because of V2 tables (and potentially from using a catalog that is not part of Spark proper), but since this is the most recent explanation I could find that covered the questions that are being discussed, I thought it was relevant to the conversation.
   
   After going through all of this, I will say that I had originally agreed with @RussellSpitzer's take, especially on point 1, that the cache is a point in time referece and immutable (throwing away potentially the notion of consistency with `lazy` caching, which is no longer the default). However, our own experience and the documentation I could find does not line up with that.
   I will say now that I'm mostly in agreement with @rdblue, especially that the (possible) need / desire to call `REFRESH` should be documented and recommended in the near term.
   
   I also agree that we should delegate the choosing of cache invalidation for queries to Spark, but ensure that we have a means to get that information to Spark to allow it to do its thing. Particularly at the Catalog level.
   
   In the long term, I think Spark users will expect the latest behavior of Spark as the standard. Though how long it takes for changes in behavior to propagate into the minds of users as the latest / most correct is not something we could really estimate. I use Spark quite often (though I've yet to do a production upgrade to Spark 3, admittedly) and I will admit I was surprised by some of the things I learned.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699242399


   I agree with most of what's been said about case 1. I would expect the behavior to be a point in time, but that doesn't fit with what Spark actually does for existing tables. And I think it is right to be concerned about a behavior change.
   
   But I don't think that case 1 is an area where Iceberg should make a choice. Caching datasets and invalidating those datasets is implemented above the source in Spark, so I think this is a concern that Spark should solve, not Iceberg. If Spark chooses to invalidate cached data after a write, then that's fine. Similarly, if a `REFRESH` also invalidates cached data, that high-level choice is up to Spark (case 3). I don't think it would be a good idea for Iceberg to call into Spark to refresh or invalidate, when Spark is calling to Iceberg to perform an operation in the first place.
   
   Case 2 is the one that I think we should worry about. A temporary view is basically a named logical plan, like a `Dataset`, so I'll just refer to them both as a "dataset".
   
   The first question is where a dataset lies on the spectrum between a view (always up to date) and an RDD (effectively immutable). I think that we probably agree here that **a dataset should have view behavior**, because:
   
   * A temporary view would have view behavior
   * Spark currently has no guarantee that a dataset will give the same result because the planner is used. Each addition to a Dataset results in a different logical plan that gets resolved and optimized in isolation. For example, `df = spark.table("t").filterExpr("date = '2020-09-25'")` could have refinements `odds = df.filterExpr("category = 'odd')` and `evens = df.filterExpr("category = 'even'")`. There is no guarantee that `odds.union(evens)` has any relationship to `df` because the extra filters require building separate scans at different times.
   * Spark's v1 behavior for refresh is based on invalidating cached data about a table and it can't cache what has not been loaded.
   * Spark supports refreshing tables, which appears to also refresh existing datasets.
   
   If we agree on what the behavior should be, then I think that the current caching catalog is correct. Because it reuses the same `Table` instance, all datasets that have a reference from that catalog are in sync. Changes to that table will cause it to be refreshed, and the `refresh` command from Spark also works.
   
   When not using the caching catalog, then the behavior would be that each dataset is independent and fixed at load time. I don't think that's correct. So we should implement a way for `SparkTable` to check whether it should refresh the Iceberg table. Then we could have the same behavior regardless of caching. And, we should also consider routing IcebergSource queries through a catalog so they are covered by the same mechanism.
   
   I think that agrees with all of @aokolnychyi's points, except for the last one. I think it is important to have a consistent version for operations like self-joins. I'd rather invest in a Spark catalog refresh system, although I think the suggestion to refresh tables for each operation would work in the short term.
   
   Also, I didn't look much into the point about `recacheByPlan`. Implementing `SparkTable.equals` sounds good to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696958355


   I guess my thoughts were related to the programmatic api. I guess in the table/catalog world we can have many more interpretations on what "cache" should do
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698421717


   I see merits in both approaches but the biggest concern for me is that we deviate from the way Spark handles cache for built-in tables. I am afraid this will cause a lot of confusion for the users. Moreover, our `SparkSessionCatalog` is not consistent now. It will invalidate cache for non-Iceberg tables only.
   
   > a temporary table that should reflect the state of the Dataset at the time it was created
   
   How Hive or Presto handle temp views? My understanding was that views provide a way to refer to and reuse a logical plan but they are stateless. @danielcweeks, what do you think about case 2? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699567199


   > I actually would have assumed it was a bug if a "Cache" command was invalidated by another table operation, in my mind it should snapshot the table state at that moment. I know because the behavior is lazy in Spark your guarantees on "when" are a bit more iffy, but I think the Spark cache shouldn't be automatically invalidated.
   
   @RussellSpitzer  I'd like to point out that as of Spark 3.0, the behavior of `CACHE TABLE` is eager, unless the DDL is `CACHE LAZY TABLE` as documented here: https://spark.apache.org/docs/3.0.0/sql-ref-syntax-aux-cache-cache-table.html#syntax.
   
   I understand that the behavior is different depending on whether or not the cached data is a `VIEW` or a `TABLE`.
   
   So I should say that I originally agreed with @RussellSpitzer's view, that a cache should be an immutable reference, but that appears to not have been the intention even prior to Spark 3.0. Though Spark 3.0 did make cache creation an action / eager so that the point in time / consistency issue is somewhat more dealt with.
   
   I went searching for the best documentation I could on this, but the most recent thing I could find was the JIRA ticket that covers the update to allow for `Non-cascading cache invalidation`. It appears that non-cascading cache invalidation _should_ only take place when a temporary dataset or table/view is dropped in order to save memory, but the underlying query plans for the other cached collections of data are still valid (to avoid using `Table` or `View` or `Dataset`). E.g. cache invalidation is not cascaded as of Spark 2.4.x unless the query plans of the downstream cached datasets would have been invalidated, which does not include strictly dropping the original cached view / table / dataset that they might have been derived from. The non-cascading behavior appears to really only cover the cases when data is no longer persisted in memory or on disk etc in order to reclaim resources, but that otherwise `REFRESH`ing temporary tables _should_ invalidate the cache (possibly depen
 dent on changes, I'm not sure) and any downstream cached views / query plans that would also be invalidated.  cc @aokolnychyi. So what I read in this ticket, which might be the most up to date documentation on "intended" cache behavior in various scenarios seems to line up with @rdblue's latest response.
   
   The JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24596
   
   For convenience and reference, I've quoted the pertinent description of the intended high level behaviors from that JIRA ticket. Note that `regular mode` here refers to cache invalidation that does cascade to derived on cluster caches of datasets / query plans.
   
   ```
   1. Drop tables and regular (persistent) views: regular mode2
   2. Drop temporary views: non-cascading mode
   3. Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
   4. Call DataSet.unpersist(): non-cascading mode
   5. Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest
   ```
   
   In the associated PR (which does appear to have been merged in and released as of Spark 2.4.x), the description is further clarified with the following TLDR ([link to the PR])(https://github.com/apache/spark/pull/21594)
   
   ```
   Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.
   ```
   
   I know that this discussion is about the observed change in behaviors that occur because of V2 tables (and potentially from using a catalog that is not part of Spark proper), but since this is the most recent explanation I could find that covered the questions that are being discussed, I thought it was relevant to the conversation.
   
   After going through all of this, I will say that I had originally agreed with @RussellSpitzer's take, especially on point 1, that the cache is a point in time referece and immutable (throwing away potentially the notion of consistency with `lazy` caching, which is no longer the default). However, our own experience and the documentation I could find does not line up with that.
   I will say now that I'm mostly in agreement with @rdblue, especially that the (possible) need / desire to call `REFRESH` should be documented and recommended in the near term.
   
   I also agree that we should delegate the choosing of cache invalidation to Spark, but ensure that we have a means to get that information to Spark to allow it to do its thing.
   
   In the long term, I think Spark users will expect the latest behavior of Spark as the standard. Though how long it takes for changes in behavior to propagate into the minds of users as the latest / most correct is not something we could really estimate. I use Spark quite often (though I've yet to do a production upgrade to Spark 3, admittedly) and I will admit I was surprised by some of the things I learned.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698457408


   > It feels like the biggest issue here is that behavior of cache is not consistent or well-defined
   
   That's what I think as well. Previously, `InsertIntoDataSourceCommand` always invalidated the cache and data sources did not have an option to change that behavior. Right now, Spark does not do anything automatically for custom catalogs (but does match the old behavior in its catalog) so I am not sure whether the interpretation is now up to the catalog or not.
   
   Also, caching of V2 tables is not supported at all now.
   
   ```
   // throws CACHE TABLE is only supported with temp views or v1 tables.
   spark.sql("CACHE TABLE " + tableName)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698421717






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699567199


   > I actually would have assumed it was a bug if a "Cache" command was invalidated by another table operation, in my mind it should snapshot the table state at that moment. I know because the behavior is lazy in Spark your guarantees on "when" are a bit more iffy, but I think the Spark cache shouldn't be automatically invalidated.
   
   @RussellSpitzer  I'd like to point out that as of Spark 3.0, the behavior of `CACHE TABLE` is eager, unless the DDL is `CACHE LAZY TABLE` [as documented here](https://spark.apache.org/docs/3.0.0/sql-ref-syntax-aux-cache-cache-table.html#syntax) for caching tables (at least non-temporary tables).
   
   I understand that the behavior is different depending on whether or not the cached data is a `VIEW` or a `TABLE`, most notably whether or not those cached plans / datasets are considered temporary or not.
   
   I went searching for the best documentation I could on this, but the most recent thing I could find was the JIRA ticket that covers the update to allow for `Non-cascading cache invalidation`. It appears that non-cascading cache invalidation _should_ only take place when a temporary dataset or table/view is dropped in order to save memory, but the underlying query plans for the other cached collections of data are still valid (to avoid using `Table` or `View` or `Dataset`). E.g. cache invalidation is not cascaded as of Spark 2.4.x unless the query plans of the downstream cached datasets would have been invalidated, which does not include strictly dropping the original cached view / table / dataset that they might have been derived from. The non-cascading behavior appears to really only cover the cases when data is no longer persisted in memory or on disk etc in order to reclaim resources, but that otherwise `REFRESH`ing temporary tables _should_ invalidate the cache (possibly depen
 dent on changes, I'm not sure) and any downstream cached views / query plans that would also be invalidated.  cc @aokolnychyi. So what I read in this ticket, which might be the most up to date documentation on "intended" cache behavior in various scenarios seems to line up with @rdblue's latest response.
   
   The JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24596
   
   For convenience and reference, I've quoted the pertinent description of the intended high level behaviors from that JIRA ticket. Note that `regular mode` here refers to cache invalidation that does cascade to derived on cluster caches of datasets / query plans.
   
   ```
   1. Drop tables and regular (persistent) views: regular mode2
   2. Drop temporary views: non-cascading mode
   3. Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
   4. Call DataSet.unpersist(): non-cascading mode
   5. Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest
   ```
   
   In the associated PR (which does appear to have been merged in and released as of Spark 2.4.x), the description is further clarified with the following TLDR ([link to the PR])(https://github.com/apache/spark/pull/21594)
   
   ```
   Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.
   ```
   
   I know that this discussion is about the observed change in behaviors that occur because of V2 tables (and potentially from using a catalog that is not part of Spark proper), but since this is the most recent explanation I could find that covered the questions that are being discussed, I thought it was relevant to the conversation.
   
   After going through all of this, I will say that I had originally agreed with @RussellSpitzer's take, especially on point 1, that the cache is a point in time referece and immutable (throwing away potentially the notion of consistency with `lazy` caching, which is no longer the default). However, our own experience and the documentation I could find does not line up with that.
   I will say now that I'm mostly in agreement with @rdblue, especially that the (possible) need / desire to call `REFRESH` should be documented and recommended in the near term.
   
   I also agree that we should delegate the choosing of cache invalidation for queries to Spark, but ensure that we have a means to get that information to Spark to allow it to do its thing. Particularly at the Catalog level.
   
   In the long term, I think Spark users will expect the latest behavior of Spark as the standard. Though how long it takes for changes in behavior to propagate into the minds of users as the latest / most correct is not something we could really estimate. I use Spark quite often (though I've yet to do a production upgrade to Spark 3, admittedly) and I will admit I was surprised by some of the things I learned.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698421717


   I see merits in both approaches but the biggest concern for me is that we deviate from the way Spark handles cache for built-in tables (and other v1 tables). I am afraid this will cause a lot of confusion for the users. Moreover, our `SparkSessionCatalog` is not consistent now. It will invalidate cache for non-Iceberg tables only.
   
   > a temporary table that should reflect the state of the Dataset at the time it was created
   
   How Hive or Presto handle temp views? My understanding was that views provide a way to refer to and reuse a logical plan but they are stateless. @danielcweeks, what do you think about case 2? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696923006


   Here is what I see in `CacheManager`:
   
   ```
     /**
      * Tries to re-cache all the cache entries that contain `resourcePath` in one or more
      * `HadoopFsRelation` node(s) as part of its logical plan.
      */
     def recacheByPath(spark: SparkSession, resourcePath: String): Unit = {
       val (fs, qualifiedPath) = {
         val path = new Path(resourcePath)
         val fs = path.getFileSystem(spark.sessionState.newHadoopConf())
         (fs, fs.makeQualified(path))
       }
   
       recacheByCondition(spark, _.plan.find(lookupAndRefresh(_, fs, qualifiedPath)).isDefined)
     }
   
     /**
      * Traverses a given `plan` and searches for the occurrences of `qualifiedPath` in the
      * [[org.apache.spark.sql.execution.datasources.FileIndex]] of any [[HadoopFsRelation]] nodes
      * in the plan. If found, we refresh the metadata and return true. Otherwise, this method returns
      * false.
      */
     private def lookupAndRefresh(plan: LogicalPlan, fs: FileSystem, qualifiedPath: Path): Boolean = {
       plan match {
         case lr: LogicalRelation => lr.relation match {
           case hr: HadoopFsRelation =>
             refreshFileIndexIfNecessary(hr.location, fs, qualifiedPath)
           case _ => false
         }
   
         case DataSourceV2Relation(fileTable: FileTable, _, _, _, _) =>
           refreshFileIndexIfNecessary(fileTable.fileIndex, fs, qualifiedPath)
   
         case _ => false
       }
     }
   ```
   
   Here, `DataSourceV2Relation` for `FileTable` was already added.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698421717






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696817291


   Here is the code I used to refresh the cache in procedures:
   
   ```
   protected void refreshCache(Identifier ident, Table table) {
     SparkSession spark = SparkSession.active();
     CacheManager cacheManager = spark.sharedState().cacheManager();
     DataSourceV2Relation relation = DataSourceV2Relation.create(table, Option.apply(catalog), Option.apply(ident));
     cacheManager.recacheByPlan(spark, relation);
   }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698421717


   I see merits in both approaches but the biggest concern for me is that we deviate from the way Spark handles cache for built-in tables. I am afraid this will cause a lot of confusion for the users. Moreover, our `SparkSessionCatalog` is not consistent now. It will invalidate cache for non-Iceberg tables only.
   
   > a temporary table that should reflect the state of the Dataset at the time it was created
   
   How Hive or Presto handle temp views? My understanding was that views provide a way to refer to and reuse a logical plan they are stateless. @danielcweeks, what do you think about case 2? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-698436564


   It feels like the biggest issue here is that behavior of cache is not consistent or well-defined (as far as I can tell).  I believe the operation originated from RDDs which were intended to be immutable.  So that would imply that the cache would not change regardless of what happens externally.  The intent was that a cache is fault tolerant and will be reconstituted from the original transforms.  That would imply that Sparks behavior for built in tables is incorrect.
   
   (You're correct, I was only talking about the first case).  
   
   Presto and Hive would treat a view as a view and always resolve back to the original table (the only case where this would not be true is in context of a transaction, but that's not implemented).
   
   The second case feels more clear to me.  A view should always reflect the current state of the underlying sources, so I would expect what you laid out in the example to be the expected behavior (though it currently doesn't behave that way).
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-699568949


   I'd also like to point out that, according to what little documentation I could find, the cached query plan _might_ not be invalidated if its using a hadoop / file based Catalog. Specifically, the documentation implies that there's a need to call `REFRESH` to actually invalidate path based data. The [documentation on `REFRESH`](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-refresh.html) has this definition:
   
   ```
   REFRESH is used to invalidate and refresh all the cached data (and the associated metadata)
   for all Datasets that contains the given data source path.
   Path matching is by prefix, i.e. “/” would invalidate everything that is cached.
   ```
   and further drives the point home that a `REFRESH` call is likely needed if things like `INSERT` etc are done to what I can only see as Hadoop based catalogs (or filesystem based catalogs) by giving this example in reference to path based resources.
   
   ```
   -- The Path is resolved using the datasource's File Index.
   CREATE TABLE test(ID INT) using parquet;
   INSERT INTO test SELECT 1000;
   CACHE TABLE test;
   INSERT INTO test SELECT 100;
   REFRESH "hdfs://path/to/table";
   ```
   
   I hope that this investigation can shed further light on the intended behavior and our assumptions, as well as what needs to be done to ensure correctness, particularly when dealing with `Tables` or essentially non-temporary `Views`. I believe what I've found heavily agrees with @rdblue's assertions and explanations about the caching of `Datasets` to be equivalent to the caching of `temporary views`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696931697


   I added one more case to consider.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1485: Reconsider caching behavior in Spark 3

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1485:
URL: https://github.com/apache/iceberg/issues/1485#issuecomment-696837537


   I actually would have assumed it was a bug if a "Cache" command was invalidated by another table operation, in my mind it should snapshot the table state at that moment. I know because the behavior is lazy in Spark your guarantees on "when" are a bit more iffy, but I think the Spark cache shouldn't be automatically invalidated.
   
    One of my main motivators here is that you could modify this table in a non spark framework and you wouldn't even know that happened inside Spark. For example say I have both Presto and Spark users, why should a Spark user's actions invalidate the cache when a Presto User's would not? Now I have a belief that my actions will always invalidate the cache, but there is a set of changes that would not. I would think it's better to assume "Cache" gives you a snapshot which will not change unless specifically asked for.
   
   The second case seems more clear to me, we definitely should be refreshing there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org