You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/11 08:58:36 UTC

[GitHub] [iceberg] pvary commented on issue #2319: Caching Tables in SparkCatalog via CachingCatalog by default leads to stale data

pvary commented on issue #2319:
URL: https://github.com/apache/iceberg/issues/2319#issuecomment-796578997


   I found the same issue and tried to start a discussion on the dev list [discussion](https://mail-archives.apache.org/mod_mbox/iceberg-dev/202102.mbox/%3c3113F592-08A2-4EEA-9F76-F3DB32870DEE@cloudera.com%3e) about it.
   The main points are:
   - Stale data
   - Table object is not thread safe
   
   Also chatted a little bit about it with @rdblue, and he mentioned that in Spark the CachingCatalog is also used for making sure that the same version of the table is retrieved every time during the same session. So getting back stale data is a feature, not a bug.
   
   Based on this discussion my feeling is that the best solution would be to create a metadata cache around `TableMetadataParser.read(FileIO io, InputFile file)` where the cache key is the `file.location()`.
   
   The snapshots are immutable and I guess (no hard numbers on it yet) that the most resource intensive part of the table creation is metadata fetching from S3 and file parsing, so this would help us more and allows us to have a least complicated solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org