You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/03 00:05:16 UTC

[GitHub] [iceberg] rdblue commented on issue #6420: Iceberg Materialized View Spec

rdblue commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546

   Thanks for writing this up, @JanKaul! It's a good idea to specify how to maintain metadata for materialized views.
   
   I think that the approach, to associate a view with some table that stores the materialized version, is a good design choice. And the metadata you have is a great start as well, although I think we can simplify or improve a couple of things.
   
   First, I think we want to avoid keeping much state information in complex table properties. Those aren't designed for the purpose and make the table a bit difficult to use. What I recommend instead reusing the existing snapshot metadata structure to store what you need as snapshot properties. This approach has some really nice features in addition to being a bit simpler.
   
   Each materialized view version is going to be stored in a snapshot, so I think it makes sense to take your idea of a "refresh" and simply store that metadata in snapshot properties. Then we don't need a "current" refresh ID, we can just reuse the current snapshot. Similarly, we wouldn't need a new ID, we could just use the snapshot ID, and the sequence number is automatically associated.
   
   The metadata in snapshot properties would be very similar, but much smaller:
   | v1 | Snapshot property | Description
   --|--|--
   _required_ | `view_version_id` | version ID of the view that was materialized
   _required_ | `table.<identifier>` | table UUID and snapshot ID for the table identified by <identifier> that was read
   
   In the table, I've also cut out a few of the base table properties...
   * Rather than `type`, just rely on everything being an Iceberg table upstream. _We may not want to do this, but it makes everything simple_
   * Rather than having a type for Hadoop vs Metastore tables, this makes no distinction. We should not design much for Hadoop tables because they are not recommended.
   * Removed properties. We can include a catalog name in the table identifier, and adding the table UUID ensures that we always use the same upstream table (or have to recompute a `full` refresh).
   
   The nice thing about keeping upstream table UUIDs and snapshot IDs in the snapshot metadata is that it allows us to roll back the state of the view along with the upstream tables. For example, if we have an hourly job that produces bad data and an agg MV based on it, it is possible to roll back both the table and the MV to the matching state. We can also do incremental refresh based on the closest materialized snapshot, not just the latest.
   
   I think we would still want some MV metadata in table properties:
   
   | v1 | Table property | Description
   |--|--|--|
   _required_ | `materialized_view_format_version` | The MV spec version used
   _required_ | `view_identifier` | Identifier for the view that is materialized
   _optional_ | `refresh_strategy` | `full` or `incremental`, default: `full`
   
   We may want additional metadata as well, like a UUID to ensure we have the right view. I don't think we have a UUID in the view spec yet, but we could add one.
   
   I also moved the refresh strategy from the view to the MV table. I think we want to keep as much config on the table as possible, if it may differ between views. I could imagine a case where you might keep both incremental and full materialized versions or might want to have different partitioning specs for materialization, in which case you'd want that set on the table. I think the only thing I'd add to the view itself is the identifier for a materialized table.
   
   The last thing I think we may want to change is to add a section of the proposal for view invalidation. Your property to allow stale data is on the right track, but we can actually detect when a table has not been updated in a way that affects the view query in a lot of cases. For example, if you plan the final query and get input splits for the tables in the view, you can check whether the input is based on a snapshot newer than the MV's base snapshot. If it isn't, then it is safe to use the materialized version. This is a little tricky since you have to account for whether files matching the final filter were deleted, but it should be entirely a metadata operation. I think it would be great to document this as part of a spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org