You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/13 14:32:15 UTC

[GitHub] [iceberg] JanKaul opened a new issue, #6420: Iceberg Materialized View Spec

JanKaul opened a new issue, #6420:
URL: https://github.com/apache/iceberg/issues/6420

   ### Feature Request / Improvement
   
   # Iceberg Materialized View Spec
   
   ## Background and Motivation
   
   A materialized view precomputes results of a query to be used as a logical
   table. When queried the materialized view serves the precomputed results
   reducing the query latency. The cost of query execution is pushed to the
   precomputation step and is amortized over the query executions.
   
   The big open-source query engines [Trino](https://trino.io/) and
   [Spark](https://spark.apache.org/) have either recently added
   ([link](https://trino.io/docs/current/connector/iceberg.html#materialized-views))
   or are in the process of adding materialized views. Currently the materialized
   views are implemented as an
   [iceberg view](https://iceberg.apache.org/view-spec/) with an underlying storage
   table. The metadata required for view maintenance is stored as a property of the
   underlying storage table.
   
   The iceberg table format is becoming an important building block in modern data
   lakes and lakehouses. In addition to open-source query-engines, support from
   commercial cloud data warehouses like Snowflake, Bigquery and Dremio is
   available or underway. Iceberg therefore plays a crucial role in enabling data
   federation between different data lakes and warehouses.
   
   ## Current limitations
   
   1. No formal specification
   
   Currently materialized views are lacking an open, accessible definition of the
   format. This makes it difficult to implement iceberg materialized views for new
   query-engines and consequently hinders adoption.
   
   2. No process for evolution
   
   Without a formal specification it is difficult to manage the evolution of the
   format accross different query-engines. There is no central place where requests
   can be brought forward. A specification can help with maintaining backward
   compatibility.
   
   3. Catalog entries for view and storage table
   
   When using a common view and a storage table to implement materialized views,
   you can either show or hide the storage table in the catalog. If the storage
   table is visible in the catalog there is a view and a table entry where
   logically there would be only one entry for the materialized view. If the
   storage table is not made visible in the catalog, it is difficult to assure
   atomic commits to the storage table.
   
   4. Limited configuration
   
   Generally one can imagine different configurations for the materialized views.
   These include "freshness" guarantees for serving data, update strategies, the
   storage table format and partitioning. It would be benefitial to make the
   configuration explicit by including it in the format specification instead of
   including it as part of view or table properties.
   
   ## Goal
   
   A common metadata format for materialized views enabling materialized views to
   be created, read and updated by different query engines.
   
   ## Overview
   
   MV (Materialized view) metadata storage mirrors how Iceberg table and view
   metadata is stored and retrieved. MV metadata is maintained in metadata files.
   All changes to the MV state create a new MV metadata file and completely replace
   the old metadata using an atomic swap. Like Iceberg tables and views, this
   atomic swap is delegated to the metastore that tracks tables, views and/or
   materialized views by name. The MV metadata file tracks the schema, partinioning
   config, snapshots, custom properties, current and past versions, as well as
   other metadata. A snapshot represents the precomputed state of a MV at some time
   and is used to access the complete set of data files in the MV. Similar to
   tables, the data files associated with a MV snapshot are tracked by manifest
   files.
   
   ### Metadata Location
   
   An atomic swap of one MV (Materialized view) metadata file for another provides
   the basis for making atomic changes. Readers use the version of the MV that was
   current when they loaded the MV metadata and are not affected by changes until
   they refresh and pick up a new metadata location.
   
   Writers create MV metadata files optimistically, assuming that the current
   metadata location will not be changed before the writer’s commit. Once a writer
   has created an update, it commits by swapping the MV's metadata file pointer
   from the base location to the new location.
   
   ## Specification (DRAFT!)
   
   ### Terms
   
   - **Schema** -- Names and types of fields in a materiallized view.
   - **Version** -- The state of a materialized view at some point in time.
   - **Partition spec** -- A definition of how partition values are derived from
     data fields.
   - **Snapshot** -- The state of a materialized view at some point in time,
     including the set of all data files.
   - **Manifest list** -- A file that lists manifest files; one per snapshot.
   - **Manifest** -- A file that lists data or delete files; a subset of a
     snapshot.
   - **Data file** -- A file that contains rows of a materialized view.
   - **Delete file** -- A file that encodes rows of a table that are deleted by
     position or data values.
   
   ### Materialized View Metadata
   
   The materialized view metadata fields are a superset of the required fields of
   the v2 table metadata and the v1 view metadata:
   
   | v1         | Field Name                  | Description                                                                                                                                                                                                                                                                                                                                                                                                  |
   | ---------- | --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
   | _required_ | **`format-version`**        | An integer version number for the materialized view format. Currently, this must be 1. Implementations must throw an exception if the materialized view's version is higher than the supported version.                                                                                                                                                                                                      |
   | _required_ | **`location`**              | The materialized view's base location. This is used by writers to determine where to store data files, manifest files, and materialized view metadata files.                                                                                                                                                                                                                                                 |
   | _required_ | **`uuid`**                  | A UUID that identifies the materialized view, generated when the materialized view is created. Implementations must throw an exception if a materialized view's UUID does not match the expected UUID after refreshing metadata.                                                                                                                                                                             |
   | _required_ | **`versions`**              | An array of structs describing the known versions of the materialized view. The number of versions to retain is controlled by the table property: “version.history.num-entries”. See section [Versions](#versions).                                                                                                                                                                                          |
   | _required_ | **`current_version_id`**    | Current version of the materialized view. Set to ‘1’ when the view is first created.                                                                                                                                                                                                                                                                                                                         |
   | _required_ | **`version_log`**           | A list of timestamp and version ID pairs that encodes changes to the current version for the materialized view. Each time the current-version-id is changed, a new entry should be added with the last-updated-ms and the new current-version-id.                                                                                                                                                            |
   | _required_ | **`last-sequence-number`**  | The materialized view's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a materialized view.                                                                                                                                                                                                                                                         |
   | _required_ | **`last-updated-ms`**       | Timestamp in milliseconds from the unix epoch when the materialized view was last updated. Each materialized view metadata file should update this field just before writing.                                                                                                                                                                                                                                |
   | _required_ | **`last-column-id`**        | An integer; the highest assigned column ID for the materialized view. This is used to ensure columns are always assigned an unused ID when evolving schemas.                                                                                                                                                                                                                                                 |
   | _required_ | **`schemas`**               | A list of schemas, stored as objects with `schema-id`.                                                                                                                                                                                                                                                                                                                                                       |
   | _required_ | **`current-schema-id`**     | ID of the materialized view's current schema.                                                                                                                                                                                                                                                                                                                                                                |
   | _required_ | **`partition-specs`**       | A list of partition specs, stored as full partition spec objects.                                                                                                                                                                                                                                                                                                                                            |
   | _required_ | **`default-spec-id`**       | ID of the "current" spec that writers should use by default.                                                                                                                                                                                                                                                                                                                                                 |
   | _required_ | **`last-partition-id`**     | An integer; the highest assigned partition field ID across all partition specs for the materialized view. This is used to ensure partition fields are always assigned an unused ID when evolving specs.                                                                                                                                                                                                      |
   | _required_ | **`sort-orders`**           | A list of sort orders, stored as full sort order objects.                                                                                                                                                                                                                                                                                                                                                    |
   | _required_ | **`default-sort-order-id`** | Default sort order id of the materialized view. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files.                                                                                                                                                                                                                                  |
   | _required_ | **`refreshes`**             | A list of refresh operations.                                                                                                                                                                                                                                                                                                                                                                                |
   | _required_ | **`current-refresh-id`**    | Id of the last refresh operation that defines the current state of the data files.                                                                                                                                                                                                                                                                                                                           |
   | _optional_ | **`snapshots`**             | A list of valid snapshots. Valid snapshots are snapshots for which all data files exist in the file system. A data file must not be deleted from the file system until the last snapshot in which it was listed is garbage collected.                                                                                                                                                                        |
   | _optional_ | **`current-snapshot-id`**   | `long` ID of the current materialized view snapshot; must be the same as the current ID of the `main` branch in `refs`.                                                                                                                                                                                                                                                                                      |
   | _optional_ | **`snapshot-log`**          | A list (optional) of timestamp and snapshot ID pairs that encodes changes to the current snapshot for the materialized view. Each time the current-snapshot-id is changed, a new entry should be added with the last-updated-ms and the new current-snapshot-id. When snapshots are expired from the list of valid snapshots, all entries before a snapshot that has expired should be removed.              |
   | _optional_ | **`properties`**            | A string to string map of materialized view properties. This is used to control settings that affect reading and writing and is not intended to be used for arbitrary metadata. For example, `commit.retry.num-retries` is used to control the number of commit retries.                                                                                                                                     |
   | _optional_ | **`metadata-log`**          | A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the materialized view. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. |
   | _optional_ | **`refs`**                  | A map of snapshot references. The map keys are the unique snapshot reference names in the materialized view, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null.                                                                                                                              |
   | _optional_ | **`statistics`**            | A list (optional) of [materialized view statistics](https://iceberg.apache.org/spec/#table-statistics).                                                                                                                                                                                                                                                                                                      |
   |            |                             |                                                                                                                                                                                                                                                                                                                                                                                                              |
   
   ### Refreshes
   
   Refresh information is stored as a list of `refresh operation` records. Each
   `refresh operation` has the following structure:
   
   | v1         | Field Name            | Description                                                                   |
   | ---------- | --------------------- | ----------------------------------------------------------------------------- |
   | _required_ | **`refresh-id`**      | ID of the refresh operation when the materialized view is refreshed.          |
   | _required_ | **`version-id`**      | Version id of the materialized view when the refresh operation was performed. |
   | _required_ | **`base-tables`**     | A map of strings (table identifiers) to `base-table` records.                 |
   | _optional_ | **`sequence-number`** | Sequence number of the snapshot that contains the refreshed data files.       |
   
   Refreshes could be handled in different ways. For a normal execution the refresh
   list could consist of only one entry, which gets overwritted on every refresh
   operation. If "timetravel" is enabled for the materialized view, a new
   `refresh operation` record can be inserted on every refresh. Together with the
   `sequence-number` field, this could be used to track the evolution of data files
   over the refresh history.
   
   ### Base table
   
   A `base table` record can have different forms based on the common field "type".
   The other fields don't necessarily have to be the same.
   
   #### Iceberg-Metastore
   
   | v1         | Field Name               | Description                                                                                      |
   | ---------- | ------------------------ | ------------------------------------------------------------------------------------------------ |
   | _required_ | **`type`**               | type="iceberg-metastore"                                                                         |
   | _required_ | **`identifier`**         | Identifier of the base table in the metastore.                                                   |
   | _required_ | **`snapshot-reference`** | Snapshot id of the base table when the refresh operation was performed.                          |
   | _optional_ | **`properties`**         | A string to string map of base table properties. Could be used to specify a different metastore. |
   
   #### Iceberg-FileSystem
   
   | v1         | Field Name               | Description                                                                                    |
   | ---------- | ------------------------ | ---------------------------------------------------------------------------------------------- |
   | _required_ | **`type`**               | type="iceberg-filesystem"                                                                      |
   | _required_ | **`identifier`**         | Path to the directory of the base table.                                                       |
   | _required_ | **`snapshot-reference`** | Version of the base table when the refresh operation was performed.                            |
   | _optional_ | **`properties`**         | A string to string map of base table properties. Could be used for a different storage system. |
   
   #### DeltaLake-FileSystem (optional)
   
   | v1         | Field Name               | Description                                                                                    |
   | ---------- | ------------------------ | ---------------------------------------------------------------------------------------------- |
   | _required_ | **`type`**               | type="deltalake-filesystem"                                                                    |
   | _required_ | **`identifier`**         | Path to the directory of the base table.                                                       |
   | _required_ | **`snapshot-reference`** | Delta table version of the base table when the refresh operation was performed.                |
   | _optional_ | **`properties`**         | A string to string map of base table properties. Could be used for a different storage system. |
   
   ### [Snapshots](https://iceberg.apache.org/spec/#snapshots)
   
   ### [Versions](https://iceberg.apache.org/view-spec/#versions)
   
   ### [Version Log](https://iceberg.apache.org/view-spec/#version-log)
   
   ### [Schemas](https://iceberg.apache.org/spec/#schemas-and-data-types)
   
   ### [Partition Spec](https://iceberg.apache.org/spec/#partitioning)
   
   ### [Sort Order](https://iceberg.apache.org/spec/#sorting)
   
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397711644

   > Not sure if there is a strong use case for multiple tables for the same view version.
   
   I am thinking about the case where based on the predicate operating on the view, we can choose intelligently what storage table to use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1786737489

   You are right, I was confused because the REST catalog stores the metadata internally and I thought this would include the storage table. But it would be possible to store the metadata for the view in the REST catalog and still store the metadata for the storage table in a `metadata.json` file. This way the metadata-location could still be used as a pointer to the storage table. Thanks for thinking this through again.
   
   This leaves us again with the question of what to use for the storage table pointer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1377660718

   Thanks, added some comments there to kick start discussion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1426863641

   It would be great if we could make it a catalog-specific decision. But for that the metadata has to be designed to enable both strategies.
   
   I think the question is what do we use as an unique identifier for the storage table in the representation of the common view?
   
   One approach is to use **table_name, namespace and catalog** as unique identifier. But this only works if the storage table is registered in the catalog. Furthermore, without being registered in the catalog, an atomic swap of the storage table metadata file cannot be guaranteed.
   
   Another approach would be to use the **storage table metadata file location** as unique identifier. This makes using the catalog a bit more awkward because the tables have to be filtered for the metadata file location. But this approach wouldn't require the storage table to be registered in the catalog. And it would enable atomic transactions on the storage table by making atomic changes to the storage table metadata file location. This is what I meant by commit procedure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1801212017

   Thanks for your input. The discussion has moved to the Google doc (https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing). It would be great if you could add your comment there. Then it will have a higher visibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397757916

   I am referring to the **view spec**, using the example here: https://iceberg.apache.org/view-spec/#appendix-a-an-example
   
   So in design 1 where we say we want to have a pointer from view to storage table, the pointer is a new type of representation `materialized`.
   
   We already support multiple representations as of today in the view spec, but today it only can have multiple SQL representations. 
   
   By adding this new type, it means a view can be backed by multiple storage tables, by adding more `materialized` representations. 
   
   The use case that you said about a table having multiple storage layout could also be modeled as a view with multiple `materialized` representations and no any SQL representation.
   
   It's just a thought, not fully flushed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397595262

   Agreed. Reverse pointer to the view will be hard to maintain, so I am inclined to not having it.
   
   I would say each view version could optionally map to a new storage table (say the view evolved in a backward incompatible way). Not sure if there is a strong use case for multiple tables for the same view version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1398572156

   I would +1 on storing in snapshot summary, because:
   1. snapshot corresponds very well to MV refresh, there is a 1:1 relationship between them.
   2. table properties is not versioned as well as snapshot, you cannot access previous able properties of a storage table easily
   
   I also agree with the suggestion of Ryan about the information to store, although I think storing the source tables referenced might be a bit hard, it requires statement analysis of the view.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "liurenjie1024 (via GitHub)" <gi...@apache.org>.
liurenjie1024 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1773609261

   Any update on this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1913116304

   Hi @szehon-ho, thanks for trying to move the process of reaching consensus along. To be honest, I don't know how the community normally reaches consensus on these kinds of topics. But I still have the feeling that we are lacking the feedback from some key stakeholders. Until now I was waiting with the PRs to get more feedback. But it seems like creating the PRs is the right thing to move the proposal along.
   
   In any case it might be good to bring this up at a community sync.
   
   I would argue for 2 PRs, one for the view metadata and one for the table metadata.
   
   #### Regarding the open questions
   
   1. Question: I agree that we mostly have a consensus there
   2. Question: I get the impression from the google doc that people would prefer Option 1, also you voted for 1
   3. Question: Because of its versatility I would really argue for Option 3
   
   #### Regarding your 2cents
   
   1. I totally agree, just using the storage table pointer is a cleaner solution.
   2. The important information here are the snapshot-ids of the source-tables(base-tables) corresponding to the last refresh operation of the materialized view. The materialized view requires this information to determine if the precomputed data is still fresh. At the end of a refresh operation the materialized view stores the snapshot-ids of its source tables as it's lineage information. Later on, it can check if the snapshot-ids are still equal to the current snapshot-ids of the source tables. If they're not equal, it knows the data changed and its precomputed data needs to be updated.
   3. As the last point of https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.m5kli4l5q7ui, we agreed to leave the refresh strategy to the query engine. If available, an incremental refresh is always preferable to a full refresh.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397751094

   I don't know if it would work or too crazy, just to throw the idea out that I just came up with:
   
   We could potentially make MV a representation in view spec, in parallel to the SQL representation. So for a materialized view, it could have 2 representations:
   
   ```
   [
     {
         "type" : "sql",
         "sql" : "SELECT \"count\"(*) my_cnt\nFROM\n  base_tab\n", => Note the updated text from the ‘replace’ view statement
         "dialect" : "spark",
         "schema-id" : 2,
         "default-catalog" : "iceberg",
         "default-namespace" : [ "anorwood" ]
       },
       {
         "type" : "materialized",
         "namespace": "some_namespace"
         "table" : "some_storage_table"
       }
   ]
   ```
   
   By doing so, you could support a case where 1 view does not really have a SQL representation, but just a few different table layouts:
   
   ```
   [
     {
         "type" : "materialized",
         "namespace": "some_namespace"
         "table" : "some_storage_table_partitioned_by_col1"
      },
     {
         "type" : "materialized",
         "namespace": "some_namespace"
         "table" : "some_storage_table_partitioned_by_col2"
       }
   ]
   ```
   
   which satisfies the use case described by Walaa.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397753231

   To clarify, I was saying that multiple representations are outside the scope of MVs, and could be part of standard table spec. Not sure if the proposal above is along the same lines (a bit confused since I see `materialized` references).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397302003

   > while underlying table snapshot information are stored in storage table snapshot properties 
   
   +1, I think we are on the same page on this.
   
   In my view it's still design 1. To me the difference of design 1 and 2 is what you read as the entry point, is that the view, or is that the storage table. To me it's always the view.  Maybe that's the confusion point that I totally missed.
   
   Information about refresh definitely makes sense to be stored in the storage table corresponding to each snapshot. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1801155675

   > This leaves us again with the question of what to use for the storage table pointer.
   
   I think table UUID or location is fine. Same should apply when referencing the base tables and their respective snapshot IDs. Maybe for base tables, UUID makes more sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1949286590

   I do realize we are still ironing out the details, but as a data point, doing a quick sanity check with @marton-bod of Trino , it looks like the general proposal here of the metadata to be added is in line with what has been added to Trino's native implementation of Iceberg materialized view.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1427613589

   I'm not entirely sure if I understand you correctly. I would expect a query engine catalog to proceed according to the following procedure:
   
   1. Find view metadata
   
   Find view metadata according to namespace and view name
   
   2. Read View metadata
   
   If query engine doesn't support materialized views, then execute view query.
   
   If query engine supports materialized views, then use storage table pointer (table_name+namespace, metadata location, UUID) to access storage table metadata.
   
   3. Read storage table metadata
   
   Use storage table metadata to execute TableScan operator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1427091999

   I think using the user-facing table namespace as an identifier is catalog specific, so I do not think it can make it to the spec. Maybe we should answer: 
   * Once an engine identifies it should use the storage table version of the view, how does it resolve the view identifier to perform a storage table scan instead of expanding the view text?
   
   I think this boils down to engines awareness of views being connected to storage tables, and navigating to the table scan APIs from a view entry point. I think using the table storage location (or table UUID) in the view still serves this requirement.
   
   > because the tables have to be filtered for the metadata file location.
   Could you clarify this a bit more? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397987523

   @JanKaul if you agree with the summarized consensus we have mostly reached there, for the sake of moving the progress of the discussion forward, could you update the Google doc with the design, and describe the spec changes based on the discussions and suggestions here and combine with your original idea? So that we can start to review the spec contents 💪 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1773687907

   I think we almost have a consensus about the Materialized View Spec. I brought up one issue with the **pointer to the storage table** that is holding up the progress at the moment. The issue is whether to use the table identifier (catalog, namespace, name) or the table metadata location and therefore whether to register the storage table in the catalog.
   
   I've come to realize that using the **table identifier** is the only solution that works for all iceberg catalogs because using the table metadata location doesn't work with the REST catalog.
   
   So in my opinion we can settle my issue and use the table identifier as the pointer to the storage table. We would therefore then need to store the storage table in the catalog.
   
   With this we should have a consensus on the Spec. I will update the documents and then everyone can have another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1919922742

   Thanks its a lot clearer now.  I guess we still have some open questions that came up, which can talk there.  I will try to get some other folks to look at it there as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397742071

   > Generically speaking, a table (MV or not), identified by a UUID, could have multiple storage layouts, and execution engines can choose the best storage layout.
   
   That's correct, but technically it could be argued that we can model everything as a view, backed by multiple storage tables, and we don't need to add more complexity into the table spec to support multiple storage layouts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1426082854

   @wmoustafa @JanKaul 
   
   I had some brief discussion about this topic in the community sync this Wednesday, and looking at this thread, I think there is already a consensus about the MV spec. Can we proceed to the actual spec change? Do you have the bandwidth to do that? If not I am also interested in making that change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1376812640

   Here is the link to the google doc: [https://docs.google.com/document/d/e/2PACX-1vQHLNmp_hg4lpPT8Z_J0IpbmpRsbio_gh5yAeFVIgtYNd47Mc-4JJtCx06ULVc1VRhigXWpWsL5myvI/pub](https://docs.google.com/document/d/e/2PACX-1vQHLNmp_hg4lpPT8Z_J0IpbmpRsbio_gh5yAeFVIgtYNd47Mc-4JJtCx06ULVc1VRhigXWpWsL5myvI/pub)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1776642611

   @wmoustafa I hope you are okay with using the table identifier as the storage table pointer instead of using the metadata location. But I don't see a way to use the metadata location with the REST catalog. And ultimately we require a robust solution that works across all catalogs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1993743018

   Linking the implementation PR: https://github.com/apache/iceberg/pull/9830 for one of the options. It uses the option of view + separate Iceberg table linked from the view metadata. Storage table is represented by its catalog identifier and managed independently by the catalog.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "manuzhang (via GitHub)" <gi...@apache.org>.
manuzhang commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1996303300

   Thanks @wmoustafa. It does show the simplicity of this option. Would you mind rebasing on latest main branch such that people can try out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1369499741

   Thank you @rdblue for your detailed comments. I really like your idea about storing the refresh metadata in the snapshot. It really simplifies the design. I have a couple of questions to better understand your proposal.
   
   1. To make sure I understood you correctly: you prefer "Design 2: Table +  attached common view"?
   
   2. How exactly would you store the metadata in the snapshot? Would you store the  metadata as properties of the `summary` field?
   
   3. How are multiple upstream table references stored in the metadata? Is the  second entry (`table.<identifier>`) a list of table UUIDs and snapshot IDs?   Because generally the materialized view definition can reference multiple  upstream tables.
   
   Regarding the `type` field of the `base_table` record. I thought it might makes sense to allow future extensions. But you are right that only supporting "iceberg-metastore" tables would allow to further simplify the design.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1381377085

   As @jackye1995 said, the first thing we need to do is to decide on a general design. I therefore simplified the section on the design comparison in the google docs. Please feel free to add any points to the pros and cons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397732221

   > I am thinking about the case where based on the predicate operating on the view, we can choose intelligently what storage table to use.
   
   I think this is potentially a generic use case for all tables, not only MV storage tables. Generically speaking, a table (MV or not), identified by a UUID, could have multiple storage layouts, and execution engines can choose the best storage layout. MV still points to a single UUID.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1457810547

   I have an idea how we could define a general specification for the "storage table pointer" that would allow to decide on a case by case basis whether to register the storage table in the catalog or not.
   
   We could use a storage table pointer of the form:
   
   `protocol` `:` `identifier`
   
   For example:
   
   - `catalog:public.table1` for a storage table that is registered in a catalog
   - `uuid:df838b92-0b32-465d-a44e-d39936e538b7` for referencing the storage table by uuid
   - `file:/home/iceberg/warehouse/nyc/taxis/metadata/00000-8a62c37d-4573-4021-952a-c0baef7d21d0.metadata.json` for referencing the storage table by metadata location


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1427430612

   I see. I think this goes back to my previous comment that the engine catalog APIs will have to be able to resolve views as tables. They will simply navigate to the storage table location from the view as an entry point. Engine catalog APIs may have to evolve to do that, along with view resolution implementation. @jackye1995 do you agree this is the workflow? @JanKaul, does this make filtering the catalog by table location not required?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1435719206

   That is correct @JanKaul. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1809654712

   @jackye1995 it would be great if you could have a look at the google doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397589244

   So just want to push the progress forward, I think we have some kind of loose consensus that:
   1. view + storage table is likely the general approach to go
   2. view stores pointer to the storage table, or potentially multiple storage tables
   3. storage table should store the information about each refresh as table snapshot properties, some reverse pointer to the view is also good to have
   
   Is that the right understanding? @rdblue @JanKaul @wmoustafa 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1397956873

   > I don't know if it would work or too crazy, just to throw the idea out that I just came up with:
   > 
   > We could potentially make MV a representation in view spec, in parallel to the SQL representation. So for a materialized view, it could have 2 representations:
   > 
   > ```
   > [
   >   {
   >       "type" : "sql",
   >       "sql" : "SELECT \"count\"(*) my_cnt\nFROM\n  base_tab\n",
   >       "dialect" : "spark",
   >       "schema-id" : 2,
   >       "default-catalog" : "iceberg",
   >       "default-namespace" : [ "anorwood" ]
   >     },
   >     {
   >       "type" : "materialized",
   >       "namespace": "some_namespace"
   >       "table" : "some_storage_table"
   >     }
   > ]
   > ```
   > 
   > By doing so, you could support a case where 1 view does not really have a SQL representation, but just a few different table layouts:
   > 
   > ```
   > [
   >   {
   >       "type" : "materialized",
   >       "namespace": "some_namespace"
   >       "table" : "some_storage_table_partitioned_by_col1"
   >    },
   >   {
   >       "type" : "materialized",
   >       "namespace": "some_namespace"
   >       "table" : "some_storage_table_partitioned_by_col2"
   >     }
   > ]
   > ```
   > 
   > which satisfies the use case described by Walaa.
   > 
   > 
   > 
   > 
   > 
   
   Great idea! I think this is a very clean solution to store the metadata. However I would rename the type to `sql_materialized` in case there are other representations in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1426821097

   Agree it makes more sense to store as one logical entity. Also I think to register with two names, we would have to assign the view and the table different names. The consequences of that are not clear.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1426852565

   I think it has to be a catalog-specific decision, because as you said Snowflake has that hidden, but for example Trino view on Hive/Glue has the storage table exposed. So I am not sure if it needs to be a part of the spec or not.
   
   I think these are different type of catalogs, typically we distinguish them as **business data catalog** and **technical data catalog**. The former is more oriented for end user, whereas the later is more oriented to technical users that would like to see all the hidden stuffs.
   
   From implementation perspective, why does approach 2 need "definition of commit procedure"? I think it's just a matter of hide the storage table or not, but there will always be 2 objects behind the scene.
   
   If you are talking about committing changes to both objects at the same time, @nastra is already doing a proposal of multi-table transaction, so I think that problem will be solved there and we don't need to worry in the MV spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1912878826

   Hi @JanKaul .  Thanks for putting this together.  I went through the detailed discussion, and see the general consensus to the "Open Questions" in the design docs are:
   
   1.  The pointer to the storage table should be stored as an optional field in the view metadata (option 1)
   2. Lineage information should be stored as additional fields in the summary of the storage table (option 2)
   3. Only the view (and not storage table) should be registered in catalog (option 2)
   
   Is that correct?  Then given that, I have summarized the additions we are making to the current metadata spec below.
   
   # View Metadata
    | v1 | v2 | Field Name | Description |
    |---|---|---|---|
     | optional | materialization | An optional `materialization` struct. If the value is null the entity is a common view, otherwise it is a materialized view |
   
   Materialization Struct
   | v1 | Field Name | Description |
   | -- | -- | -- |
   required | format-version | An integer version number for the materialized view format. Currently, this must be 1. Implementations must throw an exception if the materialized view's version is higher than the supported version.
   required | storage-table | Table metadata location | 
   
   # Snapshot
   | v1 | Field Name | Description |
   | -- | -- | -- |
   | optional | refresh-version-id | Version id of the materialized view when the refresh operation was performed.
   optional | source-tables | A List of `source-table` records. |
   
   Source Table Struct
   | v1 | Field Name | Description |
   | -- | -- | -- |
   | required | identifier | Identifier of the table as defined in the SQL expression. |
   | required | snapshot-id | Snapshot id of the source table when the last refresh operation was performed. |
   
   
   Let me know if that looks right.
   
   My 2c on this are:
   1.  The materialization struct having its own format version seems overkill to me, maybe we can just flatten it and make the materialization directly just the storage-table pointer itself?
   2. Similar to @jackye1995 on the comment above: https://github.com/apache/iceberg/issues/6420#issuecomment-1398572156, I feel having the list of source-tables is a bit difficult.  Can we proceed without this in the first cut?  I feel the engines, if they wanted, could parse the source tables, look them up, and get the snapshot-ids directly.  They must to be able to parse the view-sql so should be able to parse that.
   3. How about the 'refresh-strategy'?  (Didnt see it in the google-doc).  I feel it can in the current 'properties' field of view metadata.  iiuc, @rdblue also had suggested putting this in the table properties of the storage table along with other fields like materialized_view_format_version and view_identifier, which sounds fine too.
   
   If there is general consensus on the direction, it'd be great to move to on the actual spec pr change and discuss specifics there, as seems like this proposal has been sitting awhile?  I can also help with that, if needed.  Thanks.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1920421234

   Yea , we mean , I think we will continue on the google doc until the open questions addressed.  @JanKaul made the specification draft section, it should be clearer now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1426239116

   +1. We can proceed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1426775281

   Another point I would like to discuss is whether the storage table should be registered in the catalog. I brought this up before but I think people haven't really voted for a solution. Generally the storage table can either be registered in the catalog or not. Most proposals in this thread suggested to register both the view and the storage table in the catalog. I would like to propose to only register the view in the catalog. My reasoning is that a materialized view is one logical entity and should therefore also appear in the catalog as one entity. This is also the behavior of most RDBMS and also cloud data warehouses like snowflake.
   
   |             | Strategy 1: register storage table in catalog                                                                                                      | Strategy 2: don't register storage table in catalog                                                                                                                                                                                     |
   | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
   | Description | Metadata locations for the view and storage table are stored in the catalog. Since both locations are tracked, each transaction is automatically atomic. | Only the location of the main view is stored in the catalog. A reference to the storage table metadata location is stored in the main view. An atomic update of this reference is the basis for making atomic changes to the storage table metadata file. |
   | Pros        | <ul><li>Storage table can be addressed in the catalog</li><li>Simple implementation</li></ul>                                                            | <ul><li>only one entry for materialized view</li></ul>                                                                                                                                                                                                    |
   | Cons        | <ul><li>2 entities appear in the catalog</li></ul>                                                                                                       | <ul><li>definition of commit procedure</li></ul>                                                                                                                                                                                                          |
   
   React with the following if you prefer one of the strategies:
   :tada:  for strategy 1
   :rocket: for strategy 2
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546

   Thanks for writing this up, @JanKaul! It's a good idea to specify how to maintain metadata for materialized views.
   
   I think that the approach, to associate a view with some table that stores the materialized version, is a good design choice. And the metadata you have is a great start as well, although I think we can simplify or improve a couple of things.
   
   First, I think we want to avoid keeping much state information in complex table properties. Those aren't designed for the purpose and make the table a bit difficult to use. What I recommend instead reusing the existing snapshot metadata structure to store what you need as snapshot properties. This approach has some really nice features in addition to being a bit simpler.
   
   Each materialized view version is going to be stored in a snapshot, so I think it makes sense to take your idea of a "refresh" and simply store that metadata in snapshot properties. Then we don't need a "current" refresh ID, we can just reuse the current snapshot. Similarly, we wouldn't need a new ID, we could just use the snapshot ID, and the sequence number is automatically associated.
   
   The metadata in snapshot properties would be very similar, but much smaller:
   | v1 | Snapshot property | Description
   --|--|--
   _required_ | `view_version_id` | version ID of the view that was materialized
   _required_ | `table.<identifier>` | table UUID and snapshot ID for the table identified by <identifier> that was read
   
   In the table, I've also cut out a few of the base table properties...
   * Rather than `type`, just rely on everything being an Iceberg table upstream. _We may not want to do this, but it makes everything simple_
   * Rather than having a type for Hadoop vs Metastore tables, this makes no distinction. We should not design much for Hadoop tables because they are not recommended.
   * Removed properties. We can include a catalog name in the table identifier, and adding the table UUID ensures that we always use the same upstream table (or have to recompute a `full` refresh).
   
   The nice thing about keeping upstream table UUIDs and snapshot IDs in the snapshot metadata is that it allows us to roll back the state of the view along with the upstream tables. For example, if we have an hourly job that produces bad data and an agg MV based on it, it is possible to roll back both the table and the MV to the matching state. We can also do incremental refresh based on the closest materialized snapshot, not just the latest.
   
   I think we would still want some MV metadata in table properties:
   
   | v1 | Table property | Description
   |--|--|--|
   _required_ | `materialized_view_format_version` | The MV spec version used
   _required_ | `view_identifier` | Identifier for the view that is materialized
   _optional_ | `refresh_strategy` | `full` or `incremental`, default: `full`
   
   We may want additional metadata as well, like a UUID to ensure we have the right view. I don't think we have a UUID in the view spec yet, but we could add one.
   
   I also moved the refresh strategy from the view to the MV table. I think we want to keep as much config on the table as possible, if it may differ between views. I could imagine a case where you might keep both incremental and full materialized versions or might want to have different partitioning specs for materialization, in which case you'd want that set on the table. I think the only thing I'd add to the view itself is the identifier for a materialized table.
   
   The last thing I think we may want to change is to add a section of the proposal for view invalidation. Your property to allow stale data is on the right track, but we can actually detect when a table has not been updated in a way that affects the view query in a lot of cases. For example, if you plan the final query and get input splits for the tables in the view, you can check whether the input is based on a snapshot newer than the MV's base snapshot. If it isn't, then it is safe to use the materialized version. This is a little tricky since you have to account for whether files matching the final filter were deleted, but it should be entirely a metadata operation. I think it would be great to document this as part of a spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1441824195

   > Another point I would like to discuss is whether the storage table should be registered in the catalog. I brought this up before but I think people haven't really voted for a solution. Generally the storage table can either be registered in the catalog or not. Most proposals in this thread suggested to register both the view and the storage table in the catalog. I would like to propose to only register the view in the catalog. My reasoning is that a materialized view is one logical entity and should therefore also appear in the catalog as one entity. This is also the behavior of most RDBMS and also cloud data warehouses like snowflake.
   > 	Strategy 1: register storage table in catalog 	Strategy 2: don't register storage table in catalog
   > Description 	Metadata locations for the view and storage table are stored in the catalog. Since both locations are tracked, each transaction is automatically atomic. 	Only the location of the main view is stored in the catalog. A reference to the storage table metadata location is stored in the main view. An atomic update of this reference is the basis for making atomic changes to the storage table metadata file.
   > Pros 	
   > 
   >     * Storage table can be addressed in the catalog
   > 
   >     * Simple implementation
   > 
   > 
   > 	
   > 
   >     * only one entry for materialized view
   > 
   > 
   > Cons 	
   > 
   >     * 2 entities appear in the catalog
   > 
   > 
   > 	
   > 
   >     * definition of commit procedure
   > 
   > 
   > React with the following if you prefer one of the strategies: tada for strategy 1 rocket for strategy 2
   
   @rdblue do you have any preference for which unique identifier to use for referencing the storage table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1398113386

   Yes, I agree with the proposed design 1. I'm not entirely sure what @rdblue prefers.
   
   I will update the Google doc accordingly.
   
   The next question for me is where and how to store the refresh information. There are currently two proposed solutions:
   
   1. Store refresh information in snapshot properties of the storage table
   2. Store refresh information in storage table properties
   
   I will create a section in the Google doc that summarizes these options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1493286778

   I do not think the Iceberg metadata should "point back" to the catalog IDs (first option). What is the disadvantage of just going with the location of the storage table (last option, but without the protocol part)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1786292375

   Could you clarify how it does not work with the catalog API? At the end of the day, the catalog API takes a table identifier and returns an object. That can be table object, virtual view object, or a hybrid object (has characteristics of both table and virtual view objects). In the case of materialized view, the catalog should be able to take a view name, i.e., identifier, and return that hybrid object. So both the virtual view and storage table are addressed by the view name. That said, I am still not clear why the Iceberg spec needs to be aware of the catalog-specific table identifier. Table identifiers are not part of the Iceberg table spec, so I do not think they can make it to the materialized view spec. (Table identifier as in `catalog.db.table` should not be confused with table UUID -- the latter is part of the V2 spec which should be fine to use).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "wmoustafa (via GitHub)" <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1920163073

   The specification draft as in the PR or the Google doc? If there are open questions, let us continue them on the Google doc instead of creating a third place for discussion? It is already somewhat hard to track.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1348693373

   The draft has to be seen as an initial starting point. obviously the design is open for discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
nastra commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1351656857

   @JanKaul I think it would be great to get this out to the DEV mailing list to get more attention and input from people


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1376290240

   Thanks for the detailed proposal! Trying to catch up with the conversation here. (btw it would probably be more organized to move this to a google doc or a PR that updates the spec so we can have different threads of discussions, instead of nesting conversations here)
   
   > We may want additional metadata as well, like a UUID to ensure we have the right view. I don't think we have a UUID in the view spec yet, but we could add one.
   
   +1, should be added right away so it could be referenced by the MV table
   
   > I think we want to keep as much config on the table as possible, if it may differ between views. I could imagine a case where you might keep both incremental and full materialized versions
   
   Why do we really need to have a specific field for refresh method? To me full refresh is always a limitation when incremental cannot be performed. Is there any specific use case where a full refresh is preferred when incremental refresh could be done?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wmoustafa commented on issue #6420: Iceberg Materialized View Spec

Posted by GitBox <gi...@apache.org>.
wmoustafa commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1396606776

   I think refresh strategy is up to the engine to decide, and it could very well depend on some runtime factors/stats. So same view could leverage different refresh methods over time.
   
   I think there could be a hybrid design between Design 1 and Design 2, where some of the MV props (e.g., `allow-stale-data`) and pointer to storage table are kept in common view (as it is the case now in Design 1), while underlying table snapshot information are stored in storage table snapshot properties as @rdblue suggested (as it is the case now in Design 2). Reasoning for the pointer direction (view to table or table to view is discussed in [this thread ](https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?disco=AAAAnaz8eEs)in the design doc).
   
   `allow-stale-data` could be enhanced to `allow-stale-data-within-ms`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1427132713

   When you want to use the metadata file location to retrieve the storage table from a catalog you would need to search all entries for one where the location field of the metadata matches the provided location. Not all catalogs might be capable of such an operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JanKaul commented on issue #6420: Iceberg Materialized View Spec

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1426768524

   Thank you @jackye1995 for your initiative. I have created a new [google doc](https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing) to discuss the actual specification for materialized views.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1776639925

   I've updated the issue description and the google doc (https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing).
   
   I would love to get your feedback. @liurenjie1024 @jackye1995 @wmoustafa @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1917629464

   Great, this all makes sense to me except following points:
   
   > 3. Question: Most people seem to be for Option 1
   
   As we clarified in the google doc, it seems at least locally we are moving to Option 2:  https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1&disco=AAABFByy0mU
   
   > 3. As can be seen in the last point of https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.m5kli4l5q7ui, we agreed to leave the refresh strategy to the query engine. If available, an incremental refresh is always preferable to a full refresh.
   
   Did we consider making this property configurable?  We should at least store what the materialized view refresh strategy was used in the snapshot for debugging?  We can discuss on the google doc.
   
   So Im open to either, we can make a PR, or fill out the Specification Draft part of the doc ?  The main barrier right now to more folks reviewing is that the doc is just a list of options and long threads that take forever to see where the consensus is, it took me quite some time to follow.  Having a central section of what concrete additions we are proposing will greatly streamline the review experience.  Feel free to take my summarized spec changes in my previous comment as a starting point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] [Proposal] Iceberg Materialized View Spec [iceberg]

Posted by "JanKaul (via GitHub)" <gi...@apache.org>.
JanKaul commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1918780632

   I've updated the Specification Draft. Please let me know if you think certain changes are in order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org