You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/16 02:01:12 UTC

[GitHub] [iceberg] yyanyy opened a new pull request #2096: Core: add schema id and schemas to table metadata

yyanyy opened a new pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096

This PR adds `current-schema-id` and `schemas` to table metadata. It also introduces a wrapper around schema to associate table schema with id.
The reason to not add ID directly into `Schema` is that currently schema creation is widely used as a convenient method for a lot of actions that don't involve a "real" table schema.

Next steps:
- adds `schema-id` to snapshot logs/history entries and populate
- use history entries and `schemas` to look up the right schema in time travel queries; this may mean to add `schemas()` in `Table` API
- spec update
- add schema id to `historyTable` (will be mentioned later)

Open questions:
1. Current approach writes the newly introduced fields to JSON by default even in v1, and there could be forward/backward compatibility concern with the current approach: if a new writer writes (with ID 0) and then update (with ID 1) schema, metadata will store both schema ID 0 and 1, and default ID will be 1. Then an old writer reads and writes the metadata for whatever change, which drops schema 0 in metadata. Then when a new writer picks up the metadata again, the original schema 0 is gone, and 1 is replaced with ID 0. This could result in schema ID consistency issue among different writers.
- Since ID is introduced in this PR, there is no metadata table that exposes these inconsistent schema IDs, so we may not have this problem for now. However when we start to add schema ID to `historyTable` metadata table, at different time ID 0 could mean different things in this history table. We could potentially workaround this by only exposing `schemaID` field in `historyTable` only for v2 tables, or mention this caveat on spec.
- Alternatively we can expose these two fields only in v2 table, and time travel queries in v1 always rely on looking at old table metadata files as implemented in #1508. This could mean in future any new changes that may depend on schema ID cannot be introduced in v1.
2. Do we want to add a `last-assigned-schema-id` to table metadata? My answer would be yes, for a similar reason mentioned in [this comment](https://github.com/apache/iceberg/pull/2089#issuecomment-761184851)
3. I think currently when replacing a table, earlier history entries/`snapshotLog` will be reset to empty (second to last argument in [here](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L712)). Is this expected? do we want to fix this as a separate issue?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org