You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/03 23:09:52 UTC

[GitHub] [iceberg] wypoon commented on pull request #2275: Core: add schema id to snapshot and history entry

wypoon commented on pull request #2275:
URL: https://github.com/apache/iceberg/pull/2275#issuecomment-812938555

@yyanyy thank you for responding to my question. But because I am slow, I needed to do some testing to understand for myself the effects of https://github.com/apache/iceberg/pull/2096 and this PR.
I started with Iceberg 0.11.0 and created some tables, inserting data into them, and altering the schema. Then I added the commit for https://github.com/apache/iceberg/pull/2096 and the changes here on top (actually I added the commit for https://github.com/apache/iceberg/pull/2089 first, so that the backports are clean). I made further updates to my Iceberg tables, adding data and altering the schema.
From what I see, when the table is updated and a new metadata.json file for the table is written, the metadata.json file still has a `schema` field, but the schema now has a `schema-id` field; it also has a `schemas` field containing an array of schemas, and a `current-schema-id` field. The only schema known at the time of the switchover is the current schema of the table, so if the table change is just change to data, the existing schema is given a `schema-id` of 0, and the schemas array contains a single schema; if the table change is a schema change, then the old schema is given a `schema-id` of 0, the new schema is given a `schema-id` of 1, and the `schemas` array contains two schemas (and the `current-schema-id` field is 1). In the `snapshots` field, any snapshot created before the switchover does not have a `schema-id`; snapshots created after the switchover are given a `schema-id`, corresponding to the schema current at the time the snapshot is written.
When `Snapshot#schemaId()` is called, if the snapshot is written before the switchover, null is returned, and if written after the switchover, a non-null Integer is returned.
It is as you wrote.
With this change, I think it is straightforward to update my PR, https://github.com/apache/iceberg/pull/1508. I just need to update the following method I'm adding to `BaseTable`
```
public Schema schemaForSnapshot(long snapshotId) {
TableMetadata current = ops.current();
// First, check if the snapshot has a schema id associated with it
Snapshot snapshot = current.snapshot(snapshotId);
Integer schemaId = snapshot.schemaId();
if (schemaId != null) {
return current.schemasById().get(schemaId);
}
// Otherwise, read each of the previous metadata files until we find one whose current
// snapshot id is the snapshot id
...
}
```
I rebuilt Iceberg with my change on top of the previous changes, and I was able to see that the correct schema is used when viewing any snapshot (time travel).
When this change is merged, I will update my PR. I hope that it can be considered then.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org