You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/02/16 13:33:01 UTC

[GitHub] [iceberg] Fokko commented on issue #6858: Schema of the Underlying data files

Fokko commented on issue #6858:
URL: https://github.com/apache/iceberg/issues/6858#issuecomment-1433095111

Hey @fivetran-tusharkumar thanks for reaching out.

Iceberg is designed to do lazy changes. So if you add a column, this will be added to the table schema, but not to all the files. Once you read the files, the new column (that is missing from the Parquet file), will be added and this will be null. Once you rewrite a file, the file that replaces the file will have the new column. The Iceberg schema is optionally stored (some writers, do not write this unfortunately) in the [Parquet metadata](https://parquet.apache.org/docs/file-format/metadata/). Otherwise, you need to reconstruct the Iceberg schema from the Parquet schema that contains the FieldIDs that match with the Iceberg schema.

For example, in Iceberg if you rename a column. The old files that are already part of the table won't get rewritten right away but can be rewritten at some point eventually. Using the FieldIDs the columns are looked up, the old files are read with their original column name and then renamed to the new column name.

TLDR: You need to read the footer of each parquet file to determine the schema.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org