You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "RussellSpitzer (via GitHub)" <gi...@apache.org> on 2023/05/07 19:41:44 UTC

[GitHub] [iceberg] RussellSpitzer commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

RussellSpitzer commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1537525136

> Do you think this should be considered as bug and it should be fixed in IcebergGenerics?
The fact that the error message changed to a different column suggests that the mapping is in fact working but that the field_ids are wrong. I didn't know field mapping worked with generics, but the fact that there is a different missmatch suggests that it is being used. What you need to do is actually determine the correct field Id's for your schema and make sure those match. If the field Id's are correct, then you could file a bug with a reproduction.

> I mean shouldn't any tool which is being used to read iceberg data be using fallback mechanism too if data files don't contain field-ids? I don't think the data files need to contain the field-ids. It's not because we didn't implement this. It's mainly because this information is anyway present in iceberg metadata json files. So same information doesn't need to be there in 2 places. Please let me know about your thoughts.

The information is not in too places, if you think the information is duplicated you may have a misconception about how Iceberg handles schema evolution and schema in general.

Imagine you have a table with a column X, then drop column X and add a new column X.

In a hive table, doing this would resurrect the data from column x because it uses *name mapping only*. Or imagine that you simply wanted to drop column X and fill in a new column X with a different type. Again this is going to cause issues in hive because we have no way of differentiating old "x" from new "x"

To address issues like this Iceberg instead has a mapping between columns in files and the logical column in the table.

In Iceberg you end up with two different "column x"s each with a different field ID. Files written either have explicit field-id's written with their columns OR we need to give a backup mapping of those columns to a field ID. Remember, we now have 2 different X's in the tables history and in almost all situations we do not expect this data to be valid for both of these points in time. This is why field-id is required. The metadata.json only contains information about what field id maps to what column, the actual column name in the schema isn't relevant. If we used the current schema names as a mapping it would lead to resurrection/schema evolution issues (as noted above) when the schema changed.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org