You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/04 10:27:37 UTC

[GitHub] [iceberg] kamaljit-1991 opened a new issue, #6122: IcebergGenerics.read(table) doesn't work as expected

kamaljit-1991 opened a new issue, #6122:
URL: https://github.com/apache/iceberg/issues/6122

   ### Apache Iceberg version
   
   0.13.1
   
   ### Query engine
   
   _No response_
   
   ### Please describe the bug 🐞
   
   It is little bit of related https://github.com/apache/iceberg/issues/4523 but not quite.
   
   When the order of the columns don't match between the underlying parquet files and the table schema then the IcebergGenerics doesn't work properly while querying the table. But Athena works perfectly fine and ideally it should work too. 
   
   **Steps to reproduce:**
   Create a glue table with the following order of columns : 
   <img width="386" alt="image" src="https://user-images.githubusercontent.com/45392269/199949629-79a2a3f3-c2b8-4384-91c2-92e9afae1cb9.png">
   
   And then put the data as the following order in 2 parquet files : 
   ```
    required int32 id;
     optional binary name (STRING);
     optional double double_col;
     optional int64 timestamp_col (TIMESTAMP(MICROS,true));
     optional int64 _fivetran_synced (TIMESTAMP(MICROS,true));
   ```
   
   Another file could be : 
   ```
    optional int64 _fivetran_synced (TIMESTAMP(MICROS,false));
     optional binary name (STRING);
     optional int64 timestamp_col (TIMESTAMP(MICROS,false));
     optional int32 id (INTEGER(32,true));
     optional double double_col;
   ```
   
   Then if you try to get the records as following : 
   Iterable<Record> records = IcebergGenerics.read(table).build();
   for(Record record : records) {
       System.out.println(((GenericRecord) record).getField("name"));
   }
   
   It will fail with below exception : 
   ```
   class org.apache.iceberg.types.Types$IntegerType cannot be cast to class org.apache.iceberg.types.Types$TimestampType (org.apache.iceberg.types.Types$IntegerType and org.apache.iceberg.types.Types$TimestampType are in unnamed module of loader 'app')
   ```
   Any record from the first parquet file will fail because that doesn't match the order of columns with the table and it is trying to convert the name to timestamp. Somehow the `namesToCol` is not working properly. 
   [data.zip](https://github.com/apache/iceberg/files/9937291/data.zip)
   
   I uploaded the zip containing 3 parquet files which you could use in an iceberg table to test it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kamaljit-1991 closed issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by "kamaljit-1991 (via GitHub)" <gi...@apache.org>.

kamaljit-1991 closed issue #6122: IcebergGenerics.read(table) doesn't work as expected
URL: https://github.com/apache/iceberg/issues/6122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kamaljit-1991 commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by GitBox <gi...@apache.org>.

kamaljit-1991 commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1347696537

   Hey @RussellSpitzer checking again. We are creating the table like this : 
   ```
   catalog.createTable(
                       TableIdentifier.of(database, table),
                       schema,
                       PartitionSpec.unpartitioned(),
                       tableLocation,
                       Map.of("table_type", "ICEBERG", TableProperties.FORMAT_VERSION, "2")
   ```
   Where catalog is the object of GlueCatalog and tableLocation is s3 location. I am not sure about default name mapping. Could yo give some pointers about that ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kamaljit-1991 commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by "kamaljit-1991 (via GitHub)" <gi...@apache.org>.

kamaljit-1991 commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1537165802

   Correct @RussellSpitzer . I verified even after updating `TableProperties.DEFAULT_NAME_MAPPING` to the proper json value containing the field-id, names and fields exactly mentioned in [here](https://iceberg.apache.org/spec/#name-mapping-serialization) such as in one of my table's metadata contains : 
   ```
   "properties" : {
       "schema.name-mapping.default" : "[{\"names\":[\"_fivetran_synced\"],\"field-id\":1},{\"names\":[\"id\"],\"field-id\":2},{\"names\":[\"attr\"],\"field-id\":3}]",
       "table_type" : "ICEBERG"
     },
   ```
   
   Still IcebergGenerics fails to read the table and fails with the same kind of exception : 
   ```
   java.lang.ClassCastException: class org.apache.iceberg.types.Types$StringType cannot be cast to class org.apache.iceberg.types.Types$TimestampType (org.apache.iceberg.types.Types$StringType and org.apache.iceberg.types.Types$TimestampType are in unnamed module of loader 'app')
   ```
   
   Do you think this should be considered as bug and it should be fixed in IcebergGenerics?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kamaljit-1991 commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by "kamaljit-1991 (via GitHub)" <gi...@apache.org>.

kamaljit-1991 commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1521740451

   Actually 1 thing I recently faced is the parquet files don't contain the field_ids. Is that causing any issue here ? I feel iceberg shouldn't be dependent on the field_ids. Because parquet contains schema and all the column names so field_id shouldn't matter. That's how delta table works. Please let me know about your thoughts.
   
   
   cc @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kamaljit-1991 commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by "kamaljit-1991 (via GitHub)" <gi...@apache.org>.

kamaljit-1991 commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1619980699

   Thank you so much @RussellSpitzer . I get it now. Thanks for clarifying. I am closing this ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kamaljit-1991 commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by GitBox <gi...@apache.org>.

kamaljit-1991 commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1320826410

   @RussellSpitzer Sorry for late reply. Could you elaborate your question please, What do you exactly mean by default name mapping ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1522044247

   @kamaljit-1991 Iceberg does depend on field_ids, that's how the schema evolution works. We could not support schema evolution without it. Iceberg generics just lacks support for our fallback mechinism, default name mapping. This is what we use when working with files not made by iceberg. See the Iceberg format spec for more information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1303235332

   Are you adding a default name mapping for your table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1537525136

> Do you think this should be considered as bug and it should be fixed in IcebergGenerics?
The fact that the error message changed to a different column suggests that the mapping is in fact working but that the field_ids are wrong. I didn't know field mapping worked with generics, but the fact that there is a different missmatch suggests that it is being used. What you need to do is actually determine the correct field Id's for your schema and make sure those match. If the field Id's are correct, then you could file a bug with a reproduction.

> I mean shouldn't any tool which is being used to read iceberg data be using fallback mechanism too if data files don't contain field-ids? I don't think the data files need to contain the field-ids. It's not because we didn't implement this. It's mainly because this information is anyway present in iceberg metadata json files. So same information doesn't need to be there in 2 places. Please let me know about your thoughts.

The information is not in too places, if you think the information is duplicated you may have a misconception about how Iceberg handles schema evolution and schema in general.

Imagine you have a table with a column X, then drop column X and add a new column X.

In a hive table, doing this would resurrect the data from column x because it uses *name mapping only*. Or imagine that you simply wanted to drop column X and fill in a new column X with a different type. Again this is going to cause issues in hive because we have no way of differentiating old "x" from new "x"

To address issues like this Iceberg instead has a mapping between columns in files and the logical column in the table.

In Iceberg you end up with two different "column x"s each with a different field ID. Files written either have explicit field-id's written with their columns OR we need to give a backup mapping of those columns to a field ID. Remember, we now have 2 different X's in the tables history and in almost all situations we do not expect this data to be valid for both of these points in time. This is why field-id is required. The metadata.json only contains information about what field id maps to what column, the actual column name in the schema isn't relevant. If we used the current schema names as a mapping it would lead to resurrection/schema evolution issues (as noted above) when the schema changed.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org