You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/16 06:02:29 UTC

[GitHub] [iceberg] szehon-ho opened a new issue, #5543: Imported parquet tables may have wrong metrics

szehon-ho opened a new issue, #5543:
URL: https://github.com/apache/iceberg/issues/5543

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I found this problem while doing https://github.com/apache/iceberg/pull/5376#discussion_r934960703, which now attempts to convert metrics to readable ones and encountered an exception.  So just reporting the problem.
   
   See the test:  TestIcebergSourceTablesBase::testFilesTableWithSnapshotIdInheritance https://github.com/apache/iceberg/blob/5f5c9235c10ed4a711a64de880491b3ae4f348ec/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java#L466
   
   Setup:  
   The parquet table is a partitioned one, so when we insert data to that table, there is only one column in the file (data).  Column "Id" is partitioned so does not exist in the file.
   
   Code Flow:  
   The import code seems to do the following steps (via TableMigraitonUtil::listPartition -> TableMigrationUtil::getParquetMetrics -> ParquetUtil::footerMetrics())
   1. Assign Field Ids
   2.  Calculate metrics
   
   The first step, it sees the parquet file schema does not have ids (expected) and assigns the ids using ParquetSchemaUtil::addFallbackIds, which starts as 1, so now data column has field 1.
   
   The second step calculates metrics for 'data' column and puts them in the map with id=1.
   
   However, in the Iceberg destination table schema, id=1, data=2.  So now, when we try to read the metrics, they are not correct.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #5543: Imported parquet tables may have wrong metrics

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #5543:
URL: https://github.com/apache/iceberg/issues/5543#issuecomment-1298123982

   @szehon-ho I hit this as well when trying to resolve lower/upper bounds issue, passing `namemapping` can resolve this.
   
   ```
       NameMapping mapping = MappingUtil.create(table.schema());
       String mappingJson = NameMappingParser.toJson(mapping);
   
       table.updateProperties().set(TableProperties.DEFAULT_NAME_MAPPING, mappingJson).commit();
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho closed issue #5543: Imported parquet tables may have wrong metrics

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho closed issue #5543: Imported parquet tables may have wrong metrics
URL: https://github.com/apache/iceberg/issues/5543


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on issue #5543: Imported parquet tables may have wrong metrics

Posted by GitBox <gi...@apache.org>.
karuppayya commented on issue #5543:
URL: https://github.com/apache/iceberg/issues/5543#issuecomment-1218284924

   >This starts with id= 1, so now data column has id 1.
   
   I think we should be looking up the destination table to assign ids.
   Also if we are new ids I think we should be starting from highest id of the  destination table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #5543: Imported parquet tables may have wrong metrics

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #5543:
URL: https://github.com/apache/iceberg/issues/5543#issuecomment-1218412926

   Yea im not too famililar with the code, but it sounds resaonable to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #5543: Imported parquet tables may have wrong metrics

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #5543:
URL: https://github.com/apache/iceberg/issues/5543#issuecomment-1299159462

   Thanks for looking at it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org