You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/27 19:52:43 UTC

[GitHub] [iceberg] rdblue opened a new issue #1396: Cannot read Parquet maps with NameMapping

rdblue opened a new issue #1396:
URL: https://github.com/apache/iceberg/issues/1396


   When we try to read a Parquet file with a map using `NameMapping`, it fails because the schema generated has a different name for the key/value pair struct:
   
   ```
   org.apache.iceberg.bdp.shaded.org.apache.parquet.io.InvalidRecordException: key_value not found in optional group properties (MAP) = 1 {
     repeated group map {
       required binary key (UTF8) = 8;
       optional binary value (UTF8) = 9;
     }
   }
   	at org.apache.iceberg.bdp.shaded.org.apache.parquet.schema.GroupType.getFieldIndex(GroupType.java:146)
   	at org.apache.iceberg.bdp.shaded.org.apache.parquet.schema.GroupType.getType(GroupType.java:178)
   	at org.apache.iceberg.bdp.shaded.org.apache.parquet.schema.GroupType.getType(GroupType.java:282)
   	at org.apache.iceberg.bdp.shaded.org.apache.parquet.schema.GroupType.getType(GroupType.java:282)
   	at org.apache.iceberg.bdp.shaded.org.apache.parquet.schema.MessageType.getType(MessageType.java:90)
   	at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:91)
   	at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.access$100(ParquetMetricsRowGroupFilter.java:77)
   	at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:71)
   	at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:99)
   	at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:66)
   	at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:77)
   	at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:103)
   	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:73)
   ```
   
   The names generated when applying the name mapping do not match the original schema. The name mapping code should be updated to add field IDs and not change anything else in the file schema.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-682838451


   @chenjunjiedada, that code constructs a new map type using the Parquet builder. There is no guarantee that the field names match the data file's field names.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-684873935


   @hk-lrzy , `ApplyNameMapping` is used to assign IDs for data file's schema, so here we return the data file's type with IDs parsed from name mapping.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hk-lrzy commented on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
hk-lrzy commented on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-685331404


   @chenjunjiedada, i understand, you actually means data file schema not equals with row group schema?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-682388898


   I don't see current name mapping logic about updating the field name.  Neither `map` nor `primitive`. Could you help to construct a unit test to reproduce it?
   
   ```java
     @Override
     public Type map(GroupType map, Type keyType, Type valueType) {
       Preconditions.checkArgument(keyType != null && valueType != null,
           "Map type must have both key field and value field");
   
       MappedField field = nameMapping.find(currentPath());
       Type mapType = org.apache.parquet.schema.Types.map(map.getRepetition())
           .key(keyType)
           .value(valueType)
           .named(map.getName());
   
       return field == null ? mapType : mapType.withId(field.id());
     }
   
     @Override
     public Type primitive(PrimitiveType primitive) {
       MappedField field = nameMapping.find(currentPath());
       return field == null ? primitive : primitive.withId(field.id());
     }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hk-lrzy edited a comment on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
hk-lrzy edited a comment on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-685331404


   @chenjunjiedada, i understand, you actually means data file schema not equals to row group schema?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-683745049


   So the name mapping is created from a custom schema which converted from custom parquet schema, and some of the field names do not match the data file's field names.  Correct? 
   
   IIUC, we could detect whether data files' field names exist in `typeWithIds`. Does that sound reasonable to you?
   
   ``` java
         for (ColumnChunkMetaData col : rowGroup.getColumns()) {
           if (fileSchema.containsPath(col.getPath().toArray())) {
             PrimitiveType colType = fileSchema.getType(col.getPath().toArray()).asPrimitiveType();
             if (colType.getId() != null) {
               int id = colType.getId().intValue();
               stats.put(id, col.getStatistics());
               valueCounts.put(id, col.getValueCount());
               conversions.put(id, ParquetConversions.converterFromParquet(colType));
             }
           }
         }
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hk-lrzy commented on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
hk-lrzy commented on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-684852718


   I see current master branch will use data file's schema replace field name if path not found in nameMapping? 
   ```
     @Override
     public Type primitive(PrimitiveType primitive) {
       MappedField field = nameMapping.find(currentPath());
       return field == null ? primitive : primitive.withId(field.id());
     }
   ``` 
   So i still not aware clearly, could you tell more about detail?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on issue #1396: Cannot read Parquet maps with NameMapping

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on issue #1396:
URL: https://github.com/apache/iceberg/issues/1396#issuecomment-685401310


   @hk-lrzy , Yes, the original problem happens because the file schema that used to generate name mapping not equals to (some names are not matched) the file schema in the actual parquet file.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org