You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Joy Bestourous (Jira)" <ji...@apache.org> on 2022/08/09 14:01:00 UTC

[jira] [Commented] (PARQUET-2168) Potential bug in ParquetWriteProtocol

    [ https://issues.apache.org/jira/browse/PARQUET-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577431#comment-17577431 ] 

Joy Bestourous commented on PARQUET-2168:
-----------------------------------------

hello, is there any update on this ticket? thanks!

> Potential bug in ParquetWriteProtocol
> -------------------------------------
>
>                 Key: PARQUET-2168
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2168
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Joy Bestourous
>            Priority: Minor
>
> We found what we think is a bug in ParquetWriteProtocol, in which ParquetWriteProtocol will fail on instantiation of StructWriteProtocol if the StructType contains an empty child struct.
> Specifically, for the ParquetWriteProtocol, if the thriftStruct contains an empty struct, logic in ThriftSchemaConvertVisitor drops the element, yielding a MessageType that has 1 fewer fields than the original schema. Subsequent logic in ParquetWriteProtocol.StructWriteProtocol tries to populate a `children` element by iterating through the thrift struct children and trying to get the element from the ColumnIO object
> {code:java}
> Given: ThriftStruct with 20 fields
> MessageType schema = ThriftSchemaConverter.convertWithoutProjection(thriftStruct)
> -> ThriftSchemaConvertVisotor.convert(StructType struct...)
> -> -> Visitor = new ThriftSchemaConvertVisitor(filter, true, keepOneOfEachUnion), state)
> -> -> ConvertedField = struct.accept(visitor)
> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state)
> -> -> -> -> ConvertedField converted = child.getType().accept(this, childState)
> -> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state) //here we’re at the child struct{code}
>  In here, we have both hasSentinelUnionColumns and hasNonSentinelUnionColumns defaulted as false and we look for any child elements, in which case, one of these is updated to true.  Thus, when we come to this step, we fall into the Drop() case.    
>  
> {code:java}
>   if (hasNonSentinelUnionColumns) {
>       // user requested some of the fields of this struct, so we keep the struct
>       return new Keep(state.path, new GroupType(state.repetition, state.name, convertedChildren));
>     } else {
>       // user requested none of the fields of this struct, so we drop it
>       return new Drop(state.path);
>     }{code}
>  
> Because this field is Dropped, our MessageType.fieldsList is 19
>  
> {code:java}
> ColumnIO = new ColumnIOFactory().getColumnIO(MessageType) // again yields a ColumnIO with only 19 fields
> TProtocol = new ParquetWriteProtocol(recordConsumer, columnIo, thriftStruct)
> -> MessageWriteProtocol = new MessageWriteProtocol(ColumnIO schema, StructType thriftType)
> -> -> new StructWriteProtocol(ColumnIO schema, StructType thriftType...)
> for (i = 0 to thriftStruct.children.size) // which is 20
>  schema.getChild(i) // Out of bounds error on index 19{code}
> We currently have a workaround for this but would like to get a fix if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)