You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Timothy Miller (Jira)" <ji...@apache.org> on 2022/04/25 15:51:00 UTC

[jira] [Comment Edited] (PARQUET-2140) parquet-cli unable to read UUID values

    [ https://issues.apache.org/jira/browse/PARQUET-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527559#comment-17527559 ] 

Timothy Miller edited comment on PARQUET-2140 at 4/25/22 3:50 PM:
------------------------------------------------------------------

I just realized that I managed to duplicate this in PARQUET-2143, so PARQUET-2143 should probably be resolved as a duplicate of this one. Now, if I can just figure out how to run this is a debugger...

EDIT: This is interesting. The guid.parquet file isn't the only one I have that parquet-cli can't parse. I also can't parse either of the files attached to PARQUET-2069. Both of those throw this error:
{noformat}
Unknown error
java.lang.RuntimeException: Failed on record 0
    at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
    at org.apache.parquet.cli.Main.run(Main.java:157)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: optional binary element (STRING) is not a group
...{noformat}


was (Author: JIRAUSER287471):
I just realized that I managed to duplicate this in PARQUET-2143, so PARQUET-2143 should probably be resolved as a duplicate of this one. Now, if I can just figure out how to run this is a debugger...

> parquet-cli unable to read UUID values
> --------------------------------------
>
>                 Key: PARQUET-2140
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2140
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cli
>            Reporter: Balaji K
>            Priority: Minor
>         Attachments: guid.parquet
>
>
> I am finding that parquet-cli throws when trying to read UUID values. 
> Attached to this bug report is a parquet file with 2 columns, message encoded as byte-array and number encoded as fixed length byte array (UUID). This file was written by my .net implementation of parquet specification. The file has one row worth of data and is readable by parquet-cpp.
> +Schema as read by parquet-cli:+
> message root {
>   required binary Message (STRING);
>   required fixed_len_byte_array(16) Number (UUID);
> }
> +Values as read by parquet-cpp:+
> --- Values ---
> Message                       |Number                        |
> First record                  |215 48 212 219 218 57 169 67 166 116 7 79 44 227 50 17 |
>  
> +Here is the exception stack from parquet-cli when trying to read uuid values:+
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required binary Number (STRING) != required fixed_len_byte_array(16) Number (UUID)
>         at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
>         at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93)
>         at org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602)
>         at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
>         at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
>         at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
>         at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
>         at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
>         at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
>  
>  I debugged parquet-cli code and found that parquet-cli is trying to project the UUID as a string and later on that throws as these types are not compatible? 
>  
> +Source code references:+
> At AvroReadSupport.java, line 97
> ~~~~~~~~~~~~
>     String requestedProjectionString = configuration.get(AVRO_REQUESTED_PROJECTION);
>     if (requestedProjectionString != null) {
>       Schema avroRequestedProjection = new Schema.Parser().parse(requestedProjectionString);
>       projection = new AvroSchemaConverter(configuration).convert(avroRequestedProjection);
>     }
> ~~~~~~~~~~~~
>  
> +Debugger values for+ 
> requestedProjectionString=
> {"type":"record","name":"root","fields":[\{"name":"Message","type":"string"},\{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]}
> [Note that `Number` now has a type of `string` and a logicalType of `uuid`]
>  
> At ColumnIOFactory.java line 93
> ~~~~~~~~~~~~
> incompatibleSchema(primitiveType, currentRequestedType);
> ~~~~~~~~~~~~
> +Debugger values for+ 
> primitiveType = required fixed_len_byte_array(16) Number (UUID)
> currentRequestedType = required binary Number (STRING)
>  
> and this will throw.
>  
> If I skip over the projection code in AvroReadSupport, parquet-cli is able to read my file.
> I am not sure if the bug is in parquet-cli or parquet-mr or in the library I used to encode this file. The fact that parquet-cpp is able to read it gives me some confidence to say that the problem is either in parquet-cli or parquet-mr.
> Please point me in the right direction if I could verify this UUID roundtripping purely from parquet-mr itself in form of an unit-test. Happy to contribute tests or fix if needed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)