You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Andreas Hailu (Jira)" <ji...@apache.org> on 2021/06/14 19:50:00 UTC

[jira] [Comment Edited] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

    [ https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362991#comment-17362991 ] 

Andreas Hailu edited comment on PARQUET-1681 at 6/14/21, 7:49 PM:
------------------------------------------------------------------

Hi folks, we're running in to this issue as well. Bob Smith was able to provide a unit test in 2018 that reproduces this issue in PARQUET-1254.

I personally like [~shangx@uber.com]'s idea of having some sort of 'metadata' field to store information like the list type rather than mangling the actual schema, so that the conversion Avro <-> Parquet schema is always compatible as to not create any unforeseen pitfalls.


was (Author: ahailu):
Hi folks, we're running in to this issue as well. Bob Smith was able to provide a unit test that reproduces this issue in PARQUET-1254.

I personally like Xinli's idea of having some sort of 'metadata' field to store information like the list type rather than mangling the actual schema, so that the conversion Avro <-> Parquet schema is always compatible as to not create any unforeseen pitfalls.

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1681
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1681
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-avro
>    Affects Versions: 1.10.0, 1.9.1, 1.11.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read back by using parquet 1.10.1 without passing any schema, the reading throws an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>            {
>               "name": "phones",
>               "type": [
>                 "null",
>                 {
>                   "type": "array",
>                   "items": {
>                     "type": "record",
>                     "name": "phones_items",
>                     "fields": [
>                       
> {                         "name": "phone_number",                         "type": [                           "null",                           "string"                         ],                         "default": null                       }
>                     ]
>                   }
>                 }
>               ],
>               "default": null
>             }
> The code to read is as below 
>      val reader = AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   Configuration).build()
>     reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's checkReaderWriterCompatibility() to check the compatibility. However, checkReaderWriterCompatibility() consider the ParquetSchema and the AvroSchema(converted from File schema) as not compatible(the name in avro schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence not compatible) . Hence return false and caused the “phone_number” field in the above schema to be considered as group type which is not true. Then the exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or not. But it could because the translation of Avro schema to Parquet schema is not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)