You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2016/07/17 22:00:22 UTC

[jira] [Resolved] (PARQUET-651) Parquet-avro fails to decode array of record with a single field name "element" correctly

     [ https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Blue resolved PARQUET-651.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.9.0

Merged #352. Thanks for reviewing, [~lian cheng]!

> Parquet-avro fails to decode array of record with a single field name "element" correctly
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-651
>                 URL: https://issues.apache.org/jira/browse/PARQUET-651
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.7.0, 1.8.0, 1.8.1, 1.9.0
>            Reporter: Cheng Lian
>            Assignee: Ryan Blue
>             Fix For: 1.9.0
>
>
> Found this issue while investigating SPARK-16344.
> For the following Parquet schema
> {noformat}
> message root {
>   optional group f (LIST) {
>     repeated group list {
>       optional group element {
>         optional int64 element;
>       }
>     }
>   }
> }
> {noformat}
> parquet-avro decodes it as something like this:
> {noformat}
> record SingleElement {
>   int element;
> }
> record NestedSingleElement {
>   SingleElement element;
> }
> record Spark16344Wrong {
>   array<NestedSingleElement> f;
> }
> {noformat}
> while correct interpretation should be:
> {noformat}
> record SingleElement {
>   int element;
> }
> record Spark16344 {
>   array<SingleElement> f;
> }
> {noformat}
> The reason is that the {{element}} syntactic group for LIST in
> {noformat}
> <list-repetition> group <name> (LIST) {
>   repeated group list {
>     <element-repetition> <element-type> element;
>   }
> }
> {noformat}
> is recognized as a record field named {{element}}. The problematic code lies in [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858]. We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)