You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Sabarishan (Jira)" <ji...@apache.org> on 2022/11/03 03:07:00 UTC

[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

    [ https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628086#comment-17628086 ] 

Sabarishan commented on PARQUET-2069:
-------------------------------------

Is there a plan to fix this issue ?

I have encountered the same issue reading a parquet file written through spark.

Spark schema:
  optional group predicateData (LIST) \{
    repeated group list {
      optional binary element (STRING);
    }
  }

Avro schema:
optional group predicateData (LIST) \{
  repeated group array {
    optional binary element (STRING);
  }
}

The issue is [AvroSchemaConverter|https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L193] using "array" field naming convention which is causing schema mismatch when reading arrays.

> Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2069
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2069
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.12.0
>         Environment: Windows 10
>            Reporter: Devon Kozenieski
>            Priority: Blocker
>         Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified file that results after reading the original file and writing it back with Parquet-MR, with a few values modified. The schema should not be modified, since the schema of the input file is used as the schema to write the output file. However, the output file has a slightly modified schema that then cannot be read back the same way again with Parquet-MR, resulting in the exception message:  java.lang.ClassCastException: optional binary element (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)