You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2020/07/06 07:36:00 UTC
[jira] [Assigned] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

     [ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Szadovszky reassigned PARQUET-1879:
-----------------------------------------

    Assignee: Matthew McMahon

> Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1879
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1879
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro, parquet-format
>    Affects Versions: 1.11.0
>            Reporter: Matthew McMahon
>            Assignee: Matthew McMahon
>            Priority: Critical
>
> From my [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] in relation to an issue I'm having with getting Snowflake (Cloud DB) to load Parquet files written with version 1.11.0
> ----
> The problem only appears when using a map schema field in the Avro schema. For example:
> {code:java}
>     {
>       "name": "FeatureAmounts",
>       "type": {
>         "type": "map",
>         "values": "records.MoneyDecimal"
>       }
>     }
> {code}
> When using Parquet-Avro to write the file, a bad Parquet schema ends up with, for example
> {code:java}
> message record.ResponseRecord {
>   required binary GroupId (STRING);
>   required int64 EntryTime (TIMESTAMP(MILLIS,true));
>   required int64 HandlingDuration;
>   required binary Id (STRING);
>   optional binary ResponseId (STRING);
>   required binary RequestId (STRING);
>   optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
>   required group FeatureAmounts (MAP) {
>     repeated group map (MAP_KEY_VALUE) {
>       required binary key (STRING);
>       required fixed_len_byte_array(12) value (DECIMAL(28,15));
>     }
>   }
> }
> {code}
> From the great answer to my StackOverflow, it seems the issue is that the 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, that has no logical type equivalent. From the comment on [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904]
> {code:java}
> // This logical type annotation is implemented to support backward compatibility with ConvertedType.
>   // The new logical type representation in parquet-format doesn't have any key-value type,
>   // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
> {code}
> However, it seems this is being written with the latest 1.11.0, which then causes Apache Arrow to fail with
> {code:java}
> Logical type Null can not be applied to group node
> {code}
> As it appears that [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] only looks for the new logical type of Map or List, therefore this causes an error.
> I have seen in Parquet Formats that [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] should be something like
> {code:java}
> // Map<String, Integer>
> required group my_map (MAP) {
>   repeated group key_value {
>     required binary key (UTF8);
>     optional int32 value;
>   }
> }
> {code}
> Is this on the correct path?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)