You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Zosimova Zhanna (Jira)" <ji...@apache.org> on 2020/11/07 18:48:00 UTC
[jira] [Updated] (ARROW-10514) [C++][Parquet] Data inconsistency in parquet-reader output modes

     [ https://issues.apache.org/jira/browse/ARROW-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zosimova Zhanna updated ARROW-10514:
------------------------------------
    Description: 
I tried reading description for Parquet [file|https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/data_parquet/nested_maps.snappy.parquet] with nested maps using [parquet-reader tool|https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet_reader.cc]. 

This file has the following structure:
{code:java}
required group field_id=0 spark_schema {
  optional group field_id=1 a (Map) {
    repeated group field_id=2 key_value {
      required binary field_id=3 key (String);
      optional group field_id=4 value (Map) {
        repeated group field_id=5 key_value {
          required int32 field_id=6 key;
          required boolean field_id=7 value;
        }
      }
    }
  }
  required int32 field_id=8 b;
  required double field_id=9 c;
} {code}
When I print it using DebugPrint, I see:
{code:java}
$ ./parquet-reader nested_maps.snappy.parquet --only-metadata
<some text is omitted for the sake of readability>
Column 0: a.key_value.key (BYTE_ARRAY/UTF8)
Column 1: a.key_value.value.key_value.key (INT32)
Column 2: a.key_value.value.key_value.value (BOOLEAN)
Column 3: b (INT32)
Column 4: c (DOUBLE)
</some text is omitted for the sake of readability>{code}
When I pring it using JSONPrint, I see:
{code:java}
$ ./parquet-reader nested_maps.snappy.parquet --json
<some text is omitted for the sake of readability>
"Columns": [
  { "Id": "0", "Name": "key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} },
  { "Id": "1", "Name": "key", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "2", "Name": "value", "PhysicalType": "BOOLEAN", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "3", "Name": "b", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "4", "Name": "c", "PhysicalType": "DOUBLE", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} }
]
</some text is omitted for the sake of readability>{code}
Column 0 and Column 1 has the same Name in JSON output. That's very confusing. It would be more correct to output the full path of the column (key -> a.key_value.key).

 

This can be corrected by changing a single line: [https://github.com/apache/arrow/blob/master/cpp/src/parquet/printer.cc#L218]

 

The proposed patch in the attachment

  was:
I tried reading description for Parquet [file|[https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/data_parquet/nested_maps.snappy.parquet]] with nested maps using [parquet-reader tool|[https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet_reader.cc]]. 

This file has the following structure:
{code:java}
required group field_id=0 spark_schema {
  optional group field_id=1 a (Map) {
    repeated group field_id=2 key_value {
      required binary field_id=3 key (String);
      optional group field_id=4 value (Map) {
        repeated group field_id=5 key_value {
          required int32 field_id=6 key;
          required boolean field_id=7 value;
        }
      }
    }
  }
  required int32 field_id=8 b;
  required double field_id=9 c;
} {code}

When I print it using DebugPrint, I see:
{code:java}
$ ./parquet-reader nested_maps.snappy.parquet --only-metadata
<some text is omitted for the sake of readability>
Column 0: a.key_value.key (BYTE_ARRAY/UTF8)
Column 1: a.key_value.value.key_value.key (INT32)
Column 2: a.key_value.value.key_value.value (BOOLEAN)
Column 3: b (INT32)
Column 4: c (DOUBLE)
</some text is omitted for the sake of readability>{code}

When I pring it using JSONPrint, I see:
{code:java}
$ ./parquet-reader nested_maps.snappy.parquet --json
<some text is omitted for the sake of readability>
"Columns": [
  { "Id": "0", "Name": "key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} },
  { "Id": "1", "Name": "key", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "2", "Name": "value", "PhysicalType": "BOOLEAN", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "3", "Name": "b", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "4", "Name": "c", "PhysicalType": "DOUBLE", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} }
]
</some text is omitted for the sake of readability>{code}

Column 0 and Column 1 has the same Name in JSON output. That's very confusing. It would be more correct to output the full path of the column (key -> a.key_value.key).

 

This can be corrected by changing a single line: [https://github.com/apache/arrow/blob/master/cpp/src/parquet/printer.cc#L218]

 

The proposed patch in the attachment


> [C++][Parquet] Data inconsistency in parquet-reader output modes
> ----------------------------------------------------------------
>
>                 Key: ARROW-10514
>                 URL: https://issues.apache.org/jira/browse/ARROW-10514
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Zosimova Zhanna
>            Priority: Minor
>         Attachments: 0001-Make-the-column-name-the-same-for-both-output-format.patch
>
>
> I tried reading description for Parquet [file|https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/data_parquet/nested_maps.snappy.parquet] with nested maps using [parquet-reader tool|https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet_reader.cc]. 
> This file has the following structure:
> {code:java}
> required group field_id=0 spark_schema {
>   optional group field_id=1 a (Map) {
>     repeated group field_id=2 key_value {
>       required binary field_id=3 key (String);
>       optional group field_id=4 value (Map) {
>         repeated group field_id=5 key_value {
>           required int32 field_id=6 key;
>           required boolean field_id=7 value;
>         }
>       }
>     }
>   }
>   required int32 field_id=8 b;
>   required double field_id=9 c;
> } {code}
> When I print it using DebugPrint, I see:
> {code:java}
> $ ./parquet-reader nested_maps.snappy.parquet --only-metadata
> <some text is omitted for the sake of readability>
> Column 0: a.key_value.key (BYTE_ARRAY/UTF8)
> Column 1: a.key_value.value.key_value.key (INT32)
> Column 2: a.key_value.value.key_value.value (BOOLEAN)
> Column 3: b (INT32)
> Column 4: c (DOUBLE)
> </some text is omitted for the sake of readability>{code}
> When I pring it using JSONPrint, I see:
> {code:java}
> $ ./parquet-reader nested_maps.snappy.parquet --json
> <some text is omitted for the sake of readability>
> "Columns": [
>   { "Id": "0", "Name": "key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} },
>   { "Id": "1", "Name": "key", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
>   { "Id": "2", "Name": "value", "PhysicalType": "BOOLEAN", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
>   { "Id": "3", "Name": "b", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
>   { "Id": "4", "Name": "c", "PhysicalType": "DOUBLE", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} }
> ]
> </some text is omitted for the sake of readability>{code}
> Column 0 and Column 1 has the same Name in JSON output. That's very confusing. It would be more correct to output the full path of the column (key -> a.key_value.key).
>  
> This can be corrected by changing a single line: [https://github.com/apache/arrow/blob/master/cpp/src/parquet/printer.cc#L218]
>  
> The proposed patch in the attachment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)