You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2021/02/04 19:16:00 UTC
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

    [ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279080#comment-17279080 ] 

Micah Kornfield commented on ARROW-11497:
-----------------------------------------

Is this causing a problem in practice.  There is a C++ option [https://github.com/apache/arrow/blob/1c18706ac9e49e3e9b4998354f213a304e82d367/cpp/src/parquet/properties.h#L689] that will write out element [https://github.com/apache/arrow/blob/9b195493409ad434cbc42b0e666603c6471a9bae/cpp/src/parquet/arrow/schema.cc#L82]

 

We could expose this in python.

 

I think the main reason it isn't enabled by default is it breaks round trips for arrow data.  This could potentially be fixed on the reader side as well.  I can't find a reference but I think this might also have some impact on Pandas<->Parquet round tripping.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11497
>                 URL: https://issues.apache.org/jira/browse/ARROW-11497
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>            Reporter: Truc Lam Nguyen
>            Priority: Major
>         Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like the parquet writer for list data type does not conform to Apache Parquet list logical type specification
> According to this page: [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] list type contains 3 level where the middle level, named {{list}}, must be a repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named _item_ instead,
> Please find below the example python code that produce a parquet file (I use pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)