You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/01/22 15:37:00 UTC
[jira] [Commented] (ARROW-11344) [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method

    [ https://issues.apache.org/jira/browse/ARROW-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270222#comment-17270222 ] 

Weston Pace commented on ARROW-11344:
-------------------------------------

Thank you for creating such a detailed test case.  I have run your test against pyarrow 2.0.0 and I can confirm I get the same results that you do.  Luckily, when I ran your test against the latest code I did not see this error and I confirmed that the full_name.name column aligned with the fruit_name column.  We have recently fixed issues related to structs such as ARROW-10493 and my assumption is that you encountered one of those.

We are on the verge of releasing 3.0.0.  There is an RC available at ([https://bintray.com/apache/arrow/python-rc/3.0.0-rc2#files/python-rc/3.0.0-rc2)] if you would like to test this behavior out yourself sooner.

 

> [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11344
>                 URL: https://issues.apache.org/jira/browse/ARROW-11344
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Chen Ming
>            Priority: Major
>         Attachments: test_struct.csv, test_struct_200.parquet, test_struct_200.py, test_struct_200_flat.parquet, test_struct_200_flat.py
>
>
> Hi,
> We found an our-of-order issue with the 'struct' data type recently, would like to know if you can help to root cause it.
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('./test_struct.csv')
> print(df.dtypes)
> df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": x["file_name"]}, axis=1)
> my_df = df.drop(['file_package', 'file_name'], axis=1)
> file_fields = [('package', pa.string()), ('name', pa.string()),]
> my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)),
>                        pa.field('fruit_name', pa.string())])
> my_table = pa.Table.from_pandas(my_df, schema = my_schema)
> print('Table schema:')
> print(my_table.schema)
> pq.write_table(my_table, './test_struct_200.parquet')
> {code}
> The above code (attached as test_struct_200.py) runs with the following python packages:
> {code:java}
> Pandas Version = 1.1.3
> PyArrow Version = 2.0.0
> {code}
> Then I use parquet-tools (1.11.1) to read the file, but get the following output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
> ...
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> {code}
> (BTW, you can also view the parquet file with [http://parquet-viewer-online.com/])
> The output is supposed to be (refer to test_struct.csv) :
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
> ...
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> {code}
> As a comparison, the following code (attached as test_struct_200_flat.py) would generate a parquet file with the same data of test_struct.csv:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('./test_struct.csv')
> print(df.dtypes)
> my_schema = pa.schema([pa.field('file_package', pa.string()),
>                        pa.field('file_name', pa.string()),
>                        pa.field('fruit_name', pa.string())])
> my_table = pa.Table.from_pandas(df, schema = my_schema)
> print('Table schema:')
> print(my_table.schema)
> pq.write_table(my_table, './test_struct_200_flat.parquet')
> {code}
> I also attached the two parquet files for your references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)