You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Chen Ming (Jira)" <ji...@apache.org> on 2021/01/22 06:59:00 UTC

[jira] [Created] (ARROW-11344) [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method

Chen Ming created ARROW-11344:
---------------------------------

             Summary: [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method
                 Key: ARROW-11344
                 URL: https://issues.apache.org/jira/browse/ARROW-11344
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
            Reporter: Chen Ming
         Attachments: test_struct.csv, test_struct_200.parquet, test_struct_200.py, test_struct_200_flat.parquet, test_struct_200_flat.py

Hi,

We found an our-of-order issue with the 'struct' data type recently, would like to know if you can help to root cause it.
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv('./test_struct.csv')
print(df.dtypes)
df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": x["file_name"]}, axis=1)
my_df = df.drop(['file_package', 'file_name'], axis=1)

file_fields = [('package', pa.string()), ('name', pa.string()),]
my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)),
                       pa.field('fruit_name', pa.string())])
my_table = pa.Table.from_pandas(my_df, schema = my_schema)
print('Table schema:')
print(my_table.schema)

pq.write_table(my_table, './test_struct_200.parquet')
{code}
The above code (attached as test_struct_200.py) runs with the following python packages:
{code:java}
Pandas Version = 1.1.3
PyArrow Version = 2.0.0
{code}
Then U use parquet-tools (1.11.1) to read the file, but get the following output:
{code:java}
$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
...
full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry
{code}
(BTW, you can also view the parquet file with http://parquet-viewer-online.com/)

The output is supposed to be (refer to test_struct.csv) :
{code:java}
$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
...
full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry
{code}
As a comparison, the following code (attached as test_struct_200_flat.py) would generate a parquet file with the same data of test_struct.csv:
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv('./test_struct.csv')
print(df.dtypes)
my_schema = pa.schema([pa.field('file_package', pa.string()),
                       pa.field('file_name', pa.string()),
                       pa.field('fruit_name', pa.string())])
my_table = pa.Table.from_pandas(df, schema = my_schema)
print('Table schema:')
print(my_table.schema)

pq.write_table(my_table, './test_struct_200_flat.parquet')
{code}
I also attached the two parquet files for your references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)