You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/01/14 13:51:00 UTC

[jira] [Commented] (ARROW-10726) [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data

    [ https://issues.apache.org/jira/browse/ARROW-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476153#comment-17476153 ] 

Alenka Frim commented on ARROW-10726:
-------------------------------------

One thing I noticed was that _pq.read_table_ and _pq.ParquetDataset + read()_ behave differently: first fills index with Nulls, later errors as the schemas do not match.

Example:
{code:python}
import pandas as pd
df1 = pd.DataFrame({'a':[1, 2, 3]}, index=['a','b','c'])
df1.to_parquet('abc/one.parquet')
df2 = pd.DataFrame({'a':[4, 5, 6]})
df2.to_parquet('abc/two.parquet')
df3 = pd.DataFrame({'a':[7, 8, 9]})
df3.to_parquet('abc/three.parquet')
pd.read_parquet('abc/')

import pyarrow as pa
import pyarrow.parquet as pq
table = pq.read_table('abc/')
table
{code}
output:
{code:python}
pyarrow.Table
a: int64
__index_level_0__: string
----
a: [[1,2,3],[7,8,9],[4,5,6]]
__index_level_0__: [["a","b","c"],[null,null,null],[null,null,null]]
{code}
but
{code:python}
dataset = pq.ParquetDataset('abc/')
{code}
Errors:
{code:python}
ValueError: Schema in abc//three.parquet was different. 
a: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 375

vs

a: int64
__index_level_0__: string
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 448
{code}
I would think they would behave the same. If I understand correctly in both cases _ParquetDatasetV2_ is being called but the schema is being checked only in the second _(ParquetDataset)._

The idea of the issue was to concat the datasets as it is done in Pandas:
{code:python}
pd.concat([df1, df2, df3])

a
a	1
b	2
c	3
0	4
1	5
2	6
0	7
1	8
2	9
{code}
in this case both options _pq.read_table_ and _pq.ParquetDataset_ should be changed.

The other option is to add a check of the schemas in the _pq.read_table_ and get an error in both cases.

> [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10726
>                 URL: https://issues.apache.org/jira/browse/ARROW-10726
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Alenka Frim
>            Priority: Major
>             Fix For: 8.0.0
>
>
> See https://github.com/pandas-dev/pandas/issues/38058



--
This message was sent by Atlassian Jira
(v8.20.1#820001)