You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/01/14 13:51:00 UTC
[jira] [Commented] (ARROW-10726) [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data
[ https://issues.apache.org/jira/browse/ARROW-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476153#comment-17476153 ]
Alenka Frim commented on ARROW-10726:
-------------------------------------
One thing I noticed was that _pq.read_table_ and _pq.ParquetDataset + read()_ behave differently: first fills index with Nulls, later errors as the schemas do not match.
Example:
{code:python}
import pandas as pd
df1 = pd.DataFrame({'a':[1, 2, 3]}, index=['a','b','c'])
df1.to_parquet('abc/one.parquet')
df2 = pd.DataFrame({'a':[4, 5, 6]})
df2.to_parquet('abc/two.parquet')
df3 = pd.DataFrame({'a':[7, 8, 9]})
df3.to_parquet('abc/three.parquet')
pd.read_parquet('abc/')
import pyarrow as pa
import pyarrow.parquet as pq
table = pq.read_table('abc/')
table
{code}
output:
{code:python}
pyarrow.Table
a: int64
__index_level_0__: string
----
a: [[1,2,3],[7,8,9],[4,5,6]]
__index_level_0__: [["a","b","c"],[null,null,null],[null,null,null]]
{code}
but
{code:python}
dataset = pq.ParquetDataset('abc/')
{code}
Errors:
{code:python}
ValueError: Schema in abc//three.parquet was different.
a: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 375
vs
a: int64
__index_level_0__: string
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 448
{code}
I would think they would behave the same. If I understand correctly in both cases _ParquetDatasetV2_ is being called but the schema is being checked only in the second _(ParquetDataset)._
The idea of the issue was to concat the datasets as it is done in Pandas:
{code:python}
pd.concat([df1, df2, df3])
a
a 1
b 2
c 3
0 4
1 5
2 6
0 7
1 8
2 9
{code}
in this case both options _pq.read_table_ and _pq.ParquetDataset_ should be changed.
The other option is to add a check of the schemas in the _pq.read_table_ and get an error in both cases.
> [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data
> ---------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-10726
> URL: https://issues.apache.org/jira/browse/ARROW-10726
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Assignee: Alenka Frim
> Priority: Major
> Fix For: 8.0.0
>
>
> See https://github.com/pandas-dev/pandas/issues/38058
--
This message was sent by Atlassian Jira
(v8.20.1#820001)