You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Christian Thiel (JIRA)" <ji...@apache.org> on 2019/02/25 14:30:00 UTC
[jira] [Commented] (ARROW-4538) [PYTHON] write_to_dataset() breaks
with dataframe with valid index name
[ https://issues.apache.org/jira/browse/ARROW-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776935#comment-16776935 ]
Christian Thiel commented on ARROW-4538:
----------------------------------------
I looked at the code and believe that the problem is in parquet.py line 1242.
The write_table routine does not accept the dataframe index to be defined in the schema. Hence we should remove the schema for index columns in `write_to_dataset`. I have a working version and will create a pull request in github.
> [PYTHON] write_to_dataset() breaks with dataframe with valid index name
> -----------------------------------------------------------------------
>
> Key: ARROW-4538
> URL: https://issues.apache.org/jira/browse/ARROW-4538
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.12.0
> Reporter: Christian Thiel
> Priority: Major
>
> When using {{pa.Table.from_pandas()}} with preserve_index=True and dataframe.index.name!=None the prefix {{__index_level_}} is not added to the respective schema name. This breaks {{write_to_dataset}} with active partition columns.
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import os
> import shutil
> import pandas as pd
> import numpy as np
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df['arrays'] = pd.Series(arrays)
> df.index.name='ID'
> table = pa.Table.from_pandas(df, preserve_index=True)
> print(table.schema.names)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL,
> partition_cols=['partition_column'],
> preserve_index=True
> )
> {code}
> Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in {{write_to_dataset}} works.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)