You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Christian Thiel (JIRA)" <ji...@apache.org> on 2019/02/12 09:17:00 UTC
[jira] [Created] (ARROW-4538) pa.Table.from_pandas() with
df.index.name != None breaks write_to_dataset()
Christian Thiel created ARROW-4538:
--------------------------------------
Summary: pa.Table.from_pandas() with df.index.name != None breaks write_to_dataset()
Key: ARROW-4538
URL: https://issues.apache.org/jira/browse/ARROW-4538
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.12.0
Reporter: Christian Thiel
When using {{pa.Table.from_pandas()}} with preserve_index=True and dataframe.index.name!=None the prefix {{__index_level_}} is not added to the respective schema name. This breaks {{write_to_dataset}} with active partition columns.
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import os
import shutil
import pandas as pd
import numpy as np
PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)
arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df['arrays'] = pd.Series(arrays)
df.index.name='ID'
table = pa.Table.from_pandas(df, preserve_index=True)
print(table.schema.names)
pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL,
partition_cols=['partition_column'],
preserve_index=True
)
{code}
Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in {{write_to_dataset}} works.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)