You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Tonnam Balankura (Jira)" <ji...@apache.org> on 2020/07/27 19:01:00 UTC
[jira] [Created] (ARROW-9573) [Python] Parquet doesn't load when
partitioned column starts with '_'
Tonnam Balankura created ARROW-9573:
---------------------------------------
Summary: [Python] Parquet doesn't load when partitioned column starts with '_'
Key: ARROW-9573
URL: https://issues.apache.org/jira/browse/ARROW-9573
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 1.0.0
Reporter: Tonnam Balankura
When the loading parquet with partitioned column that starts with an underscore '_', nothing is loaded. No exceptions are raised either. Loading this parquet have worked for me in pyarrow 0.17.1, but not working in pyarrow 1.0.0.
On the other hand, loading parquet with a partitioned column starting with '_' is possible by using the `use_legacy_dataset` option. Also, when the column that starts with an underscore is not a partitioned column, loading parquet seems to work as expected.
{code:python}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>> df1 = pd.DataFrame(data={'_COL_1': [1, 2], 'COL_2': [3, 4], 'COL_3': [5, 6]})
>>> table1 = pa.Table.from_pandas(df1)
>>> pq.write_to_dataset(table1, partition_cols=['_COL_1', 'COL_2'], root_path='test_parquet1')
>>> df_pq1 = pq.read_table('test_parquet1')
>>> df_pq1
pyarrow.Table
>>> len(df_pq1)
0
>>> df_pq1_legacy = pq.read_table('test_parquet1', use_legacy_dataset=True)
pyarrow.Table
COL_3: int64
_COL_1: dictionary<values=int64, indices=int32, ordered=0>
COL_2: dictionary<values=int64, indices=int32, ordered=0>
>>> len(df_pq1_legacy)
2
>>> df2 = pd.DataFrame(data={'COL_1': [1, 2], 'COL_2': [3, 4], '_COL_3': [5, 6]})
>>> table2 = pa.Table.from_pandas(df2)
>>> pq.write_to_dataset(table2, partition_cols=['COL_1', 'COL_2'], root_path='test_parquet2')
>>> df_pq2 = pq.read_table('test_parquet2')
>>> df_pq2
pyarrow.Table
_COL_3: int64
COL_1: int32
COL_2: int32
>>> len(df_pq2)
2
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)