You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Tonnam Balankura (Jira)" <ji...@apache.org> on 2020/07/27 19:01:00 UTC

[jira] [Created] (ARROW-9573) [Python] Parquet doesn't load when partitioned column starts with '_'

Tonnam Balankura created ARROW-9573:
---------------------------------------

             Summary: [Python] Parquet doesn't load when partitioned column starts with '_'
                 Key: ARROW-9573
                 URL: https://issues.apache.org/jira/browse/ARROW-9573
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.0
            Reporter: Tonnam Balankura


When the loading parquet with partitioned column that starts with an underscore '_', nothing is loaded. No exceptions are raised either. Loading this parquet have worked for me in pyarrow 0.17.1, but not working in pyarrow 1.0.0.

On the other hand, loading parquet with a partitioned column starting with '_' is possible by using the `use_legacy_dataset` option. Also, when the column that starts with an underscore is not a partitioned column, loading parquet seems to work as expected.

{code:python}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>> df1 = pd.DataFrame(data={'_COL_1': [1, 2], 'COL_2': [3, 4], 'COL_3': [5, 6]})
>>> table1 = pa.Table.from_pandas(df1)
>>> pq.write_to_dataset(table1, partition_cols=['_COL_1', 'COL_2'], root_path='test_parquet1')
>>> df_pq1 = pq.read_table('test_parquet1')
>>> df_pq1
pyarrow.Table
>>> len(df_pq1)
0
>>> df_pq1_legacy = pq.read_table('test_parquet1', use_legacy_dataset=True)
pyarrow.Table
COL_3: int64
_COL_1: dictionary<values=int64, indices=int32, ordered=0>
COL_2: dictionary<values=int64, indices=int32, ordered=0>
>>> len(df_pq1_legacy)
2
>>> df2 = pd.DataFrame(data={'COL_1': [1, 2], 'COL_2': [3, 4], '_COL_3': [5, 6]})
>>> table2 = pa.Table.from_pandas(df2)
>>> pq.write_to_dataset(table2, partition_cols=['COL_1', 'COL_2'], root_path='test_parquet2')
>>> df_pq2 = pq.read_table('test_parquet2')
>>> df_pq2
pyarrow.Table
_COL_3: int64
COL_1: int32
COL_2: int32
>>> len(df_pq2)
2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)