You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Vladimir (Jira)" <ji...@apache.org> on 2020/01/20 14:17:00 UTC

[jira] [Created] (ARROW-7617) [Python] Slices of Dataframes with Categorical columns are not respected in write_to_dataset

Vladimir created ARROW-7617:
-------------------------------

             Summary: [Python] Slices of Dataframes with Categorical columns are not respected in write_to_dataset
                 Key: ARROW-7617
                 URL: https://issues.apache.org/jira/browse/ARROW-7617
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
            Reporter: Vladimir


Hello,

it looks like, views with selection along categorical column are not properly respected.

For the following dummy dataframe:

 
{code:java}
d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
x['Year'] = x.index.year
{code}
The slice by Year is saved to partitioned parquet properly:
{code:java}
table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'],
                    use_dictionary=True, compression='snappy'){code}
However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990:
{code:java}
x['Year'] = x['Year'].astype('category')

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'],
                    use_dictionary=True, compression='snappy')
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)