You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/11/19 23:31:00 UTC

[jira] [Comment Edited] (ARROW-14772) [Python] unexpected content after groupby on a dataframe restored from partitioned parquet with filters

    [ https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446720#comment-17446720 ] 

Weston Pace edited comment on ARROW-14772 at 11/19/21, 11:30 PM:
-----------------------------------------------------------------

It seems that pandas [groupby function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) has an "observed" parameter:

{noformat}
observed : bool, default False

This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
{noformat}

Does this help?

{quote}Another point which is rather minor compared to the considerations outlined above is the conversion of data column to categorical type upon recovery. Is my assumption correct that this would be true for any column that was used to partition dataset?
{quote}

Yes and no (note, panda's "categorical" is the same as arrow's "dictionary" so when I say "dictionary" here I am referring to the same concept).  Pyarrow exposes this as an option in the newer dataset API and it actually defaults to *not* dictionary encoded so you have to go out of your way a little to get this behavior.

However, the legacy dataset API (passing a directory name to pq.read_table) always sets "infer_dictionary" to true.  This is the API that is used by the pandas' pyarrow engine.  So, as a workaround, you could do something like...

{noformat}
import pyarrow.dataset as ds
restored_df = ds.dataset(dataset_dir, partitioning='hive').to_table(filter=ds.field('date') == str(start_date)).to_pandas()
{noformat}


was (Author: westonpace):
It seems that pandas [groupby function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) has an "observed" parameter:

{noformat}
observed : bool, default False

This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
{noformat}

Does this help?

{quote}Another point which is rather minor compared to the considerations outlined above is the conversion of data column to categorical type upon recovery. Is my assumption correct that this would be true for any column that was used to partition dataset?
{quote}

Yes and no (note, panda's "categorical" is the same as arrow's "dictionary" so when I say "dictionary" here I am referring to the same concept).  Pyarrow exposes this as an option in the newer dataset API and it actually defaults to *not* dictionary encoded so you have to go out of your way a little to get this behavior.

However, the legacy dataset API (passing a directory name to pq.read_table) always sets "infer_dictionary" to true.  This is the API that is used by the pandas / python integration.  So, as a workaround, you could do something like...

{noformat}
import pyarrow.dataset as ds
restored_df = ds.dataset(dataset_dir, partitioning='hive').to_table(filter=ds.field('date') == str(start_date)).to_pandas()
{noformat}

> [Python] unexpected content after groupby on a dataframe restored from partitioned parquet with filters
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14772
>                 URL: https://issues.apache.org/jira/browse/ARROW-14772
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1
>            Reporter: Vadim Mironov
>            Priority: Major
>              Labels: scanner
>
> While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe. 
> Following code demonstrates the issue:
> {code:java}
> import numpy as np
> import os
> import pandas as pd  # 1.3.4
> import pyarrow as pa  # 6.0.1
> import random
> import shutil
> import string
> import tempfile
> from datetime import datetime, timedelta
> if __name__ == '__main__':
>     # 1. generate random data frame
>     day_count = 5
>     data_length = 10
>     numpy_random_gen = np.random.default_rng()
>     label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)]
>     partial_dfs = []
>     start_date = datetime.today().date() - timedelta(days=day_count)
>     for date in (start_date + timedelta(n) for n in range(day_count)):
>         date_array = pd.to_datetime(np.full(data_length, date)).date
>         label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)])
>         value_array = numpy_random_gen.integers(low=1, high=500, size=data_length)
>         partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array}))
>     df = pd.concat(partial_dfs, ignore_index=True)
>     print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}")
>     # 2. persist data frame partitioned by date
>     dataset_dir = tempfile.mkdtemp()
>     df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label'])
>     # 3. restore from parquet partitioned dataset
>     restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
>         ('date', '=', str(start_date))], use_legacy_dataset=False)
>     print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
>     group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
>     shutil.rmtree(dataset_dir) {code}
> It correctly reports five unique dates upon random df generation and correctly reports only one after reading back from parquet:
> {noformat}
> Unique dates after restore:
> 0    2021-11-13
> Name: date, dtype: category
> Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']{noformat}
> Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear:
> {code:java}
>     group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
> {code}
> With the following output:
> {noformat}
>           date     label  val_sum
> 0   2021-11-13  04LOXJCH      494
> 1   2021-11-13  4QOZ321D      819
> 2   2021-11-13  GG6YO5FS      394
> 3   2021-11-13  J7ZD3LDS      203
> 4   2021-11-13  TFVIXE6L      164
> 5   2021-11-14  04LOXJCH        0
> 6   2021-11-14  4QOZ321D        0
> 7   2021-11-14  GG6YO5FS        0
> 8   2021-11-14  J7ZD3LDS        0
> 9   2021-11-14  TFVIXE6L        0
> 10  2021-11-15  04LOXJCH        0
> 11  2021-11-15  4QOZ321D        0
> 12  2021-11-15  GG6YO5FS        0
> 13  2021-11-15  J7ZD3LDS        0
> 14  2021-11-15  TFVIXE6L        0
> 15  2021-11-16  04LOXJCH        0
> 16  2021-11-16  4QOZ321D        0
> 17  2021-11-16  GG6YO5FS        0
> 18  2021-11-16  J7ZD3LDS        0
> 19  2021-11-16  TFVIXE6L        0
> 20  2021-11-17  04LOXJCH        0
> 21  2021-11-17  4QOZ321D        0
> 22  2021-11-17  GG6YO5FS        0
> 23  2021-11-17  J7ZD3LDS        0
> 24  2021-11-17  TFVIXE6L        0{noformat}
> Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)