You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Vadim Mironov (Jira)" <ji...@apache.org> on 2021/11/18 21:14:00 UTC
[jira] [Updated] (ARROW-14772) [Python] unexpected content after groupby on a dataframe restored from partitioned parquet with filters
[ https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vadim Mironov updated ARROW-14772:
----------------------------------
Description:
While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe.
Following code demonstrates the issue:
{code:java}
import numpy as np
import os
import pandas as pd # 1.3.4
import pyarrow as pa # 6.0.1
import random
import shutil
import string
import tempfile
from datetime import datetime, timedelta
if __name__ == '__main__':
# 1. generate random data frame
day_count = 5
data_length = 10
numpy_random_gen = np.random.default_rng()
label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)]
partial_dfs = []
start_date = datetime.today().date() - timedelta(days=day_count)
for date in (start_date + timedelta(n) for n in range(day_count)):
date_array = pd.to_datetime(np.full(data_length, date)).date
label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)])
value_array = numpy_random_gen.integers(low=1, high=500, size=data_length)
partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array}))
df = pd.concat(partial_dfs, ignore_index=True)
print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}")
# 2. persist data frame partitioned by date
dataset_dir = tempfile.mkdtemp()
df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label'])
# 3. restore from parquet partitioned dataset
restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
('date', '=', str(start_date))], use_legacy_dataset=False)
print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
print(group_by_df)
shutil.rmtree(dataset_dir) {code}
It correctly reports five unique dates upon random df generation and correctly reports only after reading back from parquet:
{noformat}
Unique dates after restore:
0 2021-11-13
Name: date, dtype: category
Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']{noformat}
Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear:
{code:java}
group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
print(group_by_df)
{code}
{{
{noformat}
}}
date label val_sum
0 2021-11-13 04LOXJCH 494
1 2021-11-13 4QOZ321D 819
2 2021-11-13 GG6YO5FS 394
3 2021-11-13 J7ZD3LDS 203
4 2021-11-13 TFVIXE6L 164
5 2021-11-14 04LOXJCH 0
6 2021-11-14 4QOZ321D 0
7 2021-11-14 GG6YO5FS 0
8 2021-11-14 J7ZD3LDS 0
9 2021-11-14 TFVIXE6L 0
10 2021-11-15 04LOXJCH 0
11 2021-11-15 4QOZ321D 0
12 2021-11-15 GG6YO5FS 0
13 2021-11-15 J7ZD3LDS 0
14 2021-11-15 TFVIXE6L 0
15 2021-11-16 04LOXJCH 0
16 2021-11-16 4QOZ321D 0
17 2021-11-16 GG6YO5FS 0
18 2021-11-16 J7ZD3LDS 0
19 2021-11-16 TFVIXE6L 0
20 2021-11-17 04LOXJCH 0
21 2021-11-17 4QOZ321D 0
22 2021-11-17 GG6YO5FS 0
23 2021-11-17 J7ZD3LDS 0
24 2021-11-17 TFVIXE6L 0{{{}{}}}
{{{noformat}
}}
{{Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation.}}
{{{{}}{}}}{{{{}}{}}}
was:
While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe.
Following code demonstrates the issue:
{code:java}
import numpy as np
import os
import pandas as pd # 1.3.4
import pyarrow as pa # 6.0.1
import random
import shutil
import string
import tempfile
from datetime import datetime, timedelta
if __name__ == '__main__':
# 1. generate random data frame
day_count = 5
data_length = 10
numpy_random_gen = np.random.default_rng()
label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)]
partial_dfs = []
start_date = datetime.today().date() - timedelta(days=day_count)
for date in (start_date + timedelta(n) for n in range(day_count)):
date_array = pd.to_datetime(np.full(data_length, date)).date
label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)])
value_array = numpy_random_gen.integers(low=1, high=500, size=data_length)
partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array}))
df = pd.concat(partial_dfs, ignore_index=True)
print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}")
# 2. persist data frame partitioned by date
dataset_dir = tempfile.mkdtemp()
df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label'])
# 3. restore from parquet partitioned dataset
restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
('date', '=', str(start_date))], use_legacy_dataset=False)
print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
print(group_by_df)
shutil.rmtree(dataset_dir) {code}
It correctly reports five unique dates upon random df generation and correctly reports only after reading back from parquet:
{{
{noformat}
Unique dates after restore:
0 2021-11-13
Name: date, dtype: category
Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']{noformat}
}}
Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear:
{code:java}
group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
print(group_by_df)
{code}
{{
{noformat}
}}
date label val_sum
0 2021-11-13 04LOXJCH 494
1 2021-11-13 4QOZ321D 819
2 2021-11-13 GG6YO5FS 394
3 2021-11-13 J7ZD3LDS 203
4 2021-11-13 TFVIXE6L 164
5 2021-11-14 04LOXJCH 0
6 2021-11-14 4QOZ321D 0
7 2021-11-14 GG6YO5FS 0
8 2021-11-14 J7ZD3LDS 0
9 2021-11-14 TFVIXE6L 0
10 2021-11-15 04LOXJCH 0
11 2021-11-15 4QOZ321D 0
12 2021-11-15 GG6YO5FS 0
13 2021-11-15 J7ZD3LDS 0
14 2021-11-15 TFVIXE6L 0
15 2021-11-16 04LOXJCH 0
16 2021-11-16 4QOZ321D 0
17 2021-11-16 GG6YO5FS 0
18 2021-11-16 J7ZD3LDS 0
19 2021-11-16 TFVIXE6L 0
20 2021-11-17 04LOXJCH 0
21 2021-11-17 4QOZ321D 0
22 2021-11-17 GG6YO5FS 0
23 2021-11-17 J7ZD3LDS 0
24 2021-11-17 TFVIXE6L 0{{{}{}}}
{{{noformat}
}}
{{Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation.}}
{{{{}}{}}}{{{{}}{}}}
> [Python] unexpected content after groupby on a dataframe restored from partitioned parquet with filters
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14772
> URL: https://issues.apache.org/jira/browse/ARROW-14772
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 6.0.1
> Reporter: Vadim Mironov
> Priority: Major
>
> While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe.
> Following code demonstrates the issue:
> {code:java}
> import numpy as np
> import os
> import pandas as pd # 1.3.4
> import pyarrow as pa # 6.0.1
> import random
> import shutil
> import string
> import tempfile
> from datetime import datetime, timedelta
> if __name__ == '__main__':
> # 1. generate random data frame
> day_count = 5
> data_length = 10
> numpy_random_gen = np.random.default_rng()
> label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)]
> partial_dfs = []
> start_date = datetime.today().date() - timedelta(days=day_count)
> for date in (start_date + timedelta(n) for n in range(day_count)):
> date_array = pd.to_datetime(np.full(data_length, date)).date
> label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)])
> value_array = numpy_random_gen.integers(low=1, high=500, size=data_length)
> partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array}))
> df = pd.concat(partial_dfs, ignore_index=True)
> print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}")
> # 2. persist data frame partitioned by date
> dataset_dir = tempfile.mkdtemp()
> df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label'])
> # 3. restore from parquet partitioned dataset
> restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
> ('date', '=', str(start_date))], use_legacy_dataset=False)
> print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
> group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
> print(group_by_df)
> shutil.rmtree(dataset_dir) {code}
> It correctly reports five unique dates upon random df generation and correctly reports only after reading back from parquet:
> {noformat}
> Unique dates after restore:
> 0 2021-11-13
> Name: date, dtype: category
> Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']{noformat}
> Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear:
> {code:java}
> group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum')
> print(group_by_df)
> {code}
> {{
> {noformat}
> }}
> date label val_sum
> 0 2021-11-13 04LOXJCH 494
> 1 2021-11-13 4QOZ321D 819
> 2 2021-11-13 GG6YO5FS 394
> 3 2021-11-13 J7ZD3LDS 203
> 4 2021-11-13 TFVIXE6L 164
> 5 2021-11-14 04LOXJCH 0
> 6 2021-11-14 4QOZ321D 0
> 7 2021-11-14 GG6YO5FS 0
> 8 2021-11-14 J7ZD3LDS 0
> 9 2021-11-14 TFVIXE6L 0
> 10 2021-11-15 04LOXJCH 0
> 11 2021-11-15 4QOZ321D 0
> 12 2021-11-15 GG6YO5FS 0
> 13 2021-11-15 J7ZD3LDS 0
> 14 2021-11-15 TFVIXE6L 0
> 15 2021-11-16 04LOXJCH 0
> 16 2021-11-16 4QOZ321D 0
> 17 2021-11-16 GG6YO5FS 0
> 18 2021-11-16 J7ZD3LDS 0
> 19 2021-11-16 TFVIXE6L 0
> 20 2021-11-17 04LOXJCH 0
> 21 2021-11-17 4QOZ321D 0
> 22 2021-11-17 GG6YO5FS 0
> 23 2021-11-17 J7ZD3LDS 0
> 24 2021-11-17 TFVIXE6L 0{{{}{}}}
> {{{noformat}
> }}
> {{Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation.}}
> {{{{}}{}}}{{{{}}{}}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)