You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2022/05/31 20:09:00 UTC
[jira] [Created] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns
Jonathan Keane created ARROW-16700:
--------------------------------------
Summary: [C++] [R] [Datasets] aggregates on partitioning columns
Key: ARROW-16700
URL: https://issues.apache.org/jira/browse/ARROW-16700
Project: Apache Arrow
Issue Type: Bug
Components: C++, R
Reporter: Jonathan Keane
When summarizing a whole dataset (without group_by) with an aggregate, and summarizing a partitioned column, arrow returns wrong data:
{code:r}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- expand.grid(
some_nulls = c(0L, 1L, 2L),
year = 2010:2023,
month = 1:12,
day = 1:30
)
path <- tempfile()
dir.create(path)
write_dataset(df, path, partitioning = c("year", "month"))
ds <- open_dataset(path)
# with arrow the mins/maxes are off for partitioning columns
ds %>%
summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>%
collect()
#> # A tibble: 1 × 7
#> n min_year min_month min_day max_year max_month max_day
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 15120 2023 1 1 2023 12 30
# comapred to what we get with dplyr
df %>%
summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>%
collect()
#> n min_year min_month min_day max_year max_month max_day
#> 1 15120 2010 1 1 2023 12 30
# even min alone is off:
ds %>%
summarise(min_year = min(year)) %>%
collect()
#> # A tibble: 1 × 1
#> min_year
#> <int>
#> 1 2016
# but non-partitioning columns are fine:
ds %>%
summarise(min_day = min(day)) %>%
collect()
#> # A tibble: 1 × 1
#> min_day
#> <int>
#> 1 1
# But with a group_by, this seems ok
ds %>%
group_by(some_nulls) %>%
summarise(min_year = min(year)) %>%
collect()
#> # A tibble: 3 × 2
#> some_nulls min_year
#> <int> <int>
#> 1 0 2010
#> 2 1 2010
#> 3 2 2010
{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)