You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2022/06/29 21:33:00 UTC
[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns
[ https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560667#comment-17560667 ]
Jonathan Keane commented on ARROW-16700:
----------------------------------------
[~westonpace] not sure if this is related to ARROW-16904 or ARROW-16807 but another wrong-data ticket we should take a look at
> [C++] [R] [Datasets] aggregates on partitioning columns
> -------------------------------------------------------
>
> Key: ARROW-16700
> URL: https://issues.apache.org/jira/browse/ARROW-16700
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Reporter: Jonathan Keane
> Priority: Blocker
> Fix For: 9.0.0, 8.0.1
>
>
> When summarizing a whole dataset (without group_by) with an aggregate, and summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
> some_nulls = c(0L, 1L, 2L),
> year = 2010:2023,
> month = 1:12,
> day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
> summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>%
> collect()
> #> # A tibble: 1 × 7
> #> n min_year min_month min_day max_year max_month max_day
> #> <int> <int> <int> <int> <int> <int> <int>
> #> 1 15120 2023 1 1 2023 12 30
> # comapred to what we get with dplyr
> df %>%
> summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>%
> collect()
> #> n min_year min_month min_day max_year max_month max_day
> #> 1 15120 2010 1 1 2023 12 30
> # even min alone is off:
> ds %>%
> summarise(min_year = min(year)) %>%
> collect()
> #> # A tibble: 1 × 1
> #> min_year
> #> <int>
> #> 1 2016
>
> # but non-partitioning columns are fine:
> ds %>%
> summarise(min_day = min(day)) %>%
> collect()
> #> # A tibble: 1 × 1
> #> min_day
> #> <int>
> #> 1 1
>
>
> # But with a group_by, this seems ok
> ds %>%
> group_by(some_nulls) %>%
> summarise(min_year = min(year)) %>%
> collect()
> #> # A tibble: 3 × 2
> #> some_nulls min_year
> #> <int> <int>
> #> 1 0 2010
> #> 2 1 2010
> #> 3 2 2010
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)