You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2022/05/31 20:09:00 UTC

[jira] [Created] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Jonathan Keane created ARROW-16700:
--------------------------------------

             Summary: [C++] [R] [Datasets] aggregates on partitioning columns
                 Key: ARROW-16700
                 URL: https://issues.apache.org/jira/browse/ARROW-16700
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, R
            Reporter: Jonathan Keane


When summarizing a whole dataset (without group_by) with an aggregate, and summarizing a partitioned column, arrow returns wrong data:

{code:r}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- expand.grid(
  some_nulls = c(0L, 1L, 2L),
  year = 2010:2023,
  month = 1:12,
  day = 1:30
)

path <- tempfile()
dir.create(path)
write_dataset(df, path, partitioning = c("year", "month"))

ds <- open_dataset(path)

# with arrow the mins/maxes are off for partitioning columns
ds %>%
  summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
  collect()
#> # A tibble: 1 × 7
#>       n min_year min_month min_day max_year max_month max_day
#>   <int>    <int>     <int>   <int>    <int>     <int>   <int>
#> 1 15120     2023         1       1     2023        12      30

# comapred to what we get with dplyr
df %>%
  summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
  collect()
#>       n min_year min_month min_day max_year max_month max_day
#> 1 15120     2010         1       1     2023        12      30

# even min alone is off:
ds %>%
  summarise(min_year = min(year)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   min_year
#>      <int>
#> 1     2016
  
# but non-partitioning columns are fine:
ds %>%
  summarise(min_day = min(day)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   min_day
#>     <int>
#> 1       1
  
  
# But with a group_by, this seems ok
ds %>%
  group_by(some_nulls) %>%
  summarise(min_year = min(year)) %>% 
  collect()
#> # A tibble: 3 × 2
#>   some_nulls min_year
#>        <int>    <int>
#> 1          0     2010
#> 2          1     2010
#> 3          2     2010
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)