You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Sam Albers (Jira)" <ji...@apache.org> on 2022/02/14 18:50:00 UTC
[jira] [Created] (ARROW-15679) count should return an ungrouped dataframe
Sam Albers created ARROW-15679:
----------------------------------
Summary: count should return an ungrouped dataframe
Key: ARROW-15679
URL: https://issues.apache.org/jira/browse/ARROW-15679
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 7.0.0
Reporter: Sam Albers
Unless grouped before `dplyr::count` returns a ungrouped data.frame. The arrow implement preserves the grouping variables:
{code:java}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
tf1 <- tempfile()
dir.create(tf1)
starwars |>
write_dataset(tf1)
# no group ----------------------------------------------------------------
## dplyr behaviour
count_dplyr_no_group <- starwars %>%
count(gender, homeworld, species)
group_vars(count_dplyr_no_group)
#> character(0)
## arrow behaviour
count_arrow_no_group <- open_dataset(tf1) %>%
count(gender, homeworld, species) %>%
collect()
group_vars(count_arrow_no_group)
#> [1] "gender" "homeworld"
{code}
If I am correct that this is a undesired behaviour I think it can be fixed [here|https://github.com/apache/arrow/blob/5ad5ddcafee8fada9cebb341df638b750c98efb7/r/R/dplyr-count.R#L20-L35] using this patch:
{code:java}
count.arrow_dplyr_query <- function(x, ..., wt = NULL, sort = FALSE, name = NULL) {
if (!missing(...)) {
out <- dplyr::group_by(x, ..., .add = TRUE)
} else {
out <- x
}
out <- dplyr::tally(out, wt = {{ wt }}, sort = sort, name = name)
gv <- dplyr::group_vars(x)
if (rlang::is_empty(gv)) {
out <- dplyr::ungroup(out)
} else {
# Restore original group vars
out$group_by_vars <- gv
}
out
}
{code}
I can submit a PR with some tests if that would be helpful.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)