You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/11/25 11:33:00 UTC
[jira] [Commented] (ARROW-14856) [R] group by n() on partitioning variables counts files not rows

    [ https://issues.apache.org/jira/browse/ARROW-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449144#comment-17449144 ] 

Nicola Crane commented on ARROW-14856:
--------------------------------------

Thanks for reporting this [~bbertelsen]!  I am able to reproduce this using 6.0.0.2 though if I use the latest version (6.0.1) available on CRAN, this problem no longer exists and the correct values are reported.  I'm a bit unsure which particular update fixed it, I'm afraid!  Would you mind updating to 6.0.1 and seeing if that fixes things for you?

> [R] group by n() on partitioning variables counts files not rows
> ----------------------------------------------------------------
>
>                 Key: ARROW-14856
>                 URL: https://issues.apache.org/jira/browse/ARROW-14856
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Brandon Bertelsen
>            Priority: Major
>
> It appears that when grouping by a partitioning variable, summarizy/tally, n() methods now count the number of files in a group rather than the number of rows. 
> Using R package from CRAN 6.0.0.2
> {code:java}
> library(arrow)
> library(dplyr)
> set.seed(42)
> df <- data.frame(a = sample(1:1e6))
> df$letters <- sample(letters, replace = T, 1e6)
> write_dataset(df, path = "test", partitioning = "letters", hive_style = FALSE)
> r <- read_parquet("test/a/part-0.parquet")
> nrow(r)
> # 38389 
> ds <- open_dataset("test", partitioning = 'letters')
> ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect()
> # # A tibble: 26 × 2
> # letters     n
> # <chr>   <int>
> #   1 c           1
> # 2 p           1
> # 3 a           1
> # 4 b           1
> # 5 e           1
> # 6 q           1
> # 7 d           1
> # 8 g           1
> # 9 r           1
> # 10 h           1
> # # … with 16 more rows
> file.copy("test/a/part-0.parquet", "test/a/part-1.parquet")
> ds <- open_dataset("test", partitioning = 'letters')
> ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() %>% arrange(-n)
> # # A tibble: 26 × 2
> # letters     n
> # <chr>   <int>
> #   1 a         2
> # 2 d           1
> # 3 f           1
> # 4 x           1
> # 5 c           1
> # 6 b           1
> # 7 e           1
> # 8 g           1
> # 9 u           1
> # 10 k           1
> # # … with 16 more rows
> # What about with summarize n = n()?
> ds %>% select(letters) %>% group_by(letters) %>% summarize(n = n()) %>% collect() %>% arrange(-n)
> # # A tibble: 26 × 2
> # letters     n
> # <chr>   <int>
> #   1 a           2
> # 2 b           1
> # 3 h           1
> # 4 g           1
> # 5 c           1
> # 6 i           1
> # 7 j           1
> # 8 s           1
> # 9 k           1
> # 10 d           1
> # # … with 16 more rows {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)