You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/11/25 11:33:00 UTC
[jira] [Commented] (ARROW-14856) [R] group by n() on partitioning variables counts files not rows
[ https://issues.apache.org/jira/browse/ARROW-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449144#comment-17449144 ]
Nicola Crane commented on ARROW-14856:
--------------------------------------
Thanks for reporting this [~bbertelsen]! I am able to reproduce this using 6.0.0.2 though if I use the latest version (6.0.1) available on CRAN, this problem no longer exists and the correct values are reported. I'm a bit unsure which particular update fixed it, I'm afraid! Would you mind updating to 6.0.1 and seeing if that fixes things for you?
> [R] group by n() on partitioning variables counts files not rows
> ----------------------------------------------------------------
>
> Key: ARROW-14856
> URL: https://issues.apache.org/jira/browse/ARROW-14856
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Brandon Bertelsen
> Priority: Major
>
> It appears that when grouping by a partitioning variable, summarizy/tally, n() methods now count the number of files in a group rather than the number of rows.
> Using R package from CRAN 6.0.0.2
> {code:java}
> library(arrow)
> library(dplyr)
> set.seed(42)
> df <- data.frame(a = sample(1:1e6))
> df$letters <- sample(letters, replace = T, 1e6)
> write_dataset(df, path = "test", partitioning = "letters", hive_style = FALSE)
> r <- read_parquet("test/a/part-0.parquet")
> nrow(r)
> # 38389
> ds <- open_dataset("test", partitioning = 'letters')
> ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect()
> # # A tibble: 26 × 2
> # letters n
> # <chr> <int>
> # 1 c 1
> # 2 p 1
> # 3 a 1
> # 4 b 1
> # 5 e 1
> # 6 q 1
> # 7 d 1
> # 8 g 1
> # 9 r 1
> # 10 h 1
> # # … with 16 more rows
> file.copy("test/a/part-0.parquet", "test/a/part-1.parquet")
> ds <- open_dataset("test", partitioning = 'letters')
> ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() %>% arrange(-n)
> # # A tibble: 26 × 2
> # letters n
> # <chr> <int>
> # 1 a 2
> # 2 d 1
> # 3 f 1
> # 4 x 1
> # 5 c 1
> # 6 b 1
> # 7 e 1
> # 8 g 1
> # 9 u 1
> # 10 k 1
> # # … with 16 more rows
> # What about with summarize n = n()?
> ds %>% select(letters) %>% group_by(letters) %>% summarize(n = n()) %>% collect() %>% arrange(-n)
> # # A tibble: 26 × 2
> # letters n
> # <chr> <int>
> # 1 a 2
> # 2 b 1
> # 3 h 1
> # 4 g 1
> # 5 c 1
> # 6 i 1
> # 7 j 1
> # 8 s 1
> # 9 k 1
> # 10 d 1
> # # … with 16 more rows {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)