You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Brandon Bertelsen (Jira)" <ji...@apache.org> on 2021/11/24 19:04:00 UTC
[jira] [Created] (ARROW-14856) [R] group by n() on partitioning variables counts files not rows
Brandon Bertelsen created ARROW-14856:
-----------------------------------------
Summary: [R] group by n() on partitioning variables counts files not rows
Key: ARROW-14856
URL: https://issues.apache.org/jira/browse/ARROW-14856
Project: Apache Arrow
Issue Type: Bug
Reporter: Brandon Bertelsen
It appears that when grouping by a partitioning variable, summarizy/tally, n() methods now count the number of files in a group rather than the number of rows.
Using R package from CRAN 6.0.0.2
{code:java}
library(arrow)
library(dplyr)
set.seed(42)
df <- data.frame(a = sample(1:1e6))
df$letters <- sample(letters, replace = T, 1e6)
write_dataset(df, path = "test", partitioning = "letters", hive_style = FALSE)
r <- read_parquet("test/a/part-0.parquet")
nrow(r)
# 38389
ds <- open_dataset("test", partitioning = 'letters')
ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect()
# # A tibble: 26 × 2
# letters n
# <chr> <int>
# 1 c 1
# 2 p 1
# 3 a 1
# 4 b 1
# 5 e 1
# 6 q 1
# 7 d 1
# 8 g 1
# 9 r 1
# 10 h 1
# # … with 16 more rows
file.copy("test/a/part-0.parquet", "test/a/part-1.parquet")
ds <- open_dataset("test", partitioning = 'letters')
ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() %>% arrange(-n)
# # A tibble: 26 × 2
# letters n
# <chr> <int>
# 1 a 2
# 2 d 1
# 3 f 1
# 4 x 1
# 5 c 1
# 6 b 1
# 7 e 1
# 8 g 1
# 9 u 1
# 10 k 1
# # … with 16 more rows
# What about with summarize n = n()?
ds %>% select(letters) %>% group_by(letters) %>% summarize(n = n()) %>% collect() %>% arrange(-n)
# # A tibble: 26 × 2
# letters n
# <chr> <int>
# 1 a 2
# 2 b 1
# 3 h 1
# 4 g 1
# 5 c 1
# 6 i 1
# 7 j 1
# 8 s 1
# 9 k 1
# 10 d 1
# # … with 16 more rows {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)