You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/11/09 12:08:00 UTC
[jira] [Created] (ARROW-14639) [R] Error collecting complex joins +
aggregations
Dewey Dunnington created ARROW-14639:
----------------------------------------
Summary: [R] Error collecting complex joins + aggregations
Key: ARROW-14639
URL: https://issues.apache.org/jira/browse/ARROW-14639
Project: Apache Arrow
Issue Type: Bug
Components: R
Reporter: Dewey Dunnington
I came across this when trying to implement [TCP-H benchmark 11|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/queries/q11.sql] , which involves a few joins on tables created using a global and a local aggregation. The error that I get when collecting is {{{}Arrays used to construct an ExecBatch must have equal length{}}}.
Reprex:
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
tbl1 <- tibble(key1 = 1:100, key2 = 201:300, value1 = 1:100)
tbl2 <- tibble(key2 = 201:220, key3 = rep(301:310, 2), value2 = 201:220)
tbl3 <- tibble(key3 = 301:305, value3 = 301:305)
# joined_filtered <- tbl1 %>%
# inner_join(tbl2, by = "key2") %>%
# inner_join(tbl3, by = "key3") %>%
# filter(value3 <= 304)
joined_filtered <- arrow_table(tbl1) %>%
inner_join(arrow_table(tbl2), by = "key2") %>%
inner_join(arrow_table(tbl3), by = "key3") %>%
filter(value3 <= 304)
global_agr <- joined_filtered %>%
mutate(global_agr_key = 1L) %>%
group_by(global_agr_key) %>%
summarise(
global_value = sum(value1 * value2) / 5
)
local_agr <- joined_filtered %>%
group_by(key3) %>%
summarise(value = sum(value1 * value2))
joined_filtered %>%
mutate(global_agr_key = 1L) %>%
left_join(global_agr, by = "global_agr_key") %>%
left_join(local_agr, by = "key3") %>%
filter(value > global_value) %>%
arrange(desc(value)) %>%
collect()
#> Error: Invalid: Arrays used to construct an ExecBatch must have equal length
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/aggregate_node.cc:387 ExecBatch::Make(\{batch.values[agg_src_field_ids_[i]], id_batch})
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:398 iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:318 ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:329 ReadAll(&batches)
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)