You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/11/10 23:29:00 UTC
[jira] [Resolved] (ARROW-14639) [R] Error collecting complex joins
+ aggregations
[ https://issues.apache.org/jira/browse/ARROW-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dewey Dunnington resolved ARROW-14639.
--------------------------------------
Resolution: Fixed
> [R] Error collecting complex joins + aggregations
> -------------------------------------------------
>
> Key: ARROW-14639
> URL: https://issues.apache.org/jira/browse/ARROW-14639
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Dewey Dunnington
> Priority: Major
>
> I came across this when trying to implement [TCP-H benchmark 11|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/queries/q11.sql] , which involves a few joins on tables created using a global and a local aggregation. The error that I get when collecting is {{{}Arrays used to construct an ExecBatch must have equal length{}}}.
> Reprex:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> tbl1 <- tibble(key1 = 1:100, key2 = 201:300, value1 = 1:100)
> tbl2 <- tibble(key2 = 201:220, key3 = rep(301:310, 2), value2 = 201:220)
> tbl3 <- tibble(key3 = 301:305, value3 = 301:305)
> # joined_filtered <- tbl1 %>%
> # inner_join(tbl2, by = "key2") %>%
> # inner_join(tbl3, by = "key3") %>%
> # filter(value3 <= 304)
> joined_filtered <- arrow_table(tbl1) %>%
> inner_join(arrow_table(tbl2), by = "key2") %>%
> inner_join(arrow_table(tbl3), by = "key3") %>%
> filter(value3 <= 304)
> global_agr <- joined_filtered %>%
> mutate(global_agr_key = 1L) %>%
> group_by(global_agr_key) %>%
> summarise(
> global_value = sum(value1 * value2) / 5
> )
> local_agr <- joined_filtered %>%
> group_by(key3) %>%
> summarise(value = sum(value1 * value2))
> joined_filtered %>%
> mutate(global_agr_key = 1L) %>%
> left_join(global_agr, by = "global_agr_key") %>%
> left_join(local_agr, by = "key3") %>%
> filter(value > global_value) %>%
> arrange(desc(value)) %>%
> collect()
> #> Error: Invalid: Arrays used to construct an ExecBatch must have equal length
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/aggregate_node.cc:387 ExecBatch::Make(\{batch.values[agg_src_field_ids_[i]], id_batch})
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:398 iterator_.Next()
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:318 ReadNext(&batch)
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:329 ReadAll(&batches)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)