You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/11/09 12:16:00 UTC

[jira] [Commented] (ARROW-14639) [R] Error collecting complex joins + aggregations

    [ https://issues.apache.org/jira/browse/ARROW-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441122#comment-17441122 ] 

Dewey Dunnington commented on ARROW-14639:
------------------------------------------

Much simpler reprex that triggers the same error message:

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

tbl1 <- tibble(key1 = 1:100, key2 = 201:300, value1 = 1:100)

global_agr <- arrow_table(tbl1) %>%
  mutate(global_agr_key = 1L) %>% 
  group_by(global_agr_key) %>% 
  summarise(
    global_value = sum(value1) / 5
  )

arrow_table(tbl1) %>%
  mutate(global_agr_key = 1L) %>%
  left_join(global_agr, by = "global_agr_key") %>%
  collect()
#> Error: Invalid: Arrays used to construct an ExecBatch must have equal length
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/aggregate_node.cc:387  ExecBatch::Make({batch.values[agg_src_field_ids_[i]], id_batch})
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:318  ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:329  ReadAll(&batches)
{code}


> [R] Error collecting complex joins + aggregations
> -------------------------------------------------
>
>                 Key: ARROW-14639
>                 URL: https://issues.apache.org/jira/browse/ARROW-14639
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Dewey Dunnington
>            Priority: Major
>
> I came across this when trying to implement [TCP-H benchmark 11|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/queries/q11.sql] , which involves a few joins on tables created using a global and a local aggregation. The error that I get when collecting is {{{}Arrays used to construct an ExecBatch must have equal length{}}}.
> Reprex:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> tbl1 <- tibble(key1 = 1:100, key2 = 201:300, value1 = 1:100)
> tbl2 <- tibble(key2 = 201:220, key3 = rep(301:310, 2), value2 = 201:220)
> tbl3 <- tibble(key3 = 301:305, value3 = 301:305)
> # joined_filtered <- tbl1 %>% 
> #   inner_join(tbl2, by = "key2") %>% 
> #   inner_join(tbl3, by = "key3") %>% 
> #   filter(value3 <= 304)
> joined_filtered <- arrow_table(tbl1) %>% 
>   inner_join(arrow_table(tbl2), by = "key2") %>% 
>   inner_join(arrow_table(tbl3), by = "key3") %>% 
>   filter(value3 <= 304)
> global_agr <- joined_filtered %>%
>   mutate(global_agr_key = 1L) %>%
>   group_by(global_agr_key) %>%
>   summarise(
>     global_value = sum(value1 * value2) / 5
>   )
> local_agr <- joined_filtered %>%
>   group_by(key3) %>%
>   summarise(value = sum(value1 * value2))
> joined_filtered %>%
>   mutate(global_agr_key = 1L) %>%
>   left_join(global_agr, by = "global_agr_key") %>%
>   left_join(local_agr, by = "key3") %>%
>   filter(value > global_value) %>%
>   arrange(desc(value)) %>% 
>   collect()
> #> Error: Invalid: Arrays used to construct an ExecBatch must have equal length
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/aggregate_node.cc:387  ExecBatch::Make(\{batch.values[agg_src_field_ids_[i]], id_batch})
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:318  ReadNext(&batch)
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:329  ReadAll(&batches)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)