You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/11/09 12:08:00 UTC

[jira] [Created] (ARROW-14639) [R] Error collecting complex joins + aggregations

Dewey Dunnington created ARROW-14639:
----------------------------------------

             Summary: [R] Error collecting complex joins + aggregations
                 Key: ARROW-14639
                 URL: https://issues.apache.org/jira/browse/ARROW-14639
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Dewey Dunnington


I came across this when trying to implement [TCP-H benchmark 11|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/queries/q11.sql] , which involves a few joins on tables created using a global and a local aggregation. The error that I get when collecting is {{{}Arrays used to construct an ExecBatch must have equal length{}}}.

Reprex:

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

tbl1 <- tibble(key1 = 1:100, key2 = 201:300, value1 = 1:100)
tbl2 <- tibble(key2 = 201:220, key3 = rep(301:310, 2), value2 = 201:220)
tbl3 <- tibble(key3 = 301:305, value3 = 301:305)

# joined_filtered <- tbl1 %>% 
#   inner_join(tbl2, by = "key2") %>% 
#   inner_join(tbl3, by = "key3") %>% 
#   filter(value3 <= 304)

joined_filtered <- arrow_table(tbl1) %>% 
  inner_join(arrow_table(tbl2), by = "key2") %>% 
  inner_join(arrow_table(tbl3), by = "key3") %>% 
  filter(value3 <= 304)

global_agr <- joined_filtered %>%
  mutate(global_agr_key = 1L) %>%
  group_by(global_agr_key) %>%
  summarise(
    global_value = sum(value1 * value2) / 5
  )

local_agr <- joined_filtered %>%
  group_by(key3) %>%
  summarise(value = sum(value1 * value2))

joined_filtered %>%
  mutate(global_agr_key = 1L) %>%
  left_join(global_agr, by = "global_agr_key") %>%
  left_join(local_agr, by = "key3") %>%
  filter(value > global_value) %>%
  arrange(desc(value)) %>% 
  collect()
#> Error: Invalid: Arrays used to construct an ExecBatch must have equal length
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/aggregate_node.cc:387  ExecBatch::Make(\{batch.values[agg_src_field_ids_[i]], id_batch})
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:318  ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:329  ReadAll(&batches)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)