You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/10/05 21:59:00 UTC
[jira] [Comment Edited] (ARROW-14197) [C++] Hashjoin + datasets hanging

    [ https://issues.apache.org/jira/browse/ARROW-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424721#comment-17424721 ] 

David Li edited comment on ARROW-14197 at 10/5/21, 9:58 PM:
------------------------------------------------------------

Tangential, but the exec plan string repr being hard to read is unfortunate. It would help if R assigned names to the nodes, or perhaps we could have the ExecPlan auto-assign names if the client bindings don't give any. (Also, maybe GraphViz output would be nice after all…)

I filed ARROW-14233.


was (Author: lidavidm):
Tangential, but the exec plan string repr being hard to read is unfortunate. It would help if R assigned names to the nodes, or perhaps we could have the ExecPlan auto-assign names if the client bindings don't give any. (Also, maybe GraphViz output would be nice after all…)

> [C++] Hashjoin + datasets hanging
> ---------------------------------
>
>                 Key: ARROW-14197
>                 URL: https://issues.apache.org/jira/browse/ARROW-14197
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Critical
>              Labels: query-engine
>             Fix For: 6.0.0
>
>         Attachments: gdb.2.log, gdb.log, sample-while-hung.out.txt
>
>
> I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not _every_ time). The query is:
> {code}
> l <- input_table("lineitem") %>%
>     select(l_orderkey, l_commitdate, l_receiptdate) %>%
>     filter(l_commitdate < l_receiptdate) %>%
>     select(l_orderkey)
>   o <- input_table("orders") %>%
>     select(o_orderkey, o_orderdate, o_orderpriority) %>%
>     # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + interval '3' month) %>%
>     filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < as.Date("1993-10-01")) %>%
>     select(o_orderkey, o_orderpriority)
>   # distinct after join, tested and indeed faster
>   lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
>     distinct() %>%
>     select(o_orderpriority)
>   aggr <- lo %>%
>     group_by(o_orderpriority) %>%
>     summarise(order_count = n()) %>%
>     arrange(o_orderpriority) %>% 
>     collect()
> {code}
> Basically, filtered lineitems, filtered orders, join those together, group_by, summarise, arrange. 
> This happens pretty reliably when the {{input_table}} is a dataset backed by parquet or feather fiels (e.g. {{input_table}} returns something like {{arrow::open_dataset("path/to/{filename}.feather", format = "feather")}}
> One can replicate this by installing an arrowbench branch (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: {{remotes::install_github("ursacomputing/arrowbench@moar-tpch"}} and then running the following:
> {code}
> library(arrowbench)
> results <- run_benchmark(
>   tpc_h,
>   scale_factor = 1,
>   cpu_count = 8,
>   query_id = 4,
>   lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a recent install of the arrow r package that supports hash joins and want to avoid building a separate copy.
>   format = "feather",
>   n_iter = 20
> )
> {code}
> Note this _sometimes_ will finish, but frequently it will not and be stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)