You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2020/12/20 20:44:00 UTC

[jira] [Resolved] (ARROW-10885) [Rust][DataFusion] Optimize join build vs probe based on statistics on row number

     [ https://issues.apache.org/jira/browse/ARROW-10885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove resolved ARROW-10885.
--------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 8961
[https://github.com/apache/arrow/pull/8961]

> [Rust][DataFusion] Optimize join build vs probe based on statistics on row number
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-10885
>                 URL: https://issues.apache.org/jira/browse/ARROW-10885
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Daniël Heres
>            Assignee: Daniël Heres
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Based on number of rows in a datasource we can optimize which table should be part of the build phase and which part of the probe phase in a hash join. We should make the (approximately) smallest datasource. This can have a large effect on performance if one of the two tables is much bigger than the other, as we can skip building a large lookup table.
> Recently we are adding statistics to data sources in DataFusion, so this seems something we can add relatively easily. We can approximate the number of rows based on underlying statistics in datasources, but it should at least work for simple cases first.
> When swapping the order a left join has to be changed to a right join and vice versa, inner joins remain the same. Probably it is easier to start with inner joins and then add left / right joins.
> Maybe we should also rename some internals to make clear that e.g. the left part is part of the build and the right part of the probe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)