You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Daniël Heres (Jira)" <ji...@apache.org> on 2020/12/22 13:08:00 UTC

[jira] [Commented] (ARROW-10964) [Rust] [DataFusion] Optimize nested joins

    [ https://issues.apache.org/jira/browse/ARROW-10964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253477#comment-17253477 ] 

Daniël Heres commented on ARROW-10964:
--------------------------------------

Found some nice material from Spark on this:
[https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html]

basically the idea to use column level statistics such as:
* min/max
* nr of distinct values
* null count

to come up with e.g. selectivity of a filter.

Also there is a formula for (inner) join cardinality:

{{num(A IJ B) = num(A)*num(B)/max(distinct(A.k),distinct(B.k))}}

> [Rust] [DataFusion] Optimize nested joins
> -----------------------------------------
>
>                 Key: ARROW-10964
>                 URL: https://issues.apache.org/jira/browse/ARROW-10964
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Priority: Major
>
> Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables.
> The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)