You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Daniël Heres (Jira)" <ji...@apache.org> on 2020/12/22 13:08:00 UTC
[jira] [Commented] (ARROW-10964) [Rust] [DataFusion] Optimize
nested joins
[ https://issues.apache.org/jira/browse/ARROW-10964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253477#comment-17253477 ]
Daniël Heres commented on ARROW-10964:
--------------------------------------
Found some nice material from Spark on this:
[https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html]
basically the idea to use column level statistics such as:
* min/max
* nr of distinct values
* null count
to come up with e.g. selectivity of a filter.
Also there is a formula for (inner) join cardinality:
{{num(A IJ B) = num(A)*num(B)/max(distinct(A.k),distinct(B.k))}}
> [Rust] [DataFusion] Optimize nested joins
> -----------------------------------------
>
> Key: ARROW-10964
> URL: https://issues.apache.org/jira/browse/ARROW-10964
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Andy Grove
> Priority: Major
>
> Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables.
> The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)