You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "Suresh Subbiah (JIRA)" <ji...@apache.org> on 2015/10/08 00:59:26 UTC
[jira] [Updated] (TRAFODION-8) Do we need an "Orchestrated Hash Join" in Trafodion?

     [ https://issues.apache.org/jira/browse/TRAFODION-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Subbiah updated TRAFODION-8:
-----------------------------------
    Component/s: sql-general

> Do we need an "Orchestrated Hash Join" in Trafodion?
> ----------------------------------------------------
>
>                 Key: TRAFODION-8
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-8
>             Project: Apache Trafodion
>          Issue Type: Brainstorming
>          Components: sql-general
>         Environment: Any
>            Reporter: Hans Zeller
>            Assignee: Hans Zeller
>            Priority: Minor
>
> Here are a few hash-join related things we have been discussing for the past few years under the umbrella term "Orchestrated Hash Join", because they mostly relate to how multiple hash joins in a single query could be orchestrated to get better overall performance.
> One proposal is to give all the memory available to the query to the bottom part of a join tree (bottom-weighted join). Here is such a join tree:
> {noformat}
>               HJ5
>              /   \
>            HJ4   T6
>           /   \
>         HJ3   T5
>        /   \
>      HJ2   T4
>     /   \
>   HJ1   T3
>  /   \
> T1   T2
> {noformat}
> Note that we draw the building table (the table stored in memory) on the right side and the probing table on the left. With enough memory, this left-linear tree is a very efficient execution plan, as all building tables can be read into memory at the same time and the probing can happen in parallel for all the joins.
> Now, if we had only 1/3rd of the needed memory, we could give each join 1/3rd of what it needs, and let the remaining 2/3rds overflow.
> In phase 2 of the join, only 1/3rd of the rows from T1 would then be produced by HJ1, because only 1/3rd of T2 would be in memory. Only 1/9th of the total result rows would be produced by HJ2, and so on, with only 1/243rd of the total rows produced by HJ5. This is a very inefficient way to utilize memory.
> A better strategy would be to give HJ1 and HJ2 all of their required memory (assuming all the joins need equal amounts of memory) and to let HJ3 overflow completely to disk. HJ4 and HJ5 can be delayed until later.
> Now, after building the in-memory hash tables in phase 1, phase 2 of HJ1 and HJ2 will produce 100% of the result, which whill all overflow in HJ3.
> Once that is complete, we can free the memory used by HJ1 and HJ2 and give it to HJ3 and HJ4. Now HJ5 overflows completely to disk. When that is complete, we can finish HJ5.
> By concentrating the memory into a few joins at the bottom, we are utilizing that memory better, since the left child of these joins will produce 100% of their resulting rows, ensuring that we use the in-memory table to do a lot of probes.
> There are other optimizations on how to orchestrate multiple joins:
> A) Make bit vector filters and min/max predicates from the rows read
>    from the building table and use those as predicates on the probing
>    table. This requires reading the building table first and delaying
>    the request to the probing table until after the end of the building
>    phase.
> B) Dynamically switch building and probing tables when it turns out that
>    the probing table is smaller than the building table. This requires
>    reading a portion of the probing table before completing the building
>    table.
> Issues with these techniques:
> Since each of these introduce a delay for parts of the multi-way hash join, they may in some cases slow down the entire process. Heuristics will be needed to find the optimal sequence or "orchestration" of the whole process.
> Since many joins will execute in parallel, possibly with exchange operators between the joins, each parallel instance will need to do the orchestration in the same way, to avoid potential deadlocks. Predicates from multiple parallel probing table instances will need to be merged for the building tables.
> With larger an larger main memories, it is questionable whether such techniques are needed at all or whether we should focus on in-memory algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)