You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Jesus Camacho Rodriguez (JIRA)" <ji...@apache.org> on 2018/09/23 02:43:00 UTC

[jira] [Work started] (HIVE-20623) Shared work: Extend sharing of map-join cache entries in LLAP

     [ https://issues.apache.org/jira/browse/HIVE-20623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-20623 started by Jesus Camacho Rodriguez.
------------------------------------------------------
> Shared work: Extend sharing of map-join cache entries in LLAP
> -------------------------------------------------------------
>
>                 Key: HIVE-20623
>                 URL: https://issues.apache.org/jira/browse/HIVE-20623
>             Project: Hive
>          Issue Type: Improvement
>          Components: llap, Logical Optimizer
>            Reporter: Gopal V
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>         Attachments: hash-shared-work.json.txt, hash-shared-work.svg
>
>
> For a query like this
> {code}
> with all_sales as (
> select ss_customer_sk as customer_sk, ss_ext_list_price-ss_ext_discount_amt as ext_price from store_sales
> UNION ALL
> select ws_bill_customer_sk as customer_sk, ws_ext_list_price-ws_ext_discount_amt as ext_price from web_sales
> UNION ALL
> select cs_bill_customer_sk as customer_sk, cs_ext_sales_price - cs_ext_discount_amt as ext_price from catalog_sales)
> select sum(ext_price) total_price, c_customer_id from all_sales, customer 
> where customer_sk = c_customer_sk
> group by c_customer_id
> order by total_price desc 
> limit 100;
> {code}
> The hashtable used for all 3 joins are identical, which is loaded 3x times in the same LLAP instance because they are named.
> {code}
>     cacheKey = "HASH_MAP_" + this.getOperatorId() + "_container";
> {code}
> in the cache.
> If those are identical in nature (i.e vectorization, hashtable type etc), then the duplication is just wasted CPU, memory and network - using the cache name for hashtables which will be identical in layout would be extremely useful.
> In cases where the join is pushed through a UNION, those are identical.
> This optimization can only be done without concern for accidental delays when the same upstream task is generating all of these hashtables, which is what is achieved by the shared scan optimizer already.
> In case the shared work is not present, this has potential downsides - in case two customer broadcasts were sourced from "Map 1" and "Map 2", the Map 1 builder will block the other task from reading from Map 2, even though Map 2 might have started after, but finished ahead of Map 1.
> So this specific optimization can always be considered for cases where the shared work unifies the operator tree and the parents of all the RS entries involved are same (& the RS layout is the same).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)