You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Shant Hovsepian (Jira)" <ji...@apache.org> on 2020/04/17 14:10:00 UTC

[jira] [Updated] (IMPALA-2929) Add a hint to eliminate self-join for BI systems that don't support nested types yet

     [ https://issues.apache.org/jira/browse/IMPALA-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shant Hovsepian updated IMPALA-2929:
------------------------------------
    Labels: complextype nested_types performance query_generator supportability usability  (was: nested_types performance query_generator supportability usability)

> Add a hint to eliminate self-join for BI systems that don't support nested types yet
> ------------------------------------------------------------------------------------
>
>                 Key: IMPALA-2929
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2929
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: impala 2.3
>         Environment: CDH 5.5; Impala 2.3
>            Reporter: Ruslan Dautkhanov
>            Priority: Minor
>              Labels: complextype, nested_types, performance, query_generator, supportability, usability
>
> We tested today a workaround for a BI tool to fetch nested data - created an Impala view for a table with nested collection.
> OBIEE (and many other BI tools) require to define "dimension" tables (like a customer dimensions), and facts (for example customer transactions). With a table that has a nested collection (fact), all non-nested attributes make up a customer dimension. And nested collection is a fact.
> So when we created a view that, e.g.
> {code:sql}
> CREATE VIEW adash_mrt.amf_trans_fact_vw 
> AS
> select fact.individ, trans.*
> from adash_mrt.amf_tsp_fact fact
>    , fact.amft trans;
> {code}
> And presented amf_trans_fact_vw to OBIEE as a fact, and adash_mrt.amf_tsp_fact as a customer dimension. 
> So BI tools like OBIEE then always would try to join that dimension and a fact, for example like 
> {code:sql}
> SELECT count(dim.individ), sum(trans.activity_dollars)
> FROM amf_tsp_fact dim, amf_trans_fact_vw trans
> WHERE dim.individ = trans.individ 
>       and trans.activity_type='M'  
>       and dim.tsp_gender='2'  
>       and dim.final_ace_city='SUPERIOR';
> {code}
> Then Impala would actually self-join that table amf_tsp_fact with a nested collection amft (!), rendering all benefits of nested collection useless. Because main benefit is avoiding expensive joins, but Impala's query optimizer not that smart to avoid that self-join.. 
> This JIRA is a suggestion to introduce a query-level hint or a new "set" session-level parameter that would eliminate such a self-join for BI tools that do not support nested types directly. It might be a hint in that CREATE VIEW query too.
> Here's explain plan for the above query:
> {noformat}
> 10:AGGREGATE [FINALIZE]
> |  output: count:merge(dim.individ), sum:merge(CAST(dim.activity_total_dollars_spent AS BIGINT)), sum:merge(trans.activity_dollars)
> |
> 09:EXCHANGE [UNPARTITIONED]
> |
> 07:AGGREGATE
> |  output: count(dim.individ), sum(trans.item.activity_dollars)
> |
> 06:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: fact.individ = dim.individ
> |
> |--08:EXCHANGE [BROADCAST]
> |  |
> |  00:SCAN HDFS [adash_mrt.amf_tsp_fact dim]
> |     partitions=1/1 files=6500 size=291.37GB
> |     predicates: dim.tsp_gender = '2', dim.final_ace_city = 'SUPERIOR'
> |
> 02:SUBPLAN
> |
> |--05:NESTED LOOP JOIN [CROSS JOIN]
> |  |
> |  |--03:SINGULAR ROW SRC
> |  |
> |  04:UNNEST [fact.amft trans]
> |
> 01:SCAN HDFS [adash_mrt.amf_tsp_fact fact]
>    partitions=1/1 files=6500 size=291.37GB
>    predicates on trans: trans.item.activity_type = 'M'
> {noformat}
> Notice HASH JOIN. 
> We implemented nested data types at the data lake layer, but stuck with BI tools because they don't support nested collections yet. It seems a low hanging fruit to close the gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org