You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Shant Hovsepian (Jira)" <ji...@apache.org> on 2020/04/17 14:10:00 UTC
[jira] [Updated] (IMPALA-2929) Add a hint to eliminate self-join
for BI systems that don't support nested types yet
[ https://issues.apache.org/jira/browse/IMPALA-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shant Hovsepian updated IMPALA-2929:
------------------------------------
Labels: complextype nested_types performance query_generator supportability usability (was: nested_types performance query_generator supportability usability)
> Add a hint to eliminate self-join for BI systems that don't support nested types yet
> ------------------------------------------------------------------------------------
>
> Key: IMPALA-2929
> URL: https://issues.apache.org/jira/browse/IMPALA-2929
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: impala 2.3
> Environment: CDH 5.5; Impala 2.3
> Reporter: Ruslan Dautkhanov
> Priority: Minor
> Labels: complextype, nested_types, performance, query_generator, supportability, usability
>
> We tested today a workaround for a BI tool to fetch nested data - created an Impala view for a table with nested collection.
> OBIEE (and many other BI tools) require to define "dimension" tables (like a customer dimensions), and facts (for example customer transactions). With a table that has a nested collection (fact), all non-nested attributes make up a customer dimension. And nested collection is a fact.
> So when we created a view that, e.g.
> {code:sql}
> CREATE VIEW adash_mrt.amf_trans_fact_vw
> AS
> select fact.individ, trans.*
> from adash_mrt.amf_tsp_fact fact
> , fact.amft trans;
> {code}
> And presented amf_trans_fact_vw to OBIEE as a fact, and adash_mrt.amf_tsp_fact as a customer dimension.
> So BI tools like OBIEE then always would try to join that dimension and a fact, for example like
> {code:sql}
> SELECT count(dim.individ), sum(trans.activity_dollars)
> FROM amf_tsp_fact dim, amf_trans_fact_vw trans
> WHERE dim.individ = trans.individ
> and trans.activity_type='M'
> and dim.tsp_gender='2'
> and dim.final_ace_city='SUPERIOR';
> {code}
> Then Impala would actually self-join that table amf_tsp_fact with a nested collection amft (!), rendering all benefits of nested collection useless. Because main benefit is avoiding expensive joins, but Impala's query optimizer not that smart to avoid that self-join..
> This JIRA is a suggestion to introduce a query-level hint or a new "set" session-level parameter that would eliminate such a self-join for BI tools that do not support nested types directly. It might be a hint in that CREATE VIEW query too.
> Here's explain plan for the above query:
> {noformat}
> 10:AGGREGATE [FINALIZE]
> | output: count:merge(dim.individ), sum:merge(CAST(dim.activity_total_dollars_spent AS BIGINT)), sum:merge(trans.activity_dollars)
> |
> 09:EXCHANGE [UNPARTITIONED]
> |
> 07:AGGREGATE
> | output: count(dim.individ), sum(trans.item.activity_dollars)
> |
> 06:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: fact.individ = dim.individ
> |
> |--08:EXCHANGE [BROADCAST]
> | |
> | 00:SCAN HDFS [adash_mrt.amf_tsp_fact dim]
> | partitions=1/1 files=6500 size=291.37GB
> | predicates: dim.tsp_gender = '2', dim.final_ace_city = 'SUPERIOR'
> |
> 02:SUBPLAN
> |
> |--05:NESTED LOOP JOIN [CROSS JOIN]
> | |
> | |--03:SINGULAR ROW SRC
> | |
> | 04:UNNEST [fact.amft trans]
> |
> 01:SCAN HDFS [adash_mrt.amf_tsp_fact fact]
> partitions=1/1 files=6500 size=291.37GB
> predicates on trans: trans.item.activity_type = 'M'
> {noformat}
> Notice HASH JOIN.
> We implemented nested data types at the data lake layer, but stuck with BI tools because they don't support nested collections yet. It seems a low hanging fruit to close the gap.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org