You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Gabor Kaszab (Jira)" <ji...@apache.org> on 2021/09/15 07:30:00 UTC
[jira] [Commented] (IMPALA-10809) improve the performance of unnest
operation
[ https://issues.apache.org/jira/browse/IMPALA-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415354#comment-17415354 ]
Gabor Kaszab commented on IMPALA-10809:
---------------------------------------
Anyway, just for the record I started putting together a design doc for this with multiple implementation proposals.
https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> improve the performance of unnest operation
> -------------------------------------------
>
> Key: IMPALA-10809
> URL: https://issues.apache.org/jira/browse/IMPALA-10809
> Project: IMPALA
> Issue Type: Improvement
> Reporter: pengdou1990
> Assignee: pengdou1990
> Priority: Minor
> Labels: complextype
>
> h2. current situation
> Impala's support for complex data types is not particularly friendly.
> For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.
> If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor.
> DDL:
> {code:java}
> CREATE TABLE rawdata.users2 (
> day INT,
> sampling_group INT,
> user_id BIGINT,
> time TIMESTAMP,
> _offset BIGINT,
> event_id INT,
> month_id INT,
> week_id INT,
> distinct_id STRING,
> event_bucket INT,
> adresses_list_string ARRAY<STRING>,
> count_list_bigint ARRAY<BIGINT>
> )
> WITH SERDEPROPERTIES ('serialization.format'='1')
> STORED AS PARQUET
> LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'{code}
> Query SQL:
> {code:java}
> SELECT
> `day`,
> list`.item,
> list1.item
> FROM
> rawdata.users2,
> rawdata.users2.adresses_list_string list1,
> rawdata.users2.count_list_bigint list2{code}
> Simplified Plan:
>
> {code:java}
> F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> 07:EXCHANGE [UNPARTITIONED]
> |
> 01:SUBPLAN
> |
> |--06:NESTED LOOP JOIN [CROSS JOIN]
> | |
> | |--04:UNNEST [users2.count_list_bigint clist]
> | |
> | 05:NESTED LOOP JOIN [CROSS JOIN]
> | |
> | |--02:SINGULAR ROW SRC
> | |
> | 03:UNNEST [users2.adresses_list_string list]
> |
> 00:SCAN HDFS [rawdata.users2, RANDOM]
> {code}
> h2. Improve Solution
> In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:
> At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation
> then, in BE, the raw was explode locally, and the fileds layout as childnode
> the query sql and Plan greatly simplified:
> Query SQL:
> {code:java}
> SELECT
> `day`,
> explode(adresses_list_string),
> explode(count_list_bigint)
> from
> rawdata.users2{code}
> the simplified Plan as this:
> {code:java}
> F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> 02:EXCHANGE [UNPARTITIONED]
> |
> 01:EXPLODE NODE [UNPARTITIONED]
> |
> 00:SCAN HDFS [rawdata.users2, RANDOM] {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org