You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Gabor Kaszab (Jira)" <ji...@apache.org> on 2021/09/15 07:30:00 UTC
[jira] [Commented] (IMPALA-10809) improve the performance of unnest operation

    [ https://issues.apache.org/jira/browse/IMPALA-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415354#comment-17415354 ] 

Gabor Kaszab commented on IMPALA-10809:
---------------------------------------

Anyway, just for the record I started putting together a design doc for this with multiple implementation proposals.
https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing

> improve the performance of unnest operation
> -------------------------------------------
>
>                 Key: IMPALA-10809
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10809
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: pengdou1990
>            Assignee: pengdou1990
>            Priority: Minor
>              Labels: complextype
>
> h2. current situation
> Impala's support for complex data types is not particularly friendly.
> For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.
> If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor.
> DDL:
> {code:java}
> CREATE TABLE rawdata.users2 (                                     
>  day INT,                                                        
>  sampling_group INT,                                             
>  user_id BIGINT,                                                 
>   time TIMESTAMP,                                                 
>  _offset BIGINT,                                                 
>  event_id INT,                                                   
>  month_id INT,                                                   
>  week_id INT,                                                    
>  distinct_id STRING,                                             
>  event_bucket INT,                                               
>  adresses_list_string ARRAY<STRING>,                             
>  count_list_bigint ARRAY<BIGINT>                                 
>  )                                                                 
>  WITH SERDEPROPERTIES ('serialization.format'='1')                 
>  STORED AS PARQUET                                                 
>  LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'{code}
>  Query SQL:
> {code:java}
> SELECT
>     `day`,
>     list`.item,
>    list1.item 
>  FROM
>    rawdata.users2,
>    rawdata.users2.adresses_list_string list1,
>    rawdata.users2.count_list_bigint list2{code}
>  Simplified Plan:
>  
> {code:java}
> F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> 07:EXCHANGE [UNPARTITIONED]
> |
> 01:SUBPLAN
> |
> |--06:NESTED LOOP JOIN [CROSS JOIN]
> | |
> | |--04:UNNEST [users2.count_list_bigint clist]
> | |
> | 05:NESTED LOOP JOIN [CROSS JOIN]
> | |
> | |--02:SINGULAR ROW SRC
> | |
> | 03:UNNEST [users2.adresses_list_string list]
> |
> 00:SCAN HDFS [rawdata.users2, RANDOM]
>  {code}
> h2. Improve Solution
> In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:
> At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation
> then, in BE, the raw was explode locally, and the fileds layout as childnode
> the query sql and Plan greatly simplified：
> Query SQL:
> {code:java}
> SELECT
>     `day`,
>    explode(adresses_list_string),
>    explode(count_list_bigint) 
>  from
>    rawdata.users2{code}
>  the simplified Plan as this：
> {code:java}
> F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> 02:EXCHANGE [UNPARTITIONED]
> |
> 01:EXPLODE NODE [UNPARTITIONED]
> |
> 00:SCAN HDFS [rawdata.users2, RANDOM] {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org