You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "pengdou1990 (Jira)" <ji...@apache.org> on 2021/07/20 08:08:00 UTC

[jira] [Updated] (IMPALA-10809) improve the performance of unnest operation

     [ https://issues.apache.org/jira/browse/IMPALA-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

pengdou1990 updated IMPALA-10809:
---------------------------------
    Description: 
h2. current situation

Impala's support for complex data types is not particularly friendly.

For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.

If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor.

DDL:
{code:java}
CREATE TABLE rawdata.users2 (                                     
 day INT,                                                        
 sampling_group INT,                                             
 user_id BIGINT,                                                 
  time TIMESTAMP,                                                 
 _offset BIGINT,                                                 
 event_id INT,                                                   
 month_id INT,                                                   
 week_id INT,                                                    
 distinct_id STRING,                                             
 event_bucket INT,                                               
 adresses_list_string ARRAY<STRING>,                             
 count_list_bigint ARRAY<BIGINT>                                 
 )                                                                 
 WITH SERDEPROPERTIES ('serialization.format'='1')                 
 STORED AS PARQUET                                                 
 LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'{code}

 Query SQL:
{code:java}
SELECT
    `day`,
    list`.item,
   list1.item 
 FROM
   rawdata.users2,
   rawdata.users2.adresses_list_string list1,
   rawdata.users2.count_list_bigint list2{code}

 Simplified Plan:

 
{code:java}
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
07:EXCHANGE [UNPARTITIONED]
|
01:SUBPLAN
|
|--06:NESTED LOOP JOIN [CROSS JOIN]
| |
| |--04:UNNEST [users2.count_list_bigint clist]
| |
| 05:NESTED LOOP JOIN [CROSS JOIN]
| |
| |--02:SINGULAR ROW SRC
| |
| 03:UNNEST [users2.adresses_list_string list]
|
00:SCAN HDFS [rawdata.users2, RANDOM]

 {code}
h2. Improve Solution

In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:

At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation

then, in BE, the raw was explode locally, and the fileds layout as childnode

the query sql and Plan greatly simplified:

Query SQL:
{code:java}
SELECT
    `day`,
   explode(adresses_list_string),
   explode(count_list_bigint) 
 from
   rawdata.users2{code}

 the simplified Plan as this:
{code:java}
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
02:EXCHANGE [UNPARTITIONED]
|
01:EXPLODE NODE [UNPARTITIONED]
|
00:SCAN HDFS [rawdata.users2, RANDOM] {code}
 

  was:
h2. current situation

Impala's support for complex data types is not particularly friendly.

For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.

If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor. 

DDL:
CREATE TABLE rawdata.users2 (                                     
  day INT,                                                        
  sampling_group INT,                                             
  user_id BIGINT,                                                 
  time TIMESTAMP,                                                 
  _offset BIGINT,                                                 
  event_id INT,                                                   
  month_id INT,                                                   
  week_id INT,                                                    
  distinct_id STRING,                                             
  event_bucket INT,                                               
  adresses_list_string ARRAY<STRING>,                             
  count_list_bigint ARRAY<BIGINT>                                 
)                                                                 
WITH SERDEPROPERTIES ('serialization.format'='1')                 
STORED AS PARQUET                                                 
LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'
Query SQL:
SELECT
    `day`,
    list`.item,
    list1.item 
FROM
    rawdata.users2,
    rawdata.users2.adresses_list_string list1,
    rawdata.users2.count_list_bigint list2
Simplified Plan:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
07:EXCHANGE [UNPARTITIONED]
|
01:SUBPLAN
|
|--06:NESTED LOOP JOIN [CROSS JOIN]
|  |
|  |--04:UNNEST [users2.count_list_bigint clist]
|  |
|  05:NESTED LOOP JOIN [CROSS JOIN]
|  |
|  |--02:SINGULAR ROW SRC
|  |
|  03:UNNEST [users2.adresses_list_string list]
|
00:SCAN HDFS [rawdata.users2, RANDOM]
h2. Improve Solution

In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:

At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation

then, in BE, the raw was explode locally, and the fileds layout as childnode

the query sql and Plan greatly simplified:

Query SQL:
SELECT
    `day`,
    explode(adresses_list_string),
    explode(count_list_bigint) 
from
    rawdata.users2
the simplified Plan as this:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
02:EXCHANGE [UNPARTITIONED]
|
01:EXPLODE NODE [UNPARTITIONED] 
|
00:SCAN HDFS [rawdata.users2, RANDOM]


> improve the performance of unnest operation
> -------------------------------------------
>
>                 Key: IMPALA-10809
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10809
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: pengdou1990
>            Priority: Minor
>
> h2. current situation
> Impala's support for complex data types is not particularly friendly.
> For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.
> If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor.
> DDL:
> {code:java}
> CREATE TABLE rawdata.users2 (                                     
>  day INT,                                                        
>  sampling_group INT,                                             
>  user_id BIGINT,                                                 
>   time TIMESTAMP,                                                 
>  _offset BIGINT,                                                 
>  event_id INT,                                                   
>  month_id INT,                                                   
>  week_id INT,                                                    
>  distinct_id STRING,                                             
>  event_bucket INT,                                               
>  adresses_list_string ARRAY<STRING>,                             
>  count_list_bigint ARRAY<BIGINT>                                 
>  )                                                                 
>  WITH SERDEPROPERTIES ('serialization.format'='1')                 
>  STORED AS PARQUET                                                 
>  LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'{code}
>  Query SQL:
> {code:java}
> SELECT
>     `day`,
>     list`.item,
>    list1.item 
>  FROM
>    rawdata.users2,
>    rawdata.users2.adresses_list_string list1,
>    rawdata.users2.count_list_bigint list2{code}
>  Simplified Plan:
>  
> {code:java}
> F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> 07:EXCHANGE [UNPARTITIONED]
> |
> 01:SUBPLAN
> |
> |--06:NESTED LOOP JOIN [CROSS JOIN]
> | |
> | |--04:UNNEST [users2.count_list_bigint clist]
> | |
> | 05:NESTED LOOP JOIN [CROSS JOIN]
> | |
> | |--02:SINGULAR ROW SRC
> | |
> | 03:UNNEST [users2.adresses_list_string list]
> |
> 00:SCAN HDFS [rawdata.users2, RANDOM]
>  {code}
> h2. Improve Solution
> In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:
> At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation
> then, in BE, the raw was explode locally, and the fileds layout as childnode
> the query sql and Plan greatly simplified:
> Query SQL:
> {code:java}
> SELECT
>     `day`,
>    explode(adresses_list_string),
>    explode(count_list_bigint) 
>  from
>    rawdata.users2{code}
>  the simplified Plan as this:
> {code:java}
> F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> 02:EXCHANGE [UNPARTITIONED]
> |
> 01:EXPLODE NODE [UNPARTITIONED]
> |
> 00:SCAN HDFS [rawdata.users2, RANDOM] {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org