You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2020/09/22 16:28:00 UTC

[jira] [Resolved] (IMPALA-10179) After inverting a join's inputs the join's parallelism does not get reset

     [ https://issues.apache.org/jira/browse/IMPALA-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aman Sinha resolved IMPALA-10179.
---------------------------------
    Fix Version/s: Impala 4.0
       Resolution: Fixed

> After inverting a join's inputs the join's parallelism does not get reset
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-10179
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10179
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.4.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>             Fix For: Impala 4.0
>
>
> In the following query, the left semi join gets flipped to a right semi join due to the cardinality of the tables but the parallelism of the HashJoin fragment (see Fragment F01) remains as hosts=1, instances=1.  The right behavior should be to inherit the parallelism of the new probe input table store_sales, so it should be hosts=3, instances=3 to avoid under-parallelizing the HashJoin.
> {noformat}
> [localhost:21000] default> set explain_level=2;
> EXPLAIN_LEVEL set to 2
> [localhost:21000] default> use tpcds_parquet;
> Query: use tpcds_parquet
> [localhost:21000] tpcds_parquet> explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales);
> Query: explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales)
> Max Per-Host Resource Reservation: Memory=10.31MB Threads=6
> Per-Host Resource Estimates: Memory=85MB
> Analyzed query: SELECT count(*) FROM tpcds_parquet.store_returns LEFT SEMI JOIN
> (SELECT ss_customer_sk FROM tpcds_parquet.store_sales) `$a$1` (`$c$1`) ON
> sr_customer_sk = `$a$1`.`$c$1`
> ""
> F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=10.02MB mem-reservation=0B thread-reservation=1
> PLAN-ROOT SINK
> |  output exprs: count(*)
> |  mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> 09:AGGREGATE [FINALIZE]
> |  output: count:merge(*)
> |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
> |  tuple-ids=3 row-size=8B cardinality=1
> |  in pipelines: 09(GETNEXT), 04(OPEN)
> |
> 08:EXCHANGE [UNPARTITIONED]
> |  mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
> |  tuple-ids=3 row-size=8B cardinality=1
> |  in pipelines: 04(GETNEXT)
> |
> F01:PLAN FRAGMENT [HASH(tpcds_parquet.store_sales.ss_customer_sk)] hosts=1 instances=1
> Per-Host Resources: mem-estimate=23.88MB mem-reservation=5.81MB thread-reservation=1 runtime-filters-memory=1.00MB
> 04:AGGREGATE
> |  output: count(*)
> |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
> |  tuple-ids=3 row-size=8B cardinality=1
> |  in pipelines: 04(GETNEXT), 06(OPEN)
> |
> 03:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED]
> |  hash predicates: tpcds_parquet.store_sales.ss_customer_sk = sr_customer_sk
> |  runtime filters: RF000[bloom] <- sr_customer_sk
> |  mem-estimate=2.88MB mem-reservation=2.88MB spill-buffer=128.00KB thread-reservation=0
> |  tuple-ids=0 row-size=4B cardinality=287.51K
> |  in pipelines: 06(GETNEXT), 00(OPEN)
> |
> |--07:EXCHANGE [HASH(sr_customer_sk)]
> |  |  mem-estimate=1.10MB mem-reservation=0B thread-reservation=0
> |  |  tuple-ids=0 row-size=4B cardinality=287.51K
> |  |  in pipelines: 00(GETNEXT)
> |  |
> |  F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=2
> |  00:SCAN HDFS [tpcds_parquet.store_returns, RANDOM]
> |     HDFS partitions=1/1 files=1 size=15.43MB
> |     stored statistics:
> |       table: rows=287.51K size=15.43MB
> |       columns: all
> |     extrapolated-rows=disabled max-scan-range-rows=287.51K
> |     mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=1
> |     tuple-ids=0 row-size=4B cardinality=287.51K
> |     in pipelines: 00(GETNEXT)
> |
> 06:AGGREGATE [FINALIZE]
> |  group by: tpcds_parquet.store_sales.ss_customer_sk
> |  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB thread-reservation=0
> |  tuple-ids=6 row-size=4B cardinality=90.63K
> |  in pipelines: 06(GETNEXT), 01(OPEN)
> |
> 05:EXCHANGE [HASH(tpcds_parquet.store_sales.ss_customer_sk)]
> |  mem-estimate=142.01KB mem-reservation=0B thread-reservation=0
> |  tuple-ids=6 row-size=4B cardinality=90.63K
> |  in pipelines: 01(GETNEXT)
> |
> F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
> Per-Host Resources: mem-estimate=27.00MB mem-reservation=3.50MB thread-reservation=2 runtime-filters-memory=1.00MB
> 02:AGGREGATE [STREAMING]
> |  group by: tpcds_parquet.store_sales.ss_customer_sk
> |  mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB thread-reservation=0
> |  tuple-ids=6 row-size=4B cardinality=90.63K
> |  in pipelines: 01(GETNEXT)
> |
> 01:SCAN HDFS [tpcds_parquet.store_sales, RANDOM]
>    HDFS partitions=1824/1824 files=1824 size=200.95MB
>    runtime filters: RF000[bloom] -> tpcds_parquet.store_sales.ss_customer_sk
>    stored statistics:
>      table: rows=2.88M size=200.95MB
>      partitions: 1824/1824 rows=2.88M
>      columns: all
>    extrapolated-rows=disabled max-scan-range-rows=130.09K
>    mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1
>    tuple-ids=1 row-size=4B cardinality=2.88M
>    in pipelines: 01(GETNEXT)
> {noformat}
> The same behavior is seen for inner joins as well but to reproduce that I have to comment out the 'ordering' of tables for the joins to force creating an initial join order that is sub-optimal. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org