You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2020/09/18 23:17:00 UTC

[jira] [Commented] (IMPALA-10179) After inverting a join's inputs the join's parallelism does not get reset

    [ https://issues.apache.org/jira/browse/IMPALA-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198580#comment-17198580 ] 

Aman Sinha commented on IMPALA-10179:
-------------------------------------

This is related to IMPALA-5612, although that issue was fixing the costing to account for the parallelism when deciding to invert or not.
Based on an initial look, it seems the fix for this issue is simply re-computing the stats after the inversion is done.

> After inverting a join's inputs the join's parallelism does not get reset
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-10179
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10179
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.4.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>
> In the following query, the left semi join gets flipped to a right semi join due to the cardinality of the tables but the parallelism of the HashJoin fragment (see Fragment F01) remains as hosts=1, instances=1.  The right behavior should be to inherit the parallelism of the new probe input table store_sales, so it should be hosts=3, instances=3 to avoid under-parallelizing the HashJoin.
> {noformat}
> [localhost:21000] default> set explain_level=2;
> EXPLAIN_LEVEL set to 2
> [localhost:21000] default> use tpcds_parquet;
> Query: use tpcds_parquet
> [localhost:21000] tpcds_parquet> explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales);
> Query: explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales)
> Max Per-Host Resource Reservation: Memory=10.31MB Threads=6
> Per-Host Resource Estimates: Memory=85MB
> Analyzed query: SELECT count(*) FROM tpcds_parquet.store_returns LEFT SEMI JOIN
> (SELECT ss_customer_sk FROM tpcds_parquet.store_sales) `$a$1` (`$c$1`) ON
> sr_customer_sk = `$a$1`.`$c$1`
> ""
> F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=10.02MB mem-reservation=0B thread-reservation=1
> PLAN-ROOT SINK
> |  output exprs: count(*)
> |  mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> 09:AGGREGATE [FINALIZE]
> |  output: count:merge(*)
> |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
> |  tuple-ids=3 row-size=8B cardinality=1
> |  in pipelines: 09(GETNEXT), 04(OPEN)
> |
> 08:EXCHANGE [UNPARTITIONED]
> |  mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
> |  tuple-ids=3 row-size=8B cardinality=1
> |  in pipelines: 04(GETNEXT)
> |
> F01:PLAN FRAGMENT [HASH(tpcds_parquet.store_sales.ss_customer_sk)] hosts=1 instances=1
> Per-Host Resources: mem-estimate=23.88MB mem-reservation=5.81MB thread-reservation=1 runtime-filters-memory=1.00MB
> 04:AGGREGATE
> |  output: count(*)
> |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
> |  tuple-ids=3 row-size=8B cardinality=1
> |  in pipelines: 04(GETNEXT), 06(OPEN)
> |
> 03:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED]
> |  hash predicates: tpcds_parquet.store_sales.ss_customer_sk = sr_customer_sk
> |  runtime filters: RF000[bloom] <- sr_customer_sk
> |  mem-estimate=2.88MB mem-reservation=2.88MB spill-buffer=128.00KB thread-reservation=0
> |  tuple-ids=0 row-size=4B cardinality=287.51K
> |  in pipelines: 06(GETNEXT), 00(OPEN)
> |
> |--07:EXCHANGE [HASH(sr_customer_sk)]
> |  |  mem-estimate=1.10MB mem-reservation=0B thread-reservation=0
> |  |  tuple-ids=0 row-size=4B cardinality=287.51K
> |  |  in pipelines: 00(GETNEXT)
> |  |
> |  F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=2
> |  00:SCAN HDFS [tpcds_parquet.store_returns, RANDOM]
> |     HDFS partitions=1/1 files=1 size=15.43MB
> |     stored statistics:
> |       table: rows=287.51K size=15.43MB
> |       columns: all
> |     extrapolated-rows=disabled max-scan-range-rows=287.51K
> |     mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=1
> |     tuple-ids=0 row-size=4B cardinality=287.51K
> |     in pipelines: 00(GETNEXT)
> |
> 06:AGGREGATE [FINALIZE]
> |  group by: tpcds_parquet.store_sales.ss_customer_sk
> |  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB thread-reservation=0
> |  tuple-ids=6 row-size=4B cardinality=90.63K
> |  in pipelines: 06(GETNEXT), 01(OPEN)
> |
> 05:EXCHANGE [HASH(tpcds_parquet.store_sales.ss_customer_sk)]
> |  mem-estimate=142.01KB mem-reservation=0B thread-reservation=0
> |  tuple-ids=6 row-size=4B cardinality=90.63K
> |  in pipelines: 01(GETNEXT)
> |
> F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
> Per-Host Resources: mem-estimate=27.00MB mem-reservation=3.50MB thread-reservation=2 runtime-filters-memory=1.00MB
> 02:AGGREGATE [STREAMING]
> |  group by: tpcds_parquet.store_sales.ss_customer_sk
> |  mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB thread-reservation=0
> |  tuple-ids=6 row-size=4B cardinality=90.63K
> |  in pipelines: 01(GETNEXT)
> |
> 01:SCAN HDFS [tpcds_parquet.store_sales, RANDOM]
>    HDFS partitions=1824/1824 files=1824 size=200.95MB
>    runtime filters: RF000[bloom] -> tpcds_parquet.store_sales.ss_customer_sk
>    stored statistics:
>      table: rows=2.88M size=200.95MB
>      partitions: 1824/1824 rows=2.88M
>      columns: all
>    extrapolated-rows=disabled max-scan-range-rows=130.09K
>    mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1
>    tuple-ids=1 row-size=4B cardinality=2.88M
>    in pipelines: 01(GETNEXT)
> {noformat}
> The same behavior is seen for inner joins as well but to reproduce that I have to comment out the 'ordering' of tables for the joins to force creating an initial join order that is sub-optimal. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org