You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2020/09/22 16:28:00 UTC
[jira] [Resolved] (IMPALA-10179) After inverting a join's inputs
the join's parallelism does not get reset
[ https://issues.apache.org/jira/browse/IMPALA-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Sinha resolved IMPALA-10179.
---------------------------------
Fix Version/s: Impala 4.0
Resolution: Fixed
> After inverting a join's inputs the join's parallelism does not get reset
> -------------------------------------------------------------------------
>
> Key: IMPALA-10179
> URL: https://issues.apache.org/jira/browse/IMPALA-10179
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.4.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Priority: Major
> Fix For: Impala 4.0
>
>
> In the following query, the left semi join gets flipped to a right semi join due to the cardinality of the tables but the parallelism of the HashJoin fragment (see Fragment F01) remains as hosts=1, instances=1. The right behavior should be to inherit the parallelism of the new probe input table store_sales, so it should be hosts=3, instances=3 to avoid under-parallelizing the HashJoin.
> {noformat}
> [localhost:21000] default> set explain_level=2;
> EXPLAIN_LEVEL set to 2
> [localhost:21000] default> use tpcds_parquet;
> Query: use tpcds_parquet
> [localhost:21000] tpcds_parquet> explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales);
> Query: explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales)
> Max Per-Host Resource Reservation: Memory=10.31MB Threads=6
> Per-Host Resource Estimates: Memory=85MB
> Analyzed query: SELECT count(*) FROM tpcds_parquet.store_returns LEFT SEMI JOIN
> (SELECT ss_customer_sk FROM tpcds_parquet.store_sales) `$a$1` (`$c$1`) ON
> sr_customer_sk = `$a$1`.`$c$1`
> ""
> F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> | Per-Host Resources: mem-estimate=10.02MB mem-reservation=0B thread-reservation=1
> PLAN-ROOT SINK
> | output exprs: count(*)
> | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> 09:AGGREGATE [FINALIZE]
> | output: count:merge(*)
> | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
> | tuple-ids=3 row-size=8B cardinality=1
> | in pipelines: 09(GETNEXT), 04(OPEN)
> |
> 08:EXCHANGE [UNPARTITIONED]
> | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
> | tuple-ids=3 row-size=8B cardinality=1
> | in pipelines: 04(GETNEXT)
> |
> F01:PLAN FRAGMENT [HASH(tpcds_parquet.store_sales.ss_customer_sk)] hosts=1 instances=1
> Per-Host Resources: mem-estimate=23.88MB mem-reservation=5.81MB thread-reservation=1 runtime-filters-memory=1.00MB
> 04:AGGREGATE
> | output: count(*)
> | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
> | tuple-ids=3 row-size=8B cardinality=1
> | in pipelines: 04(GETNEXT), 06(OPEN)
> |
> 03:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED]
> | hash predicates: tpcds_parquet.store_sales.ss_customer_sk = sr_customer_sk
> | runtime filters: RF000[bloom] <- sr_customer_sk
> | mem-estimate=2.88MB mem-reservation=2.88MB spill-buffer=128.00KB thread-reservation=0
> | tuple-ids=0 row-size=4B cardinality=287.51K
> | in pipelines: 06(GETNEXT), 00(OPEN)
> |
> |--07:EXCHANGE [HASH(sr_customer_sk)]
> | | mem-estimate=1.10MB mem-reservation=0B thread-reservation=0
> | | tuple-ids=0 row-size=4B cardinality=287.51K
> | | in pipelines: 00(GETNEXT)
> | |
> | F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> | Per-Host Resources: mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=2
> | 00:SCAN HDFS [tpcds_parquet.store_returns, RANDOM]
> | HDFS partitions=1/1 files=1 size=15.43MB
> | stored statistics:
> | table: rows=287.51K size=15.43MB
> | columns: all
> | extrapolated-rows=disabled max-scan-range-rows=287.51K
> | mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=1
> | tuple-ids=0 row-size=4B cardinality=287.51K
> | in pipelines: 00(GETNEXT)
> |
> 06:AGGREGATE [FINALIZE]
> | group by: tpcds_parquet.store_sales.ss_customer_sk
> | mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB thread-reservation=0
> | tuple-ids=6 row-size=4B cardinality=90.63K
> | in pipelines: 06(GETNEXT), 01(OPEN)
> |
> 05:EXCHANGE [HASH(tpcds_parquet.store_sales.ss_customer_sk)]
> | mem-estimate=142.01KB mem-reservation=0B thread-reservation=0
> | tuple-ids=6 row-size=4B cardinality=90.63K
> | in pipelines: 01(GETNEXT)
> |
> F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
> Per-Host Resources: mem-estimate=27.00MB mem-reservation=3.50MB thread-reservation=2 runtime-filters-memory=1.00MB
> 02:AGGREGATE [STREAMING]
> | group by: tpcds_parquet.store_sales.ss_customer_sk
> | mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB thread-reservation=0
> | tuple-ids=6 row-size=4B cardinality=90.63K
> | in pipelines: 01(GETNEXT)
> |
> 01:SCAN HDFS [tpcds_parquet.store_sales, RANDOM]
> HDFS partitions=1824/1824 files=1824 size=200.95MB
> runtime filters: RF000[bloom] -> tpcds_parquet.store_sales.ss_customer_sk
> stored statistics:
> table: rows=2.88M size=200.95MB
> partitions: 1824/1824 rows=2.88M
> columns: all
> extrapolated-rows=disabled max-scan-range-rows=130.09K
> mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1
> tuple-ids=1 row-size=4B cardinality=2.88M
> in pipelines: 01(GETNEXT)
> {noformat}
> The same behavior is seen for inner joins as well but to reproduce that I have to comment out the 'ordering' of tables for the joins to force creating an initial join order that is sub-optimal.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org