You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Riza Suminto (Jira)" <ji...@apache.org> on 2023/09/19 18:47:00 UTC
[jira] [Created] (IMPALA-12454) CompoudPredicate with AND operator can result in very low selectivity.
Riza Suminto created IMPALA-12454:
-------------------------------------
Summary: CompoudPredicate with AND operator can result in very low selectivity.
Key: IMPALA-12454
URL: https://issues.apache.org/jira/browse/IMPALA-12454
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 4.2.0
Reporter: Riza Suminto
CompoudPredicate with AND operator estimate its selectivity by doing simple multiplication of its child expression's selectivity.
[https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java#L174-L176]
This can lead to very low number, like what happen in TPC-DS Q53.
{code:java}
| F01:PLAN FRAGMENT [RANDOM] hosts=4 instances=4
| Per-Instance Resources: mem-estimate=24.30MB mem-reservation=1.00MB thread-reservation=1
| 00:SCAN S3 [tpcds_3000_string_parquet_managed.item, RANDOM]
| S3 partitions=1/1 files=4 size=33.54MB
| predicates: ((i_category IN ('Books', 'Children', 'Electronics') AND i_class IN ('personal', 'portable', 'reference', 'self-help') AND i_brand IN ('scholaramalgamalg #14', 'scholaramalgamalg #7', 'exportiunivamalg #9', 'scholaramalgamalg #9')) OR (i_category IN ('Women', 'Music', 'Men') AND i_class IN ('accessories', 'classical', 'fragrances', 'pants') AND i_brand IN ('amalgimporto #1', 'edu packscholar #1', 'exportiimporto #1', 'importoamalg #1')))
| stored statistics:
| table: rows=360.00K size=33.54MB
| columns: all
| extrapolated-rows=disabled max-scan-range-rows=117.77K
| mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=0
| tuple-ids=0 row-size=74B cardinality=51
| in pipelines: 00(GETNEXT) {code}
The CompoudPredicate in 00:SCAN estimate very high selectivity, reducing 360K rows into just 51. While in reality, it return 18.53K rows.
{code:java}
| 00:SCAN S3 4 4 18.000ms 24.000ms 18.53K 51 2.31 MB 24.00 MB tpcds_3000_string_parquet_managed.item {code}
Selectivity estimation in this CompoudPredicate case should use exponential backoff algorithm similar as in PlanNode.computeCombinedSelectivity().
[https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/planner/PlanNode.java#L730-L733]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)