You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Riza Suminto (Jira)" <ji...@apache.org> on 2023/09/19 18:47:00 UTC

[jira] [Created] (IMPALA-12454) CompoudPredicate with AND operator can result in very low selectivity.

Riza Suminto created IMPALA-12454:
-------------------------------------

             Summary: CompoudPredicate with AND operator can result in very low selectivity.
                 Key: IMPALA-12454
                 URL: https://issues.apache.org/jira/browse/IMPALA-12454
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 4.2.0
            Reporter: Riza Suminto


CompoudPredicate with AND operator estimate its selectivity by doing simple multiplication of its child expression's selectivity.
[https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java#L174-L176] 

 

This can lead to very low number, like what happen in TPC-DS Q53.
{code:java}
|  F01:PLAN FRAGMENT [RANDOM] hosts=4 instances=4
|  Per-Instance Resources: mem-estimate=24.30MB mem-reservation=1.00MB thread-reservation=1
|  00:SCAN S3 [tpcds_3000_string_parquet_managed.item, RANDOM]
|     S3 partitions=1/1 files=4 size=33.54MB
|     predicates: ((i_category IN ('Books', 'Children', 'Electronics') AND i_class IN ('personal', 'portable', 'reference', 'self-help') AND i_brand IN ('scholaramalgamalg #14', 'scholaramalgamalg #7', 'exportiunivamalg #9', 'scholaramalgamalg #9')) OR (i_category IN ('Women', 'Music', 'Men') AND i_class IN ('accessories', 'classical', 'fragrances', 'pants') AND i_brand IN ('amalgimporto #1', 'edu packscholar #1', 'exportiimporto #1', 'importoamalg #1')))
|     stored statistics:
|       table: rows=360.00K size=33.54MB
|       columns: all
|     extrapolated-rows=disabled max-scan-range-rows=117.77K
|     mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=0
|     tuple-ids=0 row-size=74B cardinality=51
|     in pipelines: 00(GETNEXT) {code}
The CompoudPredicate in 00:SCAN estimate very high selectivity, reducing 360K rows into just 51. While in reality, it return 18.53K rows.
{code:java}
|  00:SCAN S3                 4      4   18.000ms   24.000ms   18.53K          51    2.31 MB       24.00 MB  tpcds_3000_string_parquet_managed.item {code}
Selectivity estimation in this CompoudPredicate case should use exponential backoff algorithm similar as in PlanNode.computeCombinedSelectivity().

[https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/planner/PlanNode.java#L730-L733] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)