You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2019/08/27 23:45:00 UTC
[jira] [Updated] (IMPALA-2945) Aggregation cardinality estimates do not take into account data distribution

     [ https://issues.apache.org/jira/browse/IMPALA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong updated IMPALA-2945:
----------------------------------
    Summary: Aggregation cardinality estimates do not take into account data distribution  (was: Low memory estimation in local aggregation)

> Aggregation cardinality estimates do not take into account data distribution
> ----------------------------------------------------------------------------
>
>                 Key: IMPALA-2945
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2945
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.0, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0, Impala 2.12.0
>            Reporter: Mostafa Mokhtar
>            Priority: Major
>              Labels: planner, resource-management
>
> When computing the per-host memory estimate for local aggregations, the planner does not take into account that data is randomly distributed across nodes leading to significant underestimation in some cases. The suggested fix is to use min(agg input cardinality, NDV * #hosts) as the per-node cardinality used for the per-node memory estimate.
> Impact: In the query below, the planner significantly underestimates the per-node memory of agg node 03 to be 3.8GB but the actual is 24.77.
> Query
> {code}
> select sum(l_extendedprice) / 7.0 as avg_yearly
> from
>   lineitem,
>   part
> where
>   p_partkey = l_partkey
>   and p_brand = 'Brand#23'
>   and p_container = 'MED BOX'
>   and l_quantity < (
>     select
>       0.2 * avg(l_quantity)
>     from
>       lineitem
>     where
>       l_partkey = p_partkey
>   )
> {code}
> Plan
> {code}
> 12:AGGREGATE [FINALIZE]
> |  output: sum:merge(l_extendedprice)
> |  hosts=20 per-host-mem=unavailable
> |  tuple-ids=6 row-size=16B cardinality=1
> |
> 11:EXCHANGE [UNPARTITIONED]
> |  hosts=20 per-host-mem=unavailable
> |  tuple-ids=6 row-size=16B cardinality=1
> |
> 06:AGGREGATE
> |  output: sum(l_extendedprice)
> |  hosts=20 per-host-mem=10.00MB
> |  tuple-ids=6 row-size=16B cardinality=1
> |
> 05:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED]
> |  hash predicates: l_partkey = p_partkey
> |  other join predicates: l_quantity < 0.2 * avg(l_quantity)
> |  hosts=20 per-host-mem=125.18MB
> |  tuple-ids=0,1 row-size=80B cardinality=29992141
> |
> |--10:EXCHANGE [HASH(p_partkey)]
> |  |  hosts=20 per-host-mem=0B
> |  |  tuple-ids=0,1 row-size=80B cardinality=29992141
> |  |
> |  04:HASH JOIN [INNER JOIN, BROADCAST]
> |  |  hash predicates: l_partkey = p_partkey
> |  |  hosts=20 per-host-mem=58.30MB
> |  |  tuple-ids=0,1 row-size=80B cardinality=29992141
> |  |
> |  |--09:EXCHANGE [BROADCAST]
> |  |  |  hosts=20 per-host-mem=0B
> |  |  |  tuple-ids=1 row-size=56B cardinality=1000000
> |  |  |
> |  |  01:SCAN HDFS [tpch_1000_decimal_parquet.part, RANDOM]
> |  |     partitions=1/1 files=40 size=6.38GB
> |  |     predicates: p_brand = 'Brand#23', p_container = 'MED BOX'
> |  |     table stats: 200000000 rows total
> |  |     column stats: all
> |  |     hosts=20 per-host-mem=264.00MB
> |  |     tuple-ids=1 row-size=56B cardinality=1000000
> |  |
> |  00:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM]
> |     partitions=1/1 files=880 size=216.61GB
> |     table stats: 5999989709 rows total
> |     column stats: all
> |     hosts=20 per-host-mem=264.00MB
> |     tuple-ids=0 row-size=24B cardinality=5999989709
> |
> 08:AGGREGATE [FINALIZE]
> |  output: avg:merge(l_quantity)
> |  group by: l_partkey
> |  hosts=20 per-host-mem=167.89MB
> |  tuple-ids=4 row-size=16B cardinality=200052064
> |
> 07:EXCHANGE [HASH(l_partkey)]
> |  hosts=20 per-host-mem=0B
> |  tuple-ids=3 row-size=16B cardinality=200052064
> |
> 03:AGGREGATE
> |  output: avg(l_quantity)
> |  group by: l_partkey
> |  hosts=20 per-host-mem=3.28GB
> |  tuple-ids=3 row-size=16B cardinality=200052064
> |
> 02:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM]
>    partitions=1/1 files=880 size=216.61GB
>    table stats: 5999989709 rows total
>    column stats: all
>    hosts=20 per-host-mem=176.00MB
>    tuple-ids=2 row-size=16B cardinality=5999989709
> {code}
> Summary
> |Operator             ||#Hosts||   Avg Time||   Max Time||    #Rows  ||Est. #Rows||   Peak Mem  ||Est. Peak Mem  ||Detail                   |      
> |12:AGGREGATE              |1  |256.620ms|  256.620ms|        1|           1|   92.00 KB|        -1.00 B|  FINALIZE                       |
> |11:EXCHANGE               |1  |184.430us|  184.430us|       20|           1|          0|        -1.00 B|  UNPARTITIONED                  |
> |06:AGGREGATE             |20  |364.045ms|    1s508ms|       20|           1|    9.37 MB|       10.00 MB|                                 |
> |05:HASH JOIN             |20  |279.175ms|  304.600ms|  523.09K|      29.99M|  155.04 MB|      125.18 MB|  RIGHT SEMI JOIN, PARTITIONED   |
> |I--10:EXCHANGE           |20   |22.448ms|   32.954ms|    5.98M|      29.99M|          0|              0|  HASH(p_partkey)                |
> |I  04:HASH JOIN          |20   |25s417ms|   35s579ms|    5.98M|      29.99M|  146.02 MB|       58.30 MB|  INNER JOIN, BROADCAST          |
> |I  I--09:EXCHANGE        |20   |16.270ms|   35.329ms|  199.30K|       1.00M|          0|              0|  BROADCAST                     | 
> |I  I  01:SCAN HDFS       |20  |218.505ms|  331.299ms|  199.30K|       1.00M|  173.43 MB|      264.00 MB|  tpch_1000_decimal_parquet.part| 
> |I  00:SCAN HDFS          |20    |1s365ms|    1s822ms|    6.00B|       6.00B|    1.92 GB|      264.00 MB|  tpch_1000_decimal_parquet.l... |
> |08:AGGREGATE             |20   |29s706ms|   35s917ms|  200.00M|     200.05M|    1.64 GB|      167.89 MB|  FINALIZE|                       
> |07:EXCHANGE              |20    |5s081ms|    8s410ms|    3.11B|     200.05M|          0|              0|  HASH(l_partkey)|                
> |03:AGGREGATE             |20      |4m10s      5m12s |   3.11B|     200.05M|  24.77 GB|        3.28 GB|                                 
> |02:SCAN HDFS             |20    |1s544ms|    2s517ms|    6.00B|       6.00B|  838.85 MB|      176.00 MB|  tpch_1000_decimal_parquet.l... |



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org