You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/10/29 20:53:00 UTC
[jira] [Commented] (IMPALA-7749) Merge aggregation node memory estimate is incorrectly influenced by limit

    [ https://issues.apache.org/jira/browse/IMPALA-7749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667753#comment-16667753 ] 

ASF subversion and git services commented on IMPALA-7749:
---------------------------------------------------------

Commit 44e69e8182954db90b22809b5440bba59ed8d0ae in impala's branch refs/heads/master from poojanilangekar
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=44e69e8 ]

IMPALA-7749: Compute AggregationNode's memory estimate using input cardinality

Prior to this change, the AggregationNode's perInstanceCardinality
was influenced by the node's selectivity and limit. This was
incorrect because the hash table is constructed over the entire
input stream before any row batches are produced. This change
ensures that the input cardinality is used to determine the
perInstanceCardinality.

Testing:
Added a planner test which ensures that an AggregationNode with a
limit estimates memory based on the input cardinality.
Ran front-end and end-to-end tests affected by this change.

Change-Id: Ifd95d2ad5b677fca459c9c32b98f6176842161fc
Reviewed-on: http://gerrit.cloudera.org:8080/11806
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Merge aggregation node memory estimate is incorrectly influenced by limit
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-7749
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7749
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>    Affects Versions: Impala 2.11.0, Impala 3.0, Impala 2.12.0, Impala 3.1.0
>            Reporter: Tim Armstrong
>            Assignee: Pooja Nilangekar
>            Priority: Critical
>
> In the below query the estimate for node ID 3 is too low. If you remove the limit it is correct. 
> {noformat}
> [localhost:21000] default> set explain_level=2; explain select l_orderkey, l_partkey, l_linenumber, count(*) from tpch.lineitem group by 1, 2, 3 limit 5;
> EXPLAIN_LEVEL set to 2
> Query: explain select l_orderkey, l_partkey, l_linenumber, count(*) from tpch.lineitem group by 1, 2, 3 limit 5
> +-------------------------------------------------------------------------------------------+
> | Explain String                                                                            |
> +-------------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=43.94MB Threads=4                               |
> | Per-Host Resource Estimates: Memory=450MB                                                 |
> |                                                                                           |
> | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                                     |
> | |  Per-Host Resources: mem-estimate=0B mem-reservation=0B thread-reservation=1            |
> | PLAN-ROOT SINK                                                                            |
> | |  mem-estimate=0B mem-reservation=0B thread-reservation=0                                |
> | |                                                                                         |
> | 04:EXCHANGE [UNPARTITIONED]                                                               |
> | |  limit: 5                                                                               |
> | |  mem-estimate=0B mem-reservation=0B thread-reservation=0                                |
> | |  tuple-ids=1 row-size=28B cardinality=5                                                 |
> | |  in pipelines: 03(GETNEXT)                                                              |
> | |                                                                                         |
> | F01:PLAN FRAGMENT [HASH(l_orderkey,l_partkey,l_linenumber)] hosts=3 instances=3           |
> | Per-Host Resources: mem-estimate=10.00MB mem-reservation=1.94MB thread-reservation=1      |
> | 03:AGGREGATE [FINALIZE]                                                                   |
> | |  output: count:merge(*)                                                                 |
> | |  group by: l_orderkey, l_partkey, l_linenumber                                          |
> | |  limit: 5                                                                               |
> | |  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB thread-reservation=0  |
> | |  tuple-ids=1 row-size=28B cardinality=5                                                 |
> | |  in pipelines: 03(GETNEXT), 00(OPEN)                                                    |
> | |                                                                                         |
> | 02:EXCHANGE [HASH(l_orderkey,l_partkey,l_linenumber)]                                     |
> | |  mem-estimate=0B mem-reservation=0B thread-reservation=0                                |
> | |  tuple-ids=1 row-size=28B cardinality=6001215                                           |
> | |  in pipelines: 00(GETNEXT)                                                              |
> | |                                                                                         |
> | F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3                                            |
> | Per-Host Resources: mem-estimate=440.27MB mem-reservation=42.00MB thread-reservation=2    |
> | 01:AGGREGATE [STREAMING]                                                                  |
> | |  output: count(*)                                                                       |
> | |  group by: l_orderkey, l_partkey, l_linenumber                                          |
> | |  mem-estimate=176.27MB mem-reservation=34.00MB spill-buffer=2.00MB thread-reservation=0 |
> | |  tuple-ids=1 row-size=28B cardinality=6001215                                           |
> | |  in pipelines: 00(GETNEXT)                                                              |
> | |                                                                                         |
> | 00:SCAN HDFS [tpch.lineitem, RANDOM]                                                      |
> |    partitions=1/1 files=1 size=718.94MB                                                   |
> |    stored statistics:                                                                     |
> |      table: rows=6001215 size=718.94MB                                                    |
> |      columns: all                                                                         |
> |    extrapolated-rows=disabled max-scan-range-rows=1068457                                 |
> |    mem-estimate=264.00MB mem-reservation=8.00MB thread-reservation=1                      |
> |    tuple-ids=0 row-size=20B cardinality=6001215                                           |
> |    in pipelines: 00(GETNEXT)                                                              |
> +-------------------------------------------------------------------------------------------+
> {noformat}
> The bug is that we use cardinality_ to cap the number of distinct values, but cardinality_ is capped at the output limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org