You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Sahil Takiar (Code Review)" <ge...@cloudera.org> on 2018/10/17 20:40:41 UTC

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Sahil Takiar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11698


Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

Testing:

Added new queries to functional-query/top-n.test to validate query
results when setting 'topn_bytes_limit' to a low value.

TODO: Validate the default value of 32 mb
TODO: Write a planner test that validates the switch from TopN to sort

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/workloads/functional-query/queries/QueryTest/top-n.test
7 files changed, 58 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 9: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3413/


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Nov 2018 23:39:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3429/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 06 Nov 2018 16:04:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Paul Rogers, Tim Armstrong, Impala Public Jenkins, Vuk Ercegovac, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11698

to look at the new patch set (#9).

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 512MB and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).
The default is set to an intentionally high value in order to avoid
performance regressions.

Testing:

* Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/SelectNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/main/java/org/apache/impala/planner/SubplanNode.java
M fe/src/main/java/org/apache/impala/planner/UnionNode.java
M fe/src/main/java/org/apache/impala/planner/UnnestNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit-small.test
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit.test
M testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test
25 files changed, 205 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/9
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3373/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Oct 2018 16:06:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 07 Nov 2018 01:17:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1080/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Oct 2018 21:23:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 5:

(3 comments)

Looks good overall

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@244
PS5, Line 244:         getSortTupleDescriptor().getAvgSerializedSize() * (cardinality + offset));
TODO: can cardinality + offset overflow?


http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@324
PS5, Line 324:         // SortNode#computeStats
Can we factor out the formula into a static helper method in SortNode?


http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@337
PS5, Line 337: limit + offset
Here and elsewhere: I think limit + offset could overflow. We don't have anything that limits the max values of these, e.g.  the following is planned as normal.

select * from tpch.lineitem order by l_orderkey limit 9223372036854775807 offset 9223372036854775807

We have a checkedAdd() helper for cases like this, but we probably should add a planner test to make sure that extreme cases like this are tested.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Oct 2018 16:48:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Paul Rogers, Tim Armstrong, Impala Public Jenkins, Vuk Ercegovac, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11698

to look at the new patch set (#6).

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 512MB and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).
The default is set to an intentionally high value in order to avoid
performance regressions.

If the planner thinks the TopN operator will end up processing the
entire input dataset, it falls back to a Sort operator. Since TopN will
just end up sorting the entire dataset, there is no point in using it
vs. the Sort operator (which has the advantage that it can spill to
disk).

Testing:

* Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.
* Added a new planner test to functional-planner/ to validate that
running an 'order by limit x' where x is the size of the input, triggers
a sort instead of a TopN (even if the 'topn_bytes_limit' threshold has
not been exceeded)

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/SelectNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/main/java/org/apache/impala/planner/SubplanNode.java
M fe/src/main/java/org/apache/impala/planner/UnionNode.java
M fe/src/main/java/org/apache/impala/planner/UnnestNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit-small.test
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/topn.test
25 files changed, 285 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/6
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1088/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Oct 2018 15:19:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1172/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 14:59:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1148/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Oct 2018 02:45:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 3:

(4 comments)

Made the following changes:
* Changed default value for topn_bytes_limit from 96MB to 512MB; this makes the conversion much more conservative, which should help avoid any performance regressions
* Added a new trigger for TopN --> Sort; if the input to TopN is >= (offset + limit) switch the Sort; the idea is that if TopN is going to sort the entire input dataset, then we should just use Sort instead
* Re-factored some of the code in SingleNodePlanner so that creation of the SortNode is done in a dedicated method
* Addressed Gerrit comments

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@299
PS3, Line 299: // Add a Sort Node instead of a TopN Node if the estimated size of the materialized
             :       // rows for TopN exceeds the value of TOPN_BYTES_LIMIT
             :       long estimatedTopNMaterializedSize = 0;
             :       f
> I'd prefer to put this block into SortInfo, called say, estimateMaterialize
Done


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@306
PS3, Line 306:           estimatedTopNMaterializedSize += sortSlotDesc.getByteSize();
> how's this compare with the row size estimate in SortNode.computeNodeResour
They look similar, so I decided to combine them. In fact, the code in SortNode.computeNodeResourceProfile is probably more accurate because it is based on column stats. So I moved the logic into SortInfo so it can be shared between the SingleNodePlanner and SortNode.


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@310
PS3, Line 310: *=
> I doubt we'll run into it in practice, but prefer to use PlanNode.checkedMu
Now that we are using the code in SortNode.computeNoteResourceProfile, there is no long * long multiplication, instead its a float * long multiplication. So there shouldn't an overflow issue.


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@313
PS3, Line 313:           estimatedTopNMaterializedSize < ctx_.getQueryOptions().topn_bytes_limit;
> @Paul, thanks for taking a look! I agree with a lot of your concerns. I've 
After discussing this a bit more with Tim, it seems that we generally avoid taking into account memory budget in the planner. So instead we decided to increase the default value to a much more conservative value to avoid any query regressions.

We discussed the option of making the parameter a row limit rather than a memory limit. A row limit may be easier to tune for a user, but would make the memory usage of TopN more variable depending on the query. I'm open to either option, but think a memory limit would end up working better.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Oct 2018 17:57:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11698

to look at the new patch set (#3).

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 96 mb and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).

Testing:

Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-limit-bytes.test
8 files changed, 121 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11698/2/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/11698/2/be/src/service/query-options.cc@720
PS2, Line 720:                 StringParser::StringToInt<int64_t>(value.c_str(), value.length(), &result);
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Oct 2018 23:33:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1132/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Oct 2018 18:04:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 96 mb and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).

Testing:

Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-limit-bytes.test
8 files changed, 121 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 6: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@337
PS5, Line 337: tNode(ctx_.get
> Do we have an equivalent to checkedAdd() for C++? I added the calls to chec
Sounds like we just should be failing queries where offset + limit is > THRESHOLD since they're not executable - I don't think we can correctly execute the queries since we need to store the limit + offset in an int64_t somewhere.

Maybe file a follow-on JIRA to fix this? Since it's an existing bug unrelated to the change.

You could also file a JIRA and fix it in this commit if you prefer.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 19:08:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@244
PS5, Line 244:         getSortTupleDescriptor().getAvgSerializedSize() * (cardinality + offset));
> TODO: can cardinality + offset overflow?
Done


http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@324
PS5, Line 324:         // SortNode#computeStats
> Can we factor out the formula into a static helper method in SortNode?
Done


http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@337
PS5, Line 337: limit + offset
> Here and elsewhere: I think limit + offset could overflow. We don't have an
Do we have an equivalent to checkedAdd() for C++? I added the calls to checkedAdd() but the queries still fail in the be/. Specifically, exchange-node.h:226 (the call to CopyRows fails on the DCHECK_LE check).

I've added the planner tests for this, and they work (since they don't actually run the query, just generate the explain plan).


http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@214
PS5, Line 214:   public void testTopNBytesLimitDisabled() {
> perhaps merge this with the test on L195? settings are the same so unclear 
deleted this test and merged it with testTopN



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 14:26:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 10:

Tests are passing now. So this patch should be good to merge.

The first set of failures in https://jenkins.impala.io/job/gerrit-verify-dryrun/3373/ was due to a change that I have now reverted. The change attempted to dynamically switch from TopN to Sort if the input cardinality of the TopN operator exceeded the value of the limit, after thinking through this change some more and doing some more profiling I realized it does not make sense to implement.

The second set of failures in https://jenkins.impala.io/job/gerrit-verify-dryrun/3413/ seems to be due to flaky test failures.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 06 Nov 2018 20:23:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@299
PS3, Line 299: // Add a Sort Node instead of a TopN Node if the estimated size of the materialized
             :       // rows for TopN exceeds the value of TOPN_BYTES_LIMIT
             :       long estimatedTopNMaterializedSize = 0;
             :       f
I'd prefer to put this block into SortInfo, called say, estimateMaterializedRecordSize. I see that SortInfo is up for debate, but at least we'll keep such sort-related code in one place.


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@306
PS3, Line 306:           estimatedTopNMaterializedSize += sortSlotDesc.getByteSize();
how's this compare with the row size estimate in SortNode.computeNodeResourceProfile?


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@310
PS3, Line 310: *=
I doubt we'll run into it in practice, but prefer to use PlanNode.checkedMultiply(..) here to deal with estimate overflows in a uniform way.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Oct 2018 01:24:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1265/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Nov 2018 20:06:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Paul Rogers, Impala Public Jenkins, Vuk Ercegovac, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11698

to look at the new patch set (#5).

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 512MB and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).
The default is set to an intentionally high value in order to avoid
performance regressions.

If the planner thinks the TopN operator will end up processing the
entire input dataset, it falls back to a Sort operator. Since TopN will
just end up sorting the entire dataset, there is no point in using it
vs. the Sort operator (which has the advantage that it can spill to
disk).

Testing:

* Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.
* Added a new planner test to functional-planner/ to validate that
running an 'order by limit x' where x is the size of the input, triggers
a sort instead of a TopN (even if the 'topn_bytes_limit' threshold has
not been exceeded)

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit-disabled.test
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit-small.test
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit.test
12 files changed, 250 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/5
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@337
PS5, Line 337: limit + offset
> Sounds like we just should be failing queries where offset + limit is > THR
Filed IMPALA-7777 as a follow up and removed the test changes for handling overflows.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Oct 2018 14:58:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3373/


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Oct 2018 20:01:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 7: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@313
PS3, Line 313:       long limit, long offset, boolean hasLimit, boolean disableTopN)
> After discussing this a bit more with Tim, it seems that we generally avoid
Sahil covered this, but just wanted to confirm that there was a deliberate decision not to pursue, in general, making planner decisions to fit a query into a memory limit. We do give up some opportunities to make more optimal decisions but it's definitely simpler for now.

I don't recall seeing a case in practice where Top-N memory consumption caused a query failure but it seems like something that will happen eventually, and this change helps avoid it or at least gives an escape hatch that doesn't require rewriting the query.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Oct 2018 16:02:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Paul Rogers, Impala Public Jenkins, Vuk Ercegovac, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11698

to look at the new patch set (#4).

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 512MB and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).
The default is set to an intentionally high value in order to avoid
performance regressions.

If the planner thinks the TopN operator will end up processing the
entire input dataset, it falls back to a Sort operator. Since TopN will
just end up sorting the entire dataset, there is no point in using it
vs. the Sort operator (which has the advantage that it can spill to
disk).

Testing:

* Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.
* Added a new planner test to functional-planner/ to validate that
running an 'order by limit x' where x is the size of the input, triggers
a sort instead of a TopN (even if the 'topn_bytes_limit' threshold has
not been exceeded)

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-default-limit-bytes.test
A testdata/workloads/functional-planner/queries/PlannerTest/topn-limit-bytes.test
11 files changed, 219 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 5: Code-Review+1

(2 comments)

one question about tests, but otherwise, lgtm. will let others take another pass.

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@334
PS4, Line 334:  then it is likely we are sor
> Yeah, but that doesn't seem ideal. So I the most recent patch allows settin
makes sense, that seems easier.


http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/11698/5/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@214
PS5, Line 214:   public void testTopNBytesLimitDisabled() {
perhaps merge this with the test on L195? settings are the same so unclear why these are separated.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Oct 2018 04:20:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 512MB and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).
The default is set to an intentionally high value in order to avoid
performance regressions.

Testing:

* Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Reviewed-on: http://gerrit.cloudera.org:8080/11698
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/SelectNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/main/java/org/apache/impala/planner/SubplanNode.java
M fe/src/main/java/org/apache/impala/planner/UnionNode.java
M fe/src/main/java/org/apache/impala/planner/UnnestNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit-small.test
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit.test
M testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test
25 files changed, 205 insertions(+), 38 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Tim Armstrong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11698/1/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/11698/1/be/src/service/query-options.cc@720
PS1, Line 720:                 StringParser::StringToInt<int64_t>(value.c_str(), value.length(), &result);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/11698/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@313
PS1, Line 313:       if (useTopN && estimatedTopNMaterializedSize < ctx_.getQueryOptions().topn_bytes_limit) {
line too long (95 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Oct 2018 22:35:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 5:

(8 comments)

Addressed comments and made a few more changes:
* Changed query-options.cc to use ParseMemValue to parse the value of topn_bytes_limit since it is expressed in bytes
* Re-organized the tests slightly into three distinct tests: (1) test with the default value of topn_bytes_limit, (2) test with a small value of topn_bytes_limit, and (3) test with topn_bytes_limit = 0 which disables the changes
* A value of 0 or -1 for topn_bytes_limit now disables the changes

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@236
PS4, Line 236: th
> the
Done


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@238
PS4, Line 238:    * average tuple serialized size</code>. 'cardinality' is the cardinality of the TopN
> offset is not mentioned so either include it or remove it since the arg des
The "estimated # of rows in memory" refers to the sum of the cardinality and offset. I don't explicitly mention the offset, because one can infer from the method body that "estimated # of rows in memory" = "cardinality + offset".

If you still think the formula is redundant, I can remove it.


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@240
PS4, Line 240: 
> Even though this style of commenting is widely used, Impala tries not to us
Done


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@325
PS4, Line 325: rdinality =
> as pointed out in the header for SortInfo, the abstractions are not quite r
Moved the tuple handling into estimateTopNMaterializedSize. I couldn't make it static since it needs to reference the tuple descriptors of the current SortInfo.

estimateTopNMaterializedSize is called by the SingleNodePlanner and the SortNode now.


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@326
PS4, Line 326: th.min(root.cardinalit
> this looks correct and consistent with the avgRowSize_ computation for Plan
Done


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@334
PS4, Line 334:  then it is likely we are sor
> if we get this wrong, perhaps due to a mistaken many-to-many join cardinali
Yeah, but that doesn't seem ideal. So I the most recent patch allows settings topn_bytes_limit to 0 or -1 which disables the decision entirely.


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SortNode.java@260
PS4, Line 260:       nodeResourceProfile_ = ResourceProfile.noReservation(
> where'd the return go? pls look into why tests did not get bothered by this
I didn't run the full suite of tests, only the topn functional-query tests, which don't validate memory estimates by default.


http://gerrit.cloudera.org:8080/#/c/11698/4/testdata/workloads/functional-planner/queries/PlannerTest/topn-default-limit-bytes.test
File testdata/workloads/functional-planner/queries/PlannerTest/topn-default-limit-bytes.test:

http://gerrit.cloudera.org:8080/#/c/11698/4/testdata/workloads/functional-planner/queries/PlannerTest/topn-default-limit-bytes.test@1
PS4, Line 1: 
> seems to apply to the second test, not the first, so got confused by this.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Oct 2018 02:13:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Oct 2018 16:06:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Paul Rogers, Tim Armstrong, Impala Public Jenkins, Vuk Ercegovac, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11698

to look at the new patch set (#7).

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................

IMPALA-5004: Switch to sorting node for large TopN queries

Adds a new query option 'topn_bytes_limit' that places a limit on the
number of estimated bytes that a TopN operator can process. If the
Impala planner estimates that a TopN operator will process more bytes
than this limit, it will replace the TopN operator with a sort operator.

Since the TopN operator cannot spill to disk, it has to buffer everything
in memory. This can cause frequent OOM issues when running with a large
limit + offset. Switching to a sort operator allows Impala to spill to
disk. We prefer to use the TopN operator when possible as it has better
performance than the sort operator for 'order by limit [offset]' queries.

The default limit is set to 512MB and is based on micro-benchmarking the
topn vs. sort operator for various limits (see the JIRA for full details).
The default is set to an intentionally high value in order to avoid
performance regressions.

If the planner thinks the TopN operator will end up processing the
entire input dataset, it falls back to a Sort operator. Since TopN will
just end up sorting the entire dataset, there is no point in using it
vs. the Sort operator (which has the advantage that it can spill to
disk).

Testing:

* Added a new planner test to fuctional-planner/ to validate that
'topn_bytes_limit' properly switches between topn and sort operators.
* Added a new planner test to functional-planner/ to validate that
running an 'order by limit x' where x is the size of the input, triggers
a sort instead of a TopN (even if the 'topn_bytes_limit' threshold has
not been exceeded)

Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/analysis/SortInfo.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/SelectNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/main/java/org/apache/impala/planner/SubplanNode.java
M fe/src/main/java/org/apache/impala/planner/UnionNode.java
M fe/src/main/java/org/apache/impala/planner/UnnestNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit-small.test
A testdata/workloads/functional-planner/queries/PlannerTest/topn-bytes-limit.test
M testdata/workloads/functional-planner/queries/PlannerTest/topn.test
25 files changed, 260 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/11698/7
-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 06 Nov 2018 20:02:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@313
PS3, Line 313:           estimatedTopNMaterializedSize < ctx_.getQueryOptions().topn_bytes_limit;
> This code makes the sort/top-n decision based on memory. The number of byte
@Paul, thanks for taking a look! I agree with a lot of your concerns. I've raised a few of my own on the JIRA: IMPALA-5004

I agree with your comparison of TopN vs. Sort. In general, the tradeoffs between the two operators don't seem trivial. TopN processes fewer bytes, but cannot spill. While Sort processes significantly more data, but can spill. Furthermore, depending on the size of the limit, sometimes TopN is faster than Sort and sometimes Sort is faster than TopN.

I agree that if we do decide to add this parameter, it would be best if we could dynamically set it based on memory requirements rather than depending on users to do it.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Oct 2018 22:18:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1079/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Oct 2018 21:14:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1194/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Oct 2018 15:31:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3413/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Nov 2018 19:26:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Paul Rogers (Code Review)" <ge...@cloudera.org>.
Paul Rogers has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 3:

(1 comment)

This will be a great change. See comment about using a memory-based decision rather than, say, a row-based decision or a decision informed by the memory available to the query.

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@313
PS3, Line 313:           estimatedTopNMaterializedSize < ctx_.getQueryOptions().topn_bytes_limit;
This code makes the sort/top-n decision based on memory. The number of bytes is a parameter. This seems awkward: the memory used must be part of the overall query budget. A query could, in odd cases, have multiple TopN operators, but the code here treats them one by one.

Further, a TopN will always use fewer bytes than a Sort: a TopN needs to keep only n rows in general, but sort must buffer all rows. (Though, of course, sort can spill to relieve memory pressure.)

I wonder, does it make sense to impose the limit as a row limit rather than a memory limit?

Or, does it make sense to set the limit as part of a query memory planning exercise rather than as another free parameter that the user must juggle when thinking about memory budgets?



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Oct 2018 21:15:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5004: Switch to sorting node for large TopN queries

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 4:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@236
PS4, Line 236: is
the


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@238
PS4, Line 238:    * average tuple serialized size</code>.
offset is not mentioned so either include it or remove it since the arg descriptions and the one-liner definition make things pretty clear already.


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@240
PS4, Line 240: @param
Even though this style of commenting is widely used, Impala tries not to use it (yes, there are exceptions in the code base, but its on the rare side). If a file already uses them, then pls stick with the style of that file, but in this case, lets stick with the default style so remove the '@param'.


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@325
PS4, Line 325: estimateTopNMaterializedSize
as pointed out in the header for SortInfo, the abstractions are not quite right. however, this call-site looks fairly complicated. compueMemLayout does not redo if its already been called, so perhaps simplify this to let SortInfo deal with its descriptors and calling computeMemLayout? You can have a static method do what estimateTopNMaterializedSize does now so that it can be used in SortInfo as well as from the SortNode.


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@326
PS4, Line 326: getAvgSerializedSize()
this looks correct and consistent with the avgRowSize_ computation for PlanNodes.


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@334
PS4, Line 334: estimatedTopNMaterializedSize
if we get this wrong, perhaps due to a mistaken many-to-many join cardinality estimate in some descendant child, what is the easiest way to turn this decision off? set topn_bytes_limit to max long?


http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/11698/4/fe/src/main/java/org/apache/impala/planner/SortNode.java@260
PS4, Line 260:       nodeResourceProfile_ = ResourceProfile.noReservation(
where'd the return go? pls look into why tests did not get bothered by this... the profile could have been overwritten down below, in which case we'd get possibly different estimates.


http://gerrit.cloudera.org:8080/#/c/11698/4/testdata/workloads/functional-planner/queries/PlannerTest/topn-default-limit-bytes.test
File testdata/workloads/functional-planner/queries/PlannerTest/topn-default-limit-bytes.test:

http://gerrit.cloudera.org:8080/#/c/11698/4/testdata/workloads/functional-planner/queries/PlannerTest/topn-default-limit-bytes.test@1
PS4, Line 1: triggers sort
seems to apply to the second test, not the first, so got confused by this.



-- 
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Oct 2018 17:19:41 +0000
Gerrit-HasComments: Yes