You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org> on 2017/06/22 21:45:19 UTC

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Thomas Tauber-Marshall has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7267

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................

DRAFT - IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

This patch is a work in progress, and needs to be polished and tested.

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
9 files changed, 328 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................


IMPALA-5498: Support for partial sorts in Kudu INSERTs

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to the query memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode. (IMPALA-5669)

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

This also adds a new counter 'RunSize' to the runtime profile which
tracks the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts/upserts into Kudu tables only. Future
work can extend this to other table sinks. (IMPALA-5649)

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Reviewed-on: http://gerrit.cloudera.org:8080/7267
Reviewed-by: Thomas Tauber-Marshall <tm...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/runtime-profile-counters.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 459 insertions(+), 70 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Thomas Tauber-Marshall: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#2).

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................

DRAFT - IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

This patch is a work in progress, and needs to be polished and tested.

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
13 files changed, 388 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 104:       sorter_->Reset();
> To be clear - what you're saying is that Open() will want to allocate memor
I wouldn't expect incorrect results, but it could result in queries failing with OOM when they shouldn't.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

Line 1384:   int min_buffers_required = MIN_BUFFERS_PER_MERGE;
Need to update this calculation for partial sorts.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

Line 96:   Sorter(const TupleRowComparator& compare_less_than,
Wasn't the plan to apply the partition sort memory limit to the sorter itself?


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS3, Line 349:  highest/lowest
Should this be "first" since the order is specified elsewhere.

Maybe:
"Return the first N elements in sorted order"?


http://gerrit.cloudera.org:8080/#/c/7267/3/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

Line 58:   private final long PARTIAL_SORT_MEM_LIMIT = BackendConfig.INSTANCE.getReadSize() * 100;
It might be clearer to specify this in bytes, then just make sure that its >= 2 * read_size.

Although we really don't expect people to change read_size.


Line 286:       resourceProfile_ = new ResourceProfile(
This looks good to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 4:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 32:     sorter_(NULL),
> This doesn't need to handle offset?
Done


Line 127: 
> Since this only requires a partial sort, you can exit early when the cumula
As noted elsewhere, removing limit-related code.


Line 140:     *eos = input_eos_;
> I guess it is not clear to me why allowing a limit without an offset is use
Right, so 'offset' never really makes sense with partial-sort-node, since as you say there's not a total ordering and you're not guaranteed that one instance of partial-sort-node with an offset can pick up where another instance left off with a limit.

A limit is potentially useful just for cutting off the number of rows returned in a single partial-sort-node, but with the initial use case for this there's no way to end up with a partial-sort-node with a limit, so in the interest of keeping things simple and not having untested code paths, I think I'll just remove the limit code.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

Line 58:   virtual Status QueryMaintenance(RuntimeState* state);
> Just for general niceness, can you also update be/src/exec/sort-node.h's pr
Done


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

Line 1384:       expr_mem_pool, &sort_tuple_expr_evals_));
> Need to update this calculation for partial sorts.
Done


Line 1434:   if (sorted_runs_.size() == 1) {
> Unnecessary - this is unconditionally dereferenced
Done


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

Line 96:   Sorter(const TupleRowComparator& compare_less_than,
> Wasn't the plan to apply the partition sort memory limit to the sorter itse
This will be included in a follow up patch.


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS3, Line 349: e {
> Should this be "first" since the order is specified elsewhere.
My problem with that wording is that it could be interpreted as "take the first N elements of input and then sort them"

Maybe: "Return the first N sorted elements"?


Line 360: struct TSortNode {
> Please ignore my earlier comments about offset handling (although a comment
Added to partial-sort-node.h


http://gerrit.cloudera.org:8080/#/c/7267/3/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

Line 58:   private final long PARTIAL_SORT_MEM_LIMIT = 128 * 1024 * 1024;
> It might be clearer to specify this in bytes, then just make sure that its 
Done


Line 286:       // The memory limit cannot be less than the size of the required blocks.
> This looks good to me.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 4:

Can you extend the Sort metrics in the query profile to include Avg, Min and Max run size?
This will be needed to reflect on the quality of the sorted data?

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................


Patch Set 8: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7267

to look at the new patch set (#7).

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................

IMPALA-5498: Support for partial sorts in Kudu INSERTs

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to the query memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode. (IMPALA-5669)

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

This also adds a new counter 'RunSize' to the runtime profile which
tracks the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts/upserts into Kudu tables only. Future
work can extend this to other table sinks. (IMPALA-5649)

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/runtime-profile-counters.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 459 insertions(+), 70 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 1:

(2 comments)

The overall approach makes sense to me.

Happy to talk about the computeResourceProfile() stuff in the planner if that would be helpful.

http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 104:       RETURN_IF_ERROR(sorter_->Open());
I think you'll want to wait until the next call into GetNext() to re-open the sorter. At this point row_batch may still be holding onto memory that Open() will want to use.


Line 115:   do {
I think it would make sense to open the sorter here (if it's not already open).


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7267/2/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 348:   PARTIAL
these could use comments, especially since "partial sort" often means what top-n does (i.e. sort the top or bottom-most elements), but here we mean something different.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................


Patch Set 7: Code-Review+2

You probably need to rebase

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 2:

(2 comments)

did a quick pass again, I think it looks pretty good. Anything major needed before considering this a non-draft review? I suppose just testing?

http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

PS2, Line 296:         root = SortNode.createTopNSortNode(ctx_.getNextNodeId(), root, stmt.getSortInfo(), stmt.getOffset());
long line


http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

PS2, Line 93:   
Maybe make this one a static factory for consistency. I like them- very readable


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#4).

Change subject: IMPALA-5498: Support for partial sorts
......................................................................

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.cc
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 458 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 6: Code-Review+1

(1 comment)

LGTM aside from the variable name

http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS6, Line 103: no_spill
A positive argument would be marginally easier to follow. E.g. 'enable_spilling'


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#5).

Change subject: IMPALA-5498: Support for partial sorts
......................................................................

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.cc
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 470 insertions(+), 76 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#3).

Change subject: IMPALA-5498: Support for partial sorts
......................................................................

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
14 files changed, 431 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 1:

(4 comments)

I think the interface changes make sense. I'm not digging into the code details much yet.

http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

PS1, Line 30: a single tuple
can you clarify, this is a bit unclear


PS1, Line 34: If a merge phase was performed in the sort, sorted rows are deep copied into
            : /// the output batch.
I thought there wouldn't need to be a merge


http://gerrit.cloudera.org:8080/#/c/7267/1/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS1, Line 353:   2: required bool use_top_n;
this will move to TSortType, right?


PS1, Line 355: 3: optional i64 offset
I don't think we'll need to implement the behavior to support this for partial sorts, and if we don't we should be clear about what this applies to: top N only I believe


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 111:     *eos = true;
Do we need to update num_rows_returned_ here?


Line 148: 
We can't have a partial sort in a subplan, right? It's probably best to just put a DCHECK(false) here since there's no way to test the code.


PS4, Line 149: :Res
Let's use nullptr consistently in the new code.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

Line 96:   Sorter(const TupleRowComparator& compare_less_than,
> This will be included in a follow up patch.
WFM. If you want to wait for https://gerrit.cloudera.org/#/c/5801/ to land there will be a way to directly set a maximum on bytes of buffers in the planner, which would accomplish the same thing.


http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: RuntimeProfile* profile, Run
I feel like this is breaking some of the encapsulation of the sort implementation. This detail leaks out into the planner anyway so I guess it's not really encapsulated.

How about we just document more clearly what value is expected. The class comment talks about the number of blocks per run, so it could be extended to talk about the minimum requirement for spilling and non-spilling sorts.


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS3, Line 349:  highest/lowest
> My problem with that wording is that it could be interpreted as "take the f
WFM


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 6:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7267/6//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5498: Support for partial sorts
lets specify in Kudu for now


PS6, Line 28: A
nit: This also adds...


Line 32: used previously for inserts into Kudu.
"for inserts into Kudu"->"for DML statements into Kudu tables only. Future work can extend this to other table sinks IMPALA-????"


Line 42:   Now: no spills, sort takes 6m19s, for ~18% speedup
comparison vs no partition/sort at all?


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

PS6, Line 62: sort_exec_exprs_
update


PS6, Line 62: derived based
awkward/confusing


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

PS6, Line 1383: RunSize
can you indicate the units, e.g. #rows -> NumRowsPerRun?


PS6, Line 1387:   if (no_spill_) {
              :     // 'min_buffers_required' is 1 since the sorter will only have 1 run at a time, so
              :     // there won't be any merges.
              :     min_buffers_required = 1;
              :   } else {
              :     min_buffers_required = MIN_BUFFERS_PER_MERGE;
              :   }
nit: ternary operator


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS6, Line 45: Specifying 'no_spill' as true is
For this use case, 'no_spill' should be set to true so the sorter can reduce...


PS6, Line 103: bool no_spill = false
comment on this even though it's mentioned above


Line 117:   /// up, it is sorted and a new unsorted run is created.
comment that this may not be called if no_spill is true


PS6, Line 127: InputDone
comment that this may not be called if no_spill is true
(are there any other cases I missed?)


http://gerrit.cloudera.org:8080/#/c/7267/6/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

PS6, Line 56: TODO: run experiments to
            :   // determine the value for this, consider making it configurable, enforce it in the BE.
JIRA?


PS6, Line 124: {
             :     return type_ != TSortType.PARTIAL;
             :   }
nit: oneline?


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#6).

Change subject: IMPALA-5498: Support for partial sorts
......................................................................

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

Adds a new counter 'RunSize' to the runtime profile which tracks
the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/runtime-profile-counters.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 460 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 6:

(1 comment)

> Can you extend the Sort metrics in the query profile to include
 > Avg, Min and Max run size?
 > This will be needed to reflect on the quality of the sorted data?

Done.

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: to the merger and retrieve r
> I don't feel strongly - the change you made looks good to me. Maybe 'is_par
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 111:     num_rows_returned_ += row_batch->num_rows();
> Do we need to update num_rows_returned_ here?
Done


Line 148: }
> We can't have a partial sort in a subplan, right? It's probably best to jus
Done


PS4, Line 149: 
> Let's use nullptr consistently in the new code.
Done


http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: ples in place.
> I feel like this is breaking some of the encapsulation of the sort implemen
If you prefer, I think its also reasonable to pass a 'is_partial" flag here.

Both solutions are kind of weird, but with the flag we could DCHECK in various places to ensure that callers are correctly following the intention here of not mixing calls to AddBatch() and AddBatchNoSpill() (though I think it would work as expected if they did).


http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 527:       if (!insertStmt.hasNoClusteredHint() && !ctx_.isSingleNodeExec()) {
> Why is this limited to Kudu?
We felt it was better to just do Kudu for now to keep this patch simpler and allow us to exercise it with the stress test before using it more broadly.

I filed: IMPALA-5649 to track applying this to HDFS tables.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 104:       RETURN_IF_ERROR(sorter_->Open());
> I think you'll want to wait until the next call into GetNext() to re-open t
To be clear - what you're saying is that Open() will want to allocate memory for the new run, but it may not be able to get it because its being used by row_batch.

So this is a perf issue, not a correctness issue, and it still makes sense to call Reset() here.

I'll make this change, I just want to be sure that the memory ownership transfers are happening correctly and there aren't any potential data races.


Line 115:   do {
> I think it would make sense to open the sorter here (if it's not already op
Done


http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

PS1, Line 30: a single tuple
> can you clarify, this is a bit unclear
Done


PS1, Line 34: If a merge phase was performed in the sort, sorted rows are deep copied into
            : /// the output batch.
> I thought there wouldn't need to be a merge
Whoops, forgot to update this. This whole comment is just copied from SortNode.


http://gerrit.cloudera.org:8080/#/c/7267/1/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS1, Line 353:   2: required bool use_top_n;
> this will move to TSortType, right?
Done


PS1, Line 355: 3: optional i64 offset
> I don't think we'll need to implement the behavior to support this for part
Its also used with total sorts - we make a cost based decision about whether to execute a sort with a limit as a top-n (implemented as a heap) or with a total sort that then only outputs n rows.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................


Patch Set 6:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7267/6//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5498: Support for partial sorts
> lets specify in Kudu for now
Done


PS6, Line 28: A
> nit: This also adds...
Done


Line 32: used previously for inserts into Kudu.
> "for inserts into Kudu"->"for DML statements into Kudu tables only. Future 
This only applies to inserts/upserts. We still haven't done the partitioning/sorting for update/delete.


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

PS6, Line 62: derived based
> awkward/confusing
Done


PS6, Line 62: sort_exec_exprs_
> update
Done


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

PS6, Line 1383: RunSize
> can you indicate the units, e.g. #rows -> NumRowsPerRun?
Done


PS6, Line 1387:   if (no_spill_) {
              :     // 'min_buffers_required' is 1 since the sorter will only have 1 run at a time, so
              :     // there won't be any merges.
              :     min_buffers_required = 1;
              :   } else {
              :     min_buffers_required = MIN_BUFFERS_PER_MERGE;
              :   }
> nit: ternary operator
Done


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS6, Line 45: Specifying 'no_spill' as true is
> For this use case, 'no_spill' should be set to true so the sorter can reduc
Done


PS6, Line 103: no_spill
> A positive argument would be marginally easier to follow. E.g. 'enable_spil
Of course, I called it no_spill to mirror the function names. Do you think we should also change AddBatch/AddBatchNoSpill -> AddBatchWithSpill/AddBatch?


PS6, Line 103: bool no_spill = false
> comment on this even though it's mentioned above
Done


Line 117:   /// up, it is sorted and a new unsorted run is created.
> comment that this may not be called if no_spill is true
Done


PS6, Line 127: InputDone
> comment that this may not be called if no_spill is true
So this does have a DCHECK in it, but only if there's more than one run (which isn't possible unless AddBatch() was called), so its fine to call with either value of enable_spilling.

The other cases would be the various private *Merge* functions. I didn't add DCHECKs/documentation to those cause it seems cluttered and the DCHECK here in InputDone() ensures that none of them are called, but I can add it in if you want.


http://gerrit.cloudera.org:8080/#/c/7267/6/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

PS6, Line 56: TODO: run experiments to
            :   // determine the value for this, consider making it configurable, enforce it in the BE.
> JIRA?
Done


PS6, Line 124: {
             :     return type_ != TSortType.PARTIAL;
             :   }
> nit: oneline?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] DRAFT - IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: DRAFT - IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 1:

(2 comments)

I haven't looked much at the Planner changes needed, eg. in SortNode.computeResourceProfile, but this patch gives a good idea of the interface changes needed.

http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 41: Status PartialSortNode::Init(const TPlanNode& tnode, RuntimeState* state) {
There's a lot of boiler plate here and in Prepare, Open, and Codegen that could be shared with SortNode.


http://gerrit.cloudera.org:8080/#/c/7267/1/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

Line 65:   private final boolean useTopN_;
This will be replaced with TSortType.TOPN


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................


Patch Set 8: Code-Review+2

Rebased

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: int64_t min_buffers_required
> If you prefer, I think its also reasonable to pass a 'is_partial" flag here
I don't feel strongly - the change you made looks good to me. Maybe 'is_partial' would be slightly better so that it's clearer what the expected usage pattern is, but I really don't feel strongly at all.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/908/

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 3:

(6 comments)

Looks promising!

http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 32:     sorter_(NULL),
This doesn't need to handle offset?


Line 127:     input_batch_index_ += num_processed;
Since this only requires a partial sort, you can exit early when the cumulative number of rows processed exceeds the limit.  This might end up sorting significantly fewer rows (at the cost of less well sorted data when a limit is present).


Line 140:   if (ReachedLimit()) {
I guess it is not clear to me why allowing a limit without an offset is useful.  If there is a limit, the total ordering of rows is undefined and there is no way to continue fetching them without also allowing an offset.  Maybe document this limitation?


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

Line 58: 
Just for general niceness, can you also update be/src/exec/sort-node.h's private SortInput method to have WARN_UNUSED_RESULT?


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

Line 1434:   DCHECK(unsorted_run_ != NULL);
Unnecessary - this is unconditionally dereferenced


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 360:   // Not used with TSortType::PARTIAL.
Please ignore my earlier comments about offset handling (although a comment that offsets are not supported in exec/partion-sode-node.cc is still welcome.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 527:       if (!insertStmt.hasNoClusteredHint() && !ctx_.isSingleNodeExec()) {
Why is this limited to Kudu?

Inserting into Partitioned HDFS tables is a very common use case.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 527:       if (!insertStmt.hasNoClusteredHint() && !ctx_.isSingleNodeExec()) {
> We felt it was better to just do Kudu for now to keep this patch simpler an
Also it's not just a matter of enabling it here. Some work is required in the BE to ensure coordination between the sorter (doing a partial sort) and the writing of files to make sure that a given file isn't written out of order, i.e. that new files get written at the partial sort boundaries.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7267/2/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 348: 
> these could use comments, especially since "partial sort" often means what 
Done


http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

PS2, Line 296:       boolean useTopN = stmt.hasLimit() && !disableTopN;
> long line
Done


http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

PS2, Line 93: 
> Maybe make this one a static factory for consistency. I like them- very rea
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes