You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tianyi Wang (Code Review)" <ge...@cloudera.org> on 2017/10/06 17:23:40 UTC

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Tianyi Wang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8226


Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................

IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

This patch replaces BufferedTupleStream::GetRows with GetRowBatches,
which returns multiple batches instead of one.
BufferedTupleStrem::GetRows pins a stream and read all the rows into a
single batch, ignoring batch_size configuration. It is not a good API
and may cause problems in memory management and row batch processing.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 43 insertions(+), 68 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
> that will cause us to still have to pin the entire stream in memory.  Inste
The callers iterates through the build rows for every row in null_probe_rows/null_aware_probe_partition_. If we don't pin the stream we need to read the stream multi times, for every prob batch. Sure about that?



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Comment-Date: Fri, 06 Oct 2017 19:03:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1400/


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Oct 2017 02:44:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 4:

Tim, can you do the +2 review for this one?


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 4
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 17:49:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332
PS1, Line 332:   Status GetNext(
> My intent was to remove the GetRows() API altogether and switch callers to 
Do you mean to keep only one rowbatch and assemble the row batches multiple times but yet let the stream stay pinned?



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Oct 2017 23:36:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 3: Code-Review+1

(2 comments)

Change looks good to me - glad to see the code removed. Can the other reviewers take another look (or let me know if they don't intend to)?

http://gerrit.cloudera.org:8080/#/c/8226/3/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/3/be/src/exec/partitioned-hash-join-node.cc@948
PS3, Line 948:     DCHECK(got_reservation);
Maybe add a DCHECK message like '<< "Should have been pinned"' to explain the DCHECK.


http://gerrit.cloudera.org:8080/#/c/8226/3/be/src/exec/partitioned-hash-join-node.cc@1073
PS3, Line 1073:       DCHECK(got_reservation);
Same here



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 3
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 17:36:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

> (1 comment)

Correct, it will unpin/pin multiple times. But if the memory is available the buffers will stay in memory even though it was unpinned (you'll want to take a look through the buffer pool code starting with the header comments). If the memory is not available, then we will read from disk but compare that to today where we'd run out of memory and fail. It does mean, however, that rows are "unflattened" multiple times.

It would be good to sync up with Tim about this JIRA to ask his intention when he returns on Monday.  I believe the JIRA was filed a while back before we did a big rework of these layers (I think the JIRA still makes sense, but good to sync up).


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Comment-Date: Fri, 06 Oct 2017 21:54:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong, Dan Hecht, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8226

to look at the new patch set (#6).

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................

IMPALA-2758: Remove BufferedTupleStream::GetRows

This patch removes BufferedTupleStream::GetRows. This function pins a
stream and reads all the rows into a single batch. It is not a good API
since it creates an arbitrarily large row batch. In this patch the call
sites pin the stream and then directly use GetNext to retrieve a single
batch at a time.

Testing: It passes existing tests. A test case for GetRows is removed.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 63 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/6
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 6
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332
PS1, Line 332:   Status GetNext(
> Do you mean to keep only one rowbatch and assemble the row batches multiple
Yes that's what I meant - have one RowBatch and iterate over the stream by calling GetNext() repeatedly.



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Oct 2017 20:47:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.h@482
PS2, Line 482: output_null_aware_probe_rows_running
> Needs underscore on end.
Done


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.cc@948
PS2, Line 948:     DCHECK(got_reservation);
> The stream is pinned so this should be impossible - let's make it a DCHECK.
Done


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.cc@1073
PS2, Line 1073:       DCHECK(got_reservation);
> Same here - got_reservation should always be true for a pinned stream.
Done


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/runtime/buffered-tuple-stream.h@127
PS2, Line 127: 
> There are some references to GetRows() in the comments that can be cleaned 
Done


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/runtime/buffered-tuple-stream.h@371
PS2, Line 371:   /// Wrapper around BufferPool::PageHandle that tracks additional info about the page.
> This test was removed too.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 3
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Oct 2017 23:22:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
> I'm ok with the stream staying pinned, the main goal was to remove the GetR
WFM, agree the important step here is to eliminate GetRows().



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 11 Oct 2017 16:55:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 2:

Reimplemented according to the above discussion.


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 2
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Oct 2017 21:24:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 4: Code-Review+2

(4 comments)

Looks good, just did a final pass and caught a couple more things. If you fix those and rebase then I can start the merge.

http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.h@482
PS4, Line 482:   bool output_null_aware_probe_rows_running_;
Longer-term we should think about ways to reduce the sheer number of stateful member variables in this class and make the state machine in GetNext() easier to understand.


http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc@949
PS4, Line 949:     RowBatch null_build_batch(child(1)->row_desc(), state->batch_size(),
This can fit on one line.


http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc@962
PS4, Line 962:       null_build_batch.Reset();
We should add a RETURN_IF_CANCELLED() check here, similar to the one below (I think this was an oversight in an earlier patch).


http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc@1074
PS4, Line 1074:       RowBatch build_batch(child(1)->row_desc(), state->batch_size(),
This can fit on one line.



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 4
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 22:22:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has abandoned this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Abandoned

Need to sync up with Tim before moving forward.
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8226

to look at the new patch set (#2).

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................

IMPALA-2758: Remove BufferedTupleStream::GetRows

This patch removes BufferedTupleStream::GetRows. This function pins a
stream and reads all the rows into a single batch. It is not a good API
since it creates an arbitrarily large row batch. In this patch the call
sites pin the stream and then directly use GetNext to retrieve a single
batch at a time.

Testing: It passes existing tests. A test case for GetRows is removed.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 57 insertions(+), 90 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 2
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 6: Code-Review+2

Glad we have automated checking for dropped status now, it's super-easy to miss.


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 6
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Oct 2017 18:47:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................

IMPALA-2758: Remove BufferedTupleStream::GetRows

This patch removes BufferedTupleStream::GetRows. This function pins a
stream and reads all the rows into a single batch. It is not a good API
since it creates an arbitrarily large row batch. In this patch the call
sites pin the stream and then directly use GetNext to retrieve a single
batch at a time.

Testing: It passes existing tests. A test case for GetRows is removed.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Reviewed-on: http://gerrit.cloudera.org:8080/8226
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 63 insertions(+), 99 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 7
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@14
PS1, Line 14: 
Any testing?


http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/exec/partitioned-hash-join-node.cc@950
PS1, Line 950: break;
I think this changes the behavior here - this will only break out of the FOREACH_ROW loop, but what we want is to stop iterating over the entire set of build rows (that was previously one batch, but is now the same set of rows split into multiple batches).


http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@339
PS1, Line 339: vector<std::unique_ptr<RowBatch>>* batch
Can you move this to be after 'batch_size'? We generally keep out parameters as the trailing parameters for clarity.



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 06 Oct 2017 18:24:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1400/


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 23:15:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 23:15:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
that will cause us to still have to pin the entire stream in memory.  Instead, I think the ask of the JIRA is the change the callers to use the GetNext() interface, so that they only need to keep a single batch in memory at a time, since I believe all callers are really just iterating through the stream anyway.



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 06 Oct 2017 18:47:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8226

to look at the new patch set (#5).

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................

IMPALA-2758: Remove BufferedTupleStream::GetRows

This patch removes BufferedTupleStream::GetRows. This function pins a
stream and reads all the rows into a single batch. It is not a good API
since it creates an arbitrarily large row batch. In this patch the call
sites pin the stream and then directly use GetNext to retrieve a single
batch at a time.

Testing: It passes existing tests. A test case for GetRows is removed.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 63 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/5
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8226/3/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/3/be/src/exec/partitioned-hash-join-node.cc@948
PS3, Line 948:     DCHECK(got_reservation) << "Should have been pinned";
> Maybe add a DCHECK message like '<< "Should have been pinned"' to explain t
Done


http://gerrit.cloudera.org:8080/#/c/8226/3/be/src/exec/partitioned-hash-join-node.cc@1073
PS3, Line 1073:       DCHECK(got_reservation) << "Should have been pinned";
> Same here
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 4
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 17:44:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8226

to look at the new patch set (#4).

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................

IMPALA-2758: Remove BufferedTupleStream::GetRows

This patch removes BufferedTupleStream::GetRows. This function pins a
stream and reads all the rows into a single batch. It is not a good API
since it creates an arbitrarily large row batch. In this patch the call
sites pin the stream and then directly use GetNext to retrieve a single
batch at a time.

Testing: It passes existing tests. A test case for GetRows is removed.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 64 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 4
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has restored this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Restored
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: restore
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.h@482
PS4, Line 482:   bool output_null_aware_probe_rows_running_;
> Longer-term we should think about ways to reduce the sheer number of statef
Yeah many of the members are just used by one state. I think we should summarize the states into enums or put the stateful members into std::variant/boost::variant altogether.


http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc@949
PS4, Line 949:     RowBatch null_build_batch(child(1)->row_desc(), state->batch_size(), mem_tracker());
> This can fit on one line.
Done


http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc@962
PS4, Line 962:       RETURN_IF_CANCELLED(state);
> We should add a RETURN_IF_CANCELLED() check here, similar to the one below 
Done


http://gerrit.cloudera.org:8080/#/c/8226/4/be/src/exec/partitioned-hash-join-node.cc@1074
PS4, Line 1074:       RowBatch build_batch(child(1)->row_desc(), state->batch_size(), mem_tracker());
> This can fit on one line.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 22:55:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8226

to look at the new patch set (#3).

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................

IMPALA-2758: Remove BufferedTupleStream::GetRows

This patch removes BufferedTupleStream::GetRows. This function pins a
stream and reads all the rows into a single batch. It is not a good API
since it creates an arbitrarily large row batch. In this patch the call
sites pin the stream and then directly use GetNext to retrieve a single
batch at a time.

Testing: It passes existing tests. A test case for GetRows is removed.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 64 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 3
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1408/


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 6
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Oct 2017 18:46:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> > (1 comment)
> 
> Correct, it will unpin/pin multiple times. But if the memory is available the buffers will stay in memory even though it was unpinned (you'll want to take a look through the buffer pool code starting with the header comments). If the memory is not available, then we will read from disk but compare that to today where we'd run out of memory and fail. It does mean, however, that rows are "unflattened" multiple times.
> 
> It would be good to sync up with Tim about this JIRA to ask his intention when he returns on Monday.  I believe the JIRA was filed a while back before we did a big rework of these layers (I think the JIRA still makes sense, but good to sync up).

Thanks. I will abandon this patch for now.


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Comment-Date: Fri, 06 Oct 2017 21:56:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 2:

(5 comments)

The code change looks good, just needs a bit more cleanup.

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.h@482
PS2, Line 482: output_null_aware_probe_rows_running
Needs underscore on end.


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.cc@948
PS2, Line 948:     if (!got_reservation) return NullAwareAntiJoinError(null_build_stream);
The stream is pinned so this should be impossible - let's make it a DCHECK.


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/exec/partitioned-hash-join-node.cc@1073
PS2, Line 1073:       if (!got_reservation) return NullAwareAntiJoinError(build);
Same here - got_reservation should always be true for a pinned stream.


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/runtime/buffered-tuple-stream.h@127
PS2, Line 127: or GetRows()
There are some references to GetRows() in the comments that can be cleaned up too.


http://gerrit.cloudera.org:8080/#/c/8226/2/be/src/runtime/buffered-tuple-stream.h@371
PS2, Line 371:   friend class SimpleTupleStreamTest_TestGetRowsOverflow_Test;
This test was removed too.



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 2
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Oct 2017 22:35:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 6
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Oct 2017 22:15:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
> The callers iterates through the build rows for every row in null_probe_row
I'm ok with the stream staying pinned, the main goal was to remove the GetRows() API, which uses RowBatch in a weird way.

I think keeping the stream unpinned to handle many nulls on the build side is just a special case of
https://issues.apache.org/jira/browse/IMPALA-4857. I think that is harder to implement well - we'd want to consider doing some kind of blocked algorithm to reduce I/O.


http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332
PS1, Line 332:   Status GetNext(
My intent was to remove the GetRows() API altogether and switch callers to calling GetNext() directly.

Creating many row batches is somewhat better, but there's still a lot of unnecessary memory overhead with all of the tuple_ptrs_.



-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 10 Oct 2017 20:34:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2758: Remove BufferedTupleStream::GetRows

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Remove BufferedTupleStream::GetRows
......................................................................


Patch Set 6:

Sorry I forgot to handle some error.


-- 
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 6
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Oct 2017 18:35:47 +0000
Gerrit-HasComments: No