You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2019/06/18 07:00:47 UTC

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Todd Lipcon has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13666


Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................

KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

I was running SELECT * FROM wide_table WHERE col = <non-matching>
and found that 10% of the CPU was used in this function. Looking at the
disassembly, it seemed that most of the cycles were going to iteration
over the vector<bool> (bitset) of prepared columns. In this case, only
one column was prepared (the one with the predicate) so the iteration
and bitset-testing was a big waste of CPU.

In fact, any given column will only be prepared once, so we can just
keep a simple list of the prepared columns and iterate over those
explicitly, making the loop O(num prepared columns) instead of
O(columns).

This dropped CFileSet::Iterator::FinishBatch() from taking 12.4% of the
scanner CPU down to about 0.2%.

Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
---
M src/kudu/tablet/cfile_set.cc
M src/kudu/tablet/cfile_set.h
2 files changed, 11 insertions(+), 18 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/13666/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/13666 )

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................


Patch Set 2: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13666/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13666/2//COMMIT_MSG@16
PS2, Line 16: will only be prepared once
> Currently I think we are limited to a single predicate per column, since Sc
Thank you for the clarification.  I looked into MaterializingIterator::Init() more closely and I don't see how we can get into the situation I was worried about.


http://gerrit.cloudera.org:8080/#/c/13666/2/src/kudu/tablet/cfile_set.cc
File src/kudu/tablet/cfile_set.cc:

http://gerrit.cloudera.org:8080/#/c/13666/2/src/kudu/tablet/cfile_set.cc@470
PS2, Line 470:   prepared_iters_.clear();
> I don't think we'd gain much from that, since 'clear' doesn't actually shri
Yep, doing it once in Init() is the best place, I agree.



-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 00:14:55 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13666 )

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................

KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

I was running SELECT * FROM wide_table WHERE col = <non-matching>
and found that 10% of the CPU was used in this function. Looking at the
disassembly, it seemed that most of the cycles were going to iteration
over the vector<bool> (bitset) of prepared columns. In this case, only
one column was prepared (the one with the predicate) so the iteration
and bitset-testing was a big waste of CPU.

In fact, any given column will only be prepared once, so we can just
keep a simple list of the prepared columns and iterate over those
explicitly, making the loop O(num prepared columns) instead of
O(columns).

This dropped CFileSet::Iterator::FinishBatch() from taking 12.4% of the
scanner CPU down to about 0.2%.

Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Reviewed-on: http://gerrit.cloudera.org:8080/13666
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M src/kudu/tablet/cfile_set.cc
M src/kudu/tablet/cfile_set.h
2 files changed, 16 insertions(+), 22 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Alexey Serbin: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13666

to look at the new patch set (#2).

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................

KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

I was running SELECT * FROM wide_table WHERE col = <non-matching>
and found that 10% of the CPU was used in this function. Looking at the
disassembly, it seemed that most of the cycles were going to iteration
over the vector<bool> (bitset) of prepared columns. In this case, only
one column was prepared (the one with the predicate) so the iteration
and bitset-testing was a big waste of CPU.

In fact, any given column will only be prepared once, so we can just
keep a simple list of the prepared columns and iterate over those
explicitly, making the loop O(num prepared columns) instead of
O(columns).

This dropped CFileSet::Iterator::FinishBatch() from taking 12.4% of the
scanner CPU down to about 0.2%.

Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
---
M src/kudu/tablet/cfile_set.cc
M src/kudu/tablet/cfile_set.h
2 files changed, 15 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/13666/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/13666 )

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 28 Jun 2019 18:30:42 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/13666 )

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13666/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13666/2//COMMIT_MSG@16
PS2, Line 16: will only be prepared once
Most likely I'm missing something, but from breefly looking into the implementation, I don't see how this is guaranteed.  What if running some silly query like SELECT * FROM x WHERE col = a AND col = b?  Will column 'col' be prepared only once?  E.g., what if that code is called from MaterializingIterator::MaterializeBlock() twice?


http://gerrit.cloudera.org:8080/#/c/13666/2/src/kudu/tablet/cfile_set.cc
File src/kudu/tablet/cfile_set.cc:

http://gerrit.cloudera.org:8080/#/c/13666/2/src/kudu/tablet/cfile_set.cc@470
PS2, Line 470:   prepared_iters_.clear();
nit: does it make sense to add

  prepared_iters_.reserve(col_iters_.size());

?



-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 18 Jun 2019 21:23:33 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13666 )

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13666/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13666/2//COMMIT_MSG@16
PS2, Line 16: will only be prepared once
> Most likely I'm missing something, but from breefly looking into the implem
Currently I think we are limited to a single predicate per column, since ScanSpec::predicates_ is a map from column name to single ColumnPredicate instance. In the case where we have multiple predicates on the same column, we currently always merge them.

Then in MaterializeBlock(), we will only materialize each column once, because we separate the columns into those with predicates and those without.

Test coverage so far seems to agree. But I might believe that we have a lack of coverage and I missed something. Any idea of how we might produce the situation you're worried about?


http://gerrit.cloudera.org:8080/#/c/13666/2/src/kudu/tablet/cfile_set.cc
File src/kudu/tablet/cfile_set.cc:

http://gerrit.cloudera.org:8080/#/c/13666/2/src/kudu/tablet/cfile_set.cc@470
PS2, Line 470:   prepared_iters_.clear();
> nit: does it make sense to add
I don't think we'd gain much from that, since 'clear' doesn't actually shrink the backing array. So, if on the first batch, we do end up using the full list of iters here, it'll keep that same "shape" across iterations.

I could see doing that in CFileSet::Iterator::Init, though. I'll add it there.



-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 18 Jun 2019 22:53:52 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13666

to look at the new patch set (#3).

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................

KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

I was running SELECT * FROM wide_table WHERE col = <non-matching>
and found that 10% of the CPU was used in this function. Looking at the
disassembly, it seemed that most of the cycles were going to iteration
over the vector<bool> (bitset) of prepared columns. In this case, only
one column was prepared (the one with the predicate) so the iteration
and bitset-testing was a big waste of CPU.

In fact, any given column will only be prepared once, so we can just
keep a simple list of the prepared columns and iterate over those
explicitly, making the loop O(num prepared columns) instead of
O(columns).

This dropped CFileSet::Iterator::FinishBatch() from taking 12.4% of the
scanner CPU down to about 0.2%.

Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
---
M src/kudu/tablet/cfile_set.cc
M src/kudu/tablet/cfile_set.h
2 files changed, 16 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/13666/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2866. Optimize CFileSet::Iterator::FinishBatch

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/13666 )

Change subject: KUDU-2866. Optimize CFileSet::Iterator::FinishBatch
......................................................................


Patch Set 2: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I997fe832fdfa8d92fbbcb5d7c5bd4141e485b4f8
Gerrit-Change-Number: 13666
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 18 Jun 2019 12:41:20 +0000
Gerrit-HasComments: No