You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Mike Percy (Code Review)" <ge...@cloudera.org> on 2019/01/09 22:30:50 UTC

[kudu-CR] WIP: KUDU-2645. tablet: Add ghost row de-duplication support to MergeIterator

Mike Percy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12205


Change subject: WIP: KUDU-2645. tablet: Add ghost row de-duplication support to MergeIterator
......................................................................

WIP: KUDU-2645. tablet: Add ghost row de-duplication support to MergeIterator

Needs some polish and additional coverage.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/tablet/diff_scan-test.cc
3 files changed, 179 insertions(+), 38 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy <mp...@apache.org>

[kudu-CR] WIP: KUDU-2645. tablet: Add ghost row de-duplication support to MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: WIP: KUDU-2645. tablet: Add ghost row de-duplication support to MergeIterator
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@308
PS1, Line 308:   vector<size_t> smallest_indexes(iters_.size());
not used, will remove



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 22:37:49 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has removed a vote on this change.

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 9
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@439
PS3, Line 439:       int cmp;
> Good observation. I think smallest.resize(1, iter.get()) would be more effi
Oops, that resize() doesn't work that way when shrinking the vector, but smallest.clear(); smallest.emplace_back() looks to be fine.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441
PS3, Line 441:         // If we have a candidate for smallest row, compare it against the
> My take is that if I use emplace_back(), I don't have to think about whethe
Well, I ran a little test and indeed emplace_back() is faster.

  #include <iostream>
  #include <string>
  #include <vector>

  using std::cerr;
  using std::endl;
  using std::string;
  using std::vector;

  void test_emplace_back(vector<int>& vec, int count) {
    for (int i = 0; i < count; i++) {
      vec.emplace_back(i);
    }
  }

  void test_push_back(vector<int>& vec, int count) {
    for (int i = 0; i < count; i++) {
      vec.push_back(i);
    }
  }

  int main(int argc, const char** argv) {
    if (argc == 1) {
      cerr << "Usage: " << argv[0] << " emplace_back|push_back" << endl;
      return 1;
    }
    string arg = argv[1];

    constexpr int kCount = 100 * 1000 * 1000;

    vector<int> vec;
    vec.reserve(kCount);
    if (arg == "emplace_back") {
      test_emplace_back(vec, kCount);
      return 0;
    }
    if (arg == "push_back") {
      test_push_back(vec, kCount);
      return 0;
    }

    return 1;
  }

Compiled by g++ with -O3 and -std=c++11, I get the following runtime numbers (which are roughly representative):

  $ time ./emplace-test emplace_back

  real    0m0.192s
  user    0m0.093s
  sys     0m0.100s

  $ time ./emplace-test push_back

  real    0m0.280s
  user    0m0.176s
  sys     0m0.104s


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479
PS3, Line 479:           // We found the single live instance of the row.
> I think you can deduplicate the two calls to CopyRow (one here and the othe
Gotcha, done.


http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@474
PS4, Line 474:       int live_rows_found = 0;
> Could you make sure this doesn't trigger a not-used warning in RELEASE buil
Done


http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@503
PS4, Line 503:     unordered_set<MergeIterState*> exhausted;
> I think this can be a simple vector; aren't we guaranteed that there are no
The point of using the set data structure is to get O(1) lookup within the if-clause in std::remove_if()

I'd be surprised if the allocation overhead were significant here because tcmalloc should be using thread-local storage for these relatively small STL data structures.

So I just ran TestMerge 20 times on the current patch with --num-lists 100, and on a version of it after reverting the bulk of the changes in generic_iterators.cc back to the existing master branch code, and the performance was essentially identical for the "real" time measurement, in seconds: existing code = min 0.118, median 0.126, max 0.147; new code = min 0.118, median 0.125, max 0.143

It appears that the additional tcmalloc (non-central freelist) allocation overhead for the vector and unordered_set elements are in the noise compared to the merge process itself, which is fairly CPU intensive.



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 4
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Fri, 01 Mar 2019 23:54:03 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12205

to look at the new patch set (#5).

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found,
because it's not possible to efficiently support such deduplication
without a merge-like process.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

This patch adds tests at a couple of different levels, including a
VectorIterator-based test that is useful for benchmarking MergeIterator
performance on a simple schema, as well as a higher-level diff scan
test that operates at the Tablet level.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
6 files changed, 305 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 5
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 8:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators-test.cc
File src/kudu/common/generic_iterators-test.cc:

http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators-test.cc@287
PS8, Line 287:           ASSERT_EQ(1, expected_ints_deleted.erase(potential));
Nit: you can upgrade this to CHECK_EQ; the ContainsKey() check on L286 guarantees that it's true, so it's not really something being tested (i.e. something that warrants an assert).


http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators-test.cc@369
PS8, Line 369:             int is_deleted_idx = schema.find_first_is_deleted_virtual_column();
This can be pulled out of the loops.


http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators.cc@39
PS8, Line 39: #include "kudu/common/common.pb.h"
Any idea what this inclusion is for?



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 8
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 06 Mar 2019 00:57:03 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 9: Verified+1

Overriding flaky RemoteKsckTest.TestClusterWithLocation (which I tried to file a JIRA for, but ASF JIRA is acting up right now)


-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 9
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 06 Mar 2019 07:14:21 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12205

to look at the new patch set (#7).

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found,
because it's not possible to efficiently support such deduplication
without a merge-like process.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

This patch adds tests at a couple of different levels, including a
VectorIterator-based test that is useful for benchmarking MergeIterator
performance on a simple schema, as well as a higher-level diff scan
test that operates at the Tablet level.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/common/schema.cc
M src/kudu/common/schema.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
8 files changed, 308 insertions(+), 78 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/7
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 7
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12205

to look at the new patch set (#2).

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
6 files changed, 176 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/2
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 2
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 9: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 9
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 06 Mar 2019 05:11:13 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 8:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators-test.cc
File src/kudu/common/generic_iterators-test.cc:

http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators-test.cc@287
PS8, Line 287:           ASSERT_EQ(1, expected_ints_deleted.erase(potential));
> Nit: you can upgrade this to CHECK_EQ; the ContainsKey() check on L286 guar
Done


http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators-test.cc@369
PS8, Line 369:             int is_deleted_idx = schema.find_first_is_deleted_virtual_column();
> This can be pulled out of the loops.
Done


http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/8/src/kudu/common/generic_iterators.cc@39
PS8, Line 39: #include "kudu/common/common.pb.h"
> Any idea what this inclusion is for?
Yeah, it's for IS_DELETED



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 8
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 06 Mar 2019 03:51:45 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 3:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@19
PS3, Line 19: Because of this new behavior, while the MergeIterator will deduplicate
            : deleted rows if they are included in the result set, the UnionIterator
            : will not deduplicate them and instead return all instances found.
I might reword this to state that it's not possible to efficiently support this deduplication _without_ a merge-like process, so there's no reason to extend it to the UnionIterator.


http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@22
PS3, Line 22: 
            : One tangentially-related change in this patch is that TestMerge in
            : generic_iterators-test.cc was modified to no longer generate duplicate
            : non-deleted keys for merge testing
This makes sense given what you're trying to do, but it's a bit of a bummer in that it makes the MergeIterator less "generic".


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc
File src/kudu/common/generic_iterators-test.cc:

PS3: 
Could you add a test case for include_deleted_rows=true that uses these simple VectorIterators? That'll help assess the impact of deduplication.

Separately, could you assess the impact of the MergeIterator change when !include_deleted_rows? See my dominance patch for some ideas on how to benchmark.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc@232
PS3, Line 232:       seen.insert(potential);
Nit: InsertOrDie. Or EmplaceOrDie (if it works).


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h
File src/kudu/common/generic_iterators.h:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h@38
PS3, Line 38: schema
Nit: schemas (since this is a requirement for every sub-iterator).


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@300
PS3, Line 300:     : opts_(opts),
std::move(opts)


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@419
PS3, Line 419:   vector<MergeIterState*> smallest(states_.size());
Can we micro-optimize the !include_deleted_rows case such that the operations (and allocation) of the vector are avoided?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@432
PS3, Line 432:     for (const auto& iter : states_) {
There's a non-obvious state machine in here; could you document it?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@439
PS3, Line 439:         smallest = { iter.get() };
Is this is a copy assignment? For the common case where there are no duplicates, is this as efficient as overwriting the first element?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441
PS3, Line 441:         smallest.push_back(iter.get());
Nit: emplace_back for new code.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@465
PS3, Line 465:           // We found the single live instance of the row.
In DEBUG builds, should we verify that any remaining instances are ghosts?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479
PS3, Line 479:       RETURN_NOT_OK(CopyRow(row_to_return->next_row(), &dst_row, dst->arena()));
I think this can be factored out of both sides of the if statement if you declare row_to_return on L448 and set it appropriately in the !include_deleted_rows case.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@487
PS3, Line 487:     // Remove exhausted sub-iterators from the list we merge from.
Any reason not to combine this loop with the one that advances the sub-iterators just above?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@504
PS3, Line 504: 
             :   // Maintain a record of the number of rows actually copied to the destination
             :   // RowBlock.
I think this is mostly self-explanatory from the code. What would be interesting to document, though, is that the possibility of deduplication is what makes the previous dst->Resize() call in PrepareBatch insufficient.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@536
PS3, Line 536:   return unique_ptr<RowwiseIterator>(new MergeIterator(opts, std::move(iters)));
std::move(opts)


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc
File src/kudu/tablet/diff_scan-test.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc@61
PS3, Line 61:   ASSERT_TRUE(order_mode == UNORDERED || order_mode == ORDERED)
            :       << "unknown order mode: " << OrderMode_Name(order_mode);
Doesn't seem necessary to me; the list of values is just a few lines above.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc@97
PS3, Line 97:   opts.include_deleted_rows = true;
Would be cool to parameterize on this too.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc
File src/kudu/tablet/rowset.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc@105
PS3, Line 105:     case ORDERED: {
Why do you need the new scope here? Same for tablet.cc.



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 3
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 13 Feb 2019 05:22:37 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12205

to look at the new patch set (#9).

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found,
because it's not possible to efficiently support such deduplication
without a merge-like process.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

This patch adds tests at a couple of different levels, including a
VectorIterator-based test that is useful for benchmarking MergeIterator
performance on a simple schema, as well as a higher-level diff scan
test that operates at the Tablet level.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/common/schema.cc
M src/kudu/common/schema.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
8 files changed, 308 insertions(+), 79 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/9
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 9
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441
PS3, Line 441:         // If we have a candidate for smallest row, compare it against the
> Is there a reason to prefer emplace_back() over push_back() in general? I t
My take is that if I use emplace_back(), I don't have to think about whether I can move or copy; it allows both, and without any overhead for copying. So if we consistently use emplace_back(), that's just one less thing to think about. That's why I started pointing it out in my code reviews (for the past year or two).

IIUC, push_back(std::move(foo)) always makes a copy of 'foo' but misleads readers into thinking that 'foo' is being moved. Hence the thought to avoid it in new code.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479
PS3, Line 479:           // We found the single live instance of the row.
> I don't understand this comment.
I think you can deduplicate the two calls to CopyRow (one here and the other on L453) into one call just after if (!opts_.include_deleted_rows) { ... } else { ... }


http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@474
PS4, Line 474:       int live_rows_found = 0;
Could you make sure this doesn't trigger a not-used warning in RELEASE builds?


http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@503
PS4, Line 503:     unordered_set<MergeIterState*> exhausted;
I think this can be a simple vector; aren't we guaranteed that there are no duplicates _within_ a given sub-iterator?

Regardless, I still think it'd be better to avoid this intermediate step and combine the two loops. I get that you want to acquire states_lock_ and do states_.erase() just once, but consider that if there were duplicates for the returned row, the number of duplicates is probably very low, so the overhead of multiple lock acquisition/vector erases may be outweighed by the allocation of the set/vector here.



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 4
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 28 Feb 2019 06:11:56 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 3:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@19
PS3, Line 19: Because of this new behavior, while the MergeIterator will deduplicate
            : deleted rows if they are included in the result set, the UnionIterator
            : will not deduplicate them and instead return all instances found.
> I might reword this to state that it's not possible to efficiently support 
Done


http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@22
PS3, Line 22: 
            : One tangentially-related change in this patch is that TestMerge in
            : generic_iterators-test.cc was modified to no longer generate duplicate
            : non-deleted keys for merge testing
> This makes sense given what you're trying to do, but it's a bit of a bummer
I agree, and if there was a lot of value associated with making it continue to work the same way then I would have done it, but there's no customer for it.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc
File src/kudu/common/generic_iterators-test.cc:

PS3: 
> Could you add a test case for include_deleted_rows=true that uses these sim
Hmm, I'll look into writing a test case using the VectorIterators. I originally looked into it and it didn't seem straightforward... but I don't recall exactly why right now.

And sure, I'll work on measuring the perf impact of this patch.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc@232
PS3, Line 232:       seen.insert(potential);
> Nit: InsertOrDie. Or EmplaceOrDie (if it works).
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h
File src/kudu/common/generic_iterators.h:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h@38
PS3, Line 38: schema
> Nit: schemas (since this is a requirement for every sub-iterator).
Done


http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@350
PS1, Line 350: 
> This can be hoisted out of the outer loop, and maybe used to control whethe
Yes I agree, maybe we can do this in Init()


http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@355
PS1, Line 355:       remove_if(states_.begin(), states_.end(), [] (const unique_ptr<MergeIterState>& iter) {
> This can overflow the RowBlock. We need to avoid doing that, but also make 
Good point, thanks


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@300
PS3, Line 300:     : opts_(opts),
> std::move(opts)
clang-tidy complains when std::move()ing 'opts' because it's a trivial struct that reduces to a single bool, so that could be a pessimizing move. Once we add another field to the struct, clang-tidy will probably start to complain that we should be move()ing it.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@419
PS3, Line 419:   vector<MergeIterState*> smallest(states_.size());
> Can we micro-optimize the !include_deleted_rows case such that the operatio
I'll work on measuring the impact of these changes and go from there, since it looks to me like there are mostly only ugly options if that is an optimization we need to do.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@432
PS3, Line 432:     for (const auto& iter : states_) {
> There's a non-obvious state machine in here; could you document it?
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@439
PS3, Line 439:         smallest = { iter.get() };
> Is this is a copy assignment? For the common case where there are no duplic
Good observation. I think smallest.resize(1, iter.get()) would be more efficient than whatever this might be doing.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441
PS3, Line 441:         smallest.push_back(iter.get());
> Nit: emplace_back for new code.
Is there a reason to prefer emplace_back() over push_back() in general? I think it used to be in the GSG but it's since been removed. I find it confusing to conflate push_back() and emplace_back(), since the beauty of emplace is that it acts as an in-place constructor taking constructor arguments, whereas the vec.push_back(ptr) form seems more semantically appropriate in this case, where we are calling the trivial copy constructor of a primitive type. Even if it wasn't a primitive type, it would seem more descriptive and I believe just as efficient to simply call vec.push_back(std::move(foo)) if trying to use the move constructor and avoid temporary objects from being constructed.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@465
PS3, Line 465:           // We found the single live instance of the row.
> In DEBUG builds, should we verify that any remaining instances are ghosts?
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479
PS3, Line 479:       RETURN_NOT_OK(CopyRow(row_to_return->next_row(), &dst_row, dst->arena()));
> I think this can be factored out of both sides of the if statement if you d
I don't understand this comment.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@487
PS3, Line 487:     // Remove exhausted sub-iterators from the list we merge from.
> Any reason not to combine this loop with the one that advances the sub-iter
Yeah... we can do better here. I'll update this.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@504
PS3, Line 504: 
             :   // Maintain a record of the number of rows actually copied to the destination
             :   // RowBlock.
> I think this is mostly self-explanatory from the code. What would be intere
Good call, done.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@536
PS3, Line 536:   return unique_ptr<RowwiseIterator>(new MergeIterator(opts, std::move(iters)));
> std::move(opts)
see above


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc
File src/kudu/tablet/diff_scan-test.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc@61
PS3, Line 61:   ASSERT_TRUE(order_mode == UNORDERED || order_mode == ORDERED)
            :       << "unknown order mode: " << OrderMode_Name(order_mode);
> Doesn't seem necessary to me; the list of values is just a few lines above.
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc
File src/kudu/tablet/rowset.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc@105
PS3, Line 105:     case ORDERED: {
> Why do you need the new scope here? Same for tablet.cc.
This must have been a spurious addition once needed before a refactor. I'll remove it in both places.



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 3
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 28 Feb 2019 02:58:15 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12205

to look at the new patch set (#3).

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
6 files changed, 176 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/3
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 3
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] WIP: KUDU-2645. tablet: Add ghost row de-duplication support to MergeIterator

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: WIP: KUDU-2645. tablet: Add ghost row de-duplication support to MergeIterator
......................................................................


Patch Set 1:

(2 comments)

Just skimmed this.

http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@350
PS1, Line 350:     const int kIsDeletedColIndex = schema_.find_first_is_deleted_virtual_column();
This can be hoisted out of the outer loop, and maybe used to control whether the ghost sifting even happens in the first place?


http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@355
PS1, Line 355:         RowBlockRow dst_row = dst->row(dst_row_idx++);
This can overflow the RowBlock. We need to avoid doing that, but also make sure we don't "consume" a row as part of our "find all rows sharing the same smallest row key" and not return it to the client.



-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 22:48:57 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12205

to look at the new patch set (#8).

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found,
because it's not possible to efficiently support such deduplication
without a merge-like process.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

This patch adds tests at a couple of different levels, including a
VectorIterator-based test that is useful for benchmarking MergeIterator
performance on a simple schema, as well as a higher-level diff scan
test that operates at the Tablet level.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/common/schema.cc
M src/kudu/common/schema.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
8 files changed, 308 insertions(+), 79 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/8
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 8
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................


Patch Set 10:

JIRA came back and that flaky test is filed as KUDU-2734


-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 10
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 06 Mar 2019 07:18:21 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12205

to look at the new patch set (#4).

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found,
because it's not possible to efficiently support such deduplication
without a merge-like process.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
6 files changed, 193 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/12205/4
-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 4
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................

KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator

This patch makes it possible to do an incremental diff scan of an entire
tablet. A follow-up patch will expose this capability to scanners at the
RPC level.

Included is new support in the MergeIterator for deduplicating deleted
rows resulting from when a row with a particular primary key is deleted
from one rowset and reinserted into a different rowset. These duplicates
may be returned when the 'include_deleted_rows' option is enabled in the
row iterator options.

Because of this new behavior, while the MergeIterator will deduplicate
deleted rows if they are included in the result set, the UnionIterator
will not deduplicate them and instead return all instances found,
because it's not possible to efficiently support such deduplication
without a merge-like process.

One tangentially-related change in this patch is that TestMerge in
generic_iterators-test.cc was modified to no longer generate duplicate
non-deleted keys for merge testing. Duplicate non-deleted row keys are
no longer supported in the MergeIterator since there is currently no
practical use for that, and it's more efficient not to support them
since at the time of writing it isn't possible for them to appear in a
real tablet.

This patch adds tests at a couple of different levels, including a
VectorIterator-based test that is useful for benchmarking MergeIterator
performance on a simple schema, as well as a higher-level diff scan
test that operates at the Tablet level.

Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Reviewed-on: http://gerrit.cloudera.org:8080/12205
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Mike Percy <mp...@apache.org>
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
M src/kudu/common/generic_iterators.h
M src/kudu/common/schema.cc
M src/kudu/common/schema.h
M src/kudu/tablet/diff_scan-test.cc
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet.cc
8 files changed, 308 insertions(+), 79 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Mike Percy: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 10
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)