You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Mike Percy (Code Review)" <ge...@cloudera.org> on 2019/03/01 23:54:03 UTC
[kudu-CR] KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )
Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator
......................................................................
Patch Set 4:
(5 comments)
http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:
http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@439
PS3, Line 439: int cmp;
> Good observation. I think smallest.resize(1, iter.get()) would be more effi
Oops, that resize() doesn't work that way when shrinking the vector, but smallest.clear(); smallest.emplace_back() looks to be fine.
http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441
PS3, Line 441: // If we have a candidate for smallest row, compare it against the
> My take is that if I use emplace_back(), I don't have to think about whethe
Well, I ran a little test and indeed emplace_back() is faster.
#include <iostream>
#include <string>
#include <vector>
using std::cerr;
using std::endl;
using std::string;
using std::vector;
void test_emplace_back(vector<int>& vec, int count) {
for (int i = 0; i < count; i++) {
vec.emplace_back(i);
}
}
void test_push_back(vector<int>& vec, int count) {
for (int i = 0; i < count; i++) {
vec.push_back(i);
}
}
int main(int argc, const char** argv) {
if (argc == 1) {
cerr << "Usage: " << argv[0] << " emplace_back|push_back" << endl;
return 1;
}
string arg = argv[1];
constexpr int kCount = 100 * 1000 * 1000;
vector<int> vec;
vec.reserve(kCount);
if (arg == "emplace_back") {
test_emplace_back(vec, kCount);
return 0;
}
if (arg == "push_back") {
test_push_back(vec, kCount);
return 0;
}
return 1;
}
Compiled by g++ with -O3 and -std=c++11, I get the following runtime numbers (which are roughly representative):
$ time ./emplace-test emplace_back
real 0m0.192s
user 0m0.093s
sys 0m0.100s
$ time ./emplace-test push_back
real 0m0.280s
user 0m0.176s
sys 0m0.104s
http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479
PS3, Line 479: // We found the single live instance of the row.
> I think you can deduplicate the two calls to CopyRow (one here and the othe
Gotcha, done.
http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:
http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@474
PS4, Line 474: int live_rows_found = 0;
> Could you make sure this doesn't trigger a not-used warning in RELEASE buil
Done
http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@503
PS4, Line 503: unordered_set<MergeIterState*> exhausted;
> I think this can be a simple vector; aren't we guaranteed that there are no
The point of using the set data structure is to get O(1) lookup within the if-clause in std::remove_if()
I'd be surprised if the allocation overhead were significant here because tcmalloc should be using thread-local storage for these relatively small STL data structures.
So I just ran TestMerge 20 times on the current patch with --num-lists 100, and on a version of it after reverting the bulk of the changes in generic_iterators.cc back to the existing master branch code, and the performance was essentially identical for the "real" time measurement, in seconds: existing code = min 0.118, median 0.126, max 0.147; new code = min 0.118, median 0.125, max 0.143
It appears that the additional tcmalloc (non-central freelist) allocation overhead for the vector and unordered_set elements are in the noise compared to the merge process itself, which is fairly CPU intensive.
--
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 4
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Fri, 01 Mar 2019 23:54:03 +0000
Gerrit-HasComments: Yes