You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Mike Percy (Code Review)" <ge...@cloudera.org> on 2016/11/10 23:43:54 UTC

[kudu-CR] tablet copy: Make the StartTabletCopy() RPC async

Hello Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/5045

to review the following change.

Change subject: tablet copy: Make the StartTabletCopy() RPC async
......................................................................

tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on the open_tablet thread
pool. This clears up threads on the Consensus service pool for other
tasks.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/tserver/CMakeLists.txt
A src/kudu/tserver/tablet_copy_client_session.cc
A src/kudu/tserver/tablet_copy_client_session.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
5 files changed, 189 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 6: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5045/6/src/kudu/integration-tests/tablet_copy-itest.cc
File src/kudu/integration-tests/tablet_copy-itest.cc:

PS6, Line 949: StartTabletCopy
I wonder if multithreading this and running in parallel will make the probability of making num_service_unavailable > 0 high.


http://gerrit.cloudera.org:8080/#/c/5045/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 412: #define CALLBACK_AND_RETURN(status) \
Do we want to include error_code as a macro argument ? Otherwise, IMHO it's bit confusing to restrict the caller to have the variable named 'error_code' from each callsite and could be bug-prone too.


PS6, Line 529: cb(Status::OK(), TabletServerErrorPB::UNKNOWN_ERROR)
Could we introduce a TabletServerErrorPB::OK to go along with such calls ? This being the success case, does this expect the caller to ignore the error_code ?


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5045/4/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS4, Line 540: copy_source_uuid
this is causing a crash because it's a reference to a field in 'req', which is freed by the invocation of 'cb' above.


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#5).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that  submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

After making these changes, I looped tablet_copy-itest 300 times on TSAN
and RELEASE modes and there were no failures:

* http://dist-test.cloudera.org/job?job_id=mpercy.1480539848.25457
* http://dist-test.cloudera.org/job?job_id=mpercy.1480540482.11408

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 254 insertions(+), 76 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 4:

(3 comments)

Just passing through and took an interest.

If I'm understanding this correctly, the semantics of the StartTabletCopy RPC remain unchanged: remote callers will only receive a response when the copy is finished. Is that right?

http://gerrit.cloudera.org:8080/#/c/5045/4/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS4, Line 383: nullptr
Shouldn't need this; the constructor is guaranteed to set it.


PS4, Line 532:   cb = [](const Status&, TabletServerErrorPB::Code) {
             :     LOG(FATAL) << "StartTabletCopy() callback invoked twice";
             :   };
Isn't 'cb' a local copy? How does this have an effect?


PS4, Line 548: We run this asynchronously.
No longer true.


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5045/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 320: 
> spurious \n ?
Done


PS2, Line 436: success_callback
> I am curious why we needed a different callback for success ? One callback 
I reworked this


Line 438:     this->LogAndTombstone(
> the old comment on line 450-452 seems to be not respected here -- if we fai
Reworked this and fixed this issue


PS2, Line 444: open_tablet_pool_
> Interesting, just thinking out loud here about possible outcomes of using t
Moved to its own thread pool in order to have a zero-size queue on the pool.


PS2, Line 446: TOMBSTONE_NOT_OK
> There is nothing to tombstone at this point right ? We could as well leave 
It's a moot point now but StartAsync() could have failed if the thread pool was stopped or full, so we should tombstone in that case.


http://gerrit.cloudera.org:8080/#/c/5045/2/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS2, Line 328: tablet_copy_sessions_
> Good idea, we could eventually connect this to metrics or tools to show the
That is a good idea. I ended up peeling this out because it made this patch more complicated. I plan to put this back in in a more lightweight way, just to allow for cancellation and status notification, as well as metrics. However, I don't think those things are as important as this basic functionality or some other things I want to work on.


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 12:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5045/12/src/kudu/integration-tests/tablet_copy_client_session-itest.cc
File src/kudu/integration-tests/tablet_copy_client_session-itest.cc:

PS12, Line 145: kBlank,
              :     kTombstoned,
              :     k
a description of these two scenarios would be useful. In paricular I'm not really sure what the 'blank' scenario means.


PS12, Line 150: // Initialize state.
              :     TabletDataState delete_type = TabletDataState::TABLET_DATA_UNKNOWN;
              :     switch (scenario) {
              :       case kBlank:
              :         break;
              :       case kTombstoned:
              :         delete_type = TabletDataState::TABLET_DATA_TOMBSTONED;
              :         break;
              :       default:
              :         LOG(FATAL) << "Unsupported Scenario";
              :     }
              :     if (delete_type != TabletDataState::TABLET_DATA_UNKNOWN) {
              :       NO_FATALS(DeleteTabletWithRetries(ts1, tablet_id, delete_type, boost::none, kTimeout));
              :     }
instead of all of this, couldn't you just do:

if (scenario == kTombstoned) {
  NO_FATALS(DeleteTabletWithRetries(... TABLET_DATA_TOMBSTONED);
}


http://gerrit.cloudera.org:8080/#/c/5045/12/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 323:     Status _s = (expr); /* NOLINT */ \
why not do 'const Status& _s' here? then lint wouldn't complain, right?


Line 373:   virtual void Cancel() {
hrm, is this overriding a Runnable method? surprised to see it virtual but not override. Maybe naming this something like DontCallCallback would be more clear that it's not something inherent to the Runnable interface


Line 820:   // TODO(mpercy): Cancel all outstanding tablet copy tasks.
hrm, is this an important todo? sounds important, so maybe there should be a JIRA?


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Dinesh Bhat, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#8).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

This patch changes the semantics of StartTabletCopy() to return as soon
as the tablet copy has started -- it no longer waits until the process
is completed to return. Clients can follow the progress of the tablet
copy process using the ListTablets() RPC call and waiting for the tablet
to be in a RUNNING state.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

Some additional tests were added in tablet_copy_client_session-itest to
improve test coverage of the StartTabletCopy() code path.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_copy_service.cc
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
10 files changed, 348 insertions(+), 87 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Dinesh Bhat, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#9).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

This patch changes the semantics of StartTabletCopy() to return as soon
as the tablet copy has started -- it no longer waits until the process
is completed to return. Clients can follow the progress of the tablet
copy process using the ListTablets() RPC call and waiting for the tablet
to be in a RUNNING state.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

Some additional tests were added in tablet_copy_client_session-itest to
improve test coverage of the StartTabletCopy() code path.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_copy_service.cc
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
10 files changed, 349 insertions(+), 88 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/9
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 7:

> 1) There's some way to wait for the tablet copy to finish, at least
> for tests, even if it's super hacky.

Yeah, you can use the ListTablets API, not that bad.

> 2) It's documented in your commit message that the semantics of
> StartTabletCopy are changing.

Will do.

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has uploaded a new patch set (#2).

Change subject: tablet copy: Make the StartTabletCopy() RPC async
......................................................................

tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on the open_tablet thread
pool. This clears up threads on the Consensus service pool for other
tasks.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/tserver/CMakeLists.txt
A src/kudu/tserver/tablet_copy_client_session.cc
A src/kudu/tserver/tablet_copy_client_session.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
5 files changed, 195 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] tablet copy: Make the StartTabletCopy() RPC async

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 2:

(7 comments)

> I think we need to run the whole process on the thread pool because if there is a long delay between opening the tablet bootstrap session and finishing it, the session could time out on the remote and the process would fail.

If we used an entirely separate pool, we could set its queue length to 0 such that submitting the task would fail if there are no free slots, and thus avoid the issue. Or, use the same pool but check the queue length before submission (racy but good enough?)

That said, I suppose that having the queue might be nice so that we can see that the requests are pending. Relevant to that, can you set the status listener message to indicate that it's waiting on the queue during this time, so that it's easier to know what's going on with the tablet?

http://gerrit.cloudera.org:8080/#/c/5045/2/src/kudu/tserver/tablet_copy_client_session.h
File src/kudu/tserver/tablet_copy_client_session.h:

Line 37: class TabletCopyClientSession {
doc.


Line 40:     ThreadPool* pool_,
nit: no '_' here


Line 43:     std::function<void()> success_callback,
for the callbacks, please specify which threadpool they'll be run on.


PS2, Line 43:     std::function<void()> success_callback,
            :     std::function<void(const std::string& msg, const Status& s)> failure_callback);
agree with Dinesh that a single status callback seems easier to manage. What's the purpose of the separate 'msg' aside from the status's message?


Line 48:   Status StartAsync();
doc when this would return a bad status


PS2, Line 48:   Status StartAsync();
            :   void DoCopy();
docs. Should this be private?


http://gerrit.cloudera.org:8080/#/c/5045/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 438:     this->LogAndTombstone(
the old comment on line 450-452 seems to be not respected here -- if we failed to open but successfully copied, we don't need to tombstone it, right?


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Dinesh Bhat, Todd Lipcon,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#13).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

This patch changes the semantics of StartTabletCopy() to return as soon
as the tablet copy has started -- it no longer waits until the process
is completed to return. Clients can follow the progress of the tablet
copy process using the ListTablets() RPC call and waiting for the tablet
to be in a RUNNING state.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

Some additional tests were added in tablet_copy_client_session-itest to
improve test coverage of the StartTabletCopy() code path.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_copy_service.cc
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
10 files changed, 341 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/13
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Dinesh Bhat, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#11).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

This patch changes the semantics of StartTabletCopy() to return as soon
as the tablet copy has started -- it no longer waits until the process
is completed to return. Clients can follow the progress of the tablet
copy process using the ListTablets() RPC call and waiting for the tablet
to be in a RUNNING state.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

Some additional tests were added in tablet_copy_client_session-itest to
improve test coverage of the StartTabletCopy() code path.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_copy_service.cc
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
10 files changed, 356 insertions(+), 91 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/11
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 6:

> On second look I see what you mean; the moment we invoke the
 > callback, we respond to the RPC.
 > 
 > Why do this, though? Why not hold off responding until the very end
 > of the tablet copy? Either way the copy is being performed on the
 > new thread pool, so it shouldn't be a matter of resource usage,
 > right?

I'm going to take your advice and do this. I think it will also solve the test failure issue.

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 4:

(4 comments)

> If I'm understanding this correctly, the semantics of the
> StartTabletCopy RPC remain unchanged: remote callers will only
> receive a response when the copy is finished. Is that right?

No: callers will receive a response once the process is started, but will not be notified on completion. It would be good to add APIs to get for status / completion, as well as enable cancellation, but I think this is more important than those other features.

http://gerrit.cloudera.org:8080/#/c/5045/4/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS4, Line 383: nullptr
> Shouldn't need this; the constructor is guaranteed to set it.
Done


PS4, Line 532:   cb = [](const Status&, TabletServerErrorPB::Code) {
             :     LOG(FATAL) << "StartTabletCopy() callback invoked twice";
             :   };
> Isn't 'cb' a local copy? How does this have an effect?
It would have an effect if someone accidentally used one of the macros that invokes it, like CALLBACK_RETURN_NOT_OK().


PS4, Line 540: copy_source_uuid
> this is causing a crash because it's a reference to a field in 'req', which
Ah, thanks.


PS4, Line 548: We run this asynchronously.
> No longer true.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 6:

> > If I'm understanding this correctly, the semantics of the
 > > StartTabletCopy RPC remain unchanged: remote callers will only
 > > receive a response when the copy is finished. Is that right?
 > 
 > No: callers will receive a response once the process is started,
 > but will not be notified on completion. It would be good to add
 > APIs to get for status / completion, as well as enable
 > cancellation, but I think this is more important than those other
 > features.

On second look I see what you mean; the moment we invoke the callback, we respond to the RPC.

Why do this, though? Why not hold off responding until the very end of the tablet copy? Either way the copy is being performed on the new thread pool, so it shouldn't be a matter of resource usage, right?

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 2:

This needs a couple of changes, I think:

1. Increase the minimum number of threads on the open_tablet_pool_ or possibly use a different pool. I was thinking 4 threads minimum, maybe 4 + # disks

2. I think we need to run the whole process on the thread pool because if there is a long delay between opening the tablet bootstrap session and finishing it, the session could time out on the remote and the process would fail.

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5045/6/src/kudu/integration-tests/tablet_copy-itest.cc
File src/kudu/integration-tests/tablet_copy-itest.cc:

PS6, Line 949: StartTabletCopy
> I wonder if multithreading this and running in parallel will make the proba
I thought about it, but I looped this 300 times and it never failed so this seems to be sufficient.


http://gerrit.cloudera.org:8080/#/c/5045/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 412: #define CALLBACK_AND_RETURN(status) \
> Do we want to include error_code as a macro argument ? Otherwise, IMHO it's
If we do that, we also need to include cb too. I just thought it looked messy. But if you feel strongly about it, I'll do it.


PS6, Line 529: cb(Status::OK(), TabletServerErrorPB::UNKNOWN_ERROR)
> Could we introduce a TabletServerErrorPB::OK to go along with such calls ? 
In the case of Status::OK(), the TabletServerErrorPB::Code is ignored in TabletServiceImpl, and ends up not being set in the response RPC.


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5045/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS6, Line 62: 10
> Was this number picked considering a balanced (not overloaded) node may not
Todd made the case, which I thought made sense, that there was no obvious answer so an arbitrary choice was better than a heuristic. So I just picked 10. I think you probably want at least 5 or 6, even if you don't have that many CPUs. We'll probably end up changing the default again later if it seems like this wasn't a great choice.


Line 412: #define CALLBACK_AND_RETURN(status) \
> Yeah, my concern was a) it obligates the caller to follow inflexible naming
Yeah, it's only used right here as syntactic sugar.


PS6, Line 824: Cancel all outstanding tablet copy tasks
> Am I correct in understanding that current code will wait till all in-progr
Yeah, canceling would basically tombstone in-progress tablet copies and allow the thread pool to shut down immediately.


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 7:

> > I'm going to take your advice and do this. I think it will also
 > > solve the test failure issue.
 > 
 > OK I am going back and forth on this but I think it's better to
 > return immediately because we get RPC timeouts otherwise, which
 > clutter up the logs.

I don't have a strong opinion as long as:
1) There's some way to wait for the tablet copy to finish, at least for tests, even if it's super hacky.
2) It's documented in your commit message that the semantics of StartTabletCopy are changing.

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 6:

> > On second look I see what you mean; the moment we invoke the
 > > callback, we respond to the RPC.
 > >
 > > Why do this, though? Why not hold off responding until the very
 > end
 > > of the tablet copy? Either way the copy is being performed on the
 > > new thread pool, so it shouldn't be a matter of resource usage,
 > > right?
 > 
 > I'm going to take your advice and do this. I think it will also
 > solve the test failure issue.

OK I am going back and forth on this but I think it's better to return immediately because we get RPC timeouts otherwise, which clutter up the logs.

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#4).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that  submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

After making these changes, I looped tablet_copy-itest 300 times on TSAN
and RELEASE modes and there were no failures:

* http://dist-test.cloudera.org/job?job_id=mpercy.1480539848.25457
* http://dist-test.cloudera.org/job?job_id=mpercy.1480540482.11408

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 246 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Dinesh Bhat, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#10).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

This patch changes the semantics of StartTabletCopy() to return as soon
as the tablet copy has started -- it no longer waits until the process
is completed to return. Clients can follow the progress of the tablet
copy process using the ListTablets() RPC call and waiting for the tablet
to be in a RUNNING state.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

Some additional tests were added in tablet_copy_client_session-itest to
improve test coverage of the StartTabletCopy() code path.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_copy_service.cc
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
10 files changed, 356 insertions(+), 91 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/10
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 7: Code-Review-1

I posted this for a test run / sanity check but I need to work on better coverage for this codepath.

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 7:

> I don't have a strong opinion as long as:
 > 1) There's some way to wait for the tablet copy to finish, at least
 > for tests, even if it's super hacky.
 > 2) It's documented in your commit message that the semantics of
 > StartTabletCopy are changing.

Done

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Dinesh Bhat, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#3).

Change subject: tablet copy: Make the StartTabletCopy() RPC async
......................................................................

tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that  submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

After making these changes, I looped tablet_copy-itest 300 times on TSAN
and RELEASE modes and there were no failures:

* http://dist-test.cloudera.org/job?job_id=mpercy.1480539848.25457
* http://dist-test.cloudera.org/job?job_id=mpercy.1480540482.11408

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 242 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 4: Verified-1

i'm trying this out on a cluster that's experiencing the issue, and it crashed a bunch of tservers in StartTabletCopy. investigating now with an ASAN build, but looks like it needs some more work

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 12:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5045/12/src/kudu/integration-tests/tablet_copy_client_session-itest.cc
File src/kudu/integration-tests/tablet_copy_client_session-itest.cc:

PS12, Line 145: kBlank,
              :     kTombstoned,
              :     k
> a description of these two scenarios would be useful. In paricular I'm not 
Done


PS12, Line 150: // Initialize state.
              :     TabletDataState delete_type = TabletDataState::TABLET_DATA_UNKNOWN;
              :     switch (scenario) {
              :       case kBlank:
              :         break;
              :       case kTombstoned:
              :         delete_type = TabletDataState::TABLET_DATA_TOMBSTONED;
              :         break;
              :       default:
              :         LOG(FATAL) << "Unsupported Scenario";
              :     }
              :     if (delete_type != TabletDataState::TABLET_DATA_UNKNOWN) {
              :       NO_FATALS(DeleteTabletWithRetries(ts1, tablet_id, delete_type, boost::none, kTimeout));
              :     }
> instead of all of this, couldn't you just do:
Yeah, I had originally intended to put more scenarios in there. I'll simplify this.


http://gerrit.cloudera.org:8080/#/c/5045/12/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 323:     Status _s = (expr); /* NOLINT */ \
> why not do 'const Status& _s' here? then lint wouldn't complain, right?
oh... good idea :)


Line 373:   virtual void Cancel() {
> hrm, is this overriding a Runnable method? surprised to see it virtual but 
Removed virtual and renamed to DisableCallback()


Line 820:   // TODO(mpercy): Cancel all outstanding tablet copy tasks.
> hrm, is this an important todo? sounds important, so maybe there should be 
Filed KUDU-1795.


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

This patch changes the semantics of StartTabletCopy() to return as soon
as the tablet copy has started -- it no longer waits until the process
is completed to return. Clients can follow the progress of the tablet
copy process using the ListTablets() RPC call and waiting for the tablet
to be in a RUNNING state.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

Some additional tests were added in tablet_copy_client_session-itest to
improve test coverage of the StartTabletCopy() code path.

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Reviewed-on: http://gerrit.cloudera.org:8080/5045
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_copy_service.cc
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
10 files changed, 341 insertions(+), 89 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 12: Verified+1

Apparently log-rolling-itest is flaky.

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Dinesh Bhat, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#7).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

After making these changes, I looped tablet_copy-itest 300 times on TSAN
and RELEASE modes and there were no failures:

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 258 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5045

to look at the new patch set (#6).

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................

KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

This patch changes tablet copy to execute on its own thread pool.
This clears up threads on the Consensus service pool for other tasks.

A new ThreadPool called tablet_copy_pool_ was added to TSTabletManager
to run TabletCopy operations. Its max_threads parameter is tunable with
a new gflag and it has its max_queue_size hard-coded to 0 in order to
provide backpressure when it doesn't have capacity to immediately copy
new tablets.

A test was added in tablet_copy-itest that checks that the backpressure
mechanism is working such that  submitting too many StartTabletCopy()
requests at one time results in a ServiceUnavailable error.

After making these changes, I looped tablet_copy-itest 300 times on TSAN
and RELEASE modes and there were no failures:

* http://dist-test.cloudera.org/job?job_id=mpercy.1480539848.25457
* http://dist-test.cloudera.org/job?job_id=mpercy.1480540482.11408

Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
---
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tserver/tablet_peer_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 254 insertions(+), 76 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/5045/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] tablet copy: Make the StartTabletCopy() RPC async

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5045/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 320: 
spurious \n ?


PS2, Line 436: success_callback
I am curious why we needed a different callback for success ? One callback can have 
if (status == success ) 
  OpenTablet()
else 
  Tombstone()

?


PS2, Line 444: open_tablet_pool_
Interesting, just thinking out loud here about possible outcomes of using the same threadpool as that of OpenTablet(): Given that open_tablet_pool_ contains as many max threads as there are tablets to open during server bring up, tablet copy could contend with OpenTablet() threads if consensus for these tablets starts up in the context of OpenTablet(). i.e, leader could get to know about follower before OpenTablet thread finishes and starts a tablet copy session (because it is tombstoned or whatever reason) and now there is a possibility that there are no threads available to start async tablet copy ? Or am I just hallucinating ?


PS2, Line 446: TOMBSTONE_NOT_OK
There is nothing to tombstone at this point right ? We could as well leave the tablet as it is since it hasn't changed its state yet.


http://gerrit.cloudera.org:8080/#/c/5045/2/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS2, Line 328: tablet_copy_sessions_
Good idea, we could eventually connect this to metrics or tools to show the number of tablets being tablet copied, right ?


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 13: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-921. tablet copy: Make the StartTabletCopy() RPC async

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: KUDU-921. tablet copy: Make the StartTabletCopy() RPC async
......................................................................


Patch Set 6:

(3 comments)

LGTM, couple of nits.

http://gerrit.cloudera.org:8080/#/c/5045/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS6, Line 62: 10
Was this number picked considering a balanced (not overloaded) node may not have more than 20-30 tablets ? Why not use NumCPUS() instead ?


Line 412: #define CALLBACK_AND_RETURN(status) \
> If we do that, we also need to include cb too. I just thought it looked mes
Yeah, my concern was a) it obligates the caller to follow inflexible naming conventions and also its non-intuitive b) this leaves the programmer making mistakes like not setting the error_code before this macro (well, that room exists even when we include it as macro arg). If this MACRO isn't used outside of this context, I don't feel strongly about changing it.


PS6, Line 824: Cancel all outstanding tablet copy tasks
Am I correct in understanding that current code will wait till all in-progress copy threads complete vs the cancelling will pre-empt those threads and help with eagerly cleanups ?


-- 
To view, visit http://gerrit.cloudera.org:8080/5045
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I95c63f2bfd67624844447862efbdba9cb3676112
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes