You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Dan Burkert (Code Review)" <ge...@cloudera.org> on 2017/05/19 02:06:10 UTC

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Hello Mike Percy, Adar Dembo, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/6925

to review the following change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/tserver/ts_tablet_manager.cc
1 file changed, 18 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#8).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover the
tablet copy threadpool full path. Without the changes outlined before it
fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 139 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Hao Hao, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#10).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover the
tablet copy threadpool full path. Without the changes outlined before it
fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 152 insertions(+), 62 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/10
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:       // Skip calling SetupErrorAndRespond since this path doesn't need the
> Yah, this shouldn't have an effect on the actual cluster dynamics.  It's us
Check out the 'Advanced per-instance throttling' section of util/logging.h -- it should do what you need


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Hao Hao (Code Review)" <ge...@cloudera.org>.
Hao Hao has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 8: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 1:

Still need to test this on a live cluster, will update with results once that's done.

-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS3, Line 941:     CHECK(resp.error().has_code()) << "Tablet copy error response has no code";
             :     CHECK(tserver::TabletServerErrorPB::Code_IsValid(resp.error().code()))
             :         << "Tablet copy error response code is not valid";
could we hit these CHECKs in a rolling upgrade scenario? we don't officially recommend rolling upgrade, but I think it would be bad to hard crash


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 10: Code-Review+1

LGTM if David's good with it.

-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:       // Skip calling SetupErrorAndRespond since this path doesn't need the
> Check out the 'Advanced per-instance throttling' section of util/logging.h 
OK so after trying this out on a cluster, I think we should allow it to log every time.  To balance this out, I think we should downgrade the 'tablet x needs tablet copy' message.  The net result is that we're logging the begin tablet copy result instead of the fact that we'll be requesting it.  For reference, here's a cross section of these logs for a particular tablet:

I0522 17:09:38.703362 15818 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy
W0522 17:09:38.703636  4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } }
I0522 17:09:40.211633 15820 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy
W0522 17:09:40.211971  4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } }
I0522 17:09:41.703528 11794 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy
W0522 17:09:41.703760  4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } }


On clusters approaching normalcy, I wouldn't expect to see these logs much at all.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS8, Line 405:     if (t) {
             :       transition = *t;
             :     }
             :   }
             :   if (transition) {
             :     cb(Status::IllegalState(
             :    
> To the header?  The lock is an internal implementation detail of TSTabletMa
no, just above this if. took me a couple secs to understand why this wasn't done in the block above


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6925/3//COMMIT_MSG
Commit Message:

PS3, Line 22: to to
Double to


http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 336:   if (!controller_.status().ok() || !tc_response_.has_error()) {
This is, uh, confusing to say the least. Can you pull up the ALREADY_INPROGRESS comment and write something more comprehensive explaining what's going on here and why?


http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:       // Skip calling SetupErrorAndRespond since this path doesn't need the
This doesn't actually affect tablet copy operations, right? If SetupErrorAndRespond() is called, it'd convert ServiceUnavailable+THROTTLED into Rpc::ERROR_SERVER_TOO_BUSY but as far as I can tell that has no real effect on Peer::ProcessTabletCopyResponse apart from preventing a noisy LOG(WARNING).

Do we actually want that noisy log?


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 8:

I think this is ready to be reviewed, sorry for the churn.

-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6925/5/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

Line 337:                     const tablet::TabletDataState delete_type,
> warning: parameter 'delete_type' is const-qualified in the function declara
Done


http://gerrit.cloudera.org:8080/#/c/6925/5/src/kudu/integration-tests/tablet_copy-itest.cc
File src/kudu/integration-tests/tablet_copy-itest.cc:

Line 66: using std::pair;
> warning: using decl 'pair' is unused [misc-unused-using-decls]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 392:   string* transition;
This isn't thread safe.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#7).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover the
tablet copy threadpool full path. Without the changes outlined before it
fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 138 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#6).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover this
specific regression. Without the changes outlined before it fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
6 files changed, 124 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS3, Line 941:     CHECK(resp.error().has_code()) << "Tablet copy error response has no code";
             :     CHECK(tserver::TabletServerErrorPB::Code_IsValid(resp.error().code()))
             :         << "Tablet copy error response code is not valid";
> My understanding is that this is test-only code.
oh, duh, sorry... missed the context here.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 2:

do we have any way to write an integration test for this, even if it's relatively long-running (eg a minute or two?) Given dist-test, adding another test that takes 1-2 minutes shouldn't extend our precommit wait times any, and we're clearly lacking coverage for these types of scenarios

-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 392:   string* transition;
> This isn't thread safe.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Hao Hao, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#9).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover the
tablet copy threadpool full path. Without the changes outlined before it
fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 151 insertions(+), 62 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 10: Code-Review+2

lgtm

-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS8, Line 345: controller_.status().ok() &&
             :       (!tc_response_.has_error() ||
             :        tc_response_.error().code() ==
             :         TabletServerErrorPB::TabletServerErrorPB::ALREADY_INPROGRESS)
so here we were not notifying the queue if everything went well? was this a bug or did we have other means to notify the queue (like with heartbeats)

also could pull this to another method to make it more tracktable?


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS8, Line 395: VLOG(3) << LogPrefixUnlocked() << "Peer " << uuid << " needs tablet copy" << THROTTLE_MSG;
how about we increase the number of seconds to something like 10?


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS8, Line 228: // Update the last successful communication timestamp for the given peer
             :   // to the current time. This should be called when a non-network related
             :   // error is received from the peer, indicating that it is alive, even if it
             :   // may not be fully up and running or able to accept updates.
update docs


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS8, Line 385: shared_ptr<TabletCopyRunnable> runnable(new TabletCopyRunnable(this, req, cb));
             :   Status s = tablet_copy_pool_->Submit(runnable);
             :   if (PREDICT_TRUE(s.ok())) {
             :     return;
             :   }
wanna add some docs regarding what you mentioned in person? i.e that we don't need check if the tablet is already in transition because that is already handled somewhere?


PS8, Line 405:   if (transition) {
             :     cb(Status::IllegalState(
             :           strings::Substitute("State transition of tablet $0 already in progress: $1",
             :                               tablet_id, *transition)),
             :           TabletServerErrorPB::ALREADY_INPROGRESS);
             :     return;
             :   }
add docs stating that we do this to call the callback outside the lock?


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has submitted this change and it was merged.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover the
tablet copy threadpool full path. Without the changes outlined before it
fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Reviewed-on: http://gerrit.cloudera.org:8080/6925
Reviewed-by: Todd Lipcon <to...@apache.org>
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 152 insertions(+), 62 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Todd Lipcon: Looks good to me, but someone else must approve
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 2:

(2 comments)

Agreed that a test would be helpful

http://gerrit.cloudera.org:8080/#/c/6925/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS2, Line 395:   // The thread pool is at capacity. Check if the tablet is already in
             :   // transition (i.e. being copied).
             :   boost::optional<string> transition;
             :   {
             :     std::lock_guard<rw_spinlock> lock(lock_);
             :     auto* t = FindOrNull(transition_in_progress_, tablet_id);
             :     if (t) {
             :       transition = *t;
             :     }
             :   }
> why do we not check for the 'transition" state before trying to submit to t
+1


PS2, Line 406:     cb(Status::IllegalState(
             :           strings::Substitute("State transition of tablet $0 already in progress: $1",
             :                               tablet_id, *transition)),
             :           TabletServerErrorPB::ALREADY_INPROGRESS);
> should we logs these errors or would it be too spammy?
It should get logged by the leader making the remote call.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#3).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover this
specific regression. Without the changes outlined before it fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
6 files changed, 122 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS8, Line 405:   if (transition) {
             :     cb(Status::IllegalState(
             :           strings::Substitute("State transition of tablet $0 already in progress: $1",
             :                               tablet_id, *transition)),
             :           TabletServerErrorPB::ALREADY_INPROGRESS);
             :     return;
             :   }
> add docs stating that we do this to call the callback outside the lock?
Done


http://gerrit.cloudera.org:8080/#/c/6925/9/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 403:     }
> nit: this could be a shared lock, right?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6925/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS2, Line 395:   // The thread pool is at capacity. Check if the tablet is already in
             :   // transition (i.e. being copied).
             :   boost::optional<string> transition;
             :   {
             :     std::lock_guard<rw_spinlock> lock(lock_);
             :     auto* t = FindOrNull(transition_in_progress_, tablet_id);
             :     if (t) {
             :       transition = *t;
             :     }
             :   }
why do we not check for the 'transition" state before trying to submit to the pool?


PS2, Line 406:     cb(Status::IllegalState(
             :           strings::Substitute("State transition of tablet $0 already in progress: $1",
             :                               tablet_id, *transition)),
             :           TabletServerErrorPB::ALREADY_INPROGRESS);
should we logs these errors or would it be too spammy?


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS3, Line 941:     CHECK(resp.error().has_code()) << "Tablet copy error response has no code";
             :     CHECK(tserver::TabletServerErrorPB::Code_IsValid(resp.error().code()))
             :         << "Tablet copy error response code is not valid";
> could we hit these CHECKs in a rolling upgrade scenario? we don't officiall
My understanding is that this is test-only code.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 336:   if (!controller_.status().ok() || !tc_response_.has_error()) {
> Yah, and in fact it may be wrong.  I think I meant:
Looking more at this, we don't seem to be updating 'last_successful_communication_time ' in the successful case.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS8, Line 345: controller_.status().ok() &&
             :       (!tc_response_.has_error() ||
             :        tc_response_.error().code() ==
             :         TabletServerErrorPB::TabletServerErrorPB::ALREADY_INPROGRESS)
> so here we were not notifying the queue if everything went well? was this a
Yah, we weren't notifying the queue on success previously.  I think this tended to not matter because we retry every 1.5 seconds, and the next retry should get ALREADY_INPROGRESS, at which point the notification would happen.

I'll pull this into a named local, hopefully that will clarify.  I don't think it's a good idea to have it's own method, because this is the only place where tc_response_ should be touched.  We'd basically have to document the method as 'only call from ProcessTabletCopyResponse'.


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS8, Line 395: VLOG(3) << LogPrefixUnlocked() << "Peer " << uuid << " needs tablet copy" << THROTTLE_MSG;
> how about we increase the number of seconds to something like 10?
I demoted this log because we're now logging the result of every begin tablet copy request, so these aren't useful.  That change was made in tablet_peers.cc.


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS8, Line 228: // Update the last successful communication timestamp for the given peer
             :   // to the current time. This should be called when a non-network related
             :   // error is received from the peer, indicating that it is alive, even if it
             :   // may not be fully up and running or able to accept updates.
> update docs
Done


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS8, Line 385: shared_ptr<TabletCopyRunnable> runnable(new TabletCopyRunnable(this, req, cb));
             :   Status s = tablet_copy_pool_->Submit(runnable);
             :   if (PREDICT_TRUE(s.ok())) {
             :     return;
             :   }
> wanna add some docs regarding what you mentioned in person? i.e that we don
Done


PS8, Line 405:   if (transition) {
             :     cb(Status::IllegalState(
             :           strings::Substitute("State transition of tablet $0 already in progress: $1",
             :                               tablet_id, *transition)),
             :           TabletServerErrorPB::ALREADY_INPROGRESS);
             :     return;
             :   }
> add docs stating that we do this to call the callback outside the lock?
To the header?  The lock is an internal implementation detail of TSTabletManager.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:       // Skip calling SetupErrorAndRespond since this path doesn't need the
> This doesn't actually affect tablet copy operations, right? If SetupErrorAn
Yah, this shouldn't have an effect on the actual cluster dynamics.  It's useful for testing this specific case, though.  I've changed the logging on the leader (consensus_peers.cc) so that all tablet copy failures are logged.  I'm going to do another cluster test to verify that this isn't too noisy.  Ideally it would be limited to a WARN log every 60 seconds or so, but I'm not sure how to do that on a per-tablet basis.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#5).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover this
specific regression. Without the changes outlined before it fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
6 files changed, 124 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 9: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/9/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 403:     std::lock_guard<rw_spinlock> lock(lock_);
nit: this could be a shared lock, right?


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3//COMMIT_MSG
Commit Message:

PS3, Line 22: to to
> Double to
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#2).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................

KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/tserver/ts_tablet_manager.cc
1 file changed, 21 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 336:   if (!controller_.status().ok() || !tc_response_.has_error()) {
> This is, uh, confusing to say the least. Can you pull up the ALREADY_INPROG
Yah, and in fact it may be wrong.  I think I meant:

    if (!controller_.status().ok() || tc_response_.has_error())

I'll simplify it.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6925/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS2, Line 395:   // The thread pool is at capacity. Check if the tablet is already in
             :   // transition (i.e. being copied).
             :   boost::optional<string> transition;
             :   {
             :     std::lock_guard<rw_spinlock> lock(lock_);
             :     auto* t = FindOrNull(transition_in_progress_, tablet_id);
             :     if (t) {
             :       transition = *t;
             :     }
             :   }
> +1
The 'happy path' in this case is that the thread pool is not oversubscribed.  In that case the tablet copy immediately gets a thread, and as part of initializing, it already checks that there isn't a copy in progress.

So, if we put the check up front, it would actually happen twice for the fast path.


PS2, Line 406:     cb(Status::IllegalState(
             :           strings::Substitute("State transition of tablet $0 already in progress: $1",
             :                               tablet_id, *transition)),
             :           TabletServerErrorPB::ALREADY_INPROGRESS);
> It should get logged by the leader making the remote call.
Yes, I've beefed up the logging of these errors on the leader.  Going to do another cluster test to make sure it's not overwhelming.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes