You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2019/03/15 22:14:11 UTC

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12770


Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................

KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

The initialization of the master works as follows:

1. Register RPC services.
2. Init catalog manager asynchronously.

As a result, if a master in a multimaster cluster with a healthy leader
starts, there is a brief period of time when a call to UpdateConsensus
from the leader master will hit a CatalogManager and SysTable that are
not initialized. The initializing master will respond TABLET_NOT_FOUND
to the leader, which will cause the leader master to initiate the tablet
copy process. This is a dead end because masters don't support tablet
copy. Things are stuck until there is a leadership change or the
"orphaned" master is restarted again.

Tablets on tablet servers are not vulnerable to this because their
startup order is

1. Init the ts tablet manager synchronously.
2. Register RPC services.

So it is not possible for an UpdateConsensus call to query a ts tablet
manager that hasn't loaded all of the initial tablets.

The fix is pretty simple: recognize and return the StatusUnavailable
returned by the tablet lookup for the master tablet, instead of
TABLET_NOT_FOUND. This will cause the leader master to retry until the
initializing master has finished initializing.

This was the cause of flakiness in KUDU-2734. Without the fix, about 8%
of runs fail on TSAN with 8 stress threads. With the fix, about 0.3% do
(and in 2000 runs with 6 failures I verified that none of the 6 were due
to this issue).

Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
---
M src/kudu/tserver/tablet_service.cc
1 file changed, 9 insertions(+), 2 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/12770/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................

KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

The initialization of the master works as follows:

1. Register RPC services.
2. Init catalog manager asynchronously.

As a result, if a master in a multimaster cluster with a healthy leader
starts, there is a brief period of time when a call to UpdateConsensus
from the leader master will hit a CatalogManager and SysTable that are
not initialized. The initializing master will respond TABLET_NOT_FOUND
to the leader, which will cause the leader master to initiate the tablet
copy process. This is a dead end because masters don't support tablet
copy. Things are stuck until there is a leadership change or the
"orphaned" master is restarted again.

Tablets on tablet servers are not vulnerable to this because their
startup order is

1. Init the ts tablet manager synchronously.
2. Register RPC services.

So it is not possible for an UpdateConsensus call to query a ts tablet
manager that hasn't loaded all of the initial tablets.

The fix is pretty simple: recognize and return the StatusUnavailable
returned by the tablet lookup for the master tablet, instead of
TABLET_NOT_FOUND. This will cause the leader master to retry until the
initializing master has finished initializing.

This was the cause of flakiness in KUDU-2734. Without the fix, about 8%
of runs fail on TSAN with 8 stress threads. With the fix, about 0.3% do
(and in 2000 runs with 6 failures I verified that none of the 6 were due
to this issue).

Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Reviewed-on: http://gerrit.cloudera.org:8080/12770
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Reviewed-by: Grant Henke <gr...@apache.org>
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M src/kudu/tserver/tablet_service.cc
1 file changed, 9 insertions(+), 2 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Adar Dembo: Looks good to me, but someone else must approve
  Grant Henke: Looks good to me, approved
  Alexey Serbin: Looks good to me, but someone else must approve

-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12770/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12770/1//COMMIT_MSG@19
PS1, Line 19: This is a dead end because masters don't support tablet
            : copy
> maybe, address this as well? i.e., return some specific error code (NotSupp
In other words, maybe return to some other state of the 'consensus state machine' in case of receiving NotSupported and do business as usual?  If that was the case, this sort of issue would heal itself automatically, right?



-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 15 Mar 2019 22:35:23 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1: Code-Review+1

> > I really wouldn't recommend trying to band-aid any fixes here.
> 
> OK. So should I can this patch in favor of a test-code-only change that will reduce flakiness, and at some unknown time in the future somebody might address master startup and initialization?

Sorry, I wasn't very clear; I think _this_ patch is fine, but I'd be very wary of a patch that did some simple startup reordering without tackling it from first principles.


-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 18 Mar 2019 18:05:53 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1:

> Adar, do you think there is a change to master init order that would fix this and be worth doing to address this issue?

I think we need to take a hatchet to master initialization. It's really, really bad and has resulted in a number of issues:
1. Bad cleanup after partial failure: KUDU-1186
2. All masters need to be started up together because they multicast to one another with a 30s timeout: KUDU-2080
3. Test flakiness and/or TSAN data races when anything is changed in CatalogManager::Shutdown: KUDU-2634 (amongst many, many others)
4. No way to start a master in "listening" mode so as to simplify recovering from a dead master.

I really wouldn't recommend trying to band-aid any fixes here. BTW, I suspect RPC service registration is what it is on account of #2.


-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 15 Mar 2019 23:08:48 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1: Code-Review+2

I am on board with this patch/plan. Did we open a jira to track the master initialization re-work?


-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 20 Mar 2019 15:41:08 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1:

> > > I really wouldn't recommend trying to band-aid any fixes here.
 > >
 > > OK. So should I can this patch in favor of a test-code-only
 > change that will reduce flakiness, and at some unknown time in the
 > future somebody might address master startup and initialization?
 > 
 > Sorry, I wasn't very clear; I think _this_ patch is fine, but I'd
 > be very wary of a patch that did some simple startup reordering
 > without tackling it from first principles.

:thumbsup:


-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 18 Mar 2019 18:07:32 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 20 Mar 2019 19:01:51 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12770/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12770/1//COMMIT_MSG@19
PS1, Line 19: This is a dead end because masters don't support tablet
            : copy
maybe, address this as well? i.e., return some specific error code (NotSupported) if an attempt to start a table copy is attempted?



-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 15 Mar 2019 22:25:32 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1:

Adar, do you think there is a change to master init order that would fix this and be worth doing to address this issue?


-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 15 Mar 2019 22:15:50 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12770 )

Change subject: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
......................................................................


Patch Set 1:

> I really wouldn't recommend trying to band-aid any fixes here.

OK. So should I can this patch in favor of a test-code-only change that will reduce flakiness, and at some unknown time in the future somebody might address master startup and initialization?


-- 
To view, visit http://gerrit.cloudera.org:8080/12770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Gerrit-Change-Number: 12770
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 18 Mar 2019 16:47:00 +0000
Gerrit-HasComments: No