You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2019/10/30 00:28:13 UTC

[kudu-CR] raft consensus quorum-test: avoid thread collision warner crash

Hello Alexey Serbin, Volodymyr Verovkin,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/14580

to review the following change.


Change subject: raft_consensus_quorum-test: avoid thread collision warner crash
......................................................................

raft_consensus_quorum-test: avoid thread collision warner crash

One of my precommits failed with ThreadCollisionWarner triggering a crash in
raft_consensus_quorum-test. Digging into it, the cause appears to be the
"synthetic" nature that the test checks certain cmeta properties.

Back in commit 17f97531e the real cmeta mutex was replaced with a
DFAKE_MUTEX. This fake lock works by using the ThreadCollisionWarner to
enforce that access to the "protected" object is externally synchronized;
that is, another lock is always held while calling the object's methods. In
cmeta's case, that external lock is the main RaftConsensus lock.

The problem is that raft_consensus_quorum-test accesses cmeta directly,
without going through RaftConsensus. This opens the door to crashes like
these since there's no external synchronization in these accesses.

I opted to fix this by changing the test's ReadConsensusMetadataFromDisk
method to actually go to disk for its cmeta. I could have poked a few test
holes into RaftConsensus, but it seemed odd for a method that purported to
go "to disk" to actually get cmeta from the ConsensusMetadataManager, which
caches it in a map and always returns the same instance.

I also included some test cleanup.

  I1029 18:08:21.149920   222 raft_consensus.cc:2316] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 1 FOLLOWER]: Leader election vote request: Denying vote to candidate c0fad19d10b74845b028ecce5cde3fbe for term 2 because replica is either leader or believes a valid leader to be alive.
  I1029 18:08:21.150341   222 raft_consensus.cc:2901] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 1 FOLLOWER]: Advancing to term 2
  W1029 18:08:21.701403   222 consensus_meta.cc:220] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb: Time spent flushing consensus metadata: real 0.551s	user 0.001s	sys 0.000s
  I1029 18:08:21.701627   222 raft_consensus.cc:2355] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 2 FOLLOWER]: Leader election vote request: Granting yes vote for candidate c0fad19d10b74845b028ecce5cde3fbe in term 2.
  F1029 18:08:21.701889   586 thread_collision_warner.cc:23] Thread Collision! Previous thread id: 222, current thread id: 586
  *** Check failure stack trace: ***
  *** Aborted at 1572372501 (unix time) try "date -d @1572372501" if you are using GNU date ***
  PC: @     0x7ff8b8713c37 gsignal
  *** SIGABRT (@0x3e8000000de) received by PID 222 (TID 0x7ff8a2b0e700) from PID 222; stack trace: ***
    @     0x7ff8b92d6330 (unknown) at ??:0
    @     0x7ff8b8713c37 gsignal at ??:0
    @     0x7ff8b8717028 abort at ??:0
    @     0x7ff8bb112e09 google::logging_fail() at ??:0
    @     0x7ff8bb11462d google::LogMessage::Fail() at ??:0
    @     0x7ff8bb11664c google::LogMessage::SendToLog() at ??:0
    @     0x7ff8bb114189 google::LogMessage::Flush() at ??:0
    @     0x7ff8bb116fdf google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7ff8bb4ef2d3 base::DCheckAsserter::warn() at ??:0
    @     0x7ff8bb4ef3e8 base::ThreadCollisionWarner::EnterSelf() at ??:0
    @     0x7ff8c6ecb86b kudu::consensus::ConsensusMetadata::current_term() at ??:0
    @     0x7ff8c6fa1632 kudu::consensus::RaftConsensus::CurrentTermUnlocked() at ??:0
    @     0x7ff8c6fc847e kudu::consensus::RaftConsensus::HandleLeaderRequestTermUnlocked() at ??:0
    @     0x7ff8c6fca887 kudu::consensus::RaftConsensus::CheckLeaderRequestUnlocked() at ??:0
    @     0x7ff8c6fbf7c7 kudu::consensus::RaftConsensus::UpdateReplica() at ??:0
    @     0x7ff8c6fbe5e2 kudu::consensus::RaftConsensus::Update() at ??:0
    @           0x6112b2 kudu::consensus::LocalTestPeerProxy::SendUpdateRequest() at /home/jenkins-slave/workspace/kudu-master/2/src/kudu/consensus/consensus-test-util.h:?

Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
---
M src/kudu/consensus/consensus_meta.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/tablet/tablet_replica.cc
5 files changed, 32 insertions(+), 36 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/80/14580/1
-- 
To view, visit http://gerrit.cloudera.org:8080/14580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Gerrit-Change-Number: 14580
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>

[kudu-CR] KUDU-2949: deflake raft consensus quorum-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/14580 )

Change subject: KUDU-2949: deflake raft_consensus_quorum-test
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/14580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Gerrit-Change-Number: 14580
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>

[kudu-CR] KUDU-2949: deflake raft consensus quorum-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Volodymyr Verovkin, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14580

to look at the new patch set (#2).

Change subject: KUDU-2949: deflake raft_consensus_quorum-test
......................................................................

KUDU-2949: deflake raft_consensus_quorum-test

One of my precommits failed with ThreadCollisionWarner triggering a crash in
raft_consensus_quorum-test. Digging into it, the cause appears to be the
"synthetic" nature that the test checks certain cmeta properties.

Back in commit 17f97531e the real cmeta mutex was replaced with a
DFAKE_MUTEX. This fake lock works by using the ThreadCollisionWarner to
enforce that access to the "protected" object is externally synchronized;
that is, another lock is always held while calling the object's methods. In
cmeta's case, that external lock is the main RaftConsensus lock.

The problem is that raft_consensus_quorum-test accesses cmeta directly,
without going through RaftConsensus. This opens the door to crashes like
these since there's no external synchronization in these accesses.

I opted to fix this by changing the test's ReadConsensusMetadataFromDisk
method to actually go to disk for its cmeta. I could have poked a few test
holes into RaftConsensus, but it seemed odd for a method that purported to
go "to disk" to actually get cmeta from the ConsensusMetadataManager, which
caches it in a map and always returns the same instance.

I also included some test cleanup.

  I1029 18:08:21.149920   222 raft_consensus.cc:2316] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 1 FOLLOWER]: Leader election vote request: Denying vote to candidate c0fad19d10b74845b028ecce5cde3fbe for term 2 because replica is either leader or believes a valid leader to be alive.
  I1029 18:08:21.150341   222 raft_consensus.cc:2901] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 1 FOLLOWER]: Advancing to term 2
  W1029 18:08:21.701403   222 consensus_meta.cc:220] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb: Time spent flushing consensus metadata: real 0.551s	user 0.001s	sys 0.000s
  I1029 18:08:21.701627   222 raft_consensus.cc:2355] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 2 FOLLOWER]: Leader election vote request: Granting yes vote for candidate c0fad19d10b74845b028ecce5cde3fbe in term 2.
  F1029 18:08:21.701889   586 thread_collision_warner.cc:23] Thread Collision! Previous thread id: 222, current thread id: 586
  *** Check failure stack trace: ***
  *** Aborted at 1572372501 (unix time) try "date -d @1572372501" if you are using GNU date ***
  PC: @     0x7ff8b8713c37 gsignal
  *** SIGABRT (@0x3e8000000de) received by PID 222 (TID 0x7ff8a2b0e700) from PID 222; stack trace: ***
    @     0x7ff8b92d6330 (unknown) at ??:0
    @     0x7ff8b8713c37 gsignal at ??:0
    @     0x7ff8b8717028 abort at ??:0
    @     0x7ff8bb112e09 google::logging_fail() at ??:0
    @     0x7ff8bb11462d google::LogMessage::Fail() at ??:0
    @     0x7ff8bb11664c google::LogMessage::SendToLog() at ??:0
    @     0x7ff8bb114189 google::LogMessage::Flush() at ??:0
    @     0x7ff8bb116fdf google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7ff8bb4ef2d3 base::DCheckAsserter::warn() at ??:0
    @     0x7ff8bb4ef3e8 base::ThreadCollisionWarner::EnterSelf() at ??:0
    @     0x7ff8c6ecb86b kudu::consensus::ConsensusMetadata::current_term() at ??:0
    @     0x7ff8c6fa1632 kudu::consensus::RaftConsensus::CurrentTermUnlocked() at ??:0
    @     0x7ff8c6fc847e kudu::consensus::RaftConsensus::HandleLeaderRequestTermUnlocked() at ??:0
    @     0x7ff8c6fca887 kudu::consensus::RaftConsensus::CheckLeaderRequestUnlocked() at ??:0
    @     0x7ff8c6fbf7c7 kudu::consensus::RaftConsensus::UpdateReplica() at ??:0
    @     0x7ff8c6fbe5e2 kudu::consensus::RaftConsensus::Update() at ??:0
    @           0x6112b2 kudu::consensus::LocalTestPeerProxy::SendUpdateRequest() at /home/jenkins-slave/workspace/kudu-master/2/src/kudu/consensus/consensus-test-util.h:?

Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
---
M src/kudu/consensus/consensus_meta.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/tablet/tablet_replica.cc
5 files changed, 32 insertions(+), 36 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/80/14580/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Gerrit-Change-Number: 14580
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>

[kudu-CR] KUDU-2949: deflake raft consensus quorum-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14580 )

Change subject: KUDU-2949: deflake raft_consensus_quorum-test
......................................................................


Patch Set 2: Verified+1

Overriding Jenkins, unrelated test failure (and I had a good +1 from before).


-- 
To view, visit http://gerrit.cloudera.org:8080/14580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Gerrit-Change-Number: 14580
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 31 Oct 2019 19:47:03 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2949: deflake raft consensus quorum-test

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14580 )

Change subject: KUDU-2949: deflake raft_consensus_quorum-test
......................................................................


Patch Set 2: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14580/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14580/2//COMMIT_MSG@27
PS2, Line 27: and always returns the same instance
And this was the issue because in the absence of caching, the ThreadCollisionWarner wouldn't be triggered, right?



-- 
To view, visit http://gerrit.cloudera.org:8080/14580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Gerrit-Change-Number: 14580
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Nov 2019 00:16:30 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2949: deflake raft consensus quorum-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14580 )

Change subject: KUDU-2949: deflake raft_consensus_quorum-test
......................................................................

KUDU-2949: deflake raft_consensus_quorum-test

One of my precommits failed with ThreadCollisionWarner triggering a crash in
raft_consensus_quorum-test. Digging into it, the cause appears to be the
"synthetic" nature that the test checks certain cmeta properties.

Back in commit 17f97531e the real cmeta mutex was replaced with a
DFAKE_MUTEX. This fake lock works by using the ThreadCollisionWarner to
enforce that access to the "protected" object is externally synchronized;
that is, another lock is always held while calling the object's methods. In
cmeta's case, that external lock is the main RaftConsensus lock.

The problem is that raft_consensus_quorum-test accesses cmeta directly,
without going through RaftConsensus. This opens the door to crashes like
these since there's no external synchronization in these accesses.

I opted to fix this by changing the test's ReadConsensusMetadataFromDisk
method to actually go to disk for its cmeta. I could have poked a few test
holes into RaftConsensus, but it seemed odd for a method that purported to
go "to disk" to actually get cmeta from the ConsensusMetadataManager, which
caches it in a map and always returns the same instance.

I also included some test cleanup.

  I1029 18:08:21.149920   222 raft_consensus.cc:2316] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 1 FOLLOWER]: Leader election vote request: Denying vote to candidate c0fad19d10b74845b028ecce5cde3fbe for term 2 because replica is either leader or believes a valid leader to be alive.
  I1029 18:08:21.150341   222 raft_consensus.cc:2901] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 1 FOLLOWER]: Advancing to term 2
  W1029 18:08:21.701403   222 consensus_meta.cc:220] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb: Time spent flushing consensus metadata: real 0.551s	user 0.001s	sys 0.000s
  I1029 18:08:21.701627   222 raft_consensus.cc:2355] T TestTablet P 1894ac4aabfc4698b93d6c621f3c9ceb [term 2 FOLLOWER]: Leader election vote request: Granting yes vote for candidate c0fad19d10b74845b028ecce5cde3fbe in term 2.
  F1029 18:08:21.701889   586 thread_collision_warner.cc:23] Thread Collision! Previous thread id: 222, current thread id: 586
  *** Check failure stack trace: ***
  *** Aborted at 1572372501 (unix time) try "date -d @1572372501" if you are using GNU date ***
  PC: @     0x7ff8b8713c37 gsignal
  *** SIGABRT (@0x3e8000000de) received by PID 222 (TID 0x7ff8a2b0e700) from PID 222; stack trace: ***
    @     0x7ff8b92d6330 (unknown) at ??:0
    @     0x7ff8b8713c37 gsignal at ??:0
    @     0x7ff8b8717028 abort at ??:0
    @     0x7ff8bb112e09 google::logging_fail() at ??:0
    @     0x7ff8bb11462d google::LogMessage::Fail() at ??:0
    @     0x7ff8bb11664c google::LogMessage::SendToLog() at ??:0
    @     0x7ff8bb114189 google::LogMessage::Flush() at ??:0
    @     0x7ff8bb116fdf google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7ff8bb4ef2d3 base::DCheckAsserter::warn() at ??:0
    @     0x7ff8bb4ef3e8 base::ThreadCollisionWarner::EnterSelf() at ??:0
    @     0x7ff8c6ecb86b kudu::consensus::ConsensusMetadata::current_term() at ??:0
    @     0x7ff8c6fa1632 kudu::consensus::RaftConsensus::CurrentTermUnlocked() at ??:0
    @     0x7ff8c6fc847e kudu::consensus::RaftConsensus::HandleLeaderRequestTermUnlocked() at ??:0
    @     0x7ff8c6fca887 kudu::consensus::RaftConsensus::CheckLeaderRequestUnlocked() at ??:0
    @     0x7ff8c6fbf7c7 kudu::consensus::RaftConsensus::UpdateReplica() at ??:0
    @     0x7ff8c6fbe5e2 kudu::consensus::RaftConsensus::Update() at ??:0
    @           0x6112b2 kudu::consensus::LocalTestPeerProxy::SendUpdateRequest() at /home/jenkins-slave/workspace/kudu-master/2/src/kudu/consensus/consensus-test-util.h:?

Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Reviewed-on: http://gerrit.cloudera.org:8080/14580
Tested-by: Adar Dembo <ad...@cloudera.com>
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M src/kudu/consensus/consensus_meta.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/tablet/tablet_replica.cc
5 files changed, 32 insertions(+), 36 deletions(-)

Approvals:
  Adar Dembo: Verified
  Alexey Serbin: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/14580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Gerrit-Change-Number: 14580
Gerrit-PatchSet: 3
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>

[kudu-CR] KUDU-2949: deflake raft consensus quorum-test

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14580 )

Change subject: KUDU-2949: deflake raft_consensus_quorum-test
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14580/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14580/2//COMMIT_MSG@27
PS2, Line 27: and always returns the same instance
> And this was the issue because in the absence of caching, the ThreadCollisi
Correct. Another approach I considered was adding "don't cache" behavior into ConsensusMetadataManager, but it seemed weird to build that in just for tests.



-- 
To view, visit http://gerrit.cloudera.org:8080/14580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icacdef44fd545f46eb9f47c577641e3aa66d6e94
Gerrit-Change-Number: 14580
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Volodymyr Verovkin <ve...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Nov 2019 00:24:47 +0000
Gerrit-HasComments: Yes