You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2017/05/31 18:23:27 UTC

[kudu-CR] [consensus queue] fix race in UpdateLagMetrics()

Alexey Serbin has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7032

Change subject: [consensus_queue] fix race in UpdateLagMetrics()
......................................................................

[consensus_queue] fix race in UpdateLagMetrics()

TSAN reports warnings on races on writing/reading the
QueueState::last_idx_appended_to_leader field:

CatalogManagerTskITest.LeadershipChangeOnTskGeneration: WARNING: ThreadSanitizer: data race (pid=2710)  Write of size 8 at 0x7d500003e480 by thread T33 (mutexes: write M1863, write M1821):
    #0 kudu::consensus::PeerMessageQueue::UpdateLastIndexAppendedToLeader(long) consensus/consensus_queue.cc:607:44
    #1 kudu::consensus::RaftConsensus::UpdateReplica(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) consensus/raft_consensus.cc:1155:13
    #2 kudu::consensus::RaftConsensus::Update(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) consensus/raft_consensus.cc:752:14
    #3 kudu::tserver::ConsensusServiceImpl::UpdateConsensus(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*, kudu::rpc::RpcContext*) tserver/tablet_service.cc:861:25
    #4 kudu::consensus::ConsensusServiceIf::ConsensusServiceIf(scoped_refptr<kudu::MetricEntity> const&, scoped_refptr<kudu::rpc::ResultTracker> const&)::$_1::operator()(google::protobuf::Message const*, google::protobuf::Message*, kudu::rpc::RpcContext*) const consensus/consensus.service.cc:100:13
    ... skipped ...

  Previous read of size 8 at 0x7d500003e480 by thread T79 (mutexes: write M1822):
    #0 kudu::consensus::PeerMessageQueue::UpdateLagMetrics() consensus/consensus_queue.cc:602:20
    #1 kudu::consensus::PeerMessageQueue::UpdateMetrics() consensus/consensus_queue.cc:875:3
    #2 kudu::consensus::PeerMessageQueue::ResponseFromPeer(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, kudu::consensus::ConsensusResponsePB const&, bool*) consensus/consensus_queue.cc:828:5
    ... skipped ...

For more details, see
  http://dist-test.cloudera.org:8080/diagnose?key=0784c33a-41b5-11e7-9f6b-0242ac11000f

This patch fixes the above race.  Also, it contains some extras:
  * The PeerMessageQueue::UpdateLagMetrics() method has been renamed
    into PeerMessageQueue::UpdateLagMetricsUnlocked() and made private.
  * The PeerMessageQueue::UpdateMetrics() method has been renamed
    into PeerMessageQueue::UpdateMetricsUnlocked().
  * Added DCHECK(queue_lock_.is_locked()) into all
    PeerMessageQueue::XxxUnlocked() methods.

Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
2 files changed, 24 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/32/7032/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] [consensus queue] fix race in UpdateLagMetrics()

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [consensus_queue] fix race in UpdateLagMetrics()
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7032/1/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 603:   std::lock_guard<simple_spinlock> l(queue_lock_);
> I think this is OK, but just to confirm... this is called from RaftConsensu
I took a look at the places where PeerMessageQueue::UpdateLastIndexAppendedToLeader() is called and I didn't find any suspicious places.  Actually, the only non-test place it's called is in RaftConsensus::UpdateReplica().  I also found that many methods like RaftConsensus::XxxUnlocked() call Queue methods which take the same lock, so I think it should be fine.

Otherwise, I was thinking, would use atomic instead.

I also ran consensus-related tests in many iterations via dist-test, and no issues were reported (at least in DEBUG build).


-- 
To view, visit http://gerrit.cloudera.org:8080/7032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus queue] fix race in UpdateLagMetrics()

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus_queue] fix race in UpdateLagMetrics()
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7032/1/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 603:   std::lock_guard<simple_spinlock> l(queue_lock_);
> I took a look at the places where PeerMessageQueue::UpdateLastIndexAppended
Right, that should be the only call in a non-test.

Hrm, I thought metrics were by default atomic, at least from a brief look in metrics.h. But we would still need locking here to prevent the race.


-- 
To view, visit http://gerrit.cloudera.org:8080/7032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus queue] fix race in UpdateLagMetrics()

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [consensus_queue] fix race in UpdateLagMetrics()
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7032/1/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 603:   std::lock_guard<simple_spinlock> l(queue_lock_);
I think this is OK, but just to confirm... this is called from RaftConsensus while holding RaftConsensus's lock. Is that the correct ordering? RaftConsensus -> ConsensusQueue? Do we ever invert that, leading to possible deadlock? (I don't think so based on me reading, but curious if you also looked)


-- 
To view, visit http://gerrit.cloudera.org:8080/7032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus queue] fix race in UpdateLagMetrics()

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [consensus_queue] fix race in UpdateLagMetrics()
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7032/1/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 603:   std::lock_guard<simple_spinlock> l(queue_lock_);
> Right, that should be the only call in a non-test.
Yep, the not-so-obvious thing was accessing that last_idx_appended_to_leader from multiple places, and that's exactly where the race happened.


-- 
To view, visit http://gerrit.cloudera.org:8080/7032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus queue] fix race in UpdateLagMetrics()

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged.

Change subject: [consensus_queue] fix race in UpdateLagMetrics()
......................................................................


[consensus_queue] fix race in UpdateLagMetrics()

TSAN reports warnings on races on writing/reading the
QueueState::last_idx_appended_to_leader field:

CatalogManagerTskITest.LeadershipChangeOnTskGeneration: WARNING: ThreadSanitizer: data race (pid=2710)  Write of size 8 at 0x7d500003e480 by thread T33 (mutexes: write M1863, write M1821):
    #0 kudu::consensus::PeerMessageQueue::UpdateLastIndexAppendedToLeader(long) consensus/consensus_queue.cc:607:44
    #1 kudu::consensus::RaftConsensus::UpdateReplica(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) consensus/raft_consensus.cc:1155:13
    #2 kudu::consensus::RaftConsensus::Update(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) consensus/raft_consensus.cc:752:14
    #3 kudu::tserver::ConsensusServiceImpl::UpdateConsensus(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*, kudu::rpc::RpcContext*) tserver/tablet_service.cc:861:25
    #4 kudu::consensus::ConsensusServiceIf::ConsensusServiceIf(scoped_refptr<kudu::MetricEntity> const&, scoped_refptr<kudu::rpc::ResultTracker> const&)::$_1::operator()(google::protobuf::Message const*, google::protobuf::Message*, kudu::rpc::RpcContext*) const consensus/consensus.service.cc:100:13
    ... skipped ...

  Previous read of size 8 at 0x7d500003e480 by thread T79 (mutexes: write M1822):
    #0 kudu::consensus::PeerMessageQueue::UpdateLagMetrics() consensus/consensus_queue.cc:602:20
    #1 kudu::consensus::PeerMessageQueue::UpdateMetrics() consensus/consensus_queue.cc:875:3
    #2 kudu::consensus::PeerMessageQueue::ResponseFromPeer(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, kudu::consensus::ConsensusResponsePB const&, bool*) consensus/consensus_queue.cc:828:5
    ... skipped ...

For more details, see
  http://dist-test.cloudera.org:8080/diagnose?key=0784c33a-41b5-11e7-9f6b-0242ac11000f

This patch fixes the above race.  Also, it contains some extras:
  * The PeerMessageQueue::UpdateLagMetrics() method has been renamed
    into PeerMessageQueue::UpdateLagMetricsUnlocked() and made private.
  * The PeerMessageQueue::UpdateMetrics() method has been renamed
    into PeerMessageQueue::UpdateMetricsUnlocked().
  * Added DCHECK(queue_lock_.is_locked()) into all
    PeerMessageQueue::XxxUnlocked() methods.

Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Reviewed-on: http://gerrit.cloudera.org:8080/7032
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
2 files changed, 24 insertions(+), 16 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus queue] fix race in UpdateLagMetrics()

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [consensus_queue] fix race in UpdateLagMetrics()
......................................................................


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I25feb676619cc1f3a94fb8e631bffd8ca02ead49
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No