You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/03/07 21:30:19 UTC

[kudu-CR] [consensus] add lock coverage for peer queue

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6299

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

Certain operations in PeerMessageQueue (e.g. RequestForPeer) were not
locking when they should have been (reading or writing from peer or
PeersMap). These missing locks resulted in races.

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_queue.cc
1 file changed, 95 insertions(+), 92 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#7).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

A Peer or PeerQueue would lock without taking into account the
presence of a peer calling back after TabletCopy and updating its
members, which doesn't abide by external locks (like the queue lock).
This would result in races.

A PeerMessageQueue operation would lock with the assumption that its
access to a peer was uninterrupted by other threads. This assumption
is broken in cases where a PeerProxy object has a callback to access a
peer (e.g. proxy calls StartTabletCopy with a callback that processes
the response and updates the peer) from a separate thread (as is the
case with the RpcHeartBeater).

A Peer operation like SendNextRequest would lock under this assumption,
only to potentially be interrupted by the same callback. This callback
was not completely locked and would race with the send.

This patch adds lock coverage to these cases, both seen in
RaftConsensusITest.TestConfigChangeUnderLoad.

TSAN errors and descriptions can be found here:
https://github.com/andrwng/kudu/blob/89119693414e50280e33b170ee26de3047e0911e/docs/race_diagnostics/race_descriptions.txt

Full TSAN logs can be found here:
https://github.com/andrwng/kudu/blob/89119693414e50280e33b170ee26de3047e0911e/docs/race_diagnostics/race_type_1.txt
https://github.com/andrwng/kudu/blob/89119693414e50280e33b170ee26de3047e0911e/docs/race_diagnostics/race_type_2.txt

Dist-test results before and after:
Before: Note that for these builds, status 1 is likely a data race.
http://dist-test.cloudera.org/job?job_id=awong.1489029805.26963
http://dist-test.cloudera.org/job?job_id=awong.1489086157.18460

After:
http://dist-test.cloudera.org/job?job_id=awong.1489096815.24680

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
3 files changed, 65 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#6).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

Certain operations in PeerMessageQueue (e.g. RequestForPeer) were not
locking when they should have been (e.g. when reading or writing from
a peer). These missing locks resulted in races.

This patch adds lock coverage to these cases.

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
2 files changed, 60 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 10: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS4, Line 342: ain a thread-safe copy of necessary members.
             :   // QueueState dependents:
             :   OpId preceding_id;
             :   int num_voters;
             :   int64_t current_term;
             : 
             :   // Peer dependents:
             :   bool is_new;
             :   bool peer_needs_tablet_copy;
             :   int next_index;
             :   logging::LogThrottler status_log_throttler;
             :   std::string peer_string;
             :   MonoDelta unreachable_time;
> that sounds good. I think that if you remove the 'const' from the uuid fiel
that said, the const is there for a reason. I could go either way (manually code the ctor or remove the const). up to you


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS4, Line 360: void races when the peer is altered by another thread, make a copy and read
             :     // from it through tho
Will remove.


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/7/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS7, Line 131:   // A copy of TrackedPeer with a mutable UUID. This enables the existence of a
             :   // default constructor, which can be used to safely get a peer and keep it
             :   // alive outside of a critical section.
             :   struct TrackedPeerSnapshot : TrackedPeer {
             :   public:
             :     explicit TrackedPeerSnapshot(std::string uuid) :TrackedPeer(uuid) {
             :       this->uuid = uuid;
             :     }
             : 
             :     // Default constructor will not have a UUID, but it can be set later.
             :     TrackedPeerSnapshot() : TrackedPeer("") {}
             : 
             :     // Copy a given TrackedPeer.
             :     TrackedPeerSnapshot& operator=(const TrackedPeer& tracked_peer) {
             :       uuid = tracked_peer.uuid;
             :       is_new = tracked_peer.is_new;
             :       next_index = tracked_peer.next_index;
             :       last_received = tracked_peer.last_received;
             :       last_known_committed_index = tracked_peer.last_known_committed_index;
             :       is_last_exchange_successful = tracked_peer.is_last_exchange_successful;
             :       last_successful_communication_time = tracked_peer.last_successful_communication_time;
             :       needs_tablet_copy = tracked_peer.needs_tablet_copy;
             :       status_log_throttler = tracked_peer.status_log_throttler;
             :       return *this;
             :     }
             : 
             :     // UUID of the snapshot.
             :     std::string uuid;
             :   };
> Sure, my reasoning for going down this route was for clarity that we're sna
right.


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#3).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

Certain operations in PeerMessageQueue (e.g. RequestForPeer) were not
locking when they should have been (reading or writing from peer or
PeersMap). These missing locks resulted in races.

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_queue.cc
1 file changed, 74 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has uploaded a new patch set (#2).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

Certain operations in PeerMessageQueue (e.g. RequestForPeer) were not
locking when they should have been (reading or writing from peer or
PeersMap). These missing locks resulted in races.

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_queue.cc
1 file changed, 75 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6299/1/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS1, Line 142:   {
             :     std::lock_guard<simple_spinlock> lock(queue_lock_);
             :     if (current_term != queue_state_.current_term) {
             :       CHECK_GT(current_term, queue_state_.current_term) << "Terms should only increase";
             :       queue_state_.first_index_in_current_term = boost::none;
             :       queue_state_.current_term = current_term;
             :     }
             : 
             :     queue_state_.committed_index = committed_index;
             :     queue_state_.majority_replicated_index = committed_index;
             :     queue_state_.active_config.reset(new RaftConfigPB(active_config));
             :     CHECK(IsRaftConfigVoter(local_peer_pb_.permanent_uuid(), *queue_state_.active_config))
             :         << SecureShortDebugString(local_peer_pb_) << " not a voter in config: "
             :         << SecureShortDebugString(*queue_state_.active_config);
             :     queue_state_.majority_size_ = MajoritySize(CountVoters(*queue_state_.active_config));
             :     queue_state_.mode = LEADER;
             : 
             :     LOG_WITH_PREFIX_UNLOCKED(INFO) << "Queue going to LEADER mode. State: "
             :         << queue_state_.ToString();
             :   }
             :   CheckPeersInActiveConfigIfLeaderUnlocked();
             : 
             :   // Reset last communication time with all peers to reset the clock on the
             :   // failure timeout.
             :   MonoTime now(MonoTime::Now());
             :   for (const PeersMap::value_type& entry : peers_map_) {
             :     std::lock_guard<simple_spinlock> lock(queue_lock_);
             :     entry.second->last_successful_communication_time = now;
             :   }
             :   time_manager_->SetLeaderMode();
weary of this change. are you sure that it doesn't introduce possible races?


PS1, Line 347: {
I think that extending the coverage of this lock is likely going to have a big performance hit (and might even generate deadlocks).  Could we take a copy of "peer" instead of a pointer?


PS1, Line 366:  
extra space


PS1, Line 483: {
can req->Clear() take some meaningful time?


PS1, Line 569: std::lock_guard<simple_spinlock> lock(queue_lock_);
don't thing this one is needed, since this code only runs on VLOG(3)


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 6:

Will try that out, so far, I'd just been using specific test cases with default threading. Found the underlying cause to be a callback from TabletCopy, calling ProcessTabletCopyResponse.

-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#10).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

A Peer or PeerQueue would lock without taking into account the
presence of a peer calling back after TabletCopy and updating its
members, which doesn't abide by external locks (like the queue lock).
This would result in races.

A PeerMessageQueue operation would lock with the assumption that its
access to a peer was uninterrupted by other threads. This assumption
is broken in cases where a PeerProxy object has a callback to access a
peer (e.g. proxy calls StartTabletCopy with a callback that processes
the response and updates the peer) from a separate thread (as is the
case with the RpcHeartBeater).

A Peer operation like SendNextRequest would lock under this assumption,
only to potentially be interrupted by the same callback. This callback
was not completely locked and would race with the send.

This patch adds lock coverage to these cases, both seen in
RaftConsensusITest.TestConfigChangeUnderLoad.

TSAN errors and descriptions can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_descriptions.txt

Full TSAN logs can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_1.txt
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_2.txt

Dist-test results before and after:
Before: Note that for these builds, status 1 is likely a data race.
http://dist-test.cloudera.org/job?job_id=awong.1489029805.26963
http://dist-test.cloudera.org/job?job_id=awong.1489086157.18460

After:
http://dist-test.cloudera.org/job?job_id=awong.1489096815.24680

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
3 files changed, 42 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/10
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/7//COMMIT_MSG
Commit Message:

Line 29: https://github.com/andrwng/kudu/blob/89119693414e50280e33b170ee26de3047e0911e/docs/race_diagnostics/race_descriptions.txt
Will update descriptions textwidth to be more readable.


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/1/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS1, Line 483: {
> can req->Clear() take some meaningful time?
Hmm, what is considered meaningful time? Regardless, only the peer write needs to be locked. I'm unsure of whether there was specific rationale for putting it after setting the req (given the "Now reset the flag"). If it doesn't make a difference, can move the write up and take the req writes out of the critical section.


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6299/7/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS7, Line 328:   std::unique_lock<simple_spinlock> lock(peer_lock_);
             :   if (closed_) {
             :     return;
             :   }
             :   CHECK(request_pending_);
             :   request_pending_ = false;
> what is the additional state this lock is protecting? covering the queue in
This lock needs to protect controller_, which may be locked by SendNextRequest(). I was worried about the queue lock as well, so I'll take another look to make sure we're okay on that front.

It's worth noting that ProcessResponse() also locks during a call to queue_->NotifyPeerIsResponsiveDespiteError(), which put me more at ease locking here too.


http://gerrit.cloudera.org:8080/#/c/6299/7/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS7, Line 131:   // A copy of TrackedPeer with a mutable UUID. This enables the existence of a
             :   // default constructor, which can be used to safely get a peer and keep it
             :   // alive outside of a critical section.
             :   struct TrackedPeerSnapshot : TrackedPeer {
             :   public:
             :     explicit TrackedPeerSnapshot(std::string uuid) :TrackedPeer(uuid) {
             :       this->uuid = uuid;
             :     }
             : 
             :     // Default constructor will not have a UUID, but it can be set later.
             :     TrackedPeerSnapshot() : TrackedPeer("") {}
             : 
             :     // Copy a given TrackedPeer.
             :     TrackedPeerSnapshot& operator=(const TrackedPeer& tracked_peer) {
             :       uuid = tracked_peer.uuid;
             :       is_new = tracked_peer.is_new;
             :       next_index = tracked_peer.next_index;
             :       last_received = tracked_peer.last_received;
             :       last_known_committed_index = tracked_peer.last_known_committed_index;
             :       is_last_exchange_successful = tracked_peer.is_last_exchange_successful;
             :       last_successful_communication_time = tracked_peer.last_successful_communication_time;
             :       needs_tablet_copy = tracked_peer.needs_tablet_copy;
             :       status_log_throttler = tracked_peer.status_log_throttler;
             :       return *this;
             :     }
             : 
             :     // UUID of the snapshot.
             :     std::string uuid;
             :   };
> since we're adding the assignment op anyway, likely better to just add it o
Sure, my reasoning for going down this route was for clarity that we're snapshotting the peer while keeping the original TrackedPeer untainted (ie leaving uuid const). I'll add assignment to the original; added clarity is probably unnecessary since we're only "snapshotting" at one place so far.


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 6:

as a repro strategy I'd maybe try that test (and/or some others in raft_consensus-itest) with --stress_cpu_threads=2 or something. if you can go higher than 2 without starting to have test timeouts then try that too.

-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 7:

(2 comments)

left an relevant question, the test lgtm and/or is nits

http://gerrit.cloudera.org:8080/#/c/6299/7/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS7, Line 328:   std::unique_lock<simple_spinlock> lock(peer_lock_);
             :   if (closed_) {
             :     return;
             :   }
             :   CHECK(request_pending_);
             :   request_pending_ = false;
what is the additional state this lock is protecting? covering the queue interaction below with the lock is likely to cause hard to debug deadlocks.


http://gerrit.cloudera.org:8080/#/c/6299/7/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS7, Line 131:   // A copy of TrackedPeer with a mutable UUID. This enables the existence of a
             :   // default constructor, which can be used to safely get a peer and keep it
             :   // alive outside of a critical section.
             :   struct TrackedPeerSnapshot : TrackedPeer {
             :   public:
             :     explicit TrackedPeerSnapshot(std::string uuid) :TrackedPeer(uuid) {
             :       this->uuid = uuid;
             :     }
             : 
             :     // Default constructor will not have a UUID, but it can be set later.
             :     TrackedPeerSnapshot() : TrackedPeer("") {}
             : 
             :     // Copy a given TrackedPeer.
             :     TrackedPeerSnapshot& operator=(const TrackedPeer& tracked_peer) {
             :       uuid = tracked_peer.uuid;
             :       is_new = tracked_peer.is_new;
             :       next_index = tracked_peer.next_index;
             :       last_received = tracked_peer.last_received;
             :       last_known_committed_index = tracked_peer.last_known_committed_index;
             :       is_last_exchange_successful = tracked_peer.is_last_exchange_successful;
             :       last_successful_communication_time = tracked_peer.last_successful_communication_time;
             :       needs_tablet_copy = tracked_peer.needs_tablet_copy;
             :       status_log_throttler = tracked_peer.status_log_throttler;
             :       return *this;
             :     }
             : 
             :     // UUID of the snapshot.
             :     std::string uuid;
             :   };
since we're adding the assignment op anyway, likely better to just add it or a regular copy ctor to the original TrackedPeer. That is ,the benefits of the default ctor (less code) are negated by the additional code of the assignment op and the new struct


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#8).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

A Peer or PeerQueue would lock without taking into account the
presence of a peer calling back after TabletCopy and updating its
members, which doesn't abide by external locks (like the queue lock).
This would result in races.

A PeerMessageQueue operation would lock with the assumption that its
access to a peer was uninterrupted by other threads. This assumption
is broken in cases where a PeerProxy object has a callback to access a
peer (e.g. proxy calls StartTabletCopy with a callback that processes
the response and updates the peer) from a separate thread (as is the
case with the RpcHeartBeater).

A Peer operation like SendNextRequest would lock under this assumption,
only to potentially be interrupted by the same callback. This callback
was not completely locked and would race with the send.

This patch adds lock coverage to these cases, both seen in
RaftConsensusITest.TestConfigChangeUnderLoad.

TSAN errors and descriptions can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_descriptions.txt

Full TSAN logs can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_1.txt
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_2.txt

Dist-test results before and after:
Before: Note that for these builds, status 1 is likely a data race.
http://dist-test.cloudera.org/job?job_id=awong.1489029805.26963
http://dist-test.cloudera.org/job?job_id=awong.1489086157.18460

After:
http://dist-test.cloudera.org/job?job_id=awong.1489096815.24680

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
3 files changed, 65 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS4, Line 342: ain a thread-safe copy of necessary members.
             :   // QueueState dependents:
             :   OpId preceding_id;
             :   int num_voters;
             :   int64_t current_term;
             : 
             :   // Peer dependents:
             :   bool is_new;
             :   bool peer_needs_tablet_copy;
             :   int next_index;
             :   logging::LogThrottler status_log_throttler;
             :   std::string peer_string;
             :   MonoDelta unreachable_time;
I'm considering adding a snapshot method or constructor to  TrackedPeer (currently it only allows explicit construction). It's not necessary, and may not be used anywhere else, but _might_ be prettier. wdyt?


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#5).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

Certain operations in PeerMessageQueue (e.g. RequestForPeer) were not
locking when they should have been (e.g. when reading or writing from
a peer). These missing locks resulted in races.

This patch adds lock coverage to these cases.

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
2 files changed, 60 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS4, Line 342: ain a thread-safe copy of necessary members.
             :   OpId preceding_id;
             :   int num_voters;
             :   int64_t current_term;
             :   TrackedPeerSnapshot peer;
             :   MonoDelta unreachable_time;
             :   {
             :     std::lock_guard<simple_spinlock> lock(queue_lock_);
             :     DCHECK_EQ(queue_state_.state, kQueueOpen);
             :     DCHECK_NE(uuid, local_peer_pb_.permanent_uuid());
             : 
             :     TrackedPeer* peer_ptr = FindPtrOrNull(peers_map_, uuid);
             :     if (PREDICT_FALSE(peer_pt
> that sounds good. I think that if you remove the 'const' from the uuid fiel
Opted to create a "snapshot" subclass so it maintains that a peer can't change its uuid and is hopefully clearer that this is a copy.


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#4).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

Certain operations in PeerMessageQueue (e.g. RequestForPeer) were not
locking when they should have been (e.g. when reading or writing from
a peer). These missing locks resulted in races.

This patch adds lock coverage to these cases.

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_queue.cc
1 file changed, 43 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/7/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS7, Line 328:   std::unique_lock<simple_spinlock> lock(peer_lock_);
             :   if (closed_) {
             :     return;
             :   }
             :   CHECK(request_pending_);
             :   request_pending_ = false;
> This lock needs to protect controller_, which may be locked by SendNextRequ
could you just get all the statuses you need from controller_ tc_response and then release the lock before reaching back to the queue? even if this doesn't cause a deadlock right now, it can be a source of problems in the future.


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 6:

please add your findings to the commit message along with the links to the dist tests and the tsan errors themselves (maybe abbreviated)

-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/9/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS9, Line 85: // Copy a given TrackedPeer.
            :     TrackedPeer& operator=(const TrackedPeer& tracked_peer) {
            :       uuid = tracked_peer.uuid;
            :       is_new = tracked_peer.is_new;
            :       next_index = tracked_peer.next_index;
            :       last_received = tracked_peer.last_received;
            :       last_known_committed_index = tracked_peer.last_known_committed_index;
            :       is_last_exchange_successful = tracked_peer.is_last_exchange_successful;
            :       last_successful_communication_time = tracked_peer.last_successful_communication_time;
            :       needs_tablet_copy = tracked_peer.needs_tablet_copy;
            :       status_log_throttler = tracked_peer.status_log_throttler;
            :       last_seen_term_ = tracked_peer.last_seen_term_;
            :       return *this;
            :     }
since you removed the constness you can do TrackedPeer& operator=(const TrackedPeer& tracked_peer) = default;


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6299/1/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS1, Line 142:   {
             :     std::lock_guard<simple_spinlock> lock(queue_lock_);
             :     if (current_term != queue_state_.current_term) {
             :       CHECK_GT(current_term, queue_state_.current_term) << "Terms should only increase";
             :       queue_state_.first_index_in_current_term = boost::none;
             :       queue_state_.current_term = current_term;
             :     }
             : 
             :     queue_state_.committed_index = committed_index;
             :     queue_state_.majority_replicated_index = committed_index;
             :     queue_state_.active_config.reset(new RaftConfigPB(active_config));
             :     CHECK(IsRaftConfigVoter(local_peer_pb_.permanent_uuid(), *queue_state_.active_config))
             :         << SecureShortDebugString(local_peer_pb_) << " not a voter in config: "
             :         << SecureShortDebugString(*queue_state_.active_config);
             :     queue_state_.majority_size_ = MajoritySize(CountVoters(*queue_state_.active_config));
             :     queue_state_.mode = LEADER;
             : 
             :     LOG_WITH_PREFIX_UNLOCKED(INFO) << "Queue going to LEADER mode. State: "
             :         << queue_state_.ToString();
             :   }
             :   CheckPeersInActiveConfigIfLeaderUnlocked();
             : 
             :   // Reset last communication time with all peers to reset the clock on the
             :   // failure timeout.
             :   MonoTime now(MonoTime::Now());
             :   for (const PeersMap::value_type& entry : peers_map_) {
             :     std::lock_guard<simple_spinlock> lock(queue_lock_);
             :     entry.second->last_successful_communication_time = now;
             :   }
             :   time_manager_->SetLeaderMode();
> weary of this change. are you sure that it doesn't introduce possible races
Yeah, I realized this, definitely incorrect. Came from an earlier change that I meant to change. Has been changed in newer patches.


PS1, Line 347: {
> I think that extending the coverage of this lock is likely going to have a 
Hmm. So the reasoning there would be that since we're only reading from peer, a copy would be good enough.


PS1, Line 366:  
> extra space
Done


PS1, Line 569: std::lock_guard<simple_spinlock> lock(queue_lock_);
> don't thing this one is needed, since this code only runs on VLOG(3)
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6299/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS4, Line 342: ain a thread-safe copy of necessary members.
             :   // QueueState dependents:
             :   OpId preceding_id;
             :   int num_voters;
             :   int64_t current_term;
             : 
             :   // Peer dependents:
             :   bool is_new;
             :   bool peer_needs_tablet_copy;
             :   int next_index;
             :   logging::LogThrottler status_log_throttler;
             :   std::string peer_string;
             :   MonoDelta unreachable_time;
> I'm considering adding a snapshot method or constructor to  TrackedPeer (cu
that sounds good. I think that if you remove the 'const' from the uuid field (since const strings don't have a copy assignment op) you could get away with just adding "TrackedPeer() = default;" in c++11 land


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 6:

Can you paste the TSAN error and reproduction steps here?

I thought that we were avoiding races here based on only doing requests to peers from specific threads, and based on external locks. Curious to see what we missed.

-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6299

to look at the new patch set (#9).

Change subject: [consensus] add lock coverage for peer queue
......................................................................

[consensus] add lock coverage for peer queue

A Peer or PeerQueue would lock without taking into account the
presence of a peer calling back after TabletCopy and updating its
members, which doesn't abide by external locks (like the queue lock).
This would result in races.

A PeerMessageQueue operation would lock with the assumption that its
access to a peer was uninterrupted by other threads. This assumption
is broken in cases where a PeerProxy object has a callback to access a
peer (e.g. proxy calls StartTabletCopy with a callback that processes
the response and updates the peer) from a separate thread (as is the
case with the RpcHeartBeater).

A Peer operation like SendNextRequest would lock under this assumption,
only to potentially be interrupted by the same callback. This callback
was not completely locked and would race with the send.

This patch adds lock coverage to these cases, both seen in
RaftConsensusITest.TestConfigChangeUnderLoad.

TSAN errors and descriptions can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_descriptions.txt

Full TSAN logs can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_1.txt
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_2.txt

Dist-test results before and after:
Before: Note that for these builds, status 1 is likely a data race.
http://dist-test.cloudera.org/job?job_id=awong.1489029805.26963
http://dist-test.cloudera.org/job?job_id=awong.1489086157.18460

After:
http://dist-test.cloudera.org/job?job_id=awong.1489096815.24680

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
3 files changed, 54 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/6299/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has submitted this change and it was merged.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


[consensus] add lock coverage for peer queue

A Peer or PeerQueue would lock without taking into account the
presence of a peer calling back after TabletCopy and updating its
members, which doesn't abide by external locks (like the queue lock).
This would result in races.

A PeerMessageQueue operation would lock with the assumption that its
access to a peer was uninterrupted by other threads. This assumption
is broken in cases where a PeerProxy object has a callback to access a
peer (e.g. proxy calls StartTabletCopy with a callback that processes
the response and updates the peer) from a separate thread (as is the
case with the RpcHeartBeater).

A Peer operation like SendNextRequest would lock under this assumption,
only to potentially be interrupted by the same callback. This callback
was not completely locked and would race with the send.

This patch adds lock coverage to these cases, both seen in
RaftConsensusITest.TestConfigChangeUnderLoad.

TSAN errors and descriptions can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_descriptions.txt

Full TSAN logs can be found here:
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_1.txt
https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_type_2.txt

Dist-test results before and after:
Before: Note that for these builds, status 1 is likely a data race.
http://dist-test.cloudera.org/job?job_id=awong.1489029805.26963
http://dist-test.cloudera.org/job?job_id=awong.1489086157.18460

After:
http://dist-test.cloudera.org/job?job_id=awong.1489096815.24680

Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Reviewed-on: http://gerrit.cloudera.org:8080/6299
Reviewed-by: David Ribeiro Alves <dr...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
3 files changed, 42 insertions(+), 32 deletions(-)

Approvals:
  David Ribeiro Alves: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6299/7//COMMIT_MSG
Commit Message:

Line 29: https://github.com/andrwng/kudu/blob/consensus-locks/docs/race_diagnostics/race_descriptions.txt
> Will update descriptions textwidth to be more readable.
Done


http://gerrit.cloudera.org:8080/#/c/6299/7/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS7, Line 328:   std::unique_lock<simple_spinlock> lock(peer_lock_);
             :   if (closed_) {
             :     return;
             :   }
             :   CHECK(request_pending_);
             :   request_pending_ = false;
> could you just get all the statuses you need from controller_ tc_response a
Will go the unlock route, as we talked about.


http://gerrit.cloudera.org:8080/#/c/6299/9/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS9, Line 85: // Copy a given TrackedPeer.
            :     TrackedPeer& operator=(const TrackedPeer& tracked_peer) {
            :       uuid = tracked_peer.uuid;
            :       is_new = tracked_peer.is_new;
            :       next_index = tracked_peer.next_index;
            :       last_received = tracked_peer.last_received;
            :       last_known_committed_index = tracked_peer.last_known_committed_index;
            :       is_last_exchange_successful = tracked_peer.is_last_exchange_successful;
            :       last_successful_communication_time = tracked_peer.last_successful_communication_time;
            :       needs_tablet_copy = tracked_peer.needs_tablet_copy;
            :       status_log_throttler = tracked_peer.status_log_throttler;
            :       last_seen_term_ = tracked_peer.last_seen_term_;
            :       return *this;
            :     }
> since you removed the constness you can do TrackedPeer& operator=(const Tra
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] add lock coverage for peer queue

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] add lock coverage for peer queue
......................................................................


Patch Set 6:

Seeing if I can get it to repro, but running dist-test on a TSAN build of RaftConsensusITest.TestConfigChangeUnderLoad will trigger it ~1% of the time (as well as some others, like TestAutoCreateReplica).

In both of the below cases, it's a race between NotifyPeerIsResponsiveDespiteError, which writes to peer (called from ProcessTabletCopyResponse) and RequestForPeer, which reads from peer.

TSAN error logs here:
TestConfigChangeUnderLoad: http://dist-test.cloudera.org:8080/diagnose?key=7a92ff6a-02d9-11e7-b98c-0242ac11000f
TestAutoCreateReplica: http://dist-test.cloudera.org:8080/diagnose?key=845fb902-0301-11e7-9988-0242ac11000f

-- 
To view, visit http://gerrit.cloudera.org:8080/6299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0e3f9603c471441d9b926cac3377fc6a5a2a1514
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No