You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/09/12 04:52:29 UTC

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Hello David Ribeiro Alves, Mike Percy, Adar Dembo,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/8037

to review the following change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................

KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

The Raft leader's behavior on a timeout is to simply retry the request,
potentially aggregating more data into the new attempt if new data is
waiting in the queue.

However, as described in the JIRA, this behavior is counterproductive in
the case that the network pipe or associated reactor thread is
saturated. The original request may be in the middle of transmission
already, and so the retry ends up re-sending bytes which have already
been sent, increasing "throughput" but not increasing "goodput".

The original Raft timeout was set to 1 second mainly due to KUDU-699, an
old bug in which the leader would block waiting on outstanding requests
to followers before it would step down. That was fixed quite a long time
back, though, so there is no longer any good reason to have such a short
timeout on a Raft request.

This patch bumps the default timeout to 30 seconds. I tested this on a
8-node cluster by using iptables to inject 1% packet loss on all nodes
and running an insertion workload as described in the JIRA. Without the
patch, if I did a 'kill -STOP' of a node and waited a couple seconds
before allowing it to continue, I would see that node log "deduplicated
request" messages for 30-60 seconds before it eventually caught up.
During that time, the tablet was effectively using only two replicas,
causing increased latency, etc.

With the higher timeout, I didn't see these messages, and the unpaused
replica caught up much more quickly.

Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
---
M src/kudu/consensus/consensus_peers.cc
1 file changed, 1 insertion(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/37/8037/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.

Mike Percy has posted comments on this change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8037/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 51: DEFINE_int32(consensus_rpc_timeout_ms, 30000,
This doesn't appear to prevent stacking, does it? I don't see a mechanism for the heartbeater not to trigger Peer::SignalRequest() while a consensus request is outstanding. Is that what we want?


-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8037 )

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................

KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

The Raft leader's behavior on a timeout is to simply retry the request,
potentially aggregating more data into the new attempt if new data is
waiting in the queue.

However, as described in the JIRA, this behavior is counterproductive in
the case that the network pipe or associated reactor thread is
saturated. The original request may be in the middle of transmission
already, and so the retry ends up re-sending bytes which have already
been sent, increasing "throughput" but not increasing "goodput".

The original Raft timeout was set to 1 second mainly due to KUDU-699, an
old bug in which the leader would block waiting on outstanding requests
to followers before it would step down. That was fixed quite a long time
back, though, so there is no longer any good reason to have such a short
timeout on a Raft request.

This patch bumps the default timeout to 30 seconds. I tested this on a
8-node cluster by using iptables to inject 1% packet loss on all nodes
and running an insertion workload as described in the JIRA. Without the
patch, if I did a 'kill -STOP' of a node and waited a couple seconds
before allowing it to continue, I would see that node log "deduplicated
request" messages for 30-60 seconds before it eventually caught up.
During that time, the tablet was effectively using only two replicas,
causing increased latency, etc.

With the higher timeout, I didn't see these messages, and the unpaused
replica caught up much more quickly.

Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Reviewed-on: http://gerrit.cloudera.org:8080/8037
Reviewed-by: David Ribeiro Alves <da...@gmail.com>
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/consensus/consensus_peers.cc
1 file changed, 1 insertion(+), 1 deletion(-)

Approvals:
David Ribeiro Alves: Looks good to me, approved
Mike Percy: Looks good to me, approved
Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-Change-Number: 8037
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1:

Seems this different timing may have exposed a new TSAN warning. Will look into it tomorrow.

-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.

David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Jean-Daniel Cryans (Code Review)" <ge...@cloudera.org>.

Jean-Daniel Cryans has posted comments on this change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8037/1//COMMIT_MSG
Commit Message:

PS1, Line 32: causing increased latency
That part is less intuitive to me, where's the slowdown coming from? The node you paused happened to be the fastest of the two followers?


-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.

David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1:

does this have any nasty interaction with failure detection for other situations than the one described in the jira. It seems like it might potentially have since the previous timeout was less than 3x heartbeats and the new one isn't.

-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.

Mike Percy has posted comments on this change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.

Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8037 )

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1:

Are we planning on merging this patch?


-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-Change-Number: 8037
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 02 Oct 2017 21:52:39 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change.

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................

Patch Set 1:

(2 comments)

> does this have any nasty interaction with failure detection for other situations than the one described in the jira. It seems like it might potentially have since the previous timeout was less than 3x heartbeats and the new one isn't.

I don't think so.

Let's look at the case where the network is slow such that each batch takes 1.1sec to transmit. In the old code, we'd have:

t leader follower
------------------------------------
0.0 send 1
1.0 timeout
1.01 send 2 <queued>
1.1 send 2 <on wire> received 1
2.01 timeout
2.02 send 3 <queued>
2.21 send 3 <on wire> received 2
...

i.e even though the leader was sending a new request once every 1 second due to the timeout, the follower receives the requests only at the rate of one every 1.1sec because of the transmission bottleneck.

With the new code, the leader will just wait the full 1.1sec before sending the new request. So, the follower will still be receiving requests once every 1.1sec, but now all of the wire bandwidth will be used for "goodput" instead of each request getting successively longer and longer until reaching the max-batch-size.

It's true that if the network is slow enough that the transmission of a single batch takes more than the failure time (3x heartbeat) then the follower will be triggering pre-elections in between each request. But I think that was already the case before, and if anything this should probably reduce the frequency of it occurring, again by reducing wasted network traffic. To avoid this I suppose we could have a dynamic transmission size which gets smaller on timeouts, or some out-of-band "liveness only" heartbeat sent via UDP, or somesuch. But all of those seem like incremental work that we could do on top of this fix.

http://gerrit.cloudera.org:8080/#/c/8037/1//COMMIT_MSG
Commit Message:

PS1, Line 32: causing increased latency
> That part is less intuitive to me, where's the slowdown coming from? The no
just that when you only have two replicas, you are always waiting max(latency) across the two. When you have three replicas, you are taking median(latency) across the three. So naturally the latency with two replicas is almost always going to be worse than the latency with three replicas, if the replicas' individual latencies are all drawn from the same distribution (i.e it would not work this way if one of the replicas is consistently slower or farther away).

http://gerrit.cloudera.org:8080/#/c/8037/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 51: DEFINE_int32(consensus_rpc_timeout_ms, 30000,
> This doesn't appear to prevent stacking, does it? I don't see a mechanism f
Peer::SignalRequest has this code:

// Only allow one request at a time.
if (request_pending_) {
return;
}

--
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8037 )

Change subject: KUDU-1788. Increase Raft RPC timeout to 30sec to avoid fruitless retries.
......................................................................


Patch Set 1:

yea, sorry, was traveling much of the last couple weeks. Will merge now assuming it still applies :)


-- 
To view, visit http://gerrit.cloudera.org:8080/8037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5f47dc006dc3dfb1659a224172e1905b6bf3d2a4
Gerrit-Change-Number: 8037
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 05 Oct 2017 03:19:33 +0000
Gerrit-HasComments: No