You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Sammi Chen (Jira)" <ji...@apache.org> on 2021/10/29 12:58:00 UTC

[jira] [Updated] (RATIS-1418) Leader step down for not sending out heartbeat promptly

     [ https://issues.apache.org/jira/browse/RATIS-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammi Chen updated RATIS-1418:
------------------------------
    Description: 
In the current implemeation, leader will step down when it doesn't recevie any rpc from follower in maxTimeoutMs in follower's lastRpcResponseTime.

On the other side, grpcLogAppender decides to send out heartbeat based on the the lastRpcResponseTime and lastRpcSendTime, takes the late one, which makes in some cases heartbeat is not sent out promptly. 


Here are some logs captured.

Leader
#last heartbeat sent before leader step down
2021-10-29 14:39:09,612 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->f2a076b3-3b3d-479e-94f7-a9d8021f31ef-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->f2a076b3-3b3d-479e-94f7-a9d8021f31ef-GrpcLogAppender send request [heartbeat = true, callId = 263]
2021-10-29 14:39:09,614 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender send request [heartbeat = true, callId = 265]

#last non-heartbeat sent before leader step down
2021-10-29 14:39:13,176 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender send request [heartbeat = false, callId = 1287]
2021-10-29 14:39:13,176 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.server.leader.FollowerInfo: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240: nextIndex: updateIncreasingly 1023 -> 1024
2021-10-29 14:39:13,177 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender send request [heartbeat = false, callId = 1288]

#leader step down
2021-10-29 14:39:15,549 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] WARN org.apache.ratis.server.RaftServer$Division: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl: Lost leadership on term: 1. Election timeout: 5200ms. In charge for: 26147ms. Conf: 0: [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7|rpc:9.37.156.222:9856|admin:9.37.156.222:9857|client:9.37.156.222:9858|dataStream:|priority:1, f2a076b3-3b3d-479e-94f7-a9d8021f31ef|rpc:9.186.21.242:9856|admin:9.186.21.242:9857|client:9.186.21.242:9858|dataStream:|priority:0, 7219573a-06d6-4883-8b71-d6d676bf2240|rpc:9.186.21.247:9856|admin:9.186.21.247:9857|client:9.186.21.247:9858|dataStream:|priority:0], old=null
2021-10-29 14:39:15,550 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] WARN org.apache.ratis.server.RaftServer$Division: Follower 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->f2a076b3-3b3d-479e-94f7-a9d8021f31ef(c0,m0,n1024, attendVote=true, lastRpcSendTime=2262, lastRpcResponseTime=5935)
2021-10-29 14:39:15,550 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] WARN org.apache.ratis.server.RaftServer$Division: Follower 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240(c0,m0,n1024, attendVote=true, lastRpcSendTime=2372, lastRpcResponseTime=5933)
2021-10-29 14:39:15,550 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] INFO org.apache.ratis.server.RaftServer$Division: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174: changes role from    LEADER to FOLLOWER at term 1 for StepDownReason:LOST_MAJORITY_HEARTBEATS

Follower
#the last heartbeat reply before leader step down
2021-10-29 14:39:09,621 [ForkJoinPool.commonPool-worker-57] DEBUG org.apache.ratis.server.RaftServer$Division: HEARTBEAT: 7219573a-06d6-4883-8b71-d6d676bf2240@group-83CE93F2D174: reply appendEntries(2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@id: "D-\034\277\350\213J\350\273\355\203\316\223\362\321t"
<-7219573a-06d6-4883-8b71-d6d676bf2240#265:OK-t1,SUCCESS,nextIndex=1,followerCommit=0)
2021-10-29 14:39:09,621 [ForkJoinPool.commonPool-worker-57] DEBUG org.apache.ratis.grpc.server.GrpcServerProtocolService: 7219573a-06d6-4883-8b71-d6d676bf2240: reply 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@id: "D-\034\277\350\213J\350\273\355\203\316\223\362\321t"
<-7219573a-06d6-4883-8b71-d6d676bf2240#265:OK-t1,SUCCESS,nextIndex=1,followerCommit=0

#the appendEntities, there are lots of such messages, I just captured one. 
2021-10-29 14:39:13,178 [grpc-default-executor-0] DEBUG org.apache.ratis.server.RaftServer$Division: 7219573a-06d6-4883-8b71-d6d676bf2240@group-83CE93F2D174: receive appendEntries(2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7, 1, callId = 956, (t:1, i:690), 0, true, commits[2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7:c0, f2a076b3-3b3d-479e-94f7-a9d8021f31ef:c0, 7219573a-06d6-4883-8b71-d6d676bf2240:c0], entries: (t:1, i:691), STATEMACHINELOGENTRY, 4082@client-D1BD5630718C

#the latest apendEntities received by follower
2021-10-29 14:39:14,667 [grpc-default-executor-0] DEBUG org.apache.ratis.server.RaftServer$Division: 7219573a-06d6-4883-8b71-d6d676bf2240@group-83CE93F2D174: receive appendEntries(2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7, 1, callId = 1287, (t:1, i:1021), 0, true, commits[2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7:c0, f2a076b3-3b3d-479e-94f7-a9d8021f31ef:c0, 7219573a-06d6-4883-8b71-d6d676bf2240:c0], entries: (t:1, i:1022), STATEMACHINELOGENTRY, 3068@client-B2121B2CF473






  was:
In the current implemeation, leader will step down when it doesn't recevie any rpc from follower in maxTimeoutMs in follower's lastRpcResponseTime.

On the other side, grpcLogAppender decides to send out heartbeat based on the the lastRpcResponseTime and lastRpcSendTime, takes the late one, which makes in some cases heartbeat is not sent out promptly. 


Here are some logs captured.

Leader


Follower





> Leader step down for not sending out heartbeat promptly
> -------------------------------------------------------
>
>                 Key: RATIS-1418
>                 URL: https://issues.apache.org/jira/browse/RATIS-1418
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Major
>
> In the current implemeation, leader will step down when it doesn't recevie any rpc from follower in maxTimeoutMs in follower's lastRpcResponseTime.
> On the other side, grpcLogAppender decides to send out heartbeat based on the the lastRpcResponseTime and lastRpcSendTime, takes the late one, which makes in some cases heartbeat is not sent out promptly. 
> Here are some logs captured.
> Leader
> #last heartbeat sent before leader step down
> 2021-10-29 14:39:09,612 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->f2a076b3-3b3d-479e-94f7-a9d8021f31ef-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->f2a076b3-3b3d-479e-94f7-a9d8021f31ef-GrpcLogAppender send request [heartbeat = true, callId = 263]
> 2021-10-29 14:39:09,614 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender send request [heartbeat = true, callId = 265]
> #last non-heartbeat sent before leader step down
> 2021-10-29 14:39:13,176 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender send request [heartbeat = false, callId = 1287]
> 2021-10-29 14:39:13,176 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.server.leader.FollowerInfo: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240: nextIndex: updateIncreasingly 1023 -> 1024
> 2021-10-29 14:39:13,177 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender-LogAppenderDaemon] DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240-GrpcLogAppender send request [heartbeat = false, callId = 1288]
> #leader step down
> 2021-10-29 14:39:15,549 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] WARN org.apache.ratis.server.RaftServer$Division: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl: Lost leadership on term: 1. Election timeout: 5200ms. In charge for: 26147ms. Conf: 0: [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7|rpc:9.37.156.222:9856|admin:9.37.156.222:9857|client:9.37.156.222:9858|dataStream:|priority:1, f2a076b3-3b3d-479e-94f7-a9d8021f31ef|rpc:9.186.21.242:9856|admin:9.186.21.242:9857|client:9.186.21.242:9858|dataStream:|priority:0, 7219573a-06d6-4883-8b71-d6d676bf2240|rpc:9.186.21.247:9856|admin:9.186.21.247:9857|client:9.186.21.247:9858|dataStream:|priority:0], old=null
> 2021-10-29 14:39:15,550 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] WARN org.apache.ratis.server.RaftServer$Division: Follower 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->f2a076b3-3b3d-479e-94f7-a9d8021f31ef(c0,m0,n1024, attendVote=true, lastRpcSendTime=2262, lastRpcResponseTime=5935)
> 2021-10-29 14:39:15,550 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] WARN org.apache.ratis.server.RaftServer$Division: Follower 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174->7219573a-06d6-4883-8b71-d6d676bf2240(c0,m0,n1024, attendVote=true, lastRpcSendTime=2372, lastRpcResponseTime=5933)
> 2021-10-29 14:39:15,550 [2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174-LeaderStateImpl] INFO org.apache.ratis.server.RaftServer$Division: 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@group-83CE93F2D174: changes role from    LEADER to FOLLOWER at term 1 for StepDownReason:LOST_MAJORITY_HEARTBEATS
> Follower
> #the last heartbeat reply before leader step down
> 2021-10-29 14:39:09,621 [ForkJoinPool.commonPool-worker-57] DEBUG org.apache.ratis.server.RaftServer$Division: HEARTBEAT: 7219573a-06d6-4883-8b71-d6d676bf2240@group-83CE93F2D174: reply appendEntries(2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@id: "D-\034\277\350\213J\350\273\355\203\316\223\362\321t"
> <-7219573a-06d6-4883-8b71-d6d676bf2240#265:OK-t1,SUCCESS,nextIndex=1,followerCommit=0)
> 2021-10-29 14:39:09,621 [ForkJoinPool.commonPool-worker-57] DEBUG org.apache.ratis.grpc.server.GrpcServerProtocolService: 7219573a-06d6-4883-8b71-d6d676bf2240: reply 2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7@id: "D-\034\277\350\213J\350\273\355\203\316\223\362\321t"
> <-7219573a-06d6-4883-8b71-d6d676bf2240#265:OK-t1,SUCCESS,nextIndex=1,followerCommit=0
> #the appendEntities, there are lots of such messages, I just captured one. 
> 2021-10-29 14:39:13,178 [grpc-default-executor-0] DEBUG org.apache.ratis.server.RaftServer$Division: 7219573a-06d6-4883-8b71-d6d676bf2240@group-83CE93F2D174: receive appendEntries(2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7, 1, callId = 956, (t:1, i:690), 0, true, commits[2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7:c0, f2a076b3-3b3d-479e-94f7-a9d8021f31ef:c0, 7219573a-06d6-4883-8b71-d6d676bf2240:c0], entries: (t:1, i:691), STATEMACHINELOGENTRY, 4082@client-D1BD5630718C
> #the latest apendEntities received by follower
> 2021-10-29 14:39:14,667 [grpc-default-executor-0] DEBUG org.apache.ratis.server.RaftServer$Division: 7219573a-06d6-4883-8b71-d6d676bf2240@group-83CE93F2D174: receive appendEntries(2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7, 1, callId = 1287, (t:1, i:1021), 0, true, commits[2b5298c6-78ea-470f-a1f0-4db4e3f5a6e7:c0, f2a076b3-3b3d-479e-94f7-a9d8021f31ef:c0, 7219573a-06d6-4883-8b71-d6d676bf2240:c0], entries: (t:1, i:1022), STATEMACHINELOGENTRY, 3068@client-B2121B2CF473



--
This message was sent by Atlassian Jira
(v8.3.4#803005)