You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Martin Koběrský <mk...@samba.ai> on 2022/03/25 10:38:15 UTC
Kraft - Adding new controllers in production

Hello!

I'm wondering if there's a right way to add new controllers to an already
existing cluster without downtime.
I've tried the following:
I have three controllers joined in a cluster then one by one I change
configuration to 4 voters, stop controller, delete quorum-state, start
controller. When it is done for all 3 existing controllers I start the
fourth one. In most of my experiments there is a consensus on who is the
leader after the 4th controller is up, but there were some cases where the
leader couldn't be elected and I had to restart each controller again.
Although the leader was elected after the transition, I'm not sure if
during the transition the cluster has not temporarily lost an ability to
elect a leader. I think that this problem should potentially go away if
there is atleast 5 controllers to begin with.
Then I do the same thing but change the configurations to 5 voters. When I
run the fifth one it can't seem to figure out who the leader is and it is
calling up new elections indefinitely (as shown in the log). If I then
restart all of the controllers one by one, the cluster settles on a leader.
Is this the right way to go about it? I'm concerned because of the
indefinite elections, also I'd like the process to be as simple as possible.

Thanks and best regards, Martin.

[2022-03-25 09:16:44,233] INFO [RaftManager nodeId=5] Completed transition
to CandidateState(localId=5, epoch=2109, retries=1, electionTimeoutMs=1090)
(org.apache.kafka.raft.QuorumState)
[2022-03-25 09:16:44,234] INFO [RaftManager nodeId=5] Vote request
VoteRequestData(clusterId='ihf_kq2QSbmIRDQhTLSZYg',
topics=[TopicData(topicName='__cluster_metadata',
partitions=[PartitionData(partitionIndex=0, candidateEpoch=2109,
candidateId=4, lastOffsetEpoch=2004, lastOffset=623)])]) with epoch 2109 is
rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-03-25 09:16:44,292] INFO [RaftManager nodeId=5] Re-elect as candidate
after election backoff has completed (org.apache.kafka.raft.KafkaRaftClient)
[2022-03-25 09:16:45,200] INFO [RaftManager nodeId=5] Completed transition
to CandidateState(localId=5, epoch=2110, retries=2, electionTimeoutMs=1774)
(org.apache.kafka.raft.QuorumState)
[2022-03-25 09:16:45,201] INFO [RaftManager nodeId=5] Vote request
VoteRequestData(clusterId='ihf_kq2QSbmIRDQhTLSZYg',
topics=[TopicData(topicName='__cluster_metadata',
partitions=[PartitionData(partitionIndex=0, candidateEpoch=2110,
candidateId=4, lastOffsetEpoch=2004, lastOffset=623)])]) with epoch 2110 is
rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-03-25 09:16:45,305] INFO [RaftManager nodeId=5] Vote request
VoteRequestData(clusterId='ihf_kq2QSbmIRDQhTLSZYg',
topics=[TopicData(topicName='__cluster_metadata',
partitions=[PartitionData(partitionIndex=0, candidateEpoch=2110,
candidateId=2, lastOffsetEpoch=2100, lastOffset=627)])]) with epoch 2110 is
rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-03-25 09:16:45,307] INFO [RaftManager nodeId=5] Vote request
VoteRequestData(clusterId='ihf_kq2QSbmIRDQhTLSZYg',
topics=[TopicData(topicName='__cluster_metadata',
partitions=[PartitionData(partitionIndex=0, candidateEpoch=2110,
candidateId=3, lastOffsetEpoch=2039, lastOffset=626)])]) with epoch 2110 is
rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-03-25 09:16:45,308] INFO [RaftManager nodeId=5] Insufficient
remaining votes to become leader (rejected by [2, 3, 4]). We will backoff
before retrying election again (org.apache.kafka.raft.KafkaRaftClient)
[2022-03-25 09:16:45,412] INFO [RaftManager nodeId=5] Vote request
VoteRequestData(clusterId='ihf_kq2QSbmIRDQhTLSZYg',
topics=[TopicData(topicName='__cluster_metadata',
partitions=[PartitionData(partitionIndex=0, candidateEpoch=2110,
candidateId=1, lastOffsetEpoch=2079, lastOffset=627)])]) with epoch 2110 is
rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-03-25 09:16:45,607] INFO [RaftManager nodeId=5] Re-elect as candidate
after election backoff has completed (org.apache.kafka.raft.KafkaRaftClient)