You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Nico Meyer (JIRA)" <ji...@apache.org> on 2018/10/09 15:02:00 UTC

[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

    [ https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643575#comment-16643575 ] 

Nico Meyer edited comment on KAFKA-3410 at 10/9/18 3:01 PM:
------------------------------------------------------------

This problem is not fixed by KAFKA-1211, at least not for a failed disk that needs to be replaced. Not if the broker with the problem becomes the only ISR just before failing and unclean leader election is disabled.

I would assume that this is not an uncommon problem. In our case it seems that a faulty disk on the leader was extremely slow just before producing hard I/O errors, which in turn blocked the fetches from the followers longer than replica.lag.time.max.ms. Therefore the leader removed both of the followers from the ISR about half a second before taking the partitions on the faulty disk offline. I think this not desirable, since the followers where actually up to date. I believe the error is to compare the last time the follower fetched to the LEO with the wall clock, instead it should be compared with the last time something was written to the log.

But lets say the followers were removed from the ISR for some other reason. At that point the partition is offline and in one option to proceed is to enable unclean leader election. This can lead to lost data, even if min.insync.replicas=2 is used and all producers use acks=all. :(.

Another option is to replace the faulty disk and restart the broker, which leads to the shutdown of the followers described in this issue. Even worse, if the unclean leader election is enabled to mitigate that problem, all the logs are truncated to offset 0. I would prefer if the broker could not become leader for a partition after its log dir has been deleted.

I just confirmed that the same problem still exists in 2.0.0.

Properly recovering from this problem in the middle of the night without messing up is pretty hard:
 * shutdown the failed leader
 * for each affected partition check which broker has the highest LEO. Is there a tool for that? kafka-log-dirs.sh is a good start, but only returns the number of bytes in the log which is not equivalent
 * Write a JSON file for kafka-reassign-partitions.sh that lists only the broker with the highest LEO for each partition, or at least lists it as the preferred leader. Execute it and save the original partition assignment.
 * enable unclean leader election for all affected topics
 * wait for the new leaders to be elected
 * disable unclean leader election
 * Replace disk on failed broker
 * Reinstate the original partition assignment saved in step 3

 

 


was (Author: nico.meyer):
This problem is not fixed by KAFKA-1211, at least not for a failed disk that needs to be replaced. Not if the broker with the problem becomes the only ISR just before failing and unclean leader election is disabled.

I would assume that this is not an uncommon problem. In our case it seems that a faulty disk on the leader was extremely slow just before producing hard I/O errors, which in turn blocked the fetches from the followers longer than replica.lag.time.max.ms. Therefore the leader removed both of the followers from the ISR about half a second before taking the partitions on the faulty disk offline. I think this not desirable, since the followers where actually up to date. I believe the error is to compare the last time the follower fetched to the LEO with the wall clock, instead it should be compared with the last time something was written to the log.

But lets say the followers were removed from the ISR for some other reason. At that point the partition is offline and in one option to proceed is to enable unclean leader election. This can lead to lost data, even if min.insync.replicas=2 is used and all producers use acks=all. :(.

Another option is to replace the faulty disk and restart the broker, which leads to the shutdown of the followers described in this issue. Even worse, if the unclean leader election is enabled to that problem, all the logs are truncated to offset 0. I would prefer if the broker could not become leader for a partition after its log dir has been deleted.

I just confirmed that the same problem still exists in 2.0.0.

Properly recovering from this problem in the middle of the night without messing up is pretty hard:
 * shutdown the failed leader
 * for each affected partition check which broker has the highest LEO. Is there a tool for that? kafka-log-dirs.sh is a good start, but only returns the number of bytes in the log which is not equivalent
 * Write a JSON file for kafka-reassign-partitions.sh that lists only the broker with the highest LEO for each partition, or at least lists it as the preferred leader. Execute it and save the original partition assignment.
 * enable unclean leader election for all affected topics
 * wait for the new leaders to be elected
 * disable unclean leader election
 * Replace disk on failed broker
 * Reinstate the original partition assignment saved in step 3

 

 

> Unclean leader election and "Halting because log truncation is not allowed"
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-3410
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3410
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>            Reporter: James Cheng
>            Priority: Major
>              Labels: reliability
>
> I ran into a scenario where one of my brokers would continually shutdown, with the error message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because log truncation is not allowed for topic test, Current leader 1's latest offset 0 is less than replica 2's latest offset 151 (kafka.server.ReplicaFetcherThread)
> I managed to reproduce it with the following scenario:
> 1. Start broker1, with unclean.leader.election.enable=false
> 2. Start broker2, with unclean.leader.election.enable=false
> 3. Create topic, single partition, with replication-factor 2.
> 4. Write data to the topic.
> 5. At this point, both brokers are in the ISR. Broker1 is the partition leader.
> 6. Ctrl-Z on broker2. (Simulates a GC pause or a slow network) Broker2 gets dropped out of ISR. Broker1 is still the leader. I can still write data to the partition.
> 7. Shutdown Broker1. Hard or controlled, doesn't matter.
> 8. rm -rf the log directory of broker1. (This simulates a disk replacement or full hardware replacement)
> 9. Resume broker2. It attempts to connect to broker1, but doesn't succeed because broker1 is down. At this point, the partition is offline. Can't write to it.
> 10. Resume broker1. Broker1 resumes leadership of the topic. Broker2 attempts to join ISR, and immediately halts with the error message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because log truncation is not allowed for topic test, Current leader 1's latest offset 0 is less than replica 2's latest offset 151 (kafka.server.ReplicaFetcherThread)
> I am able to recover by setting unclean.leader.election.enable=true on my brokers.
> I'm trying to understand a couple things:
> * In step 10, why is broker1 allowed to resume leadership even though it has no data?
> * In step 10, why is it necessary to stop the entire broker due to one partition that is in this state? Wouldn't it be possible for the broker to continue to serve traffic for all the other topics, and just mark this one as unavailable?
> * Would it make sense to allow an operator to manually specify which broker they want to become the new master? This would give me more control over how much data loss I am willing to handle. In this case, I would want broker2 to become the new master. Or, is that possible and I just don't know how to do it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)