You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "Jack Foy (Jira)" <ji...@apache.org> on 2021/06/16 19:45:00 UTC

[jira] [Commented] (KAFKA-12241) Partition offline when ISR shrinks to leader and LogDir goes offline

    [ https://issues.apache.org/jira/browse/KAFKA-12241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364493#comment-17364493 ] 

Jack Foy commented on KAFKA-12241:
----------------------------------

In my opinion this is a cleaner fix than the one proposed in https://issues.apache.org/jira/browse/KAFKA-3861 for the same problem.

> Partition offline when ISR shrinks to leader and LogDir goes offline
> --------------------------------------------------------------------
>
>                 Key: KAFKA-12241
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12241
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.2
>            Reporter: Noa Resare
>            Priority: Major
>
> This is a long standing issue that we haven't previously tracked in a JIRA. We experience this maybe once per month on average and we see the following sequence of events:
>  # A broker shrinks ISR to just itself for a partition. However, the followers are at highWatermark:{{ [Partition PARTITION broker=601] Shrinking ISR from 1501,601,1201,1801 to 601. Leader: (highWatermark: 432385279, endOffset: 432385280). Out of sync replicas: (brokerId: 1501, endOffset: 432385279) (brokerId: 1201, endOffset: 432385279) (brokerId: 1801, endOffset: 432385279).}}
>  # Around this time (in the case I have in front of me, 20ms earlier according to the logging subsystem) LogDirFailureChannel captures an Error while appending records to PARTITION due to a readonly filesystem.
>  # ~20 ms after the ISR shrink, LogDirFailureHandler offlines the partition: Logs for partitions LIST_OF_PARTITIONS are offline and logs for future partitions are offline due to failure on log directory /kafka/d6/data 
>  # ~50ms later the controller marks the replicas as offline from 601: message: [Controller id=901] Mark replicas LIST_OF_PARTITIONS on broker 601 as offline 
>  # ~2ms later the controller offlines the partition: [Controller id=901 epoch=4] Changed partition PARTITION state from OnlinePartition to OfflinePartition 
> To resolve this someone needs to manually enable unclean leader election, which is obviously not ideal. Since the leader knows that all the followers that are removed from ISR is at highWatermark, maybe it could convey that to the controller in the LeaderAndIsr response so that the controller could pick a new leader without having to resort to unclean leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)