You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/09/26 21:26:21 UTC

[jira] [Commented] (KAFKA-4214) kafka-reassign-partitions fails all the time when brokers are bounced during reassignment

    [ https://issues.apache.org/jira/browse/KAFKA-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524218#comment-15524218 ] 

ASF GitHub Bot commented on KAFKA-4214:
---------------------------------------

GitHub user apurvam opened a pull request:

    https://github.com/apache/kafka/pull/1910

    KAFKA-4214:kafka-reassign-partitions fails all the time when brokers are bounced during reassignment

    There is a corner case bug, where during partition reassignment, if the
    controller and a broker receiving a new replica are bounced at the same
    time, the partition reassignment is failed.
    
    The cause of this bug is a block of code in the KafkaController which
    fails the reassignment if the aliveNewReplicas != newReplicas, ie. if
    some of the new replicas are offline at the time a controller fails
    over.
    
    The fix is to have the controller listen for ISR change events even for
    new replicas which are not alive when the controller boots up. Once the
    said replicas come online, they will be in the ISR set, and the new
    controller will detect this, and then mark the reassignment as
    successful.
    
    Interestingly, the block of code in question was introduced in
    KAFKA-990, where a concern about this exact scenario was raised :)
    
    This bug was revealed in the system tests in https://github.com/apache/kafka/pull/1904. 
    The relevant tests will be enabled in either this or a followup PR when PR-1904 is merged.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apurvam/kafka KAFKA-4214

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/1910.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1910
    
----
commit 7f9814466592f549e8fcde914815c35dcb5a046d
Author: Apurva Mehta <ap...@gmail.com>
Date:   2016-09-26T21:11:44Z

    There is a corner case bug, where during partition reassignment, if the
    controller and a broker receiving a new replica are bounced at the same
    time, the partition reassignment is failed.
    
    The cause of this bug is a block of code in the KafkaController which
    fails the reassignment if the aliveNewReplicas != newReplicas, ie. if
    some of the new replicas are offline at the time a controller fails
    over.
    
    The fix is to have the controller listen for ISR change events even for
    new replicas which are not alive when the controller boots up. Once the
    said replicas come online, they will be in the ISR set, and the new
    controller will detect this, and then mark the reassignment as
    successful.
    
    Interestingly, the block of code in question was introduced in
    KAFKA-990, where a concern about this exact scenario was raised :)

----


> kafka-reassign-partitions fails all the time when brokers are bounced during reassignment
> -----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4214
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4214
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>
> Due to KAFKA-4204, we never realized that the existing system test for testing reassignment would always fail when brokers were bounced in mid process. This happens reliably, even for topics of varying number of partition and varying replication factors.
> In particular, we see errors like this in the logs when the brokers are bounced: 
> {noformat}
> Status of partition reassignment:
> ERROR: Assigned replicas (1,2) don't match the list of replicas for reassignment (1) for partition [test_topic,2]
> Reassignment of partition [test_topic,1] completed successfully
> Reassignment of partition [test_topic,2] failed
> Reassignment of partition [test_topic,3] completed successfully
> Reassignment of partition [test_topic,0] is still in progress
> {noformat}
> Currently, the tests which bounce brokers during reassignment are disabled until this bug is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)