You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/05/11 03:45:13 UTC

[jira] [Commented] (KAFKA-3694) Transient system test failure ReplicationTest.test_replication_with_broker_failure.security_protocol

    [ https://issues.apache.org/jira/browse/KAFKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279449#comment-15279449 ] 

ASF GitHub Bot commented on KAFKA-3694:
---------------------------------------

GitHub user hachikuji opened a pull request:

    https://github.com/apache/kafka/pull/1365

    KAFKA-3694: Ensure broker Zk deregistration prior to restart in ReplicationTest

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hachikuji/kafka KAFKA-3694

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/1365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1365
    
----
commit 50630af74bcf78abe2b3d2a5c07b3347a914543e
Author: Jason Gustafson <ja...@confluent.io>
Date:   2016-05-11T00:16:42Z

    KAFKA-3694: Ensure broker Zk deregistration prior to restart in ReplicationTest

----


> Transient system test failure ReplicationTest.test_replication_with_broker_failure.security_protocol
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3694
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3694
>             Project: Kafka
>          Issue Type: Bug
>          Components: system tests
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>
> We've seen this failure in several recent builds:
> {code}
> ====================================================================================================
> test_id:    2016-05-10--001.kafkatest.tests.core.replication_test.ReplicationTest.test_replication_with_broker_failure.security_protocol=PLAINTEXT.failure_mode=hard_bounce.broker_type=leader
> status:     FAIL
> run time:   2 minutes 1.184 seconds
>     Kafka server didn't finish startup
> Traceback (most recent call last):
>   File "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py", line 106, in run_all_tests
>     data = self.run_single_test()
>   File "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py", line 162, in run_single_test
>     return self.current_test_context.function(self.current_test)
>   File "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/mark/_mark.py", line 331, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/replication_test.py", line 157, in test_replication_with_broker_failure
>     self.run_produce_consume_validate(core_test_action=lambda: failures[failure_mode](self, broker_type))
>   File "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py", line 79, in run_produce_consume_validate
>     raise e
> TimeoutError: Kafka server didn't finish startup
> {code}
> After some investigation, the problem seems to be caused by an unexpected partition leader change which is triggered proactively by the controller when the preferred leader becomes alive again. The test currently assumes that it is safe to restart the broker as soon as it observes a leadership change since this is typically caused by a Zk session timeout. However, in this case, the session hasn't actually expired when the leadership change occurs. So after starting up, the broker sees its brokerId still registered and immediately shuts down, which causes the test failure above. To fix the problem, we should probably have a stronger check to ensure that the broker has actually been deregesitered from Zk prior to restarting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)