You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Ewen Cheslack-Postava (JIRA)" <ji...@apache.org> on 2017/01/24 21:20:26 UTC

[jira] [Resolved] (KAFKA-4673) Python VerifiableConsumer service has thread-safety bug for event_handlers

     [ https://issues.apache.org/jira/browse/KAFKA-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ewen Cheslack-Postava resolved KAFKA-4673.
------------------------------------------
       Resolution: Fixed
    Fix Version/s: 0.10.2.0
                   0.10.1.2
                   0.10.3.0

Issue resolved by pull request 2427
[https://github.com/apache/kafka/pull/2427]

> Python VerifiableConsumer service has thread-safety bug for event_handlers
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-4673
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4673
>             Project: Kafka
>          Issue Type: Bug
>          Components: system tests
>    Affects Versions: 0.10.1.1
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Ewen Cheslack-Postava
>             Fix For: 0.10.3.0, 0.10.1.2, 0.10.2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> From http://confluent-kafka-0-10-1-system-test-results.s3-us-west-2.amazonaws.com/2017-01-17--001.1484653357--apache--0.10.1--d436fa2/report.html
> {quote}
> ====================================================================================================
> test_id:    2017-01-17--001.kafkatest.tests.client.consumer_test.OffsetValidationTest.test_consumer_bounce.clean_shutdown=True.bounce_mode=rolling
> status:     FAIL
> run time:   1 minute 42.663 seconds
>     dictionary changed size during iteration
> Traceback (most recent call last):
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/tests/runner.py", line 106, in run_all_tests
>     data = self.run_single_test()
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/tests/runner.py", line 162, in run_single_test
>     return self.current_test_context.function(self.current_test)
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/mark/_mark.py", line 331, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/client/consumer_test.py", line 137, in test_consumer_bounce
>     self.await_all_members(consumer)
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/verifiable_consumer_test.py", line 83, in await_all_members
>     self.await_members(consumer, self.num_consumers)
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/verifiable_consumer_test.py", line 80, in await_members
>     err_msg="Consumers failed to join in a reasonable amount of time")
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/utils/util.py", line 31, in wait_until
>     if condition():
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/verifiable_consumer_test.py", line 78, in <lambda>
>     wait_until(lambda: len(consumer.joined_nodes()) == num_consumers,
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/services/verifiable_consumer.py", line 317, in joined_nodes
>     return [handler.node for handler in self.event_handlers.itervalues()
> RuntimeError: dictionary changed size during iteration
> {quote}
> It looks like the background thread is incorrectly inserting elements into self.event_handlers during this iteration. This is the first time I've seen this, so I suspect it's rarely hit. Looking at the code, most access is protected by a lock except for the additions at the beginning of the _worker method. Looking through the stacktrace, this looks like the likely culprit since we start the consumer and immediately start calling `await_all_members`, which ultimately iterates over that list. If that call happens faster than the thread gets started we could hit a bad interleaving.
> Fix is probably easy -- just add the proper locking around use of that object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)