You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jason Gustafson (JIRA)" <ji...@apache.org> on 2016/04/07 00:40:25 UTC

[jira] [Commented] (KAFKA-3513) Transient failure of OffsetValidationTest

    [ https://issues.apache.org/jira/browse/KAFKA-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229269#comment-15229269 ] 

Jason Gustafson commented on KAFKA-3513:
----------------------------------------

[~ewencp] Took a look at this, but no luck reproducing. As you observed, the consumers join the group successfully and apparently well before the timeout expectation. That suggests that there may have been a delay before one of the rebalance events was propagated to the test driver. Unfortunately, we don't currently have enough information to confirm whether that happened. So maybe we can make a few improvements to make future debugging easier:

1. Persist the output from stdout which contains the verifiable consumer events (this is safer for the verifiable consumer since we only log the start and end offsets of message batches).
2. Add a timestamp to the verifiable consumer events.
3. Add a log message in the service when the event is received.

That should give us enough information to see whether and where delays are occurring.

> Transient failure of OffsetValidationTest
> -----------------------------------------
>
>                 Key: KAFKA-3513
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3513
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, system tests
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Jason Gustafson
>
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-04-05--001.1459840046--apache--trunk--31e263e/report.html
> The version of the test fails in this case is:
> Module: kafkatest.tests.client.consumer_test
> Class:  OffsetValidationTest
> Method: test_broker_failure
> Arguments:
> {
>   "clean_shutdown": true,
>   "enable_autocommit": false
> }
> and others passed. It's unclear if the parameters actually have any impact on the failure.
> I did some initial triage and it looks like the test code isn't seeing all the group members join the group (receive partition assignments), but it appears from the logs that they all did. This could indicate a simple timing issue, but I haven't been able to verify that yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)