You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Apurva Mehta (JIRA)" <ji...@apache.org> on 2017/01/07 02:01:02 UTC

[jira] [Comment Edited] (KAFKA-4558) throttling_test fails if the producer starts too fast.

    [ https://issues.apache.org/jira/browse/KAFKA-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806421#comment-15806421 ] 

Apurva Mehta edited comment on KAFKA-4558 at 1/7/17 2:00 AM:
-------------------------------------------------------------

So I had a look at the code. All the 13 tests which use `ProduceConsumeValidate` have changed since that commit. So it is totally unproductive revert that change at this point.

Regarding your proposal for two metrics: partitions assigned and per-partition lag may not be what we want. Particularly, in the`ProduceConsumeValidate` test, the producer is started after the consumer. So if the topic is originally empty, or if the consumer is configured to read from the end, the lag will always be zero. This is per my understanding of how lag is reported, viz. how far from the tail of the log the consumer is. So the lag metric probably won't be very useful in majority of the cases. 

But waiting until partitions assigned is non zero may be what we want. The tests I have seen just have a single console consumer for the entire topic, so there should be enough partitions to go around. Of course this may not be true in the future). At the very least it will be better than what we have right now. And if there are not enough partitions to go around, the test will fail early (since the wait_until will time out), and can be diagnosed before checkin. 

Regarding implementation of partitions assigned alone, I thought it might be worth staging the implementation by first using the metric through jmx. This would give us a shorter turn around time and validate whether this approach is sufficient to fix the current issues. We can even play with different metrics more quickly if necessary. 

Finally, would adding an HttpMetricsReporter necessitate a KIP?




was (Author: apurva):
So I had a look at the code. All the 13 tests which use `ProduceConsumeValidate` have changed since that commit. So it is totally unproductive revert that change at this point.

Regarding your proposal for two metrics: partitions assigned and per-partition lag may not be what we want. Particularly, in the`ProduceConsumeValidate` test, the producer is started after the consumer. So if the topic is originally empty, or if the consumer is configured to read from the end, the lag will always be zero. This is per my understanding of how lag is reported, viz. how far from the tail of the log the consumer is. So the lag metric probably won't be very useful in majority of the cases. 

But waiting until partitions assigned is non zero may be what we want. At the very least it will be better than what we have right now.

Regarding implementation of partitions assigned alone, I thought it might be worth staging the implementation by first using the metric through jmx. This would give us a shorter turn around time and validate whether this approach is sufficient to fix the current issues. We can even play with different metrics more quickly if necessary. 

Finally, would adding an HttpMetricsReporter necessitate a KIP?



> throttling_test fails if the producer starts too fast.
> ------------------------------------------------------
>
>                 Key: KAFKA-4558
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4558
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>
> As described in https://issues.apache.org/jira/browse/KAFKA-4526, the throttling test will fail if the producer in the produce-consume-validate loop starts up before the consumer is fully initialized.
> We need to block the start of the producer until the consumer is ready to go. 
> The current plan is to poll the consumer for a particular metric (like, for instance, partition assignment) which will act as a good proxy for successful initialization. Currently, we just check for the existence of a process with the PID, which is not a strong enough check, causing the test to fail intermittently. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)