You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jason Gustafson (JIRA)" <ji...@apache.org> on 2015/12/11 01:43:10 UTC
[jira] [Comment Edited] (KAFKA-2978) Topic partition is not sometimes consumed after rebalancing of consumer group

    [ https://issues.apache.org/jira/browse/KAFKA-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051920#comment-15051920 ] 

Jason Gustafson edited comment on KAFKA-2978 at 12/11/15 12:42 AM:
-------------------------------------------------------------------

I think I see what's going on and the problem will affect trunk just as well. There is a window when a rebalance completes where a pending fetch can cause the consumed position to get out of sync with the fetched position. Here's the sequence:

1. Fetch is sent at offset 0: (fetched: 0, consumed: 0)
2. Rebalance is triggered: (fetched: 0, consumed: 0)
3. Fetch returns and is buffered. We set the next fetch position: (fetched: 25, consumed: 0)
4. Rebalance finishes and all offsets are reset: (fetched: 0, consumed: 0)
5. Buffered fetch is returned to the user and we update the consumed position: (fetched: 0, consumed: 25)

This can only happen if we get assigned the same partition in the rebalance. But when it does happen, we stop fetching against that partition since fetches are only sent if the fetched position matches the consumed position. This should be straightforward to fix, but I want to check if there are any other implications of a pending fetch which we haven't considered.


was (Author: hachikuji):
I think I see what's going on and the problem will affect trunk just as well. There is a window when a rebalance completes where a pending fetch can cause the consumed position to get out of sync with the fetched position. Here's the sequence:

1. Fetch is sent at offset 0: (fetched: 0, consumed: 0)
2. Rebalance is triggered: (fetched: 0, consumed: 0)
3. Fetch returns and is buffered. We set the next fetch position: (fetched: 25, consumed: 0)
4. Rebalance finishes and all offsets are reset: (fetched: 0, consumed: 0)
5. Buffered fetch is returned to the user and we update the consumed position: (fetched: 0, consumed: 25)

This can only happen if we happen to get assigned the same partition in the rebalance. But when it does happen, we stop fetching against that partition since fetches are only sent if the fetched position matches the consumed position. This should be straightforward to fix, but I want to check if there are any other implications of a pending fetch which we haven't considered.

> Topic partition is not sometimes consumed after rebalancing of consumer group
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-2978
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2978
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, core
>    Affects Versions: 0.9.0.0
>            Reporter: Michal Turek
>            Assignee: Jason Gustafson
>            Priority: Critical
>             Fix For: 0.9.0.1
>
>
> Hi there, we are evaluating Kafka 0.9 to find if it is stable enough and ready for production. We wrote a tool that basically verifies that each produced message is also properly consumed. We found the issue described below while stressing Kafka using this tool.
> Adding more and more consumers to a consumer group may result in unsuccessful rebalancing. Data from one or more partitions are not consumed and are not effectively available to the client application (e.g. for 15 minutes). Situation can be resolved externally by touching the consumer group again (add or remove a consumer) which forces another rebalancing that may or may not be successful.
> Significantly higher CPU utilization was observed in such cases (from about 3% to 17%). The CPU utilization takes place in both the affected consumer and Kafka broker according to htop and profiling using jvisualvm. 
> Jvisualvm indicates the issue may be related to KAFKA-2936 (see its screenshots in the GitHub repo below), but I'm very unsure. I don't also know if the issue is in consumer or broker because both are affected and I don't know Kafka internals.
> The issue is not deterministic but it can be easily reproduced after a few minutes just by executing more and more consumers. More parallelism with multiple CPUs probably gives the issue more chances to appear.
> The tool itself together with very detailed instructions for quite reliable reproduction of the issue and initial analysis are available here:
> - https://github.com/avast/kafka-tests
> - https://github.com/avast/kafka-tests/tree/issue1/issues/1_rebalancing
> - Prefer fixed tag {{issue1}} to branch {{master}} which may change.
> - Note there are also various screenshots of jvisualvm together with full logs from all components of the tool.
> My colleague was able to independently reproduce the issue according to the instructions above. If you have any questions or if you need any help with the tool, just let us know. This issue is blocker for us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)