You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Apurva Mehta (JIRA)" <ji...@apache.org> on 2016/09/24 17:34:20 UTC

[jira] [Commented] (KAFKA-4215) Consumers miss messages during partition reassignment

    [ https://issues.apache.org/jira/browse/KAFKA-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519348#comment-15519348 ] 

Apurva Mehta commented on KAFKA-4215:
-------------------------------------

Here is [~junrao]'s theory for what is going on:

{quote}
For  the 2nd problem that Apurva discovered, i.e., consumer loses messages during reassignment with bounce and replication factor 1, the problem seems likely due to an existing limitation. Let's say replication factor is 1 and we want to move the replica to a different broker. When the reassignment tool starts, the controller will add both the old and the new replica to assigned replicas and let the new replica be the follower to catch up from the old replica (which is the leader). Now if we stop the broker where the old replica is in, the controller will select the new replica as the new leader even though it's not fully caught up yet since by default we allow unclean leader election. So in the end, some of the messages may only be in the old replica but not in the new replica, and some other messages may only be in the new replica, but not the old replica, due to the unclean leader election. Then the consumer will be missing some of the messages since it only reads from the leader. When verifying the log data, we simply accumulate messages from both replicas. So, all messages can still be found. We can verify if this is the case. If so, we can file a jira to track this but don't have to fix it immediately since it only happens when replication factor is 1, unclean leader election is enabled, and there are bounces during partition reassignment. So, it should be rare. As an immediate fix for the test, we can either use replication factor 2 or disable unclean leader election. It would be useful to improve the log verification part as well. Instead of accumulating messages from all replicas, we probably want to verify that all replicas are identical.
{quote}

The facts backup this hypothesis: 
# unclean leader election was enabled in this test.
# the log verification aggregates the data in all the replicas for a partition into a single set, and is not representative of the data the consumers actually see.


> Consumers miss messages during partition reassignment
> -----------------------------------------------------
>
>                 Key: KAFKA-4215
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4215
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>
> In the specific case where the replication-factor of a topic is 1, when partition reassignment is ongoing, and when a broker is bounced, consumers reliably lose some messages in the stream. 
> This can be reproduced in system tests where the following error message sis observed:
> {noformat}
> AssertionError: 737 acked message did not make it to the Consumer. They are: 22530, 45059, 22534, 45063, 22538, 45067, 22542, 45071, 22546, 45075, 22550, 45079, 22554, 45083, 22558, 45087, 22562, 45091, 22566, 45095, ...plus 717 more. Total Acked: 51809, Total Consumed: 51073. We validated that the first 737 of these missing messages correctly made it into Kafka's data files. This suggests they were lost on their way to the consumer.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)