You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Robert Christ (JIRA)" <ji...@apache.org> on 2016/04/06 03:29:25 UTC

[jira] [Commented] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

    [ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227519#comment-15227519 ] 

Robert Christ commented on KAFKA-3042:
--------------------------------------

I work with James and we have seen this problem repeatedly.  We have been
able to reproduce the problem somewhat reliably and the pattern seems
to be:

1) hard kill the controller (say broker 1)
2) after session timeout, the zookeeper session expires for broker 1
3) another node (say broker 2) takes ownership of the /controller node
4) The zookeeper session for broker 2 expires even though broker 2 continues to function (see below)
5) another (say broker 3) takes ownership of the /controller node
6) At some point in the future, possibly after broker 3 finishes taking controllership or broker 1 resumes from the hard stop,
broker 2 will spew unending streams of the "Cached zkVersion..." message.
7) Restarting broker 2 will cause the zkVersion problem to go away.

While the zkVersion message is appearing the ISR lists do not get updated and we have underreplicated
partiions.

So 4 is the mystery.  I believe it happens because we have some form of network/disk/cpu contention
that actually causes the ping from broker 2 not to reach or be acknowledged by zk within the session timeout.
We are actively working to try to figure that out but I believe it is triggering some race condition or bug where
the active controller loses control of the /controller node and another node takes it.

I have logs (oh so many logs) from when this was occurring and can reproduce it fairly easily.


> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-3042
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3042
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>         Environment: jdk 1.7
> centos 6.4
>            Reporter: Jiahongchao
>         Attachments: controller.log, server.log.2016-03-23-01, state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)