You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Andrey Elenskiy (JIRA)" <ji...@apache.org> on 2017/06/22 21:36:00 UTC

[jira] [Comment Edited] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

    [ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060049#comment-16060049 ] 

Andrey Elenskiy edited comment on KAFKA-2729 at 6/22/17 9:35 PM:
-----------------------------------------------------------------

Seeing the same issue on 0.10.2. 

A node running zookeeper lost networking for split second and initiated an election which caused some sessions to expire with:

{{2017-06-22 02:07:36,092 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running}}

which caused controller resignation:

{{[2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, broker id 158980 (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: Stopped partition state machine (kafka.controller.PartitionStateMachine)
[2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: Stopped replica state machine (kafka.controller.ReplicaStateMachine)
[2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as the controller (kafka.controller.KafkaController)}}

and after that just kept getting this in broker's server logs for next 8 hours until just restarting manually it:

{{[2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking ISR for partition [A,5] from 158980,133641,155394 to 158980 (kafka.cluster.Partition)
[2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached zkVersion [73] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)}}



was (Author: timoha):
Seeing the same issue on 0.10.2. 

A node running zookeeper lost networking for split second and initiated an election which caused some sessions to expire with:
```
2017-06-22 02:07:36,092 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
```
which caused controller resignation:
```
[2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, broker id 158980 (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: Stopped partition state machine (kafka.controller.PartitionStateMachine)
[2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: Stopped replica state machine (kafka.controller.ReplicaStateMachine)
[2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as the controller (kafka.controller.KafkaController)
```
and after that just kept getting this in broker's server logs for next 8 hours until just restarting manually it:
```
[2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking ISR for partition [A,5] from 158980,133641,155394 to 158980 (kafka.cluster.Partition)
[2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached zkVersion [73] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
```

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, we started seeing a large number of undereplicated partitions. The zookeeper cluster recovered, however we continued to see a large number of undereplicated partitions. Two brokers in the kafka cluster were showing this in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered after a restart. Our own investigation yielded nothing, I was hoping you could shed some light on this issue. Possibly if it's related to: https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)