You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Xinyao Hu <xi...@dropbox.com> on 2014/04/16 18:50:01 UTC

leaders being kicked out of ISR

Hey all,

This is xinyao from dropbox and I am trying to push kafka to widely used
inside dropbox.

I setup up five nodes with 2x replication. The system is running well to
handle all the traffic. The problem I am seeing now is that one
host(probably one leader) being kicked out of ISR. This is some related
information.

[2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153.
(kafka.network.Processor)
[2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:31:27,157] INFO Rolled new log segment for
'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:30,709] INFO Rolled new log segment for
'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms.
(kafka.log.Log)
[2014-04-16 15:31:32,310] INFO Rolled new log segment for
'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60.
(kafka.network.Processor)
[2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58.
(kafka.network.Processor)
[2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38.
(kafka.network.Processor)
[2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22.
(kafka.network.Processor)
[2014-04-16 15:31:54,954] INFO Rolled new log segment for
'api-not_your_cursor-2' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
(kafka.network.Processor)
[2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
(kafka.network.Processor)
[2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:32:05,540] INFO Partition
[client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for
partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,579] ERROR Conditional update of path
/brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
with data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected
version 0 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,581] INFO Partition
[client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion
[0] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker
1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,596] ERROR Conditional update of path
/brokers/topics/api-shmodel-non-ssl/partitions/4/state with data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 0 failed due
to org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for
/brokers/topics/api-shmodel-non-ssl/partitions/4/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker
1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-04-16 15:32:05,596] INFO Partition
[server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for
partition [server-network_logs_errors-0012,0] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,605] ERROR Conditional update of path
/brokers/topics/server-network_logs_errors-0012/partitions/0/state with
data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 0
 failed due to org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for
/brokers/topics/server-network_logs_errors-0012/partitions/0/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,605] INFO Partition
[server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

You can see that something went wrong around 15:31. This is the current
topic replicas information:

Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr: 5,1
Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr: 2
Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr: 2,3
Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr: 3,4
Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr: 5,4
Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr: 5,4
Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr: 5
Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr: 2,1
Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr: 3,2
Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr: 3,4

....

I suspect that broker 1 lost leadership for all the topics around that
timestamp but it is in sync as a follower.

I followed "How to reduce churn" to increase replica.lag.max.messages.
However I don't think it is useful since I suspect it is leader kicked out
of ISR.

I cannot consistently reproduce this since it happened twice, once a day in
the past two days. These are all the configs I changed in
server.properties.

default.replication.factor=2

auto.leader.rebalance.enable=true

log.roll.hours=1

replica.lag.max.messages=50000

Let me know if you need more logs to debug.

Re: leaders being kicked out of ISR

Posted by Xinyao Hu <xi...@dropbox.com>.

Adding some context.

The lost leader is keep trying to write:

[2014-04-16 16:53:10,126] ERROR Conditional update of path
/brokers/topics/photos-mobile_analytics_user_logs-0060/partitions/0/state
with data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 0 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/brokers/topics/photos-mobile_analytics_user_logs-0060/partitions/0/state
(kafka.utils.ZkUtils$)

and keep failing since in zookeeper he is no longer the leader:

[zk: 127.0.0.1:2181(CONNECTED) 17] get
/kafka-testing/brokers/topics/photos-mobile_analytics_user_logs-0060/partitions/0/state
{"controller_epoch":2,"leader":2,"version":1,"leader_epoch":1,"isr":[2]}

And the system never gets recovered.


On Wed, Apr 16, 2014 at 9:50 AM, Xinyao Hu <xi...@dropbox.com> wrote:

> Hey all,
>
> This is xinyao from dropbox and I am trying to push kafka to widely used
> inside dropbox.
>
> I setup up five nodes with 2x replication. The system is running well to
> handle all the traffic. The problem I am seeing now is that one
> host(probably one leader) being kicked out of ISR. This is some related
> information.
>
> [2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153.
> (kafka.network.Processor)
> [2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:31:27,157] INFO Rolled new log segment for
> 'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:30,709] INFO Rolled new log segment for
> 'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms.
> (kafka.log.Log)
> [2014-04-16 15:31:32,310] INFO Rolled new log segment for
> 'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22.
> (kafka.network.Processor)
> [2014-04-16 15:31:54,954] INFO Rolled new log segment for
> 'api-not_your_cursor-2' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
> (kafka.network.Processor)
> [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
> (kafka.network.Processor)
> [2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:32:05,540] INFO Partition
> [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for
> partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,579] ERROR Conditional update of path
> /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
> with data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected
> version 0 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for
> /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,581] INFO Partition
> [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion
> [0] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker
> 1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,596] ERROR Conditional update of path
> /brokers/topics/api-shmodel-non-ssl/partitions/4/state with data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 0 failed due
> to org.apache.zookeeper.KeeperException$BadVersionException:
> KeeperErrorCode = BadVersion for
> /brokers/topics/api-shmodel-non-ssl/partitions/4/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker
> 1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,596] INFO Partition
> [server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for
> partition [server-network_logs_errors-0012,0] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,605] ERROR Conditional update of path
> /brokers/topics/server-network_logs_errors-0012/partitions/0/state with
> data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 0
>  failed due to org.apache.zookeeper.KeeperException$BadVersionException:
> KeeperErrorCode = BadVersion for
> /brokers/topics/server-network_logs_errors-0012/partitions/0/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,605] INFO Partition
> [server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not
> equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>
> You can see that something went wrong around 15:31. This is the current
> topic replicas information:
>
> Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2
> Configs:
>  Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr:
> 3,5
>  Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr:
> 1,4
>  Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr:
> 5,2
>  Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr:
> 3
>  Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2
> Configs:
>  Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr:
> 3,5
>  Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr:
> 1,4
>  Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr:
> 5,2
>  Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr:
> 3
>  Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2
> Configs:
>  Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr:
> 3
>  Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr:
> 2,4
>  Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr:
> 3,5
>  Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr:
> 1,4
>  Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr:
> 5,2
> Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2
> Configs:
>  Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr:
> 5,1
>  Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr:
> 2
>  Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr:
> 2,3
>  Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr:
> 3,4
>  Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr:
> 5,4
> Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2
> Configs:
>  Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr:
> 5,4
>  Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr:
> 5
>  Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr:
> 2,1
>  Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr:
> 3,2
>  Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr:
> 3,4
>
> ....
>
> I suspect that broker 1 lost leadership for all the topics around that
> timestamp but it is in sync as a follower.
>
> I followed "How to reduce churn" to increase replica.lag.max.messages.
> However I don't think it is useful since I suspect it is leader kicked out
> of ISR.
>
> I cannot consistently reproduce this since it happened twice, once a day
> in the past two days. These are all the configs I changed in
> server.properties.
>
> default.replication.factor=2
>
> auto.leader.rebalance.enable=true
>
> log.roll.hours=1
>
> replica.lag.max.messages=50000
>
> Let me know if you need more logs to debug.
>

Re: leaders being kicked out of ISR

Posted by Jun Rao <ju...@gmail.com>.

Any error in the controller/state-change logs? Do you have GC induced ZK
session expiration in the brokers?

Thanks,

Jun


On Wed, Apr 16, 2014 at 9:50 AM, Xinyao Hu <xi...@dropbox.com> wrote:

> Hey all,
>
> This is xinyao from dropbox and I am trying to push kafka to widely used
> inside dropbox.
>
> I setup up five nodes with 2x replication. The system is running well to
> handle all the traffic. The problem I am seeing now is that one
> host(probably one leader) being kicked out of ISR. This is some related
> information.
>
> [2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153.
> (kafka.network.Processor)
> [2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:31:27,157] INFO Rolled new log segment for
> 'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:30,709] INFO Rolled new log segment for
> 'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms.
> (kafka.log.Log)
> [2014-04-16 15:31:32,310] INFO Rolled new log segment for
> 'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22.
> (kafka.network.Processor)
> [2014-04-16 15:31:54,954] INFO Rolled new log segment for
> 'api-not_your_cursor-2' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
> (kafka.network.Processor)
> [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
> (kafka.network.Processor)
> [2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:32:05,540] INFO Partition
> [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for
> partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,579] ERROR Conditional update of path
> /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
> with data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected
> version 0 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for
> /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,581] INFO Partition
> [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion
> [0] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker
> 1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,596] ERROR Conditional update of path
> /brokers/topics/api-shmodel-non-ssl/partitions/4/state with data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 0 failed due
> to org.apache.zookeeper.KeeperException$BadVersionException:
> KeeperErrorCode = BadVersion for
> /brokers/topics/api-shmodel-non-ssl/partitions/4/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker
> 1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,596] INFO Partition
> [server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for
> partition [server-network_logs_errors-0012,0] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,605] ERROR Conditional update of path
> /brokers/topics/server-network_logs_errors-0012/partitions/0/state with
> data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 0
>  failed due to org.apache.zookeeper.KeeperException$BadVersionException:
> KeeperErrorCode = BadVersion for
> /brokers/topics/server-network_logs_errors-0012/partitions/0/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,605] INFO Partition
> [server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not
> equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>
> You can see that something went wrong around 15:31. This is the current
> topic replicas information:
>
> Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr:
> 3,5
> Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr:
> 1,4
> Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr:
> 5,2
> Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
> Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr:
> 3,5
> Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr:
> 1,4
> Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr:
> 5,2
> Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
> Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr: 3
> Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr:
> 3,5
> Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr:
> 1,4
> Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr:
> 5,2
> Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr:
> 5,1
> Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr: 2
> Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr:
> 2,3
> Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr:
> 3,4
> Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr:
> 5,4
> Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr:
> 5,4
> Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr: 5
> Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr:
> 2,1
> Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr:
> 3,2
> Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr:
> 3,4
>
> ....
>
> I suspect that broker 1 lost leadership for all the topics around that
> timestamp but it is in sync as a follower.
>
> I followed "How to reduce churn" to increase replica.lag.max.messages.
> However I don't think it is useful since I suspect it is leader kicked out
> of ISR.
>
> I cannot consistently reproduce this since it happened twice, once a day in
> the past two days. These are all the configs I changed in
> server.properties.
>
> default.replication.factor=2
>
> auto.leader.rebalance.enable=true
>
> log.roll.hours=1
>
> replica.lag.max.messages=50000
>
> Let me know if you need more logs to debug.
>