You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by joe smith <wa...@yahoo.com> on 2014/05/30 23:48:26 UTC

Kafka broker stopped send/receive messages

 
Hi,

We are doing load testing.  Ran into some exception - need some help
clarifying what the problem is and how to bring the broker back on-line (hoping
w/o the need to shutdown other brokers).  Each broker was handling about 100k msgs/sec. 
Payload is small - less than 500 bytes
About 2 hours into our run, one of the brokers stopped.  It
looked like an update to the zookeeper failed - conflict in /controller data’s
version.   It seems to happen when a new
leader is chosen.  How do we prevent and
recover?

Sorry for the long post.  Thanks for the help.

Broker version kafka_2.10-0.8.1.1
jvm 1.7
We have 3 brokers running on 3 virtual machines.
1 topic, 6 partitions, 2 replicas
3 producing apps (6 producer threads each)
3 consuming apps (2 consumer each)
The partitions and consumers were properly load-balanced across 3 brokers.

 [2014-05-30 09:42:21,122]
ERROR Conditional update of path /brokers/topics/topic1/partitions/2/state with
data
{"controller_epoch":27,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 1960 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for /brokers/topics/topic1/partitions/2/state (kafka.utils.ZkUtils$)
[2014-05-30 09:42:21,123] INFO Closing socket connection to /10.74.9.97.
(kafka.network.Processor)
[2014-05-30 09:42:21,147] INFO Partition [topic1,2] on broker 1: Cached
zkVersion [1960] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-05-30 09:42:21,147] INFO Partition [topic1,5] on broker 1: Shrinking ISR
for partition [topic1,5] from 1,3 to 1 (kafka.cluster.Partition)
[2014-05-30 09:42:21,149] ERROR Conditional update of path /brokers/topics/topic1/partitions/5/state
with data
{"controller_epoch":27,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 2394 failed due to org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for /brokers/topics/topic1/partitions/5/state
(kafka.utils.ZkUtils$)
[2014-05-30 09:42:21,149] INFO Partition [topic1,5] on broker 1: Cached
zkVersion [2394] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-05-30 09:42:21,182] INFO conflict in /controller data:
{"version":1,"brokerid":1,"timestamp":"1401442941179"}
stored data:
{"version":1,"brokerid":3,"timestamp":"1401442940062"}
(kafka.utils.ZkUtils$)
[2014-05-30 09:42:21,259] INFO re-registering broker info in ZK for broker 1
(kafka.server.KafkaHealthcheck)
[2014-05-30 09:42:21,267] INFO Registered broker 1 at path /brokers/ids/1 with
address host01:9092. (kafka.utils.ZkUtils$)
[2014-05-30 09:42:21,267] INFO done re-registering broker
(kafka.server.KafkaHealthcheck)
[2014-05-30 09:42:21,299] INFO Subscribing to /brokers/topics path to watch for
new topics (kafka.server.KafkaHealthcheck)
 
We then saw bunch of producer errors - indicating the target
partition is not local.
 
[2014-05-30 09:42:21,482] WARN [KafkaApi-1] Produce request with
correlation id 277356381 from client pooledKafkaMsgProducer on partition [topic1,2]
failed due to Leader not local for partition [topic1,2] on broker 1
(kafka.server.KafkaApis)
[2014-05-30 09:42:21,482] WARN [KafkaApi-1] Produce request with correlation id
272337409 from client pooledKafkaMsgProducer on partition [topic1,2] failed due
to Leader not local for partition [topic1,2] on broker 1
(kafka.server.KafkaApis)
 
We saw a new leader is elected.

[2014-05-30 09:42:21,523] INFO [ReplicaFetcherManager on broker 1]
Removed fetcher for partitions [topic1,5],[topic1,2]
(kafka.server.ReplicaFetcherManager)
[2014-05-30 09:42:21,523] INFO Truncating log topic1-5 to offset 2304238765.
(kafka.log.Log)
[2014-05-30 09:42:21,549] INFO New leader is 3
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2014-05-30 09:42:21,652] INFO Truncating log topic1-2 to offset 2304238451.
(kafka.log.Log)
[2014-05-30 09:42:21,659] INFO [ReplicaFetcherManager on broker 1] Added
fetcher for partitions ArrayBuffer([[topic1,2], initOffset 2304238451 to broker
id:2,host:host02:9092] , [[topic1,5], initOffset 2304238765 to broker
id:3,host03:9092] ) (kafka.server.ReplicaFetcherManager)
[2014-05-30 09:42:21,688] WARN [KafkaApi-1] Fetch request with correlation id
114863 from client host02-2254-1-ConsumerFetcherThread-g1_host02-1401389340410-4d2e28ab-0-1
on partition [topic1,2] failed due to Leader not local for partition [topic1,2]
on broker 1 (kafka.server.KafkaApis)
[2014-05-30 09:42:21,690] WARN [KafkaApi-1] Fetch request with correlation id
114444 from client
host03-5415-1-ConsumerFetcherThread-g1-host03-1401389345244-70fc9378-0-1 on
partition [topic1,5] failed due to Leader not local for partition [topic1,5] on
broker 1 (kafka.server.KafkaApis)
[2014-05-30 09:42:21,789] WARN [KafkaApi-1] Fetch request with
correlation id 114863 from client host02-2254-1-ConsumerFetcherThread-g1_host02-1401389340410-4d2e28ab-0-1
on partition [topic1,2] failed due to Leader not local for partition [topic1,2]
on broker 1 (kafka.server.KafkaApis)
[2014-05-30 09:42:21,790] WARN [KafkaApi-1] Fetch request with
correlation id 114444 from client host03-5415-1-ConsumerFetcherThread-g1_host03-1401389345244-70fc9378-0-1
on partition [topic1,5] failed due to Leader not local for partition [topic1,5]
on broker 1 (kafka.server.KafkaApis)
[2014-05-30 09:42:21,793] INFO Closing socket connection to /10.74.41.157.
(kafka.network.Processor)
[2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.37.171.
(kafka.network.Processor)
[2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.41.157.
(kafka.network.Processor)
[2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.41.157.
(kafka.network.Processor)

Re: Kafka broker stopped send/receive messages

Posted by Jun Rao <ju...@gmail.com>.
Do you see ZK session expiration? If so, take a look at
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whypartitionleadersmigratethemselvessometimes
?

Thanks,

Jun


On Fri, May 30, 2014 at 2:48 PM, joe smith <wa...@yahoo.com> wrote:

>
> Hi,
>
> We are doing load testing.  Ran into some exception - need some help
> clarifying what the problem is and how to bring the broker back on-line
> (hoping
> w/o the need to shutdown other brokers).  Each broker was handling about
> 100k msgs/sec.
> Payload is small - less than 500 bytes
> About 2 hours into our run, one of the brokers stopped.  It
> looked like an update to the zookeeper failed - conflict in /controller
> data’s
> version.   It seems to happen when a new
> leader is chosen.  How do we prevent and
> recover?
>
> Sorry for the long post.  Thanks for the help.
>
> Broker version kafka_2.10-0.8.1.1
> jvm 1.7
> We have 3 brokers running on 3 virtual machines.
> 1 topic, 6 partitions, 2 replicas
> 3 producing apps (6 producer threads each)
> 3 consuming apps (2 consumer each)
> The partitions and consumers were properly load-balanced across 3 brokers.
>
>  [2014-05-30 09:42:21,122]
> ERROR Conditional update of path /brokers/topics/topic1/partitions/2/state
> with
> data
> {"controller_epoch":27,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 1960 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/topic1/partitions/2/state
> (kafka.utils.ZkUtils$)
> [2014-05-30 09:42:21,123] INFO Closing socket connection to /10.74.9.97.
> (kafka.network.Processor)
> [2014-05-30 09:42:21,147] INFO Partition [topic1,2] on broker 1: Cached
> zkVersion [1960] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-05-30 09:42:21,147] INFO Partition [topic1,5] on broker 1: Shrinking
> ISR
> for partition [topic1,5] from 1,3 to 1 (kafka.cluster.Partition)
> [2014-05-30 09:42:21,149] ERROR Conditional update of path
> /brokers/topics/topic1/partitions/5/state
> with data
> {"controller_epoch":27,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 2394 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException:
> KeeperErrorCode = BadVersion for /brokers/topics/topic1/partitions/5/state
> (kafka.utils.ZkUtils$)
> [2014-05-30 09:42:21,149] INFO Partition [topic1,5] on broker 1: Cached
> zkVersion [2394] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-05-30 09:42:21,182] INFO conflict in /controller data:
> {"version":1,"brokerid":1,"timestamp":"1401442941179"}
> stored data:
> {"version":1,"brokerid":3,"timestamp":"1401442940062"}
> (kafka.utils.ZkUtils$)
> [2014-05-30 09:42:21,259] INFO re-registering broker info in ZK for broker
> 1
> (kafka.server.KafkaHealthcheck)
> [2014-05-30 09:42:21,267] INFO Registered broker 1 at path /brokers/ids/1
> with
> address host01:9092. (kafka.utils.ZkUtils$)
> [2014-05-30 09:42:21,267] INFO done re-registering broker
> (kafka.server.KafkaHealthcheck)
> [2014-05-30 09:42:21,299] INFO Subscribing to /brokers/topics path to
> watch for
> new topics (kafka.server.KafkaHealthcheck)
>
> We then saw bunch of producer errors - indicating the target
> partition is not local.
>
> [2014-05-30 09:42:21,482] WARN [KafkaApi-1] Produce request with
> correlation id 277356381 from client pooledKafkaMsgProducer on partition
> [topic1,2]
> failed due to Leader not local for partition [topic1,2] on broker 1
> (kafka.server.KafkaApis)
> [2014-05-30 09:42:21,482] WARN [KafkaApi-1] Produce request with
> correlation id
> 272337409 from client pooledKafkaMsgProducer on partition [topic1,2]
> failed due
> to Leader not local for partition [topic1,2] on broker 1
> (kafka.server.KafkaApis)
>
> We saw a new leader is elected.
>
> [2014-05-30 09:42:21,523] INFO [ReplicaFetcherManager on broker 1]
> Removed fetcher for partitions [topic1,5],[topic1,2]
> (kafka.server.ReplicaFetcherManager)
> [2014-05-30 09:42:21,523] INFO Truncating log topic1-5 to offset
> 2304238765.
> (kafka.log.Log)
> [2014-05-30 09:42:21,549] INFO New leader is 3
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2014-05-30 09:42:21,652] INFO Truncating log topic1-2 to offset
> 2304238451.
> (kafka.log.Log)
> [2014-05-30 09:42:21,659] INFO [ReplicaFetcherManager on broker 1] Added
> fetcher for partitions ArrayBuffer([[topic1,2], initOffset 2304238451 to
> broker
> id:2,host:host02:9092] , [[topic1,5], initOffset 2304238765 to broker
> id:3,host03:9092] ) (kafka.server.ReplicaFetcherManager)
> [2014-05-30 09:42:21,688] WARN [KafkaApi-1] Fetch request with correlation
> id
> 114863 from client
> host02-2254-1-ConsumerFetcherThread-g1_host02-1401389340410-4d2e28ab-0-1
> on partition [topic1,2] failed due to Leader not local for partition
> [topic1,2]
> on broker 1 (kafka.server.KafkaApis)
> [2014-05-30 09:42:21,690] WARN [KafkaApi-1] Fetch request with correlation
> id
> 114444 from client
> host03-5415-1-ConsumerFetcherThread-g1-host03-1401389345244-70fc9378-0-1 on
> partition [topic1,5] failed due to Leader not local for partition
> [topic1,5] on
> broker 1 (kafka.server.KafkaApis)
> [2014-05-30 09:42:21,789] WARN [KafkaApi-1] Fetch request with
> correlation id 114863 from client
> host02-2254-1-ConsumerFetcherThread-g1_host02-1401389340410-4d2e28ab-0-1
> on partition [topic1,2] failed due to Leader not local for partition
> [topic1,2]
> on broker 1 (kafka.server.KafkaApis)
> [2014-05-30 09:42:21,790] WARN [KafkaApi-1] Fetch request with
> correlation id 114444 from client
> host03-5415-1-ConsumerFetcherThread-g1_host03-1401389345244-70fc9378-0-1
> on partition [topic1,5] failed due to Leader not local for partition
> [topic1,5]
> on broker 1 (kafka.server.KafkaApis)
> [2014-05-30 09:42:21,793] INFO Closing socket connection to /10.74.41.157.
> (kafka.network.Processor)
> [2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.37.171.
> (kafka.network.Processor)
> [2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.41.157.
> (kafka.network.Processor)
> [2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.41.157.
> (kafka.network.Processor)