You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by 陈小军 <ch...@nhn.com> on 2014/04/24 08:14:30 UTC

question about isr

Hi Team,
   I found a strange phenomenon of isr list in my kafka cluster
   
   When I use the tool that kafka provide to get the topic information, and it show isr list as following, seem it is ok
   
     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe --zookeeper 10.96.250.215:10013,10.96.250.216:10013,10.96.250.217:10013/nelo2-kafka
 
Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2     Configs:
        Topic: nelo2-normal-logs        Partition: 0    Leader: 3       Replicas: 3,0   Isr: 0,3
        Topic: nelo2-normal-logs        Partition: 1    Leader: 0       Replicas: 0,1   Isr: 0,1
        Topic: nelo2-normal-logs        Partition: 2    Leader: 1       Replicas: 1,3   Isr: 1,3
 
  but when I use some sdk to get the meta info from broker, the isr is different. 
metadata:  { size: 246, 
  correlationId: 0,
  brokerNum: -1,
  nodeId: 1,
  host: 'xseed171.kdev.nhnsystem.com',
  port: 9093,
  topicNum: 0,
  topicError: 0,
  topic: 'nelo2-normal-logs',
  partitionNum: 2,
  errorCode: 0,
  partition: 0,
  leader: 3,
  replicasNum: 2,
  replicas: [ 3, 0 ],
  isrNum: 2,
  isr: [ 0, 3 ] }
metadata:  { size: 246,
  correlationId: 0,
  brokerNum: -1,
  nodeId: 1,
  host: 'xseed171.kdev.nhnsystem.com',
  port: 9093,
  topicNum: 0,
  topicError: 0,
  topic: 'nelo2-normal-logs',
  partitionNum: 1,
  errorCode: 0,
  partition: 1,
  leader: 0,
  replicasNum: 2,
  replicas: [ 0, 1 ],
  isrNum: 2,
  isr: [ 0, 1 ] }
metadata:  { size: 246,
  correlationId: 0,
  brokerNum: -1,
  nodeId: 1,
  host: 'xseed171.kdev.nhnsystem.com',
  port: 9093,
  topicNum: 0,
  topicError: 0,
  topic: 'nelo2-normal-logs',
  partitionNum: 0,
  errorCode: 0,
  partition: 2,
  leader: 1,
  replicasNum: 2,
  replicas: [ 1, 3 ],
  isrNum: 1,
  isr: [ 1 ] }

 I also use other sdk, get the same result. I check the logs from kafka, it seems the sdk result is right. the tool get the wrong result. why is it happend?
 
[2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1) for partition [nelo2-normal-logs,1] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
[2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3) for partition [nelo2-normal-logs,2] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
[2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0) for partition [nelo2-normal-logs,0] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
                                                                                                   
Thanks~!    

Best Regards
Jerry

Re: question about isr

Posted by Ian Friedman <ia...@flurry.com>.

Another question - If I start broker 1 back up now, with his replicas now out of sync, will he automatically catch up with the leaders or do I have to do something to catch him up?

--Ian 

On May 12, 2014, at 10:49 AM, Jun Rao <ju...@gmail.com> wrote:

> Could you try 0.8.1.1, which fixed some bugs related to controller hanging during shutdown.
> 
> For ZK session expiration, the #1 cause is GC. We have also seen that transient issues with the underlying storage (e.g. raid controller reset) can also cause ZK session expiration.
> 
> As for the data loss in the producer, you are probably using ack=1, which could lead to data loss during leader failover. It just that in this case, the failover window is forever due to some bugs.
> 
> Thanks,
> 
> Jun
> 
> 
> On Sun, May 11, 2014 at 10:14 PM, Ian Friedman <ia...@flurry.com> wrote:
> Jun - We're using 0.8.1 
> The timestamps in the last few entries in controller.log seem to correspond to when the trouble started in server.log
> controller.log:
> 
> [2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
> 
> [2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
> 
> server.log:
> 
> [2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3. (kafka.network.Processor)
> 
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2. (kafka.network.Processor)
> 
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2. (kafka.network.Processor)
> 
> [2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27] on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27] from 1,2,3 to 1 (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,109] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-Medium/partitions/27/state (kafka.utils.ZkUtils$)
> 
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,111] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-High/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state (kafka.utils.ZkUtils$)
> 
> I'm also concerned why this happened at all... I do see some slightly high CMS activity in the GC log but I don't see any Full GCs and the whole thing was over within a minute... I'm a bit concerned how something like this could make a broker fail this way!
> 
> One of the after effects we found from leaving the broker running in this state is that it was still allowing producers with cached metadata to produce to it, so we lost a whole bunch of messages before I shut him down. What's up with that??
> 
> 
> 
> --Ian
> 
> 
> 
> On Sun, May 11, 2014 at 11:39 PM, Jun Rao <ju...@gmail.com> wrote:
> It seems that broker 1 had a soft failure (potentially due to GC). However, somehow it didn't receive the latest LeaderAndIsrRequest from the controller. Which version of Kafka are you using? In the controller log, do you see broker 1 being added back as a live broker?
> 
> Thanks,
> 
> Jun
> 
> 
> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ia...@flurry.com> wrote:
> This seems similar to behavior we’re seeing. At some point one of our brokers (id 1) just gives up and starts throwing those errors and kafka-topics no longer lists it as a ISR. However the logs for that broker say something very odd:
> 
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8] on broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] ERROR Conditional update of path /brokers/topics/callbackServiceTopic/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31] on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-High/partitions/31/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/31/state (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,258] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state (kafka.utils.ZkUtils$)
> 
> etc. And these errors continue every few seconds.
> 
> kafka-topics.sh —describe output:
> Topic:callbackServiceTopic-High PartitionCount:50       ReplicationFactor:3     Configs:
>         Topic: callbackServiceTopic-High        Partition: 0    Leader: 2       Replicas: 3,1,2 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 1    Leader: 2       Replicas: 1,2,3 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 2    Leader: 2       Replicas: 2,3,1 Isr: 2,3
> 
> I went and looked at one of those ZNodes in the zkCLI and found this:
> 
> [zk: localhost:2181(CONNECTED) 2] get /brokers/topics/callbackServiceTopic-High/partitions/31/state
> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
> 
> What does the version number there represent and how does it get out of sync? Should I restart broker 1? Is the fact that broker 1 is behind in leader_epoch significant?
> 
> Still trying to figure out Kafka operations :(
> —Ian
> 
> 
> On Apr 24, 2014, at 9:26 PM, 陈小军 <ch...@nhn.com> wrote:
> 
> > I don't do any partition reassignment.
> >
> > When broker occure following error, this phenomenon  will happen.
> >
> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] ERROR Conditional update of path /brokers/topics/nelo2-normal-logs/partitions/0/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,315] ERROR Conditional update of path /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]} and expected version 8 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,318] ERROR Conditional update of path /brokers/topics/nelo2-crash-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partit
> > Best Regards
> > Jerry
> > -----Original Message-----
> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&lt;chenxiaojun@nhn.com&gt;;
> > Cc:
> > Sent: 2014-04-25 (星期五) 02:12:02
> > Subject: Re: question about isr
> >
> > Interesting. Which version of Kafka are you using? Were you doing some partition reassignment? Thanks, Jun
> >
> >
> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt; wrote:
> >
> > Hi Team,
> >
> >   I found a strange phenomenon of isr list in my kafka cluster
> >
> >
> >
> >   When I use the tool that kafka provide to get the topic information, and it show isr list as following, seem it is ok
> >
> >
> >
> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe --zookeeper 10.96.250.215:10013,10.96.250.216:10013,10.96.250.217:10013/nelo2-kafka
> >
> >
> >
> >
> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2     Configs:
> >
> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3       Replicas: 3,0   Isr: 0,3
> >
> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0       Replicas: 0,1   Isr: 0,1
> >
> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1       Replicas: 1,3   Isr: 1,3
> >
> >
> >
> >  but when I use some sdk to get the meta info from broker, the isr is different.
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 2,
> >
> >  errorCode: 0,
> >
> >  partition: 0,
> >
> >  leader: 3,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 3, 0 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 3 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 1,
> >
> >  errorCode: 0,
> >
> >  partition: 1,
> >
> >  leader: 0,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 0, 1 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 1 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 0,
> >
> >  errorCode: 0,
> >
> >  partition: 2,
> >
> >  leader: 1,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 1, 3 ],
> >
> >  isrNum: 1,
> >
> >  isr: [ 1 ] }
> >
> >
> >
> > I also use other sdk, get the same result. I check the logs from kafka, it seems the sdk result is right. the tool get the wrong result. why is it happend?
> >
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1) for partition [nelo2-normal-logs,1] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3) for partition [nelo2-normal-logs,2] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0) for partition [nelo2-normal-logs,0] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> >
> >
> > Thanks~!
> >
> >
> >
> > Best Regards
> >
> > Jerry
> >
> >
> >
> >
> 
> 
> 
>

Re: question about isr

Posted by Jun Rao <ju...@gmail.com>.

0.8.1.1 fixed an issue when the controller could hang during soft failure.

Out-of sync replicas will catch up from the leader automatically, once
restarted.

Thanks,

Jun


On Mon, May 12, 2014 at 12:37 PM, Ian Friedman <ia...@flurry.com> wrote:

> You're probably right that it has to be GC given the CMS activity I saw in
> the log but I didn't see a single concurrent mode failure, which bothers me
> that we had this happen anyway... also the ZK timeout is set to a very
> large number... I dunno, seems weird.
>
> I will see what I can do about getting 0.8.1.1 deployed... how do you
> expect it to address this problem?
>
> --Ian
>
>
> On May 12, 2014, at 10:49 AM, Jun Rao <ju...@gmail.com> wrote:
>
> Could you try 0.8.1.1, which fixed some bugs related to controller hanging
> during shutdown.
>
> For ZK session expiration, the #1 cause is GC. We have also seen that
> transient issues with the underlying storage (e.g. raid controller reset)
> can also cause ZK session expiration.
>
> As for the data loss in the producer, you are probably using ack=1, which
> could lead to data loss during leader failover. It just that in this case,
> the failover window is forever due to some bugs.
>
> Thanks,
>
> Jun
>
>
> On Sun, May 11, 2014 at 10:14 PM, Ian Friedman <ia...@flurry.com> wrote:
>
>> Jun - We're using 0.8.1
>> The timestamps in the last few entries in controller.log seem to
>> correspond to when the trouble started in server.log
>>
>> controller.log:
>>
>> [2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK
>> expired; shut down all controller components and try to re-elect
>> (kafka.controller.KafkaController$SessionExpirationListener)
>>
>> [2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down
>> (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
>>
>> server.log:
>>
>> [2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3.
>> (kafka.network.Processor)
>>
>> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
>> (kafka.network.Processor)
>>
>> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
>> (kafka.network.Processor)
>>
>> [2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27]
>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27]
>> from 1,2,3 to 1 (kafka.cluster.Partition)
>>
>> [2014-05-08 19:01:30,109] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for
>> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state
>> (kafka.utils.ZkUtils$)
>>
>> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27]
>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>>
>> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on
>> broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from
>> 1,2,3 to 1 (kafka.cluster.Partition)
>>
>> [2014-05-08 19:01:30,111] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-High/partitions/3/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state
>> (kafka.utils.ZkUtils$)
>>
>> I'm also concerned why this happened at all... I do see some slightly
>> high CMS activity in the GC log but I don't see any Full GCs and the whole
>> thing was over within a minute... I'm a bit concerned how something like
>> this could make a broker fail this way!
>>
>> One of the after effects we found from leaving the broker running in this
>> state is that it was still allowing producers with cached metadata to
>> produce to it, so we lost a whole bunch of messages before I shut him down.
>> What's up with that??
>>
>>
>> --Ian
>>
>>
>> On Sun, May 11, 2014 at 11:39 PM, Jun Rao <ju...@gmail.com> wrote:
>>
>>> It seems that broker 1 had a soft failure (potentially due to GC).
>>> However, somehow it didn't receive the latest LeaderAndIsrRequest from the
>>> controller. Which version of Kafka are you using? In the controller log, do
>>> you see broker 1 being added back as a live broker?
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>>
>>> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ia...@flurry.com> wrote:
>>>
>>>> This seems similar to behavior we’re seeing. At some point one of our
>>>> brokers (id 1) just gives up and starts throwing those errors and
>>>> kafka-topics no longer lists it as a ISR. However the logs for that broker
>>>> say something very odd:
>>>>
>>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8]
>>>> on broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on
>>>> broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3
>>>> to 1 (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,251] ERROR Conditional update of path
>>>> /brokers/topics/callbackServiceTopic/partitions/3/state with data
>>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>>> and expected version 9 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state
>>>> (kafka.utils.ZkUtils$)
>>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on
>>>> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31]
>>>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31]
>>>> from 1,2,3 to 1 (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,255] ERROR Conditional update of path
>>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state with data
>>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>>> and expected version 9 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for
>>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>>> (kafka.utils.ZkUtils$)
>>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31]
>>>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3]
>>>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from
>>>> 1,2,3 to 1 (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,258] ERROR Conditional update of path
>>>> /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data
>>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>>> and expected version 9 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state
>>>> (kafka.utils.ZkUtils$)
>>>>
>>>> etc. And these errors continue every few seconds.
>>>>
>>>> kafka-topics.sh —describe output:
>>>> Topic:callbackServiceTopic-High PartitionCount:50
>>>> ReplicationFactor:3     Configs:
>>>>         Topic: callbackServiceTopic-High        Partition: 0    Leader:
>>>> 2       Replicas: 3,1,2 Isr: 2,3
>>>>         Topic: callbackServiceTopic-High        Partition: 1    Leader:
>>>> 2       Replicas: 1,2,3 Isr: 2,3
>>>>         Topic: callbackServiceTopic-High        Partition: 2    Leader:
>>>> 2       Replicas: 2,3,1 Isr: 2,3
>>>>
>>>> I went and looked at one of those ZNodes in the zkCLI and found this:
>>>>
>>>> [zk: localhost:2181(CONNECTED) 2] get
>>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>>>
>>>> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
>>>>
>>>> What does the version number there represent and how does it get out of
>>>> sync? Should I restart broker 1? Is the fact that broker 1 is behind in
>>>> leader_epoch significant?
>>>>
>>>> Still trying to figure out Kafka operations :(
>>>> —Ian
>>>>
>>>>
>>>> On Apr 24, 2014, at 9:26 PM, 陈小军 <ch...@nhn.com> wrote:
>>>>
>>>> > I don't do any partition reassignment.
>>>> >
>>>> > When broker occure following error, this phenomenon  will happen.
>>>> >
>>>> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition
>>>> [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition
>>>> [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,313] ERROR Conditional update of path
>>>> /brokers/topics/nelo2-normal-logs/partitions/0/state with data
>>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>>> and expected version 7 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state
>>>> (kafka.utils.ZkUtils$)
>>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on
>>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1]
>>>> on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from
>>>> 0,2 to 0 (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,315] ERROR Conditional update of path
>>>> /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data
>>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]}
>>>> and expected version 8 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
>>>> (kafka.utils.ZkUtils$)
>>>> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1]
>>>> on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on
>>>> broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0
>>>> (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,318] ERROR Conditional update of path
>>>> /brokers/topics/nelo2-crash-logs/partitions/1/state with data
>>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>>> and expected version 7 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state
>>>> (kafka.utils.ZkUtils$)
>>>> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on
>>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partit
>>>> > Best Regards
>>>> > Jerry
>>>> > -----Original Message-----
>>>> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
>>>> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&
>>>> lt;chenxiaojun@nhn.com&gt;;
>>>> > Cc:
>>>> > Sent: 2014-04-25 (星期五) 02:12:02
>>>> > Subject: Re: question about isr
>>>> >
>>>> > Interesting. Which version of Kafka are you using? Were you doing
>>>> some partition reassignment? Thanks, Jun
>>>> >
>>>> >
>>>> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt;
>>>> wrote:
>>>> >
>>>> > Hi Team,
>>>> >
>>>> >   I found a strange phenomenon of isr list in my kafka cluster
>>>> >
>>>> >
>>>> >
>>>> >   When I use the tool that kafka provide to get the topic
>>>> information, and it show isr list as following, seem it is ok
>>>> >
>>>> >
>>>> >
>>>> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe
>>>> --zookeeper 10.96.250.215:10013,10.96.250.216:10013,
>>>> 10.96.250.217:10013/nelo2-kafka
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
>>>>   Configs:
>>>> >
>>>> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3
>>>>   Replicas: 3,0   Isr: 0,3
>>>> >
>>>> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0
>>>>   Replicas: 0,1   Isr: 0,1
>>>> >
>>>> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1
>>>>   Replicas: 1,3   Isr: 1,3
>>>> >
>>>> >
>>>> >
>>>> >  but when I use some sdk to get the meta info from broker, the isr is
>>>> different.
>>>> >
>>>> > metadata:  { size: 246,
>>>> >
>>>> >  correlationId: 0,
>>>> >
>>>> >  brokerNum: -1,
>>>> >
>>>> >  nodeId: 1,
>>>> >
>>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>>> >
>>>> >  port: 9093,
>>>> >
>>>> >  topicNum: 0,
>>>> >
>>>> >  topicError: 0,
>>>> >
>>>> >  topic: 'nelo2-normal-logs',
>>>> >
>>>> >  partitionNum: 2,
>>>> >
>>>> >  errorCode: 0,
>>>> >
>>>> >  partition: 0,
>>>> >
>>>> >  leader: 3,
>>>> >
>>>> >  replicasNum: 2,
>>>> >
>>>> >  replicas: [ 3, 0 ],
>>>> >
>>>> >  isrNum: 2,
>>>> >
>>>> >  isr: [ 0, 3 ] }
>>>> >
>>>> > metadata:  { size: 246,
>>>> >
>>>> >  correlationId: 0,
>>>> >
>>>> >  brokerNum: -1,
>>>> >
>>>> >  nodeId: 1,
>>>> >
>>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>>> >
>>>> >  port: 9093,
>>>> >
>>>> >  topicNum: 0,
>>>> >
>>>> >  topicError: 0,
>>>> >
>>>> >  topic: 'nelo2-normal-logs',
>>>> >
>>>> >  partitionNum: 1,
>>>> >
>>>> >  errorCode: 0,
>>>> >
>>>> >  partition: 1,
>>>> >
>>>> >  leader: 0,
>>>> >
>>>> >  replicasNum: 2,
>>>> >
>>>> >  replicas: [ 0, 1 ],
>>>> >
>>>> >  isrNum: 2,
>>>> >
>>>> >  isr: [ 0, 1 ] }
>>>> >
>>>> > metadata:  { size: 246,
>>>> >
>>>> >  correlationId: 0,
>>>> >
>>>> >  brokerNum: -1,
>>>> >
>>>> >  nodeId: 1,
>>>> >
>>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>>> >
>>>> >  port: 9093,
>>>> >
>>>> >  topicNum: 0,
>>>> >
>>>> >  topicError: 0,
>>>> >
>>>> >  topic: 'nelo2-normal-logs',
>>>> >
>>>> >  partitionNum: 0,
>>>> >
>>>> >  errorCode: 0,
>>>> >
>>>> >  partition: 2,
>>>> >
>>>> >  leader: 1,
>>>> >
>>>> >  replicasNum: 2,
>>>> >
>>>> >  replicas: [ 1, 3 ],
>>>> >
>>>> >  isrNum: 1,
>>>> >
>>>> >  isr: [ 1 ] }
>>>> >
>>>> >
>>>> >
>>>> > I also use other sdk, get the same result. I check the logs from
>>>> kafka, it seems the sdk result is right. the tool get the wrong result. why
>>>> is it happend?
>>>> >
>>>> >
>>>> >
>>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>>> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
>>>> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
>>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>>> >
>>>> >
>>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>>> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
>>>> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
>>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>>> >
>>>> >
>>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>>> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
>>>> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
>>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Thanks~!
>>>> >
>>>> >
>>>> >
>>>> > Best Regards
>>>> >
>>>> > Jerry
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>
>
>

Re: question about isr

Posted by Ian Friedman <ia...@flurry.com>.

You're probably right that it has to be GC given the CMS activity I saw in the log but I didn't see a single concurrent mode failure, which bothers me that we had this happen anyway... also the ZK timeout is set to a very large number... I dunno, seems weird.

I will see what I can do about getting 0.8.1.1 deployed... how do you expect it to address this problem?

--Ian

On May 12, 2014, at 10:49 AM, Jun Rao <ju...@gmail.com> wrote:

> Could you try 0.8.1.1, which fixed some bugs related to controller hanging during shutdown.
> 
> For ZK session expiration, the #1 cause is GC. We have also seen that transient issues with the underlying storage (e.g. raid controller reset) can also cause ZK session expiration.
> 
> As for the data loss in the producer, you are probably using ack=1, which could lead to data loss during leader failover. It just that in this case, the failover window is forever due to some bugs.
> 
> Thanks,
> 
> Jun
> 
> 
> On Sun, May 11, 2014 at 10:14 PM, Ian Friedman <ia...@flurry.com> wrote:
> Jun - We're using 0.8.1 
> The timestamps in the last few entries in controller.log seem to correspond to when the trouble started in server.log
> controller.log:
> 
> [2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
> 
> [2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
> 
> server.log:
> 
> [2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3. (kafka.network.Processor)
> 
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2. (kafka.network.Processor)
> 
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2. (kafka.network.Processor)
> 
> [2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27] on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27] from 1,2,3 to 1 (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,109] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-Medium/partitions/27/state (kafka.utils.ZkUtils$)
> 
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,111] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-High/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state (kafka.utils.ZkUtils$)
> 
> I'm also concerned why this happened at all... I do see some slightly high CMS activity in the GC log but I don't see any Full GCs and the whole thing was over within a minute... I'm a bit concerned how something like this could make a broker fail this way!
> 
> One of the after effects we found from leaving the broker running in this state is that it was still allowing producers with cached metadata to produce to it, so we lost a whole bunch of messages before I shut him down. What's up with that??
> 
> 
> 
> --Ian
> 
> 
> 
> On Sun, May 11, 2014 at 11:39 PM, Jun Rao <ju...@gmail.com> wrote:
> It seems that broker 1 had a soft failure (potentially due to GC). However, somehow it didn't receive the latest LeaderAndIsrRequest from the controller. Which version of Kafka are you using? In the controller log, do you see broker 1 being added back as a live broker?
> 
> Thanks,
> 
> Jun
> 
> 
> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ia...@flurry.com> wrote:
> This seems similar to behavior we’re seeing. At some point one of our brokers (id 1) just gives up and starts throwing those errors and kafka-topics no longer lists it as a ISR. However the logs for that broker say something very odd:
> 
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8] on broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] ERROR Conditional update of path /brokers/topics/callbackServiceTopic/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31] on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-High/partitions/31/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/31/state (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,258] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state (kafka.utils.ZkUtils$)
> 
> etc. And these errors continue every few seconds.
> 
> kafka-topics.sh —describe output:
> Topic:callbackServiceTopic-High PartitionCount:50       ReplicationFactor:3     Configs:
>         Topic: callbackServiceTopic-High        Partition: 0    Leader: 2       Replicas: 3,1,2 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 1    Leader: 2       Replicas: 1,2,3 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 2    Leader: 2       Replicas: 2,3,1 Isr: 2,3
> 
> I went and looked at one of those ZNodes in the zkCLI and found this:
> 
> [zk: localhost:2181(CONNECTED) 2] get /brokers/topics/callbackServiceTopic-High/partitions/31/state
> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
> 
> What does the version number there represent and how does it get out of sync? Should I restart broker 1? Is the fact that broker 1 is behind in leader_epoch significant?
> 
> Still trying to figure out Kafka operations :(
> —Ian
> 
> 
> On Apr 24, 2014, at 9:26 PM, 陈小军 <ch...@nhn.com> wrote:
> 
> > I don't do any partition reassignment.
> >
> > When broker occure following error, this phenomenon  will happen.
> >
> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] ERROR Conditional update of path /brokers/topics/nelo2-normal-logs/partitions/0/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,315] ERROR Conditional update of path /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]} and expected version 8 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,318] ERROR Conditional update of path /brokers/topics/nelo2-crash-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partit
> > Best Regards
> > Jerry
> > -----Original Message-----
> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&lt;chenxiaojun@nhn.com&gt;;
> > Cc:
> > Sent: 2014-04-25 (星期五) 02:12:02
> > Subject: Re: question about isr
> >
> > Interesting. Which version of Kafka are you using? Were you doing some partition reassignment? Thanks, Jun
> >
> >
> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt; wrote:
> >
> > Hi Team,
> >
> >   I found a strange phenomenon of isr list in my kafka cluster
> >
> >
> >
> >   When I use the tool that kafka provide to get the topic information, and it show isr list as following, seem it is ok
> >
> >
> >
> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe --zookeeper 10.96.250.215:10013,10.96.250.216:10013,10.96.250.217:10013/nelo2-kafka
> >
> >
> >
> >
> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2     Configs:
> >
> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3       Replicas: 3,0   Isr: 0,3
> >
> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0       Replicas: 0,1   Isr: 0,1
> >
> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1       Replicas: 1,3   Isr: 1,3
> >
> >
> >
> >  but when I use some sdk to get the meta info from broker, the isr is different.
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 2,
> >
> >  errorCode: 0,
> >
> >  partition: 0,
> >
> >  leader: 3,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 3, 0 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 3 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 1,
> >
> >  errorCode: 0,
> >
> >  partition: 1,
> >
> >  leader: 0,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 0, 1 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 1 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 0,
> >
> >  errorCode: 0,
> >
> >  partition: 2,
> >
> >  leader: 1,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 1, 3 ],
> >
> >  isrNum: 1,
> >
> >  isr: [ 1 ] }
> >
> >
> >
> > I also use other sdk, get the same result. I check the logs from kafka, it seems the sdk result is right. the tool get the wrong result. why is it happend?
> >
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1) for partition [nelo2-normal-logs,1] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3) for partition [nelo2-normal-logs,2] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0) for partition [nelo2-normal-logs,0] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> >
> >
> > Thanks~!
> >
> >
> >
> > Best Regards
> >
> > Jerry
> >
> >
> >
> >
> 
> 
> 
>

Re: question about isr

Posted by Jun Rao <ju...@gmail.com>.

Could you try 0.8.1.1, which fixed some bugs related to controller hanging
during shutdown.

For ZK session expiration, the #1 cause is GC. We have also seen that
transient issues with the underlying storage (e.g. raid controller reset)
can also cause ZK session expiration.

As for the data loss in the producer, you are probably using ack=1, which
could lead to data loss during leader failover. It just that in this case,
the failover window is forever due to some bugs.

Thanks,

Jun


On Sun, May 11, 2014 at 10:14 PM, Ian Friedman <ia...@flurry.com> wrote:

> Jun - We're using 0.8.1
> The timestamps in the last few entries in controller.log seem to
> correspond to when the trouble started in server.log
>
> controller.log:
>
> [2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK
> expired; shut down all controller components and try to re-elect
> (kafka.controller.KafkaController$SessionExpirationListener)
>
> [2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down
> (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
>
> server.log:
>
> [2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3.
> (kafka.network.Processor)
>
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
> (kafka.network.Processor)
>
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
> (kafka.network.Processor)
>
> [2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27]
> on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27]
> from 1,2,3 to 1 (kafka.cluster.Partition)
>
> [2014-05-08 19:01:30,109] ERROR Conditional update of path
> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data
> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
> and expected version 9 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for
> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state
> (kafka.utils.ZkUtils$)
>
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27]
> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
> updating ISR (kafka.cluster.Partition)
>
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on
> broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from
> 1,2,3 to 1 (kafka.cluster.Partition)
>
> [2014-05-08 19:01:30,111] ERROR Conditional update of path
> /brokers/topics/callbackServiceTopic-High/partitions/3/state with data
> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
> and expected version 9 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state
> (kafka.utils.ZkUtils$)
>
> I'm also concerned why this happened at all... I do see some slightly high
> CMS activity in the GC log but I don't see any Full GCs and the whole thing
> was over within a minute... I'm a bit concerned how something like this
> could make a broker fail this way!
>
> One of the after effects we found from leaving the broker running in this
> state is that it was still allowing producers with cached metadata to
> produce to it, so we lost a whole bunch of messages before I shut him down.
> What's up with that??
>
>
> --Ian
>
>
> On Sun, May 11, 2014 at 11:39 PM, Jun Rao <ju...@gmail.com> wrote:
>
>> It seems that broker 1 had a soft failure (potentially due to GC).
>> However, somehow it didn't receive the latest LeaderAndIsrRequest from the
>> controller. Which version of Kafka are you using? In the controller log, do
>> you see broker 1 being added back as a live broker?
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ia...@flurry.com> wrote:
>>
>>> This seems similar to behavior we’re seeing. At some point one of our
>>> brokers (id 1) just gives up and starts throwing those errors and
>>> kafka-topics no longer lists it as a ISR. However the logs for that broker
>>> say something very odd:
>>>
>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8]
>>> on broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on
>>> broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3
>>> to 1 (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,251] ERROR Conditional update of path
>>> /brokers/topics/callbackServiceTopic/partitions/3/state with data
>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>> and expected version 9 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state
>>> (kafka.utils.ZkUtils$)
>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on
>>> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31]
>>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31]
>>> from 1,2,3 to 1 (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,255] ERROR Conditional update of path
>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state with data
>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>> and expected version 9 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for
>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>> (kafka.utils.ZkUtils$)
>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31]
>>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on
>>> broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from
>>> 1,2,3 to 1 (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,258] ERROR Conditional update of path
>>> /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data
>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>> and expected version 9 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state
>>> (kafka.utils.ZkUtils$)
>>>
>>> etc. And these errors continue every few seconds.
>>>
>>> kafka-topics.sh —describe output:
>>> Topic:callbackServiceTopic-High PartitionCount:50
>>> ReplicationFactor:3     Configs:
>>>         Topic: callbackServiceTopic-High        Partition: 0    Leader:
>>> 2       Replicas: 3,1,2 Isr: 2,3
>>>         Topic: callbackServiceTopic-High        Partition: 1    Leader:
>>> 2       Replicas: 1,2,3 Isr: 2,3
>>>         Topic: callbackServiceTopic-High        Partition: 2    Leader:
>>> 2       Replicas: 2,3,1 Isr: 2,3
>>>
>>> I went and looked at one of those ZNodes in the zkCLI and found this:
>>>
>>> [zk: localhost:2181(CONNECTED) 2] get
>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>>
>>> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
>>>
>>> What does the version number there represent and how does it get out of
>>> sync? Should I restart broker 1? Is the fact that broker 1 is behind in
>>> leader_epoch significant?
>>>
>>> Still trying to figure out Kafka operations :(
>>> —Ian
>>>
>>>
>>> On Apr 24, 2014, at 9:26 PM, 陈小军 <ch...@nhn.com> wrote:
>>>
>>> > I don't do any partition reassignment.
>>> >
>>> > When broker occure following error, this phenomenon  will happen.
>>> >
>>> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition
>>> [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition
>>> [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,313] ERROR Conditional update of path
>>> /brokers/topics/nelo2-normal-logs/partitions/0/state with data
>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>> and expected version 7 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state
>>> (kafka.utils.ZkUtils$)
>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on
>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1]
>>> on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from
>>> 0,2 to 0 (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,315] ERROR Conditional update of path
>>> /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data
>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]}
>>> and expected version 8 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
>>> (kafka.utils.ZkUtils$)
>>> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1]
>>> on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on
>>> broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0
>>> (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,318] ERROR Conditional update of path
>>> /brokers/topics/nelo2-crash-logs/partitions/1/state with data
>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>> and expected version 7 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state
>>> (kafka.utils.ZkUtils$)
>>> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on
>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partit
>>> > Best Regards
>>> > Jerry
>>> > -----Original Message-----
>>> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
>>> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&
>>> lt;chenxiaojun@nhn.com&gt;;
>>> > Cc:
>>> > Sent: 2014-04-25 (星期五) 02:12:02
>>> > Subject: Re: question about isr
>>> >
>>> > Interesting. Which version of Kafka are you using? Were you doing some
>>> partition reassignment? Thanks, Jun
>>> >
>>> >
>>> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt;
>>> wrote:
>>> >
>>> > Hi Team,
>>> >
>>> >   I found a strange phenomenon of isr list in my kafka cluster
>>> >
>>> >
>>> >
>>> >   When I use the tool that kafka provide to get the topic information,
>>> and it show isr list as following, seem it is ok
>>> >
>>> >
>>> >
>>> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe
>>> --zookeeper 10.96.250.215:10013,10.96.250.216:10013,
>>> 10.96.250.217:10013/nelo2-kafka
>>> >
>>> >
>>> >
>>> >
>>> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
>>>   Configs:
>>> >
>>> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3
>>> Replicas: 3,0   Isr: 0,3
>>> >
>>> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0
>>> Replicas: 0,1   Isr: 0,1
>>> >
>>> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1
>>> Replicas: 1,3   Isr: 1,3
>>> >
>>> >
>>> >
>>> >  but when I use some sdk to get the meta info from broker, the isr is
>>> different.
>>> >
>>> > metadata:  { size: 246,
>>> >
>>> >  correlationId: 0,
>>> >
>>> >  brokerNum: -1,
>>> >
>>> >  nodeId: 1,
>>> >
>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>> >
>>> >  port: 9093,
>>> >
>>> >  topicNum: 0,
>>> >
>>> >  topicError: 0,
>>> >
>>> >  topic: 'nelo2-normal-logs',
>>> >
>>> >  partitionNum: 2,
>>> >
>>> >  errorCode: 0,
>>> >
>>> >  partition: 0,
>>> >
>>> >  leader: 3,
>>> >
>>> >  replicasNum: 2,
>>> >
>>> >  replicas: [ 3, 0 ],
>>> >
>>> >  isrNum: 2,
>>> >
>>> >  isr: [ 0, 3 ] }
>>> >
>>> > metadata:  { size: 246,
>>> >
>>> >  correlationId: 0,
>>> >
>>> >  brokerNum: -1,
>>> >
>>> >  nodeId: 1,
>>> >
>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>> >
>>> >  port: 9093,
>>> >
>>> >  topicNum: 0,
>>> >
>>> >  topicError: 0,
>>> >
>>> >  topic: 'nelo2-normal-logs',
>>> >
>>> >  partitionNum: 1,
>>> >
>>> >  errorCode: 0,
>>> >
>>> >  partition: 1,
>>> >
>>> >  leader: 0,
>>> >
>>> >  replicasNum: 2,
>>> >
>>> >  replicas: [ 0, 1 ],
>>> >
>>> >  isrNum: 2,
>>> >
>>> >  isr: [ 0, 1 ] }
>>> >
>>> > metadata:  { size: 246,
>>> >
>>> >  correlationId: 0,
>>> >
>>> >  brokerNum: -1,
>>> >
>>> >  nodeId: 1,
>>> >
>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>> >
>>> >  port: 9093,
>>> >
>>> >  topicNum: 0,
>>> >
>>> >  topicError: 0,
>>> >
>>> >  topic: 'nelo2-normal-logs',
>>> >
>>> >  partitionNum: 0,
>>> >
>>> >  errorCode: 0,
>>> >
>>> >  partition: 2,
>>> >
>>> >  leader: 1,
>>> >
>>> >  replicasNum: 2,
>>> >
>>> >  replicas: [ 1, 3 ],
>>> >
>>> >  isrNum: 1,
>>> >
>>> >  isr: [ 1 ] }
>>> >
>>> >
>>> >
>>> > I also use other sdk, get the same result. I check the logs from
>>> kafka, it seems the sdk result is right. the tool get the wrong result. why
>>> is it happend?
>>> >
>>> >
>>> >
>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
>>> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>> >
>>> >
>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
>>> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>> >
>>> >
>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
>>> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>> >
>>> >
>>> >
>>> >
>>> > Thanks~!
>>> >
>>> >
>>> >
>>> > Best Regards
>>> >
>>> > Jerry
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>

Re: question about isr

Posted by Ian Friedman <ia...@flurry.com>.

Jun - We're using 0.8.1
The timestamps in the last few entries in controller.log seem to correspond
to when the trouble started in server.log

controller.log:

[2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK
expired; shut down all controller components and try to re-elect
(kafka.controller.KafkaController$SessionExpirationListener)

[2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down
(kafka.controller.TopicDeletionManager$DeleteTopicsThread)

server.log:

[2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3.
(kafka.network.Processor)

[2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
(kafka.network.Processor)

[2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
(kafka.network.Processor)

[2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27]
on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27]
from 1,2,3 to 1 (kafka.cluster.Partition)

[2014-05-08 19:01:30,109] ERROR Conditional update of path
/brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data
{"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
and expected version 9 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/brokers/topics/callbackServiceTopic-Medium/partitions/27/state
(kafka.utils.ZkUtils$)

[2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27]
on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
updating ISR (kafka.cluster.Partition)

[2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on
broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from
1,2,3 to 1 (kafka.cluster.Partition)

[2014-05-08 19:01:30,111] ERROR Conditional update of path
/brokers/topics/callbackServiceTopic-High/partitions/3/state with data
{"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
and expected version 9 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state
(kafka.utils.ZkUtils$)

I'm also concerned why this happened at all... I do see some slightly high
CMS activity in the GC log but I don't see any Full GCs and the whole thing
was over within a minute... I'm a bit concerned how something like this
could make a broker fail this way!

One of the after effects we found from leaving the broker running in this
state is that it was still allowing producers with cached metadata to
produce to it, so we lost a whole bunch of messages before I shut him down.
What's up with that??


--Ian


On Sun, May 11, 2014 at 11:39 PM, Jun Rao <ju...@gmail.com> wrote:

> It seems that broker 1 had a soft failure (potentially due to GC).
> However, somehow it didn't receive the latest LeaderAndIsrRequest from the
> controller. Which version of Kafka are you using? In the controller log, do
> you see broker 1 being added back as a live broker?
>
> Thanks,
>
> Jun
>
>
> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ia...@flurry.com> wrote:
>
>> This seems similar to behavior we’re seeing. At some point one of our
>> brokers (id 1) just gives up and starts throwing those errors and
>> kafka-topics no longer lists it as a ISR. However the logs for that broker
>> say something very odd:
>>
>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8] on
>> broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on
>> broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3
>> to 1 (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,251] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic/partitions/3/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state
>> (kafka.utils.ZkUtils$)
>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on
>> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31]
>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31]
>> from 1,2,3 to 1 (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,255] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-High/partitions/31/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for
>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>> (kafka.utils.ZkUtils$)
>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31]
>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on
>> broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from
>> 1,2,3 to 1 (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,258] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state
>> (kafka.utils.ZkUtils$)
>>
>> etc. And these errors continue every few seconds.
>>
>> kafka-topics.sh —describe output:
>> Topic:callbackServiceTopic-High PartitionCount:50
>> ReplicationFactor:3     Configs:
>>         Topic: callbackServiceTopic-High        Partition: 0    Leader: 2
>>       Replicas: 3,1,2 Isr: 2,3
>>         Topic: callbackServiceTopic-High        Partition: 1    Leader: 2
>>       Replicas: 1,2,3 Isr: 2,3
>>         Topic: callbackServiceTopic-High        Partition: 2    Leader: 2
>>       Replicas: 2,3,1 Isr: 2,3
>>
>> I went and looked at one of those ZNodes in the zkCLI and found this:
>>
>> [zk: localhost:2181(CONNECTED) 2] get
>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
>>
>> What does the version number there represent and how does it get out of
>> sync? Should I restart broker 1? Is the fact that broker 1 is behind in
>> leader_epoch significant?
>>
>> Still trying to figure out Kafka operations :(
>> —Ian
>>
>>
>> On Apr 24, 2014, at 9:26 PM, 陈小军 <ch...@nhn.com> wrote:
>>
>> > I don't do any partition reassignment.
>> >
>> > When broker occure following error, this phenomenon  will happen.
>> >
>> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition
>> [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition
>> [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,313] ERROR Conditional update of path
>> /brokers/topics/nelo2-normal-logs/partitions/0/state with data
>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>> and expected version 7 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state
>> (kafka.utils.ZkUtils$)
>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on
>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on
>> broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2
>> to 0 (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,315] ERROR Conditional update of path
>> /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data
>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]}
>> and expected version 8 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
>> (kafka.utils.ZkUtils$)
>> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on
>> broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker
>> 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0
>> (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,318] ERROR Conditional update of path
>> /brokers/topics/nelo2-crash-logs/partitions/1/state with data
>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>> and expected version 7 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state
>> (kafka.utils.ZkUtils$)
>> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker
>> 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partit
>> > Best Regards
>> > Jerry
>> > -----Original Message-----
>> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
>> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&
>> lt;chenxiaojun@nhn.com&gt;;
>> > Cc:
>> > Sent: 2014-04-25 (星期五) 02:12:02
>> > Subject: Re: question about isr
>> >
>> > Interesting. Which version of Kafka are you using? Were you doing some
>> partition reassignment? Thanks, Jun
>> >
>> >
>> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt;
>> wrote:
>> >
>> > Hi Team,
>> >
>> >   I found a strange phenomenon of isr list in my kafka cluster
>> >
>> >
>> >
>> >   When I use the tool that kafka provide to get the topic information,
>> and it show isr list as following, seem it is ok
>> >
>> >
>> >
>> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe
>> --zookeeper 10.96.250.215:10013,10.96.250.216:10013,
>> 10.96.250.217:10013/nelo2-kafka
>> >
>> >
>> >
>> >
>> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
>> Configs:
>> >
>> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3
>> Replicas: 3,0   Isr: 0,3
>> >
>> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0
>> Replicas: 0,1   Isr: 0,1
>> >
>> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1
>> Replicas: 1,3   Isr: 1,3
>> >
>> >
>> >
>> >  but when I use some sdk to get the meta info from broker, the isr is
>> different.
>> >
>> > metadata:  { size: 246,
>> >
>> >  correlationId: 0,
>> >
>> >  brokerNum: -1,
>> >
>> >  nodeId: 1,
>> >
>> >  host: 'xseed171.kdev.nhnsystem.com',
>> >
>> >  port: 9093,
>> >
>> >  topicNum: 0,
>> >
>> >  topicError: 0,
>> >
>> >  topic: 'nelo2-normal-logs',
>> >
>> >  partitionNum: 2,
>> >
>> >  errorCode: 0,
>> >
>> >  partition: 0,
>> >
>> >  leader: 3,
>> >
>> >  replicasNum: 2,
>> >
>> >  replicas: [ 3, 0 ],
>> >
>> >  isrNum: 2,
>> >
>> >  isr: [ 0, 3 ] }
>> >
>> > metadata:  { size: 246,
>> >
>> >  correlationId: 0,
>> >
>> >  brokerNum: -1,
>> >
>> >  nodeId: 1,
>> >
>> >  host: 'xseed171.kdev.nhnsystem.com',
>> >
>> >  port: 9093,
>> >
>> >  topicNum: 0,
>> >
>> >  topicError: 0,
>> >
>> >  topic: 'nelo2-normal-logs',
>> >
>> >  partitionNum: 1,
>> >
>> >  errorCode: 0,
>> >
>> >  partition: 1,
>> >
>> >  leader: 0,
>> >
>> >  replicasNum: 2,
>> >
>> >  replicas: [ 0, 1 ],
>> >
>> >  isrNum: 2,
>> >
>> >  isr: [ 0, 1 ] }
>> >
>> > metadata:  { size: 246,
>> >
>> >  correlationId: 0,
>> >
>> >  brokerNum: -1,
>> >
>> >  nodeId: 1,
>> >
>> >  host: 'xseed171.kdev.nhnsystem.com',
>> >
>> >  port: 9093,
>> >
>> >  topicNum: 0,
>> >
>> >  topicError: 0,
>> >
>> >  topic: 'nelo2-normal-logs',
>> >
>> >  partitionNum: 0,
>> >
>> >  errorCode: 0,
>> >
>> >  partition: 2,
>> >
>> >  leader: 1,
>> >
>> >  replicasNum: 2,
>> >
>> >  replicas: [ 1, 3 ],
>> >
>> >  isrNum: 1,
>> >
>> >  isr: [ 1 ] }
>> >
>> >
>> >
>> > I also use other sdk, get the same result. I check the logs from kafka,
>> it seems the sdk result is right. the tool get the wrong result. why is it
>> happend?
>> >
>> >
>> >
>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
>> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>> >
>> >
>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
>> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>> >
>> >
>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
>> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>> >
>> >
>> >
>> >
>> > Thanks~!
>> >
>> >
>> >
>> > Best Regards
>> >
>> > Jerry
>> >
>> >
>> >
>> >
>>
>>
>

Re: question about isr

Posted by Jun Rao <ju...@gmail.com>.

It seems that broker 1 had a soft failure (potentially due to GC). However,
somehow it didn't receive the latest LeaderAndIsrRequest from the
controller. Which version of Kafka are you using? In the controller log, do
you see broker 1 being added back as a live broker?

Thanks,

Jun


On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ia...@flurry.com> wrote:

> This seems similar to behavior we’re seeing. At some point one of our
> brokers (id 1) just gives up and starts throwing those errors and
> kafka-topics no longer lists it as a ISR. However the logs for that broker
> say something very odd:
>
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8] on
> broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip
> updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on
> broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3
> to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] ERROR Conditional update of path
> /brokers/topics/callbackServiceTopic/partitions/3/state with data
> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
> and expected version 9 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state
> (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on
> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
> updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31] on
> broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31] from
> 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] ERROR Conditional update of path
> /brokers/topics/callbackServiceTopic-High/partitions/31/state with data
> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
> and expected version 9 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for
> /brokers/topics/callbackServiceTopic-High/partitions/31/state
> (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31] on
> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
> updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on
> broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from
> 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,258] ERROR Conditional update of path
> /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data
> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
> and expected version 9 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state
> (kafka.utils.ZkUtils$)
>
> etc. And these errors continue every few seconds.
>
> kafka-topics.sh —describe output:
> Topic:callbackServiceTopic-High PartitionCount:50
> ReplicationFactor:3     Configs:
>         Topic: callbackServiceTopic-High        Partition: 0    Leader: 2
>       Replicas: 3,1,2 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 1    Leader: 2
>       Replicas: 1,2,3 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 2    Leader: 2
>       Replicas: 2,3,1 Isr: 2,3
>
> I went and looked at one of those ZNodes in the zkCLI and found this:
>
> [zk: localhost:2181(CONNECTED) 2] get
> /brokers/topics/callbackServiceTopic-High/partitions/31/state
> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
>
> What does the version number there represent and how does it get out of
> sync? Should I restart broker 1? Is the fact that broker 1 is behind in
> leader_epoch significant?
>
> Still trying to figure out Kafka operations :(
> —Ian
>
>
> On Apr 24, 2014, at 9:26 PM, 陈小军 <ch...@nhn.com> wrote:
>
> > I don't do any partition reassignment.
> >
> > When broker occure following error, this phenomenon  will happen.
> >
> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition
> [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition
> [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] ERROR Conditional update of path
> /brokers/topics/nelo2-normal-logs/partitions/0/state with data
> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
> and expected version 7 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state
> (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on broker
> 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on
> broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2
> to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,315] ERROR Conditional update of path
> /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data
> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]}
> and expected version 8 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
> (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on
> broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip
> updating ISR (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker
> 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0
> (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,318] ERROR Conditional update of path
> /brokers/topics/nelo2-crash-logs/partitions/1/state with data
> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
> and expected version 7 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state
> (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker
> 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partit
> > Best Regards
> > Jerry
> > -----Original Message-----
> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&
> lt;chenxiaojun@nhn.com&gt;;
> > Cc:
> > Sent: 2014-04-25 (星期五) 02:12:02
> > Subject: Re: question about isr
> >
> > Interesting. Which version of Kafka are you using? Were you doing some
> partition reassignment? Thanks, Jun
> >
> >
> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt; wrote:
> >
> > Hi Team,
> >
> >   I found a strange phenomenon of isr list in my kafka cluster
> >
> >
> >
> >   When I use the tool that kafka provide to get the topic information,
> and it show isr list as following, seem it is ok
> >
> >
> >
> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe
> --zookeeper 10.96.250.215:10013,10.96.250.216:10013,
> 10.96.250.217:10013/nelo2-kafka
> >
> >
> >
> >
> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
> Configs:
> >
> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3
> Replicas: 3,0   Isr: 0,3
> >
> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0
> Replicas: 0,1   Isr: 0,1
> >
> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1
> Replicas: 1,3   Isr: 1,3
> >
> >
> >
> >  but when I use some sdk to get the meta info from broker, the isr is
> different.
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 2,
> >
> >  errorCode: 0,
> >
> >  partition: 0,
> >
> >  leader: 3,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 3, 0 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 3 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 1,
> >
> >  errorCode: 0,
> >
> >  partition: 1,
> >
> >  leader: 0,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 0, 1 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 1 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 0,
> >
> >  errorCode: 0,
> >
> >  partition: 2,
> >
> >  leader: 1,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 1, 3 ],
> >
> >  isrNum: 1,
> >
> >  isr: [ 1 ] }
> >
> >
> >
> > I also use other sdk, get the same result. I check the logs from kafka,
> it seems the sdk result is right. the tool get the wrong result. why is it
> happend?
> >
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> >
> >
> > Thanks~!
> >
> >
> >
> > Best Regards
> >
> > Jerry
> >
> >
> >
> >
>
>

Re: question about isr

Posted by Ian Friedman <ia...@flurry.com>.

This seems similar to behavior we’re seeing. At some point one of our brokers (id 1) just gives up and starts throwing those errors and kafka-topics no longer lists it as a ISR. However the logs for that broker say something very odd:

[2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8] on broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3 to 1 (kafka.cluster.Partition)
[2014-05-09 10:16:00,251] ERROR Conditional update of path /brokers/topics/callbackServiceTopic/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state (kafka.utils.ZkUtils$)
[2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31] on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31] from 1,2,3 to 1 (kafka.cluster.Partition)
[2014-05-09 10:16:00,255] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-High/partitions/31/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/31/state (kafka.utils.ZkUtils$)
[2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31] on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from 1,2,3 to 1 (kafka.cluster.Partition)
[2014-05-09 10:16:00,258] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state (kafka.utils.ZkUtils$)

etc. And these errors continue every few seconds.

kafka-topics.sh —describe output:
Topic:callbackServiceTopic-High	PartitionCount:50	ReplicationFactor:3	Configs:
	Topic: callbackServiceTopic-High	Partition: 0	Leader: 2	Replicas: 3,1,2	Isr: 2,3
	Topic: callbackServiceTopic-High	Partition: 1	Leader: 2	Replicas: 1,2,3	Isr: 2,3
	Topic: callbackServiceTopic-High	Partition: 2	Leader: 2	Replicas: 2,3,1	Isr: 2,3

I went and looked at one of those ZNodes in the zkCLI and found this:

[zk: localhost:2181(CONNECTED) 2] get /brokers/topics/callbackServiceTopic-High/partitions/31/state
{"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}

What does the version number there represent and how does it get out of sync? Should I restart broker 1? Is the fact that broker 1 is behind in leader_epoch significant?

Still trying to figure out Kafka operations :(
—Ian

 
On Apr 24, 2014, at 9:26 PM, 陈小军 <ch...@nhn.com> wrote:

> I don't do any partition reassignment. 
> 
> When broker occure following error, this phenomenon  will happen.
> 
> [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
> [2014-03-14 12:11:44,313] ERROR Conditional update of path /brokers/topics/nelo2-normal-logs/partitions/0/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state (kafka.utils.ZkUtils$)
> [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2 to 0 (kafka.cluster.Partition)
> [2014-03-14 12:11:44,315] ERROR Conditional update of path /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]} and expected version 8 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state (kafka.utils.ZkUtils$)
> [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0 (kafka.cluster.Partition)
> [2014-03-14 12:11:44,318] ERROR Conditional update of path /brokers/topics/nelo2-crash-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state (kafka.utils.ZkUtils$)
> [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partit
> Best Regards
> Jerry
> -----Original Message-----
> From: "Jun Rao"&lt;junrao@gmail.com&gt; 
> To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&lt;chenxiaojun@nhn.com&gt;; 
> Cc: 
> Sent: 2014-04-25 (星期五) 02:12:02
> Subject: Re: question about isr
> 
> Interesting. Which version of Kafka are you using? Were you doing some partition reassignment? Thanks, Jun
> 
> 
> On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt; wrote:
> 
> Hi Team,
> 
>   I found a strange phenomenon of isr list in my kafka cluster
> 
> 
> 
>   When I use the tool that kafka provide to get the topic information, and it show isr list as following, seem it is ok
> 
> 
> 
>     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe --zookeeper 10.96.250.215:10013,10.96.250.216:10013,10.96.250.217:10013/nelo2-kafka
> 
> 
> 
> 
> Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2     Configs:
> 
>        Topic: nelo2-normal-logs        Partition: 0    Leader: 3       Replicas: 3,0   Isr: 0,3
> 
>        Topic: nelo2-normal-logs        Partition: 1    Leader: 0       Replicas: 0,1   Isr: 0,1
> 
>        Topic: nelo2-normal-logs        Partition: 2    Leader: 1       Replicas: 1,3   Isr: 1,3
> 
> 
> 
>  but when I use some sdk to get the meta info from broker, the isr is different.
> 
> metadata:  { size: 246,
> 
>  correlationId: 0,
> 
>  brokerNum: -1,
> 
>  nodeId: 1,
> 
>  host: 'xseed171.kdev.nhnsystem.com',
> 
>  port: 9093,
> 
>  topicNum: 0,
> 
>  topicError: 0,
> 
>  topic: 'nelo2-normal-logs',
> 
>  partitionNum: 2,
> 
>  errorCode: 0,
> 
>  partition: 0,
> 
>  leader: 3,
> 
>  replicasNum: 2,
> 
>  replicas: [ 3, 0 ],
> 
>  isrNum: 2,
> 
>  isr: [ 0, 3 ] }
> 
> metadata:  { size: 246,
> 
>  correlationId: 0,
> 
>  brokerNum: -1,
> 
>  nodeId: 1,
> 
>  host: 'xseed171.kdev.nhnsystem.com',
> 
>  port: 9093,
> 
>  topicNum: 0,
> 
>  topicError: 0,
> 
>  topic: 'nelo2-normal-logs',
> 
>  partitionNum: 1,
> 
>  errorCode: 0,
> 
>  partition: 1,
> 
>  leader: 0,
> 
>  replicasNum: 2,
> 
>  replicas: [ 0, 1 ],
> 
>  isrNum: 2,
> 
>  isr: [ 0, 1 ] }
> 
> metadata:  { size: 246,
> 
>  correlationId: 0,
> 
>  brokerNum: -1,
> 
>  nodeId: 1,
> 
>  host: 'xseed171.kdev.nhnsystem.com',
> 
>  port: 9093,
> 
>  topicNum: 0,
> 
>  topicError: 0,
> 
>  topic: 'nelo2-normal-logs',
> 
>  partitionNum: 0,
> 
>  errorCode: 0,
> 
>  partition: 2,
> 
>  leader: 1,
> 
>  replicasNum: 2,
> 
>  replicas: [ 1, 3 ],
> 
>  isrNum: 1,
> 
>  isr: [ 1 ] }
> 
> 
> 
> I also use other sdk, get the same result. I check the logs from kafka, it seems the sdk result is right. the tool get the wrong result. why is it happend?
> 
> 
> 
> [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1) for partition [nelo2-normal-logs,1] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> 
> 
> [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3) for partition [nelo2-normal-logs,2] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> 
> 
> [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0) for partition [nelo2-normal-logs,0] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> 
> 
> 
> 
> Thanks~!
> 
> 
> 
> Best Regards
> 
> Jerry
> 
> 
> 
>

Re: question about isr

Posted by 陈小军 <ch...@nhn.com>.

I don't do any partition reassignment. 
 
When broker occure following error, this phenomenon  will happen.
 
[hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
[2014-03-14 12:11:44,313] ERROR Conditional update of path /brokers/topics/nelo2-normal-logs/partitions/0/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state (kafka.utils.ZkUtils$)
[2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2 to 0 (kafka.cluster.Partition)
[2014-03-14 12:11:44,315] ERROR Conditional update of path /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]} and expected version 8 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state (kafka.utils.ZkUtils$)
[2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0 (kafka.cluster.Partition)
[2014-03-14 12:11:44,318] ERROR Conditional update of path /brokers/topics/nelo2-crash-logs/partitions/1/state with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state (kafka.utils.ZkUtils$)
[2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partit
Best Regards
Jerry
-----Original Message-----
From: "Jun Rao"&lt;junrao@gmail.com&gt; 
To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&lt;chenxiaojun@nhn.com&gt;; 
Cc: 
Sent: 2014-04-25 (星期五) 02:12:02
Subject: Re: question about isr
 
Interesting. Which version of Kafka are you using? Were you doing some partition reassignment? Thanks, Jun


On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt; wrote:

Hi Team,

   I found a strange phenomenon of isr list in my kafka cluster



   When I use the tool that kafka provide to get the topic information, and it show isr list as following, seem it is ok



     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe --zookeeper 10.96.250.215:10013,10.96.250.216:10013,10.96.250.217:10013/nelo2-kafka




Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2     Configs:

        Topic: nelo2-normal-logs        Partition: 0    Leader: 3       Replicas: 3,0   Isr: 0,3

        Topic: nelo2-normal-logs        Partition: 1    Leader: 0       Replicas: 0,1   Isr: 0,1

        Topic: nelo2-normal-logs        Partition: 2    Leader: 1       Replicas: 1,3   Isr: 1,3



  but when I use some sdk to get the meta info from broker, the isr is different.

metadata:  { size: 246,

  correlationId: 0,

  brokerNum: -1,

  nodeId: 1,

  host: 'xseed171.kdev.nhnsystem.com',

  port: 9093,

  topicNum: 0,

  topicError: 0,

  topic: 'nelo2-normal-logs',

  partitionNum: 2,

  errorCode: 0,

  partition: 0,

  leader: 3,

  replicasNum: 2,

  replicas: [ 3, 0 ],

  isrNum: 2,

  isr: [ 0, 3 ] }

metadata:  { size: 246,

  correlationId: 0,

  brokerNum: -1,

  nodeId: 1,

  host: 'xseed171.kdev.nhnsystem.com',

  port: 9093,

  topicNum: 0,

  topicError: 0,

  topic: 'nelo2-normal-logs',

  partitionNum: 1,

  errorCode: 0,

  partition: 1,

  leader: 0,

  replicasNum: 2,

  replicas: [ 0, 1 ],

  isrNum: 2,

  isr: [ 0, 1 ] }

metadata:  { size: 246,

  correlationId: 0,

  brokerNum: -1,

  nodeId: 1,

  host: 'xseed171.kdev.nhnsystem.com',

  port: 9093,

  topicNum: 0,

  topicError: 0,

  topic: 'nelo2-normal-logs',

  partitionNum: 0,

  errorCode: 0,

  partition: 2,

  leader: 1,

  replicasNum: 2,

  replicas: [ 1, 3 ],

  isrNum: 1,

  isr: [ 1 ] }



 I also use other sdk, get the same result. I check the logs from kafka, it seems the sdk result is right. the tool get the wrong result. why is it happend?



[2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1) for partition [nelo2-normal-logs,1] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)


[2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3) for partition [nelo2-normal-logs,2] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)


[2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0) for partition [nelo2-normal-logs,0] in response to UpdateMetadata request sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)




Thanks~!



Best Regards

Jerry

Re: question about isr

Posted by Jun Rao <ju...@gmail.com>.

Interesting. Which version of Kafka are you using? Were you doing some
partition reassignment?

Thanks,

Jun


On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 <ch...@nhn.com> wrote:

> Hi Team,
>    I found a strange phenomenon of isr list in my kafka cluster
>
>    When I use the tool that kafka provide to get the topic information,
> and it show isr list as following, seem it is ok
>
>      [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe --zookeeper
> 10.96.250.215:10013,10.96.250.216:10013,10.96.250.217:10013/nelo2-kafka
>
> Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
> Configs:
>         Topic: nelo2-normal-logs        Partition: 0    Leader: 3
> Replicas: 3,0   Isr: 0,3
>         Topic: nelo2-normal-logs        Partition: 1    Leader: 0
> Replicas: 0,1   Isr: 0,1
>         Topic: nelo2-normal-logs        Partition: 2    Leader: 1
> Replicas: 1,3   Isr: 1,3
>
>   but when I use some sdk to get the meta info from broker, the isr is
> different.
> metadata:  { size: 246,
>   correlationId: 0,
>   brokerNum: -1,
>   nodeId: 1,
>   host: 'xseed171.kdev.nhnsystem.com',
>   port: 9093,
>   topicNum: 0,
>   topicError: 0,
>   topic: 'nelo2-normal-logs',
>   partitionNum: 2,
>   errorCode: 0,
>   partition: 0,
>   leader: 3,
>   replicasNum: 2,
>   replicas: [ 3, 0 ],
>   isrNum: 2,
>   isr: [ 0, 3 ] }
> metadata:  { size: 246,
>   correlationId: 0,
>   brokerNum: -1,
>   nodeId: 1,
>   host: 'xseed171.kdev.nhnsystem.com',
>   port: 9093,
>   topicNum: 0,
>   topicError: 0,
>   topic: 'nelo2-normal-logs',
>   partitionNum: 1,
>   errorCode: 0,
>   partition: 1,
>   leader: 0,
>   replicasNum: 2,
>   replicas: [ 0, 1 ],
>   isrNum: 2,
>   isr: [ 0, 1 ] }
> metadata:  { size: 246,
>   correlationId: 0,
>   brokerNum: -1,
>   nodeId: 1,
>   host: 'xseed171.kdev.nhnsystem.com',
>   port: 9093,
>   topicNum: 0,
>   topicError: 0,
>   topic: 'nelo2-normal-logs',
>   partitionNum: 0,
>   errorCode: 0,
>   partition: 2,
>   leader: 1,
>   replicasNum: 2,
>   replicas: [ 1, 3 ],
>   isrNum: 1,
>   isr: [ 1 ] }
>
>  I also use other sdk, get the same result. I check the logs from kafka,
> it seems the sdk result is right. the tool get the wrong result. why is it
> happend?
>
> [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
> [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>
> Thanks~!
>
> Best Regards
> Jerry
>