You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by tao xiao <xi...@gmail.com> on 2015/02/28 08:15:36 UTC

How replicas catch up the leader

Hi team,

I had a replica node that was shutdown improperly due to no disk space
left. I managed to clean up the disk and restarted the replica but the
replica since then never caught up the leader shown below

Topic:test PartitionCount:1 ReplicationFactor:3 Configs:

Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6

broker 1 is the replica that failed before. Is there a way that I can force
the replica to catch up the leader?

-- 
Regards,
Tao

Re: How replicas catch up the leader

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

Can you check if you replica fetcher thread is still running on broker 1?
Also, you may check the public access log on broker 5 to see if there are
fetch requests from broker 1.

On 2/28/15, 12:39 AM, "tao xiao" <xi...@gmail.com> wrote:

>Thanks Harsha. In my case the replica doesn't catch up at all. the last
>log
>date is 5 days ago. It seems the failed replica is excluded from
>replication list. I am looking for a command that can add the replica back
>to the ISR list or force it to start sync-up again
>
>On Sat, Feb 28, 2015 at 4:27 PM, Harsha <ka...@harsha.io> wrote:
>
>> you can increase num.replica.fetchers by default its 1 and also try
>> increasing replica.fetch.max.bytes
>> -Harsha
>>
>> On Fri, Feb 27, 2015, at 11:15 PM, tao xiao wrote:
>> > Hi team,
>> >
>> > I had a replica node that was shutdown improperly due to no disk space
>> > left. I managed to clean up the disk and restarted the replica but the
>> > replica since then never caught up the leader shown below
>> >
>> > Topic:test PartitionCount:1 ReplicationFactor:3 Configs:
>> >
>> > Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6
>> >
>> > broker 1 is the replica that failed before. Is there a way that I can
>> > force
>> > the replica to catch up the leader?
>> >
>> > --
>> > Regards,
>> > Tao
>>
>
>
>
>-- 
>Regards,
>Tao

Re: How replicas catch up the leader

Posted by tao xiao <xi...@gmail.com>.

Thanks Harsha. In my case the replica doesn't catch up at all. the last log
date is 5 days ago. It seems the failed replica is excluded from
replication list. I am looking for a command that can add the replica back
to the ISR list or force it to start sync-up again

On Sat, Feb 28, 2015 at 4:27 PM, Harsha <ka...@harsha.io> wrote:

> you can increase num.replica.fetchers by default its 1 and also try
> increasing replica.fetch.max.bytes
> -Harsha
>
> On Fri, Feb 27, 2015, at 11:15 PM, tao xiao wrote:
> > Hi team,
> >
> > I had a replica node that was shutdown improperly due to no disk space
> > left. I managed to clean up the disk and restarted the replica but the
> > replica since then never caught up the leader shown below
> >
> > Topic:test PartitionCount:1 ReplicationFactor:3 Configs:
> >
> > Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6
> >
> > broker 1 is the replica that failed before. Is there a way that I can
> > force
> > the replica to catch up the leader?
> >
> > --
> > Regards,
> > Tao
>

-- 
Regards,
Tao

Re: How replicas catch up the leader

Posted by Harsha <ka...@harsha.io>.

you can increase num.replica.fetchers by default its 1 and also try
increasing replica.fetch.max.bytes
-Harsha

On Fri, Feb 27, 2015, at 11:15 PM, tao xiao wrote:
> Hi team,
> 
> I had a replica node that was shutdown improperly due to no disk space
> left. I managed to clean up the disk and restarted the replica but the
> replica since then never caught up the leader shown below
> 
> Topic:test PartitionCount:1 ReplicationFactor:3 Configs:
> 
> Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6
> 
> broker 1 is the replica that failed before. Is there a way that I can
> force
> the replica to catch up the leader?
> 
> -- 
> Regards,
> Tao

Re: How replicas catch up the leader

Posted by "sy.pan" <sh...@gmail.com>.

Hi, @Jiangjie Qin

this is the related info from controller.log:

[2015-03-11 10:54:11,962] ERROR [Controller 0]: Error completing reassignment of partition [ad_click_sts,3] (kafka.controller.KafkaController)
kafka.common.KafkaException: Partition [ad_click_sts,3] to be reassigned is already assigned to replicas 0,1. Ignoring request for partition reassignment
        at kafka.controller.KafkaController.initiateReassignReplicasForTopicPartition(KafkaController.scala:585)

It seems like kafka ignore the kafka-reassign-partitions.sh command.

the json file used in command is :

{"version":1,"partitions":[{"topic":"ad_click_sts","partition":3,"replicas":[0,1]}]}


The partition has lost sync replication in practice :

Topic: ad_click_sts	Partition: 3	Leader: 0	Replicas: 0,1	Isr: 0

Regards
sy.pan

> 在 2015年3月11日，12:32，Jiangjie Qin <jq...@linkedin.com.INVALID> 写道：
> 
> It looks that in your case it is because broker 1 somehow missed a
> controller LeaderAndIsrRequest for [ad_click_sts,4]. So the zkVersion
> would be different from the value stored in zookeeper from that on.
> Therefore broker 1 failed to update ISR. In this case you have to bounce
> broker to fix it.
> From what you posted, it looks both broker 0 and broker 1 are having this
> issue. So the question is how could both broker missed a controller
> LeaderAndIsrRequest. Is there anything interesting in controller.log?
> 
> Jiangjie (Becket) Qin
> 
> On 3/10/15, 8:33 PM, "sy.pan" <shengyi.pan@gmail.com <ma...@gmail.com>> wrote:
> 
>> @tao xiao and  Jiangjie Qin, Thank you very much
>> 
>> I try to run kafka-reassign-partitions.sh, but the issue still exists…
>> 
>> this the log info:
>> 
>> [2015-03-11 11:00:40,086] ERROR Conditional update of path
>> /brokers/topics/ad_click_sts/partitions/4/state with data
>> {"controller_epoch":23,"leader":1,"version":1,"leader_epoch":35,"isr":[1,0
>> ]} and expected version 564 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
>> = BadVersion for /brokers/topics/ad_click_sts/partitions/4/state
>> (kafka.utils.ZkUtils$)
>> 
>> [2015-03-11 11:00:40,086] INFO Partition [ad_click_sts,4] on broker 1:
>> Cached zkVersion [564] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partition)
>>>>>>>>>>>>>>>>>>>> 
>> 
>> finally, I had to restart the kafka node and the Isr problem is fixed, is
>> there any better ways?
>> 
>> Regards
>> sy.pan
>> 
>> 
>>> 在 2015年3月11日，03:34，Jiangjie Qin <jq...@linkedin.com.INVALID> 写道：
>>> 
>>> This looks like a leader broker somehow did not respond to a fetch
>>> request
>>> from the follower. It may be because the broker was too busy. If that is
>>> the case, Xiao¹s approach could help - reassign partitions or reelect
>>> leaders to balance the traffic among brokers.
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>>> On 3/9/15, 8:31 PM, "sy.pan" <shengyi.pan@gmail.com
>>> <mailto:shengyi.pan@gmail.com <ma...@gmail.com>>> wrote:
>>> 
>>>> Hi, tao xiao and Jiangjie Qin
>>>> 
>>>> I encounter with the same issue, my node had recovered from high load
>>>> problem (caused by other application)
>>>> 
>>>> this is the kafka-topic show:
>>>> 
>>>> Topic:ad_click_sts	PartitionCount:6	ReplicationFactor:2	Configs:
>>>> 	Topic: ad_click_sts	Partition: 0	Leader: 1	Replicas: 1,0	Isr: 1
>>>> 	Topic: ad_click_sts	Partition: 1	Leader: 0	Replicas: 0,1	Isr: 0
>>>> 	Topic: ad_click_sts	Partition: 2	Leader: 1	Replicas: 1,0	Isr: 1
>>>> 	Topic: ad_click_sts	Partition: 3	Leader: 0	Replicas: 0,1	Isr: 0
>>>> 	Topic: ad_click_sts	Partition: 4	Leader: 1	Replicas: 1,0	Isr: 1
>>>> 	Topic: ad_click_sts	Partition: 5	Leader: 0	Replicas: 0,1	Isr: 0
>>>> 
>>>> ReplicaFetcherThread info extracted from kafka server.log :
>>>> 
>>>> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>>>> fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>>>> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>>>> bytes; RequestInfo: [ad_click_sts,5] ->
>>>> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>>>> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>>>> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>>>> java.net.SocketTimeoutException
>>>>      at 
>>>> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>>>>      at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>>>>      ŠŠ..
>>>>      at 
>>>> 
>>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsum
>>>> er
>>>> .scala:108)
>>>>      at 
>>>> 
>>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>> a:
>>>> 108)
>>>>      at 
>>>> 
>>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>> a:
>>>> 108)
>>>>      at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>>>>      at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>>>>      at 
>>>> 
>>>> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherTh
>>>> re
>>>> ad.scala:96)
>>>>      at 
>>>> 
>>>> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88
>>>> )
>>>>      at 
>>>> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>>>> 
>>>> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>>>> (kafka.consumer.SimpleConsumer)
>>>> 
>>>> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>>>> Cached zkVersion [556] not equal to that in zookeeper, skip updating
>>>> ISR
>>>> (kafka.cluster.Partition)
>>>> 
>>>> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>>>> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>>>> (kafka.cluster.Partition)
>>>> 
>>>> 
>>>> How to fix this Isr problem ? Is there some command can be run ?
>>>> 
>>>> Regards
>>>> sy.pan

Re: How replicas catch up the leader

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

It looks that in your case it is because broker 1 somehow missed a
controller LeaderAndIsrRequest for [ad_click_sts,4]. So the zkVersion
would be different from the value stored in zookeeper from that on.
Therefore broker 1 failed to update ISR. In this case you have to bounce
broker to fix it.
From what you posted, it looks both broker 0 and broker 1 are having this
issue. So the question is how could both broker missed a controller
LeaderAndIsrRequest. Is there anything interesting in controller.log?

Jiangjie (Becket) Qin

On 3/10/15, 8:33 PM, "sy.pan" <sh...@gmail.com> wrote:

>@tao xiao and  Jiangjie Qin, Thank you very much
>
>I try to run kafka-reassign-partitions.sh, but the issue still exists…
>
>this the log info:
>
>[2015-03-11 11:00:40,086] ERROR Conditional update of path
>/brokers/topics/ad_click_sts/partitions/4/state with data
>{"controller_epoch":23,"leader":1,"version":1,"leader_epoch":35,"isr":[1,0
>]} and expected version 564 failed due to
>org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
>= BadVersion for /brokers/topics/ad_click_sts/partitions/4/state
>(kafka.utils.ZkUtils$)
>
>[2015-03-11 11:00:40,086] INFO Partition [ad_click_sts,4] on broker 1:
>Cached zkVersion [564] not equal to that in zookeeper, skip updating ISR
>(kafka.cluster.Partition)
>>>>>>>>>>>>>>>>>>>
>
>finally, I had to restart the kafka node and the Isr problem is fixed, is
>there any better ways?
>
>Regards
>sy.pan
>
>
>> 在 2015年3月11日，03:34，Jiangjie Qin <jq...@linkedin.com.INVALID> 写道：
>> 
>> This looks like a leader broker somehow did not respond to a fetch
>>request
>> from the follower. It may be because the broker was too busy. If that is
>> the case, Xiao¹s approach could help - reassign partitions or reelect
>> leaders to balance the traffic among brokers.
>> 
>> Jiangjie (Becket) Qin
>> 
>> On 3/9/15, 8:31 PM, "sy.pan" <shengyi.pan@gmail.com
>><ma...@gmail.com>> wrote:
>> 
>>> Hi, tao xiao and Jiangjie Qin
>>> 
>>> I encounter with the same issue, my node had recovered from high load
>>> problem (caused by other application)
>>> 
>>> this is the kafka-topic show:
>>> 
>>> Topic:ad_click_sts	PartitionCount:6	ReplicationFactor:2	Configs:
>>> 	Topic: ad_click_sts	Partition: 0	Leader: 1	Replicas: 1,0	Isr: 1
>>> 	Topic: ad_click_sts	Partition: 1	Leader: 0	Replicas: 0,1	Isr: 0
>>> 	Topic: ad_click_sts	Partition: 2	Leader: 1	Replicas: 1,0	Isr: 1
>>> 	Topic: ad_click_sts	Partition: 3	Leader: 0	Replicas: 0,1	Isr: 0
>>> 	Topic: ad_click_sts	Partition: 4	Leader: 1	Replicas: 1,0	Isr: 1
>>> 	Topic: ad_click_sts	Partition: 5	Leader: 0	Replicas: 0,1	Isr: 0
>>> 
>>> ReplicaFetcherThread info extracted from kafka server.log :
>>> 
>>> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>>> fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>>> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>>> bytes; RequestInfo: [ad_click_sts,5] ->
>>> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>>> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>>> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>>> java.net.SocketTimeoutException
>>>       at 
>>> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>>>       at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>>>       ŠŠ..
>>>       at 
>>> 
>>>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsum
>>>er
>>> .scala:108)
>>>       at 
>>> 
>>>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>a:
>>> 108)
>>>       at 
>>> 
>>>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>a:
>>> 108)
>>>       at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>>>       at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>>>       at 
>>> 
>>>kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherTh
>>>re
>>> ad.scala:96)
>>>       at 
>>> 
>>>kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88
>>>)
>>>       at 
>>>kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>>> 
>>> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>>> (kafka.consumer.SimpleConsumer)
>>> 
>>> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>>> Cached zkVersion [556] not equal to that in zookeeper, skip updating
>>>ISR
>>> (kafka.cluster.Partition)
>>> 
>>> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>>> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>>> (kafka.cluster.Partition)
>>> 
>>> 
>>> How to fix this Isr problem ? Is there some command can be run ?
>>> 
>>> Regards
>>> sy.pan
>

Re: How replicas catch up the leader

Posted by "sy.pan" <sh...@gmail.com>.

@tao xiao and  Jiangjie Qin, Thank you very much

I try to run kafka-reassign-partitions.sh, but the issue still exists…

this the log info:

[2015-03-11 11:00:40,086] ERROR Conditional update of path /brokers/topics/ad_click_sts/partitions/4/state with data {"controller_epoch":23,"leader":1,"version":1,"leader_epoch":35,"isr":[1,0]} and expected version 564 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/ad_click_sts/partitions/4/state (kafka.utils.ZkUtils$)

[2015-03-11 11:00:40,086] INFO Partition [ad_click_sts,4] on broker 1: Cached zkVersion [564] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>>>>>>>>>>>>>>>>>>

finally, I had to restart the kafka node and the Isr problem is fixed, is there any better ways?

Regards
sy.pan


> 在 2015年3月11日，03:34，Jiangjie Qin <jq...@linkedin.com.INVALID> 写道：
> 
> This looks like a leader broker somehow did not respond to a fetch request
> from the follower. It may be because the broker was too busy. If that is
> the case, Xiao¹s approach could help - reassign partitions or reelect
> leaders to balance the traffic among brokers.
> 
> Jiangjie (Becket) Qin
> 
> On 3/9/15, 8:31 PM, "sy.pan" <shengyi.pan@gmail.com <ma...@gmail.com>> wrote:
> 
>> Hi, tao xiao and Jiangjie Qin
>> 
>> I encounter with the same issue, my node had recovered from high load
>> problem (caused by other application)
>> 
>> this is the kafka-topic show:
>> 
>> Topic:ad_click_sts	PartitionCount:6	ReplicationFactor:2	Configs:
>> 	Topic: ad_click_sts	Partition: 0	Leader: 1	Replicas: 1,0	Isr: 1
>> 	Topic: ad_click_sts	Partition: 1	Leader: 0	Replicas: 0,1	Isr: 0
>> 	Topic: ad_click_sts	Partition: 2	Leader: 1	Replicas: 1,0	Isr: 1
>> 	Topic: ad_click_sts	Partition: 3	Leader: 0	Replicas: 0,1	Isr: 0
>> 	Topic: ad_click_sts	Partition: 4	Leader: 1	Replicas: 1,0	Isr: 1
>> 	Topic: ad_click_sts	Partition: 5	Leader: 0	Replicas: 0,1	Isr: 0
>> 
>> ReplicaFetcherThread info extracted from kafka server.log :
>> 
>> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>> fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>> bytes; RequestInfo: [ad_click_sts,5] ->
>> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>> java.net.SocketTimeoutException
>>       at 
>> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>>       at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>>       ŠŠ..
>>       at 
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer
>> .scala:108)
>>       at 
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>> 108)
>>       at 
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>> 108)
>>       at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>>       at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>>       at 
>> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThre
>> ad.scala:96)
>>       at 
>> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
>>       at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>> 
>> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>> (kafka.consumer.SimpleConsumer)
>> 
>> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>> Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partition)
>> 
>> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>> (kafka.cluster.Partition)
>> 
>> 
>> How to fix this Isr problem ? Is there some command can be run ?
>> 
>> Regards
>> sy.pan

Re: How replicas catch up the leader

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

This looks like a leader broker somehow did not respond to a fetch request
from the follower. It may be because the broker was too busy. If that is
the case, Xiao¹s approach could help - reassign partitions or reelect
leaders to balance the traffic among brokers.

Jiangjie (Becket) Qin

On 3/9/15, 8:31 PM, "sy.pan" <sh...@gmail.com> wrote:

>Hi, tao xiao and Jiangjie Qin
>
>I encounter with the same issue, my node had recovered from high load
>problem (caused by other application)
>
>this is the kafka-topic show:
>
>Topic:ad_click_sts	PartitionCount:6	ReplicationFactor:2	Configs:
>	Topic: ad_click_sts	Partition: 0	Leader: 1	Replicas: 1,0	Isr: 1
>	Topic: ad_click_sts	Partition: 1	Leader: 0	Replicas: 0,1	Isr: 0
>	Topic: ad_click_sts	Partition: 2	Leader: 1	Replicas: 1,0	Isr: 1
>	Topic: ad_click_sts	Partition: 3	Leader: 0	Replicas: 0,1	Isr: 0
>	Topic: ad_click_sts	Partition: 4	Leader: 1	Replicas: 1,0	Isr: 1
>	Topic: ad_click_sts	Partition: 5	Leader: 0	Replicas: 0,1	Isr: 0
>
>ReplicaFetcherThread info extracted from kafka server.log :
>
>[2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>bytes; RequestInfo: [ad_click_sts,5] ->
>PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>java.net.SocketTimeoutException
>        at 
>sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>        ŠŠ..
>        at 
>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer
>.scala:108)
>        at 
>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>108)
>        at 
>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>108)
>        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>        at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>        at 
>kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThre
>ad.scala:96)
>        at 
>kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
>        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>
>[2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>(kafka.consumer.SimpleConsumer)
>
>[2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR
>(kafka.cluster.Partition)
>
>[2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>(kafka.cluster.Partition)
>
>
>How to fix this Isr problem ? Is there some command can be run ?
>
>Regards
>sy.pan

Re: How replicas catch up the leader

Posted by tao xiao <xi...@gmail.com>.

I ended up running kafka-reassign-partitions.sh to reassign partitions to
different nodes

On Tue, Mar 10, 2015 at 11:31 AM, sy.pan <sh...@gmail.com> wrote:

> Hi, tao xiao and Jiangjie Qin
>
> I encounter with the same issue, my node had recovered from high load
> problem (caused by other application)
>
> this is the kafka-topic show:
>
> Topic:ad_click_sts      PartitionCount:6        ReplicationFactor:2
>  Configs:
>         Topic: ad_click_sts     Partition: 0    Leader: 1       Replicas:
> 1,0   Isr: 1
>         Topic: ad_click_sts     Partition: 1    Leader: 0       Replicas:
> 0,1   Isr: 0
>         Topic: ad_click_sts     Partition: 2    Leader: 1       Replicas:
> 1,0   Isr: 1
>         Topic: ad_click_sts     Partition: 3    Leader: 0       Replicas:
> 0,1   Isr: 0
>         Topic: ad_click_sts     Partition: 4    Leader: 1       Replicas:
> 1,0   Isr: 1
>         Topic: ad_click_sts     Partition: 5    Leader: 0       Replicas:
> 0,1   Isr: 0
>
> ReplicaFetcherThread info extracted from kafka server.log :
>
> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in fetch
> Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes;
> RequestInfo: [ad_click_sts,5] ->
> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
> java.net.SocketTimeoutException
>         at
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>         ……..
>         at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
>         at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
>         at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
>         at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>         at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>         at
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96)
>         at
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>
> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
> (kafka.consumer.SimpleConsumer)
>
> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
> Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
>
> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
> (kafka.cluster.Partition)
>
>
> How to fix this Isr problem ? Is there some command can be run ?
>
> Regards
> sy.pan




-- 
Regards,
Tao

Re: How replicas catch up the leader

Posted by "sy.pan" <sh...@gmail.com>.

Hi, tao xiao and Jiangjie Qin

I encounter with the same issue, my node had recovered from high load problem (caused by other application)

this is the kafka-topic show:

Topic:ad_click_sts	PartitionCount:6	ReplicationFactor:2	Configs:
	Topic: ad_click_sts	Partition: 0	Leader: 1	Replicas: 1,0	Isr: 1
	Topic: ad_click_sts	Partition: 1	Leader: 0	Replicas: 0,1	Isr: 0
	Topic: ad_click_sts	Partition: 2	Leader: 1	Replicas: 1,0	Isr: 1
	Topic: ad_click_sts	Partition: 3	Leader: 0	Replicas: 0,1	Isr: 0
	Topic: ad_click_sts	Partition: 4	Leader: 1	Replicas: 1,0	Isr: 1
	Topic: ad_click_sts	Partition: 5	Leader: 0	Replicas: 0,1	Isr: 0

ReplicaFetcherThread info extracted from kafka server.log :

[2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId: ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [ad_click_sts,5] -> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] -> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] -> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
java.net.SocketTimeoutException
        at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
        ……..
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)

[2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null (kafka.consumer.SimpleConsumer)

[2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1: Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

[2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1: Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1 (kafka.cluster.Partition)


How to fix this Isr problem ? Is there some command can be run ?

Regards
sy.pan