You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by tao xiao <xi...@gmail.com> on 2015/02/28 08:15:36 UTC
How replicas catch up the leader
Hi team,
I had a replica node that was shutdown improperly due to no disk space
left. I managed to clean up the disk and restarted the replica but the
replica since then never caught up the leader shown below
Topic:test PartitionCount:1 ReplicationFactor:3 Configs:
Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6
broker 1 is the replica that failed before. Is there a way that I can force
the replica to catch up the leader?
--
Regards,
Tao
Re: How replicas catch up the leader
Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.
Can you check if you replica fetcher thread is still running on broker 1?
Also, you may check the public access log on broker 5 to see if there are
fetch requests from broker 1.
On 2/28/15, 12:39 AM, "tao xiao" <xi...@gmail.com> wrote:
>Thanks Harsha. In my case the replica doesn't catch up at all. the last
>log
>date is 5 days ago. It seems the failed replica is excluded from
>replication list. I am looking for a command that can add the replica back
>to the ISR list or force it to start sync-up again
>
>On Sat, Feb 28, 2015 at 4:27 PM, Harsha <ka...@harsha.io> wrote:
>
>> you can increase num.replica.fetchers by default its 1 and also try
>> increasing replica.fetch.max.bytes
>> -Harsha
>>
>> On Fri, Feb 27, 2015, at 11:15 PM, tao xiao wrote:
>> > Hi team,
>> >
>> > I had a replica node that was shutdown improperly due to no disk space
>> > left. I managed to clean up the disk and restarted the replica but the
>> > replica since then never caught up the leader shown below
>> >
>> > Topic:test PartitionCount:1 ReplicationFactor:3 Configs:
>> >
>> > Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6
>> >
>> > broker 1 is the replica that failed before. Is there a way that I can
>> > force
>> > the replica to catch up the leader?
>> >
>> > --
>> > Regards,
>> > Tao
>>
>
>
>
>--
>Regards,
>Tao
Re: How replicas catch up the leader
Posted by tao xiao <xi...@gmail.com>.
Thanks Harsha. In my case the replica doesn't catch up at all. the last log
date is 5 days ago. It seems the failed replica is excluded from
replication list. I am looking for a command that can add the replica back
to the ISR list or force it to start sync-up again
On Sat, Feb 28, 2015 at 4:27 PM, Harsha <ka...@harsha.io> wrote:
> you can increase num.replica.fetchers by default its 1 and also try
> increasing replica.fetch.max.bytes
> -Harsha
>
> On Fri, Feb 27, 2015, at 11:15 PM, tao xiao wrote:
> > Hi team,
> >
> > I had a replica node that was shutdown improperly due to no disk space
> > left. I managed to clean up the disk and restarted the replica but the
> > replica since then never caught up the leader shown below
> >
> > Topic:test PartitionCount:1 ReplicationFactor:3 Configs:
> >
> > Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6
> >
> > broker 1 is the replica that failed before. Is there a way that I can
> > force
> > the replica to catch up the leader?
> >
> > --
> > Regards,
> > Tao
>
--
Regards,
Tao
Re: How replicas catch up the leader
Posted by Harsha <ka...@harsha.io>.
you can increase num.replica.fetchers by default its 1 and also try
increasing replica.fetch.max.bytes
-Harsha
On Fri, Feb 27, 2015, at 11:15 PM, tao xiao wrote:
> Hi team,
>
> I had a replica node that was shutdown improperly due to no disk space
> left. I managed to clean up the disk and restarted the replica but the
> replica since then never caught up the leader shown below
>
> Topic:test PartitionCount:1 ReplicationFactor:3 Configs:
>
> Topic: test Partition: 0 Leader: 5 Replicas: 1,5,6 Isr: 5,6
>
> broker 1 is the replica that failed before. Is there a way that I can
> force
> the replica to catch up the leader?
>
> --
> Regards,
> Tao
Re: How replicas catch up the leader
Posted by "sy.pan" <sh...@gmail.com>.
Hi, @Jiangjie Qin
this is the related info from controller.log:
[2015-03-11 10:54:11,962] ERROR [Controller 0]: Error completing reassignment of partition [ad_click_sts,3] (kafka.controller.KafkaController)
kafka.common.KafkaException: Partition [ad_click_sts,3] to be reassigned is already assigned to replicas 0,1. Ignoring request for partition reassignment
at kafka.controller.KafkaController.initiateReassignReplicasForTopicPartition(KafkaController.scala:585)
It seems like kafka ignore the kafka-reassign-partitions.sh command.
the json file used in command is :
{"version":1,"partitions":[{"topic":"ad_click_sts","partition":3,"replicas":[0,1]}]}
The partition has lost sync replication in practice :
Topic: ad_click_sts Partition: 3 Leader: 0 Replicas: 0,1 Isr: 0
Regards
sy.pan
> 在 2015年3月11日,12:32,Jiangjie Qin <jq...@linkedin.com.INVALID> 写道:
>
> It looks that in your case it is because broker 1 somehow missed a
> controller LeaderAndIsrRequest for [ad_click_sts,4]. So the zkVersion
> would be different from the value stored in zookeeper from that on.
> Therefore broker 1 failed to update ISR. In this case you have to bounce
> broker to fix it.
> From what you posted, it looks both broker 0 and broker 1 are having this
> issue. So the question is how could both broker missed a controller
> LeaderAndIsrRequest. Is there anything interesting in controller.log?
>
> Jiangjie (Becket) Qin
>
> On 3/10/15, 8:33 PM, "sy.pan" <shengyi.pan@gmail.com <ma...@gmail.com>> wrote:
>
>> @tao xiao and Jiangjie Qin, Thank you very much
>>
>> I try to run kafka-reassign-partitions.sh, but the issue still exists…
>>
>> this the log info:
>>
>> [2015-03-11 11:00:40,086] ERROR Conditional update of path
>> /brokers/topics/ad_click_sts/partitions/4/state with data
>> {"controller_epoch":23,"leader":1,"version":1,"leader_epoch":35,"isr":[1,0
>> ]} and expected version 564 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
>> = BadVersion for /brokers/topics/ad_click_sts/partitions/4/state
>> (kafka.utils.ZkUtils$)
>>
>> [2015-03-11 11:00:40,086] INFO Partition [ad_click_sts,4] on broker 1:
>> Cached zkVersion [564] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partition)
>>>>>>>>>>>>>>>>>>>>
>>
>> finally, I had to restart the kafka node and the Isr problem is fixed, is
>> there any better ways?
>>
>> Regards
>> sy.pan
>>
>>
>>> 在 2015年3月11日,03:34,Jiangjie Qin <jq...@linkedin.com.INVALID> 写道:
>>>
>>> This looks like a leader broker somehow did not respond to a fetch
>>> request
>>> from the follower. It may be because the broker was too busy. If that is
>>> the case, Xiao¹s approach could help - reassign partitions or reelect
>>> leaders to balance the traffic among brokers.
>>>
>>> Jiangjie (Becket) Qin
>>>
>>> On 3/9/15, 8:31 PM, "sy.pan" <shengyi.pan@gmail.com
>>> <mailto:shengyi.pan@gmail.com <ma...@gmail.com>>> wrote:
>>>
>>>> Hi, tao xiao and Jiangjie Qin
>>>>
>>>> I encounter with the same issue, my node had recovered from high load
>>>> problem (caused by other application)
>>>>
>>>> this is the kafka-topic show:
>>>>
>>>> Topic:ad_click_sts PartitionCount:6 ReplicationFactor:2 Configs:
>>>> Topic: ad_click_sts Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1
>>>> Topic: ad_click_sts Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0
>>>> Topic: ad_click_sts Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1
>>>> Topic: ad_click_sts Partition: 3 Leader: 0 Replicas: 0,1 Isr: 0
>>>> Topic: ad_click_sts Partition: 4 Leader: 1 Replicas: 1,0 Isr: 1
>>>> Topic: ad_click_sts Partition: 5 Leader: 0 Replicas: 0,1 Isr: 0
>>>>
>>>> ReplicaFetcherThread info extracted from kafka server.log :
>>>>
>>>> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>>>> fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>>>> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>>>> bytes; RequestInfo: [ad_click_sts,5] ->
>>>> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>>>> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>>>> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>>>> java.net.SocketTimeoutException
>>>> at
>>>> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>>>> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>>>> ŠŠ..
>>>> at
>>>>
>>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsum
>>>> er
>>>> .scala:108)
>>>> at
>>>>
>>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>> a:
>>>> 108)
>>>> at
>>>>
>>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>> a:
>>>> 108)
>>>> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>>>> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>>>> at
>>>>
>>>> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherTh
>>>> re
>>>> ad.scala:96)
>>>> at
>>>>
>>>> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88
>>>> )
>>>> at
>>>> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>>>>
>>>> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>>>> (kafka.consumer.SimpleConsumer)
>>>>
>>>> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>>>> Cached zkVersion [556] not equal to that in zookeeper, skip updating
>>>> ISR
>>>> (kafka.cluster.Partition)
>>>>
>>>> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>>>> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>>>> (kafka.cluster.Partition)
>>>>
>>>>
>>>> How to fix this Isr problem ? Is there some command can be run ?
>>>>
>>>> Regards
>>>> sy.pan
Re: How replicas catch up the leader
Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.
It looks that in your case it is because broker 1 somehow missed a
controller LeaderAndIsrRequest for [ad_click_sts,4]. So the zkVersion
would be different from the value stored in zookeeper from that on.
Therefore broker 1 failed to update ISR. In this case you have to bounce
broker to fix it.
From what you posted, it looks both broker 0 and broker 1 are having this
issue. So the question is how could both broker missed a controller
LeaderAndIsrRequest. Is there anything interesting in controller.log?
Jiangjie (Becket) Qin
On 3/10/15, 8:33 PM, "sy.pan" <sh...@gmail.com> wrote:
>@tao xiao and Jiangjie Qin, Thank you very much
>
>I try to run kafka-reassign-partitions.sh, but the issue still exists…
>
>this the log info:
>
>[2015-03-11 11:00:40,086] ERROR Conditional update of path
>/brokers/topics/ad_click_sts/partitions/4/state with data
>{"controller_epoch":23,"leader":1,"version":1,"leader_epoch":35,"isr":[1,0
>]} and expected version 564 failed due to
>org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
>= BadVersion for /brokers/topics/ad_click_sts/partitions/4/state
>(kafka.utils.ZkUtils$)
>
>[2015-03-11 11:00:40,086] INFO Partition [ad_click_sts,4] on broker 1:
>Cached zkVersion [564] not equal to that in zookeeper, skip updating ISR
>(kafka.cluster.Partition)
>>>>>>>>>>>>>>>>>>>
>
>finally, I had to restart the kafka node and the Isr problem is fixed, is
>there any better ways?
>
>Regards
>sy.pan
>
>
>> 在 2015年3月11日,03:34,Jiangjie Qin <jq...@linkedin.com.INVALID> 写道:
>>
>> This looks like a leader broker somehow did not respond to a fetch
>>request
>> from the follower. It may be because the broker was too busy. If that is
>> the case, Xiao¹s approach could help - reassign partitions or reelect
>> leaders to balance the traffic among brokers.
>>
>> Jiangjie (Becket) Qin
>>
>> On 3/9/15, 8:31 PM, "sy.pan" <shengyi.pan@gmail.com
>><ma...@gmail.com>> wrote:
>>
>>> Hi, tao xiao and Jiangjie Qin
>>>
>>> I encounter with the same issue, my node had recovered from high load
>>> problem (caused by other application)
>>>
>>> this is the kafka-topic show:
>>>
>>> Topic:ad_click_sts PartitionCount:6 ReplicationFactor:2 Configs:
>>> Topic: ad_click_sts Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1
>>> Topic: ad_click_sts Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0
>>> Topic: ad_click_sts Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1
>>> Topic: ad_click_sts Partition: 3 Leader: 0 Replicas: 0,1 Isr: 0
>>> Topic: ad_click_sts Partition: 4 Leader: 1 Replicas: 1,0 Isr: 1
>>> Topic: ad_click_sts Partition: 5 Leader: 0 Replicas: 0,1 Isr: 0
>>>
>>> ReplicaFetcherThread info extracted from kafka server.log :
>>>
>>> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>>> fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>>> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>>> bytes; RequestInfo: [ad_click_sts,5] ->
>>> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>>> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>>> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>>> java.net.SocketTimeoutException
>>> at
>>> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>>> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>>> ŠŠ..
>>> at
>>>
>>>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsum
>>>er
>>> .scala:108)
>>> at
>>>
>>>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>a:
>>> 108)
>>> at
>>>
>>>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scal
>>>a:
>>> 108)
>>> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>>> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>>> at
>>>
>>>kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherTh
>>>re
>>> ad.scala:96)
>>> at
>>>
>>>kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88
>>>)
>>> at
>>>kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>>>
>>> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>>> (kafka.consumer.SimpleConsumer)
>>>
>>> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>>> Cached zkVersion [556] not equal to that in zookeeper, skip updating
>>>ISR
>>> (kafka.cluster.Partition)
>>>
>>> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>>> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>>> (kafka.cluster.Partition)
>>>
>>>
>>> How to fix this Isr problem ? Is there some command can be run ?
>>>
>>> Regards
>>> sy.pan
>
Re: How replicas catch up the leader
Posted by "sy.pan" <sh...@gmail.com>.
@tao xiao and Jiangjie Qin, Thank you very much
I try to run kafka-reassign-partitions.sh, but the issue still exists…
this the log info:
[2015-03-11 11:00:40,086] ERROR Conditional update of path /brokers/topics/ad_click_sts/partitions/4/state with data {"controller_epoch":23,"leader":1,"version":1,"leader_epoch":35,"isr":[1,0]} and expected version 564 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/ad_click_sts/partitions/4/state (kafka.utils.ZkUtils$)
[2015-03-11 11:00:40,086] INFO Partition [ad_click_sts,4] on broker 1: Cached zkVersion [564] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>>>>>>>>>>>>>>>>>>
finally, I had to restart the kafka node and the Isr problem is fixed, is there any better ways?
Regards
sy.pan
> 在 2015年3月11日,03:34,Jiangjie Qin <jq...@linkedin.com.INVALID> 写道:
>
> This looks like a leader broker somehow did not respond to a fetch request
> from the follower. It may be because the broker was too busy. If that is
> the case, Xiao¹s approach could help - reassign partitions or reelect
> leaders to balance the traffic among brokers.
>
> Jiangjie (Becket) Qin
>
> On 3/9/15, 8:31 PM, "sy.pan" <shengyi.pan@gmail.com <ma...@gmail.com>> wrote:
>
>> Hi, tao xiao and Jiangjie Qin
>>
>> I encounter with the same issue, my node had recovered from high load
>> problem (caused by other application)
>>
>> this is the kafka-topic show:
>>
>> Topic:ad_click_sts PartitionCount:6 ReplicationFactor:2 Configs:
>> Topic: ad_click_sts Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1
>> Topic: ad_click_sts Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0
>> Topic: ad_click_sts Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1
>> Topic: ad_click_sts Partition: 3 Leader: 0 Replicas: 0,1 Isr: 0
>> Topic: ad_click_sts Partition: 4 Leader: 1 Replicas: 1,0 Isr: 1
>> Topic: ad_click_sts Partition: 5 Leader: 0 Replicas: 0,1 Isr: 0
>>
>> ReplicaFetcherThread info extracted from kafka server.log :
>>
>> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>> fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>> bytes; RequestInfo: [ad_click_sts,5] ->
>> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>> java.net.SocketTimeoutException
>> at
>> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>> ŠŠ..
>> at
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer
>> .scala:108)
>> at
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>> 108)
>> at
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>> 108)
>> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
>> at
>> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThre
>> ad.scala:96)
>> at
>> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
>> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>>
>> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>> (kafka.consumer.SimpleConsumer)
>>
>> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>> Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partition)
>>
>> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>> (kafka.cluster.Partition)
>>
>>
>> How to fix this Isr problem ? Is there some command can be run ?
>>
>> Regards
>> sy.pan
Re: How replicas catch up the leader
Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.
This looks like a leader broker somehow did not respond to a fetch request
from the follower. It may be because the broker was too busy. If that is
the case, Xiao¹s approach could help - reassign partitions or reelect
leaders to balance the traffic among brokers.
Jiangjie (Becket) Qin
On 3/9/15, 8:31 PM, "sy.pan" <sh...@gmail.com> wrote:
>Hi, tao xiao and Jiangjie Qin
>
>I encounter with the same issue, my node had recovered from high load
>problem (caused by other application)
>
>this is the kafka-topic show:
>
>Topic:ad_click_sts PartitionCount:6 ReplicationFactor:2 Configs:
> Topic: ad_click_sts Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1
> Topic: ad_click_sts Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0
> Topic: ad_click_sts Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1
> Topic: ad_click_sts Partition: 3 Leader: 0 Replicas: 0,1 Isr: 0
> Topic: ad_click_sts Partition: 4 Leader: 1 Replicas: 1,0 Isr: 1
> Topic: ad_click_sts Partition: 5 Leader: 0 Replicas: 0,1 Isr: 0
>
>ReplicaFetcherThread info extracted from kafka server.log :
>
>[2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in
>fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
>ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>bytes; RequestInfo: [ad_click_sts,5] ->
>PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
>PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
>PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
>java.net.SocketTimeoutException
> at
>sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> ŠŠ..
> at
>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer
>.scala:108)
> at
>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>108)
> at
>kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:
>108)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
> at
>kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThre
>ad.scala:96)
> at
>kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>
>[2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
>(kafka.consumer.SimpleConsumer)
>
>[2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
>Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR
>(kafka.cluster.Partition)
>
>[2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
>Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
>(kafka.cluster.Partition)
>
>
>How to fix this Isr problem ? Is there some command can be run ?
>
>Regards
>sy.pan
Re: How replicas catch up the leader
Posted by tao xiao <xi...@gmail.com>.
I ended up running kafka-reassign-partitions.sh to reassign partitions to
different nodes
On Tue, Mar 10, 2015 at 11:31 AM, sy.pan <sh...@gmail.com> wrote:
> Hi, tao xiao and Jiangjie Qin
>
> I encounter with the same issue, my node had recovered from high load
> problem (caused by other application)
>
> this is the kafka-topic show:
>
> Topic:ad_click_sts PartitionCount:6 ReplicationFactor:2
> Configs:
> Topic: ad_click_sts Partition: 0 Leader: 1 Replicas:
> 1,0 Isr: 1
> Topic: ad_click_sts Partition: 1 Leader: 0 Replicas:
> 0,1 Isr: 0
> Topic: ad_click_sts Partition: 2 Leader: 1 Replicas:
> 1,0 Isr: 1
> Topic: ad_click_sts Partition: 3 Leader: 0 Replicas:
> 0,1 Isr: 0
> Topic: ad_click_sts Partition: 4 Leader: 1 Replicas:
> 1,0 Isr: 1
> Topic: ad_click_sts Partition: 5 Leader: 0 Replicas:
> 0,1 Isr: 0
>
> ReplicaFetcherThread info extracted from kafka server.log :
>
> [2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in fetch
> Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId:
> ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes;
> RequestInfo: [ad_click_sts,5] ->
> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] ->
> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] ->
> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
> java.net.SocketTimeoutException
> at
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> ……..
> at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
> at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
> at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
> at
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96)
> at
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
>
> [2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null
> (kafka.consumer.SimpleConsumer)
>
> [2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1:
> Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
>
> [2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1:
> Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1
> (kafka.cluster.Partition)
>
>
> How to fix this Isr problem ? Is there some command can be run ?
>
> Regards
> sy.pan
--
Regards,
Tao
Re: How replicas catch up the leader
Posted by "sy.pan" <sh...@gmail.com>.
Hi, tao xiao and Jiangjie Qin
I encounter with the same issue, my node had recovered from high load problem (caused by other application)
this is the kafka-topic show:
Topic:ad_click_sts PartitionCount:6 ReplicationFactor:2 Configs:
Topic: ad_click_sts Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1
Topic: ad_click_sts Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0
Topic: ad_click_sts Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1
Topic: ad_click_sts Partition: 3 Leader: 0 Replicas: 0,1 Isr: 0
Topic: ad_click_sts Partition: 4 Leader: 1 Replicas: 1,0 Isr: 1
Topic: ad_click_sts Partition: 5 Leader: 0 Replicas: 0,1 Isr: 0
ReplicaFetcherThread info extracted from kafka server.log :
[2015-03-09 21:06:05,450] ERROR [ReplicaFetcherThread-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 7331; ClientId: ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [ad_click_sts,5] -> PartitionFetchInfo(6149699,1048576),[ad_click_sts,3] -> PartitionFetchInfo(6147835,1048576),[ad_click_sts,1] -> PartitionFetchInfo(6235071,1048576) (kafka.server.ReplicaFetcherThread)
java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
……..
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
[2015-03-09 21:06:05,450] WARN Reconnect due to socket error: null (kafka.consumer.SimpleConsumer)
[2015-03-09 21:05:57,116] INFO Partition [ad_click_sts,4] on broker 1: Cached zkVersion [556] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2015-03-09 21:06:05,772] INFO Partition [ad_click_sts,2] on broker 1: Shrinking ISR for partition [ad_click_sts,2] from 1,0 to 1 (kafka.cluster.Partition)
How to fix this Isr problem ? Is there some command can be run ?
Regards
sy.pan