You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Fredo Lee <bu...@gmail.com> on 2016/06/01 02:27:12 UTC

Re: soft failure for kakfa 0.8.2.2

we use 0.8.2.2.

2016-05-31 20:14 GMT+08:00 Tom Crayford <tc...@heroku.com>:

> Is this under 0.8? There are a few known bugs in 0.8 that can lead to this
> situation. I'd recommend upgrading to 0.9 as soon as is viable to prevent
> this and many other kinds of issues that were fixed in 0.9.
>
> Thanks
>
> Tom Crayford
> Heroku Kafka
>
> On Tue, May 31, 2016 at 6:19 AM, Fredo Lee <bu...@gmail.com>
> wrote:
>
> > thanks for your reply.
> >
> > yes, there are more than one controller. the msg of "soft failure" is
> > reported by the old controller.
> >
> > 2016-05-31 11:42 GMT+08:00 Muqtafi Akhmad <mu...@traveloka.com>:
> >
> > > hello Fredo,
> > >
> > > My guess is that there was partition leader election that was might be
> > > triggered by detection of offline partition in Kafka cluster. Somehow
> the
> > > broker try to trigger leader election while previous election has been
> > > completed hence this log :
> > >
> > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated
> > state
> > > > change for partition [consup-0000000043,78] from OnlinePartition to
> > > > OnlinePartition failed (state.change.logger)
> > > > kafka.common.StateChangeFailedException: encountered error while
> > electing
> > > > leader for partition [consup-0000000043,78] due to: aborted leader
> > > election
> > > > for partition [consup-0000000043,78]
> > > > since the LeaderAndIsr path was already written by another
> controller.
> > > This
> > > > probably means that the current controller 1008 went through a soft
> > > failure
> > > > and another controller was elec
> > > > ted with epoch 14..
> > >
> > > The question is,
> > > - was there any offline partition?
> > > - was there more than one active controller?
> > >
> > > CMIIW
> > >
> > >
> > > On Mon, May 30, 2016 at 2:41 PM, Fredo Lee <bu...@gmail.com>
> > > wrote:
> > >
> > > > my server.log >>>>>>>>>>>>>>>>
> > > >
> > > > lots of  error msg:
> > > >
> > > > [2016-05-28 12:12:31,132] WARN [ReplicaFetcherThread-0-1007], Replica
> > > 1008
> > > > for partition [consup-0000000003,16] reset its fetch offset from
> > 13985537
> > > > to current leader 1007's latest offset 13985537
> > > > (kafka.server.ReplicaFetcherThread)
> > > > [2016-05-28 12:12:31,132] ERROR [ReplicaFetcherThread-0-1007],
> Current
> > > > offset 13987676 for partition [consup-0000000003,16] out of range;
> > reset
> > > > offset to 13985537 (kafka.server.ReplicaFetcherThread)
> > > >
> > > >
> > > > the other error msg:
> > > >
> > > > [2016-05-28 12:12:31,708] ERROR [Replica Manager on Broker 1008]:
> Error
> > > > when processing fetch request for partition [consup-0000000003,35]
> > offset
> > > > 13848954 from consumer with correlation id 0. Possible cause: Request
> > for
> > > > offset 13848954 but we only have log segments in the range 12356946
> to
> > > > 13847167. (kafka.server.ReplicaManager)
> > > >
> > > >
> > > >
> > > > 2016-05-30 15:37 GMT+08:00 Fredo Lee <bu...@gmail.com>:
> > > >
> > > > > My state-change.log>>>>>>>>>>>
> > > > >
> > > > >
> > > > > [2016-05-28 12:33:31,510] ERROR Controller 1008 epoch 13 aborted
> > leader
> > > > > election for partition [consup-0000000043,78] since the
> LeaderAndIsr
> > > path
> > > > > was already written by another controll
> > > > > er. This probably means that the current controller 1008 went
> > through a
> > > > > soft failure and another controller was elected with epoch 14.
> > > > > (state.change.logger)
> > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(1020,-1) for a request sent to broker
> > > > id:1009,host:
> > > > > 22.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(1066,-1) for a request sent to broker
> > > > id:1001,host:
> > > > > consup-kafka20.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(777,-1) for a request sent to broker
> > > id:1006,host:
> > > > > consup-kafka11.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(742,-1) for a request sent to broker
> > > id:1018,host:
> > > > > consup-kafka10.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(1021,-1) for a request sent to broker
> > > > id:1009,host:
> > > > > consup-kafka22.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(1067,-1) for a request sent to broker
> > > > id:1001,host:
> > > > > consup-kafka20.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(778,-1) for a request sent to broker
> > > id:1006,host:
> > > > > consup-kafka11.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(1022,-1) for a request sent to broker
> > > > id:1009,host:
> > > > > consup-kafka22.com,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13
> encountered
> > > > error
> > > > > while electing leader for partition [consup-0000000043,78] due to:
> > > > aborted
> > > > > leader election for partition [consup
> > > > > -0000000043,78] since the LeaderAndIsr path was already written by
> > > > another
> > > > > controller. This probably means that the current controller 1008
> went
> > > > > through a soft failure and another con
> > > > > troller was elected with epoch 14.. (state.change.logger)
> > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(743,-1) for a request sent to broker
> > > id:1018,host:
> > > > > consup-kafka10.co
> > > > > m,port:9092 (state.change.logger)
> > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated
> > > state
> > > > > change for partition [consup-0000000043,78] from OnlinePartition to
> > > > > OnlinePartition failed (state.change.logger)
> > > > > kafka.common.StateChangeFailedException: encountered error while
> > > electing
> > > > > leader for partition [consup-0000000043,78] due to: aborted leader
> > > > election
> > > > > for partition [consup-0000000043,78]
> > > > > since the LeaderAndIsr path was already written by another
> > controller.
> > > > > This probably means that the current controller 1008 went through a
> > > soft
> > > > > failure and another controller was elec
> > > > > ted with epoch 14..
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:380)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:208)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:146)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:145)
> > > > >         at
> scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:145)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController.onPreferredReplicaElection(KafkaController.scala:631)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply$mcV$sp(KafkaControl
> > > > > ler.scala:1158)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > > > > la:1153)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > > > > la:1153)
> > > > >         at kafka.utils.Utils$.inLock(Utils.scala:535)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1150)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1148)
> > > > >         at
> > > > >
> > > >
> > >
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1148)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1127)
> > > > >         at
> > > > >
> > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> > > > >         at
> > > > >
> > >
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance(KafkaController.scala:1127)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$onControllerFailover$1.apply$mcV$sp(KafkaController.scala:326)
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:99)
> > > > >         at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
> > > > >         at
> > > > >
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > >         at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> > > > >         at
> > > > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> > > > >         at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
> > > > >         at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
> > > > >         at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > >         at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > >         at java.lang.Thread.run(Thread.java:679)
> > > > > Caused by: kafka.common.StateChangeFailedException: aborted leader
> > > > > election for partition [consup-0000000043,78] since the
> LeaderAndIsr
> > > path
> > > > > was already written by another controller. This probably means that
> > the
> > > > > current controller 1008 went through a soft failure and another
> > > > controller
> > > > > was elected with epoch 14.
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:354)
> > > > >         ... 33 more
> > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > response
> > > > > UpdateMetadataResponse(744,-1) for a request sent to broker
> > > id:1018,host:
> > > > > consup-kafka10.com,port:9092 (state.change.logger)
> > > > >
> > > > >
> > > > > 2016-05-28 15:31 GMT+08:00 Muqtafi Akhmad <mu...@traveloka.com>:
> > > > >
> > > > >> hello Fredo,
> > > > >>
> > > > >> Can you elaborate the 'soft' failure?
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Sat, May 28, 2016 at 1:53 PM, Fredo Lee <
> buaatianwanli@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > we have a kafka cluster with 19 nodes. every week we  suffer a
> > soft
> > > > >> failure
> > > > >> > from this cluster. how to resolve this problem.
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Muqtafi Akhmad
> > > > >> Software Engineer
> > > > >> Traveloka
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Muqtafi Akhmad
> > > Software Engineer
> > > Traveloka
> > >
> >
>

Re: soft failure for kakfa 0.8.2.2

Posted by Tom Crayford <tc...@heroku.com>.

Ok. I'd recommend upgrading to 0.9 asap to fix the known bugs in 0.8 here.

Thanks

Tom Crayford
Heroku Kafka

On Wed, Jun 1, 2016 at 3:27 AM, Fredo Lee <bu...@gmail.com> wrote:

> we use 0.8.2.2.
>
> 2016-05-31 20:14 GMT+08:00 Tom Crayford <tc...@heroku.com>:
>
> > Is this under 0.8? There are a few known bugs in 0.8 that can lead to
> this
> > situation. I'd recommend upgrading to 0.9 as soon as is viable to prevent
> > this and many other kinds of issues that were fixed in 0.9.
> >
> > Thanks
> >
> > Tom Crayford
> > Heroku Kafka
> >
> > On Tue, May 31, 2016 at 6:19 AM, Fredo Lee <bu...@gmail.com>
> > wrote:
> >
> > > thanks for your reply.
> > >
> > > yes, there are more than one controller. the msg of "soft failure" is
> > > reported by the old controller.
> > >
> > > 2016-05-31 11:42 GMT+08:00 Muqtafi Akhmad <mu...@traveloka.com>:
> > >
> > > > hello Fredo,
> > > >
> > > > My guess is that there was partition leader election that was might
> be
> > > > triggered by detection of offline partition in Kafka cluster. Somehow
> > the
> > > > broker try to trigger leader election while previous election has
> been
> > > > completed hence this log :
> > > >
> > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated
> > > state
> > > > > change for partition [consup-0000000043,78] from OnlinePartition to
> > > > > OnlinePartition failed (state.change.logger)
> > > > > kafka.common.StateChangeFailedException: encountered error while
> > > electing
> > > > > leader for partition [consup-0000000043,78] due to: aborted leader
> > > > election
> > > > > for partition [consup-0000000043,78]
> > > > > since the LeaderAndIsr path was already written by another
> > controller.
> > > > This
> > > > > probably means that the current controller 1008 went through a soft
> > > > failure
> > > > > and another controller was elec
> > > > > ted with epoch 14..
> > > >
> > > > The question is,
> > > > - was there any offline partition?
> > > > - was there more than one active controller?
> > > >
> > > > CMIIW
> > > >
> > > >
> > > > On Mon, May 30, 2016 at 2:41 PM, Fredo Lee <bu...@gmail.com>
> > > > wrote:
> > > >
> > > > > my server.log >>>>>>>>>>>>>>>>
> > > > >
> > > > > lots of  error msg:
> > > > >
> > > > > [2016-05-28 12:12:31,132] WARN [ReplicaFetcherThread-0-1007],
> Replica
> > > > 1008
> > > > > for partition [consup-0000000003,16] reset its fetch offset from
> > > 13985537
> > > > > to current leader 1007's latest offset 13985537
> > > > > (kafka.server.ReplicaFetcherThread)
> > > > > [2016-05-28 12:12:31,132] ERROR [ReplicaFetcherThread-0-1007],
> > Current
> > > > > offset 13987676 for partition [consup-0000000003,16] out of range;
> > > reset
> > > > > offset to 13985537 (kafka.server.ReplicaFetcherThread)
> > > > >
> > > > >
> > > > > the other error msg:
> > > > >
> > > > > [2016-05-28 12:12:31,708] ERROR [Replica Manager on Broker 1008]:
> > Error
> > > > > when processing fetch request for partition [consup-0000000003,35]
> > > offset
> > > > > 13848954 from consumer with correlation id 0. Possible cause:
> Request
> > > for
> > > > > offset 13848954 but we only have log segments in the range 12356946
> > to
> > > > > 13847167. (kafka.server.ReplicaManager)
> > > > >
> > > > >
> > > > >
> > > > > 2016-05-30 15:37 GMT+08:00 Fredo Lee <bu...@gmail.com>:
> > > > >
> > > > > > My state-change.log>>>>>>>>>>>
> > > > > >
> > > > > >
> > > > > > [2016-05-28 12:33:31,510] ERROR Controller 1008 epoch 13 aborted
> > > leader
> > > > > > election for partition [consup-0000000043,78] since the
> > LeaderAndIsr
> > > > path
> > > > > > was already written by another controll
> > > > > > er. This probably means that the current controller 1008 went
> > > through a
> > > > > > soft failure and another controller was elected with epoch 14.
> > > > > > (state.change.logger)
> > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(1020,-1) for a request sent to broker
> > > > > id:1009,host:
> > > > > > 22.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(1066,-1) for a request sent to broker
> > > > > id:1001,host:
> > > > > > consup-kafka20.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(777,-1) for a request sent to broker
> > > > id:1006,host:
> > > > > > consup-kafka11.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(742,-1) for a request sent to broker
> > > > id:1018,host:
> > > > > > consup-kafka10.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(1021,-1) for a request sent to broker
> > > > > id:1009,host:
> > > > > > consup-kafka22.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(1067,-1) for a request sent to broker
> > > > > id:1001,host:
> > > > > > consup-kafka20.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(778,-1) for a request sent to broker
> > > > id:1006,host:
> > > > > > consup-kafka11.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(1022,-1) for a request sent to broker
> > > > > id:1009,host:
> > > > > > consup-kafka22.com,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13
> > encountered
> > > > > error
> > > > > > while electing leader for partition [consup-0000000043,78] due
> to:
> > > > > aborted
> > > > > > leader election for partition [consup
> > > > > > -0000000043,78] since the LeaderAndIsr path was already written
> by
> > > > > another
> > > > > > controller. This probably means that the current controller 1008
> > went
> > > > > > through a soft failure and another con
> > > > > > troller was elected with epoch 14.. (state.change.logger)
> > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(743,-1) for a request sent to broker
> > > > id:1018,host:
> > > > > > consup-kafka10.co
> > > > > > m,port:9092 (state.change.logger)
> > > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13
> initiated
> > > > state
> > > > > > change for partition [consup-0000000043,78] from OnlinePartition
> to
> > > > > > OnlinePartition failed (state.change.logger)
> > > > > > kafka.common.StateChangeFailedException: encountered error while
> > > > electing
> > > > > > leader for partition [consup-0000000043,78] due to: aborted
> leader
> > > > > election
> > > > > > for partition [consup-0000000043,78]
> > > > > > since the LeaderAndIsr path was already written by another
> > > controller.
> > > > > > This probably means that the current controller 1008 went
> through a
> > > > soft
> > > > > > failure and another controller was elec
> > > > > > ted with epoch 14..
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:380)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:208)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:146)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:145)
> > > > > >         at
> > scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:145)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController.onPreferredReplicaElection(KafkaController.scala:631)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply$mcV$sp(KafkaControl
> > > > > > ler.scala:1158)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > > > > > la:1153)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > > > > > la:1153)
> > > > > >         at kafka.utils.Utils$.inLock(Utils.scala:535)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1150)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1148)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1148)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1127)
> > > > > >         at
> > > > > >
> > > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> > > > > >         at
> > > > > >
> > > >
> > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance(KafkaController.scala:1127)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$onControllerFailover$1.apply$mcV$sp(KafkaController.scala:326)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:99)
> > > > > >         at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
> > > > > >         at
> > > > > >
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> > > > > >         at
> > > > > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > > >         at java.lang.Thread.run(Thread.java:679)
> > > > > > Caused by: kafka.common.StateChangeFailedException: aborted
> leader
> > > > > > election for partition [consup-0000000043,78] since the
> > LeaderAndIsr
> > > > path
> > > > > > was already written by another controller. This probably means
> that
> > > the
> > > > > > current controller 1008 went through a soft failure and another
> > > > > controller
> > > > > > was elected with epoch 14.
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:354)
> > > > > >         ... 33 more
> > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > > > response
> > > > > > UpdateMetadataResponse(744,-1) for a request sent to broker
> > > > id:1018,host:
> > > > > > consup-kafka10.com,port:9092 (state.change.logger)
> > > > > >
> > > > > >
> > > > > > 2016-05-28 15:31 GMT+08:00 Muqtafi Akhmad <muqtafi@traveloka.com
> >:
> > > > > >
> > > > > >> hello Fredo,
> > > > > >>
> > > > > >> Can you elaborate the 'soft' failure?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Sat, May 28, 2016 at 1:53 PM, Fredo Lee <
> > buaatianwanli@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > we have a kafka cluster with 19 nodes. every week we  suffer a
> > > soft
> > > > > >> failure
> > > > > >> > from this cluster. how to resolve this problem.
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Muqtafi Akhmad
> > > > > >> Software Engineer
> > > > > >> Traveloka
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Muqtafi Akhmad
> > > > Software Engineer
> > > > Traveloka
> > > >
> > >
> >
>