You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Abhinav Anand <ab...@gmail.com> on 2014/06/18 10:40:32 UTC

Network failure leads to split brain in brokers

Hi Guys,
 We have a 6 node cluster. Due to network failure yesterday(network was up
and running in few minutes), few of the brokers were not able to talk to
each other, including the zookeeper. During an issue in our camus job, we
identified that broker-2 was misbehaving. The controller logs reflect that
the broker-2 is not alive, though broker-2 was serving all
metadata/consumer requests and its log don't show nasty errors.

If you see the camus logs below, any request to broker-2 would return
broker-2 as the leader.
Even though list-topic would show (1,3) as ISR and 3 as the leader. This
state is pretty disturbing as its difficult to detect and inherently not
supposed to state.

#### *Log analysis *####
For brevity i have attached my analysis in Kafka_split_brain.log file,
reflected the key points in the mail

Network failure occurred somewhere around 2014-06-16 20:30 hrs
*###Controller Logs (Time ordered)*

1. Controller fails to send request to broker 2

[2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread],
Controller 3 fails to send a request to broker 2
(kafka.controller.RequestSendThread)
2. NoReplicaOnlineException shows broker 2 to be not live

[2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state
change for partition [same-topic-name,40] from OfflinePartition to
OnlinePartition failed (state.change.logger)
kafka.common.NoReplicaOnlineException: No replica for partition
[same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)],
Assigned replicas are: [List(2)]
3. After restarting the broker today, the controller logs selects 2 as ISR.

[2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No broker
in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2 from live
brokers 2. There's potential data loss.
(kafka.controller.OfflinePartitionLeaderSelector)

*###Broker-2 Logs*

1. When controller sends replica-request, warning invalid LeaderAndIsr

    - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid LeaderAndIsr
      request with correlation id 0 from controller 3 epoch 63 with an older
      leader epoch 41 for partition [cse-sauron,1], current leader epoch is
      41 (state.change.logger)

2. On restart today

    - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the become-follower
      state change with correlation id 32 from controller 3 epoch 63 for
      partition [bfd-logstream_9092-PROD,82] new leader -1
(state.change.logger)



*###Where it all Started (Camus logs)*

We have a camus job which runs every half-hour, during some of the
yesterdays run we had duplicate data creeping in. On investigation, we
found out whenever the metadata request was sent to broker-id 2, broker-id
2 used to declare itself as leader. If the request was sent to other
brokers, the leader was always broker 3. On list topic, broker (1,3) were
ISR and 3 was the leader.  Though if we see the camus log for yesterday’s
run, it finds broker-2 as the leader for one of its run.


   - Job Run at 2014-06-17 00:06:34
      - 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
      Fetching metadata from broker broker-1 with client id cse-sauron for 0
      topic(s) []
      2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
      cse-sauron uri:tcp://broker-3-url leader:3 partition:0 offset:10693174
      latest_offset:10693174
   - Job Run at 2014-06-17 00:24:58
      - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
      Fetching metadata from broker 2 with client id cse-sauron for 0
topic(s) []
      2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob -
      The current offset was found to be more than the latest offset
      - 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob -
      Moving to the earliest offset available
      - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
      cse-sauron uri:tcp://broker-2-url leader:2 partition:0 offset:9747106
      latest_offset:10674802
   - Job Run at 2014-06-17 01:01:54:
      - 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
      Fetching metadata from broker dare-broker02:9092 with client id
cse-sauron
      for 0 topic(s) []
      2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
      cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092 leader:3
      partition:0 offset:10253311 latest_offset:10697656

 Let me know if you need any more details

-- 

Abhinav Anand

Re: Network failure leads to split brain in brokers

Posted by Guozhang Wang <wa...@gmail.com>.

Could you try with the latest release 0.8.1.1? It has a bunch of critical
bug fixes which may solve your problem.


On Thu, Jun 19, 2014 at 11:01 PM, Abhinav Anand <ab...@gmail.com> wrote:

> Hi Guozhang,
> I am using 0.8.0 release
>
> Regards,
> Abhinav
>
>
> On Thu, Jun 19, 2014 at 2:06 AM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Abhinav,
> >
> > Which Kafka version are you using?
> >
> > Guozhang
> >
> >
> > On Wed, Jun 18, 2014 at 1:40 AM, Abhinav Anand <ab...@gmail.com>
> wrote:
> >
> > > Hi Guys,
> > >  We have a 6 node cluster. Due to network failure yesterday(network was
> > up
> > > and running in few minutes), few of the brokers were not able to talk
> to
> > > each other, including the zookeeper. During an issue in our camus job,
> we
> > > identified that broker-2 was misbehaving. The controller logs reflect
> > that
> > > the broker-2 is not alive, though broker-2 was serving all
> > > metadata/consumer requests and its log don't show nasty errors.
> > >
> > > If you see the camus logs below, any request to broker-2 would return
> > > broker-2 as the leader.
> > > Even though list-topic would show (1,3) as ISR and 3 as the leader.
> This
> > > state is pretty disturbing as its difficult to detect and inherently
> not
> > > supposed to state.
> > >
> > > #### *Log analysis *####
> > > For brevity i have attached my analysis in Kafka_split_brain.log file,
> > > reflected the key points in the mail
> > >
> > > Network failure occurred somewhere around 2014-06-16 20:30 hrs
> > > *###Controller Logs (Time ordered)*
> > >
> > > 1. Controller fails to send request to broker 2
> > >
> > > [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread],
> > > Controller 3 fails to send a request to broker 2
> > > (kafka.controller.RequestSendThread)
> > > 2. NoReplicaOnlineException shows broker 2 to be not live
> > >
> > > [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state
> > > change for partition [same-topic-name,40] from OfflinePartition to
> > > OnlinePartition failed (state.change.logger)
> > > kafka.common.NoReplicaOnlineException: No replica for partition
> > > [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)],
> > > Assigned replicas are: [List(2)]
> > > 3. After restarting the broker today, the controller logs selects 2 as
> > > ISR.
> > >
> > > [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No
> > > broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2
> > > from live brokers 2. There's potential data loss.
> > > (kafka.controller.OfflinePartitionLeaderSelector)
> > >
> > > *###Broker-2 Logs*
> > >
> > > 1. When controller sends replica-request, warning invalid LeaderAndIsr
> > >
> > >     - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid
> > >       LeaderAndIsr request with correlation id 0 from controller 3
> epoch
> > 63 with
> > >       an older leader epoch 41 for partition [cse-sauron,1], current
> > >       leader epoch is 41 (state.change.logger)
> > >
> > > 2. On restart today
> > >
> > >     - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the
> > >       become-follower state change with correlation id 32 from
> > controller 3 epoch
> > >       63 for partition [bfd-logstream_9092-PROD,82] new leader -1
> > >       (state.change.logger)
> > >
> > >
> > >
> > > *###Where it all Started (Camus logs)*
> > >
> > > We have a camus job which runs every half-hour, during some of the
> > > yesterdays run we had duplicate data creeping in. On investigation, we
> > > found out whenever the metadata request was sent to broker-id 2,
> > broker-id
> > > 2 used to declare itself as leader. If the request was sent to other
> > > brokers, the leader was always broker 3. On list topic, broker (1,3)
> were
> > > ISR and 3 was the leader.  Though if we see the camus log for
> yesterday’s
> > > run, it finds broker-2 as the leader for one of its run.
> > >
> > >
> > >    - Job Run at 2014-06-17 00:06:34
> > >       - 2014-06-17 00:06:34 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       Fetching metadata from broker broker-1 with client id cse-sauron
> > for 0
> > >       topic(s) []
> > >       2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> > >       cse-sauron uri:tcp://broker-3-url leader:3 partition:0
> > >       offset:10693174 latest_offset:10693174
> > >    - Job Run at 2014-06-17 00:24:58
> > >       - 2014-06-17 00:24:58 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       Fetching metadata from broker 2 with client id cse-sauron for 0
> > topic(s) []
> > >       2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob
> -
> > >       The current offset was found to be more than the latest offset
> > >       - 2014-06-17 00:24:58 ERROR:
> com.linkedin.camus.etl.kafka.CamusJob
> > >       - Moving to the earliest offset available
> > >       - 2014-06-17 00:24:58 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       cse-sauron uri:tcp://broker-2-url leader:2 partition:0
> > >       offset:9747106 latest_offset:10674802
> > >    - Job Run at 2014-06-17 01:01:54:
> > >       - 2014-06-17 01:01:54 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       Fetching metadata from broker dare-broker02:9092 with client id
> > cse-sauron
> > >       for 0 topic(s) []
> > >       2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> > >       cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092
> > leader:3
> > >       partition:0 offset:10253311 latest_offset:10697656
> > >
> > >  Let me know if you need any more details
> > >
> > > --
> > >
> > > Abhinav Anand
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>
>
>
> --
> Abhinav Anand
>



-- 
-- Guozhang

Re: Network failure leads to split brain in brokers

Posted by Abhinav Anand <ab...@gmail.com>.

Hi Guozhang,
I am using 0.8.0 release

Regards,
Abhinav


On Thu, Jun 19, 2014 at 2:06 AM, Guozhang Wang <wa...@gmail.com> wrote:

> Hello Abhinav,
>
> Which Kafka version are you using?
>
> Guozhang
>
>
> On Wed, Jun 18, 2014 at 1:40 AM, Abhinav Anand <ab...@gmail.com> wrote:
>
> > Hi Guys,
> >  We have a 6 node cluster. Due to network failure yesterday(network was
> up
> > and running in few minutes), few of the brokers were not able to talk to
> > each other, including the zookeeper. During an issue in our camus job, we
> > identified that broker-2 was misbehaving. The controller logs reflect
> that
> > the broker-2 is not alive, though broker-2 was serving all
> > metadata/consumer requests and its log don't show nasty errors.
> >
> > If you see the camus logs below, any request to broker-2 would return
> > broker-2 as the leader.
> > Even though list-topic would show (1,3) as ISR and 3 as the leader. This
> > state is pretty disturbing as its difficult to detect and inherently not
> > supposed to state.
> >
> > #### *Log analysis *####
> > For brevity i have attached my analysis in Kafka_split_brain.log file,
> > reflected the key points in the mail
> >
> > Network failure occurred somewhere around 2014-06-16 20:30 hrs
> > *###Controller Logs (Time ordered)*
> >
> > 1. Controller fails to send request to broker 2
> >
> > [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread],
> > Controller 3 fails to send a request to broker 2
> > (kafka.controller.RequestSendThread)
> > 2. NoReplicaOnlineException shows broker 2 to be not live
> >
> > [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state
> > change for partition [same-topic-name,40] from OfflinePartition to
> > OnlinePartition failed (state.change.logger)
> > kafka.common.NoReplicaOnlineException: No replica for partition
> > [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)],
> > Assigned replicas are: [List(2)]
> > 3. After restarting the broker today, the controller logs selects 2 as
> > ISR.
> >
> > [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No
> > broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2
> > from live brokers 2. There's potential data loss.
> > (kafka.controller.OfflinePartitionLeaderSelector)
> >
> > *###Broker-2 Logs*
> >
> > 1. When controller sends replica-request, warning invalid LeaderAndIsr
> >
> >     - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid
> >       LeaderAndIsr request with correlation id 0 from controller 3 epoch
> 63 with
> >       an older leader epoch 41 for partition [cse-sauron,1], current
> >       leader epoch is 41 (state.change.logger)
> >
> > 2. On restart today
> >
> >     - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the
> >       become-follower state change with correlation id 32 from
> controller 3 epoch
> >       63 for partition [bfd-logstream_9092-PROD,82] new leader -1
> >       (state.change.logger)
> >
> >
> >
> > *###Where it all Started (Camus logs)*
> >
> > We have a camus job which runs every half-hour, during some of the
> > yesterdays run we had duplicate data creeping in. On investigation, we
> > found out whenever the metadata request was sent to broker-id 2,
> broker-id
> > 2 used to declare itself as leader. If the request was sent to other
> > brokers, the leader was always broker 3. On list topic, broker (1,3) were
> > ISR and 3 was the leader.  Though if we see the camus log for yesterday’s
> > run, it finds broker-2 as the leader for one of its run.
> >
> >
> >    - Job Run at 2014-06-17 00:06:34
> >       - 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       Fetching metadata from broker broker-1 with client id cse-sauron
> for 0
> >       topic(s) []
> >       2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       cse-sauron uri:tcp://broker-3-url leader:3 partition:0
> >       offset:10693174 latest_offset:10693174
> >    - Job Run at 2014-06-17 00:24:58
> >       - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       Fetching metadata from broker 2 with client id cse-sauron for 0
> topic(s) []
> >       2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob -
> >       The current offset was found to be more than the latest offset
> >       - 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob
> >       - Moving to the earliest offset available
> >       - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       cse-sauron uri:tcp://broker-2-url leader:2 partition:0
> >       offset:9747106 latest_offset:10674802
> >    - Job Run at 2014-06-17 01:01:54:
> >       - 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       Fetching metadata from broker dare-broker02:9092 with client id
> cse-sauron
> >       for 0 topic(s) []
> >       2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092
> leader:3
> >       partition:0 offset:10253311 latest_offset:10697656
> >
> >  Let me know if you need any more details
> >
> > --
> >
> > Abhinav Anand
> >
>
>
>
> --
> -- Guozhang
>



-- 
Abhinav Anand

Re: Network failure leads to split brain in brokers

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Abhinav,

Which Kafka version are you using?

Guozhang


On Wed, Jun 18, 2014 at 1:40 AM, Abhinav Anand <ab...@gmail.com> wrote:

> Hi Guys,
>  We have a 6 node cluster. Due to network failure yesterday(network was up
> and running in few minutes), few of the brokers were not able to talk to
> each other, including the zookeeper. During an issue in our camus job, we
> identified that broker-2 was misbehaving. The controller logs reflect that
> the broker-2 is not alive, though broker-2 was serving all
> metadata/consumer requests and its log don't show nasty errors.
>
> If you see the camus logs below, any request to broker-2 would return
> broker-2 as the leader.
> Even though list-topic would show (1,3) as ISR and 3 as the leader. This
> state is pretty disturbing as its difficult to detect and inherently not
> supposed to state.
>
> #### *Log analysis *####
> For brevity i have attached my analysis in Kafka_split_brain.log file,
> reflected the key points in the mail
>
> Network failure occurred somewhere around 2014-06-16 20:30 hrs
> *###Controller Logs (Time ordered)*
>
> 1. Controller fails to send request to broker 2
>
> [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread],
> Controller 3 fails to send a request to broker 2
> (kafka.controller.RequestSendThread)
> 2. NoReplicaOnlineException shows broker 2 to be not live
>
> [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state
> change for partition [same-topic-name,40] from OfflinePartition to
> OnlinePartition failed (state.change.logger)
> kafka.common.NoReplicaOnlineException: No replica for partition
> [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)],
> Assigned replicas are: [List(2)]
> 3. After restarting the broker today, the controller logs selects 2 as
> ISR.
>
> [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No
> broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2
> from live brokers 2. There's potential data loss.
> (kafka.controller.OfflinePartitionLeaderSelector)
>
> *###Broker-2 Logs*
>
> 1. When controller sends replica-request, warning invalid LeaderAndIsr
>
>     - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid
>       LeaderAndIsr request with correlation id 0 from controller 3 epoch 63 with
>       an older leader epoch 41 for partition [cse-sauron,1], current
>       leader epoch is 41 (state.change.logger)
>
> 2. On restart today
>
>     - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the
>       become-follower state change with correlation id 32 from controller 3 epoch
>       63 for partition [bfd-logstream_9092-PROD,82] new leader -1
>       (state.change.logger)
>
>
>
> *###Where it all Started (Camus logs)*
>
> We have a camus job which runs every half-hour, during some of the
> yesterdays run we had duplicate data creeping in. On investigation, we
> found out whenever the metadata request was sent to broker-id 2, broker-id
> 2 used to declare itself as leader. If the request was sent to other
> brokers, the leader was always broker 3. On list topic, broker (1,3) were
> ISR and 3 was the leader.  Though if we see the camus log for yesterday’s
> run, it finds broker-2 as the leader for one of its run.
>
>
>    - Job Run at 2014-06-17 00:06:34
>       - 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
>       Fetching metadata from broker broker-1 with client id cse-sauron for 0
>       topic(s) []
>       2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
>       cse-sauron uri:tcp://broker-3-url leader:3 partition:0
>       offset:10693174 latest_offset:10693174
>    - Job Run at 2014-06-17 00:24:58
>       - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
>       Fetching metadata from broker 2 with client id cse-sauron for 0 topic(s) []
>       2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob -
>       The current offset was found to be more than the latest offset
>       - 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob
>       - Moving to the earliest offset available
>       - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
>       cse-sauron uri:tcp://broker-2-url leader:2 partition:0
>       offset:9747106 latest_offset:10674802
>    - Job Run at 2014-06-17 01:01:54:
>       - 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
>       Fetching metadata from broker dare-broker02:9092 with client id cse-sauron
>       for 0 topic(s) []
>       2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
>       cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092 leader:3
>       partition:0 offset:10253311 latest_offset:10697656
>
>  Let me know if you need any more details
>
> --
>
> Abhinav Anand
>



-- 
-- Guozhang