You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Krzysztof Ociepa <oc...@gmail.com> on 2014/04/03 14:11:21 UTC

Failing broker with errors for Conditional update

Hi Team,

I test Kafka 0.8.1 and have problems with missing brokers. What I did:

1. Installed Zookeeper 3.3.6 servers on 3 dedicated machines
2. Installed Kafka 0.8.1 brokers on 3 machines (different from Zookeeper
ones)
3. Configured brokers as above:
# Replication configurations
num.replica.fetchers=2
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000
controller.socket.timeout.ms=30000
controller.message.queue.size=10
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
leader.imbalance.check.interval.seconds=1200

# ZK configuration
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=3000

4.  Started Zookeeper and brokers. Created topics. In Zookeeper client I
saw:
ls /brokers/ids
[3, 2, 1]

5. Ran producers and consumers with very low traffic. All worked fine.
6. Increased traffic to ~100 messages per second and the controller
(ID=2 / machine 2) failed with broker on the same machine (I've seen few
times this scenario on my cluster).

My logs as following.
Machine 2:

    Controller ID=2:
    [2014-04-02 10:38:37,152] INFO
    [Controller-2-to-broker-3-send-thread], Starting 
    (kafka.controller.RequestSendThread)
    [2014-04-02 10:42:21,131] INFO [SessionExpirationListener on 2], ZK
    expired; shut down all controller components and try to re-elect
    (kafka.controller.KafkaController$SessionExpirationListener)
    [2014-04-02 10:42:21,176] INFO [delete-topics-thread], Shutting down
    (kafka.controller.TopicDeletionManager$DeleteTopicsThread)

    Server ID=2:
    [2014-04-02 10:38:39,113] INFO Partition [topic1,1] on broker 2:
    Expanding ISR for partition [topic1,1] from 2,1 to 2,1,3
    (kafka.cluster.Partition)
    [2014-04-02 10:38:39,129] INFO Partition [topic2,1] on broker 2:
    Expanding ISR for partition [topic2,1] from 2,1 to 2,1,3
    (kafka.cluster.Partition)
    [2014-04-02 10:38:39,488] INFO Closing socket connection to
    /11.111.1.19. (kafka.network.Processor)
    [2014-04-02 10:38:45,343] INFO Closing socket connection to
    /11.111.1.19. (kafka.network.Processor)
    [2014-04-02 10:39:21,842] INFO Closing socket connection to
    /11.111.1.110. (kafka.network.Processor)
    [2014-04-02 10:41:24,605] INFO Partition [topic3,1] on broker 2:
    Shrinking ISR for partition [topic3,1] from 2,1,3 to 2,1
    (kafka.cluster.Partition)
    [2014-04-02 10:41:26,748] INFO Partition [topic3,1] on broker 2:
    Expanding ISR for partition [topic3,1] from 2,1 to 2,1,3
    (kafka.cluster.Partition)
    [2014-04-02 10:42:22,727] INFO Closing socket connection to
    /11.111.1.21. (kafka.network.Processor)
    [2014-04-02 10:42:23,181] INFO Closing socket connection to
    /11.111.1.21. (kafka.network.Processor)
    [2014-04-02 10:42:24,351] INFO Closing socket connection to
    /11.111.1.25. (kafka.network.Processor)
    [2014-04-02 10:42:24,501] INFO Closing socket connection to
    /11.111.1.25. (kafka.network.Processor)
    [2014-04-02 10:42:27,915] ERROR [ReplicaFetcherThread-0-1], Error
    for partition [topic2,4] to broker 1:class
    kafka.common.NotLeaderForPartitionException
    (kafka.server.ReplicaFetcherThread)
    ...
    [2014-04-02 10:42:34,575] ERROR Conditional update of path
    /brokers/topics/topic1/partitions/1/state with data
    {"controller_epoch":7,"leader":2,"version":1,"leader_epoch":61,"isr":[2]}
    and expected version 119 failed due to
    org.apache.zookeeper.KeeperException$BadVersionException:
    KeeperErrorCode = BadVersion for
    /brokers/topics/topic1/partitions/1/state (kafka.utils.ZkUtils$)
    [2014-04-02 10:42:34,575] INFO Partition [topic1,1] on broker 2:
    Cached zkVersion [119] not equal to that in zookeeper, skip updating
    ISR (kafka.cluster.Partition)

    State-change ID=2:
    [2014-04-02 10:38:37,134] TRACE Controller 2 epoch 7 received
    response correlationId 313 for a request sent to broker
    id:2,host:11.111.1.23,port:9092 (state.change.logger)
    [2014-04-02 10:38:37,696] TRACE Controller 2 epoch 7 received
    response correlationId 313 for a request sent to broker
    id:1,host:11.111.1.21,port:9092 (state.change.logger)
    [2014-04-02 10:38:38,682] TRACE Controller 2 epoch 7 received
    response correlationId 312 for a request sent to broker
    id:3,host:11.111.1.25,port:9092 (state.change.logger)
    [2014-04-02 10:38:38,909] TRACE Controller 2 epoch 7 received
    response correlationId 313 for a request sent to broker
    id:3,host:11.111.1.25,port:9092 (state.change.logger)
    [2014-04-02 10:38:38,918] TRACE Controller 2 epoch 7 received
    response correlationId 313 for a request sent to broker
    id:3,host:11.111.1.25,port:9092 (state.change.logger)

Machine 1:

    Controller ID=1:
    [2014-04-02 10:42:20,342] INFO [Controller 1]: Broker 1 starting
    become controller state transition (kafka.controller.KafkaController)
    [2014-04-02 10:42:20,372] INFO [Controller 1]: Controller 1
    incremented epoch to 8 (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,176] DEBUG [Channel manager on controller 1]:
    Controller 1 trying to connect to broker 1
    (kafka.controller.ControllerChannelManager)
    [2014-04-02 10:42:21,238] INFO
    [Controller-1-to-broker-1-send-thread], Controller 1 connected to
    id:1,host:11.111.1.21,port:9092 for sending state change requests
    (kafka.controller.RequestSendThread)
    [2014-04-02 10:42:21,239] DEBUG [Channel manager on controller 1]:
    Controller 1 trying to connect to broker 3
    (kafka.controller.ControllerChannelManager)
    [2014-04-02 10:42:21,460] INFO
    [Controller-1-to-broker-3-send-thread], Controller 1 connected to
    id:3,host:11.111.1.25,port:9092 for sending state change requests
    (kafka.controller.RequestSendThread)
    [2014-04-02 10:42:21,483] INFO
    [Controller-1-to-broker-3-send-thread], Starting 
    (kafka.controller.RequestSendThread)
    [2014-04-02 10:42:21,483] INFO
    [Controller-1-to-broker-1-send-thread], Starting 
    (kafka.controller.RequestSendThread)
    [2014-04-02 10:42:21,506] INFO [Controller 1]: Partitions undergoing
    preferred replica election:  (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,515] INFO [Controller 1]: Partitions that
    completed preferred replica election: 
    (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,516] INFO [Controller 1]: Resuming preferred
    replica election for partitions:  (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,653] INFO [Controller 1]: Partitions being
    reassigned: Map() (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,655] INFO [Controller 1]: Partitions already
    reassigned: List() (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,656] INFO [Controller 1]: Resuming reassignment
    of partitions: Map() (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,742] INFO [Controller 1]: List of topics to be
    deleted:  (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,743] INFO [Controller 1]: List of topics
    ineligible for deletion: topic1,topic2,topic3
    (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,848] INFO [Controller 1]: Currently active
    brokers in the cluster: Set(1, 3) (kafka.controller.KafkaController)
    [2014-04-02 10:42:21,849] INFO [Controller 1]: Currently shutting
    brokers in the cluster: Set() (kafka.controller.KafkaController)

After failover, Controller is on machine 1 and broker ID=2 disappeared
in Zookeeper:
ls /brokers/ids
[3, 1]

Only restart broker ID=2 helps in order to back. My questionis:
Why broker is not able to back and works?
(I see problem with the bad version in Zookeeper but I would like to
better understand it and prevent in future)

Thanks,
Chris

Re: Failing broker with errors for Conditional update

Posted by Neha Narkhede <ne...@gmail.com>.

Hey Chris,

Guozhang's analysis is worth testing. Would you mind grabbing the latest
0.8.1 branch build to re-run your test against?

Thanks,
Neha


On Thu, Apr 3, 2014 at 8:35 AM, Guozhang Wang <wa...@gmail.com> wrote:

> Hi Chris,
>
> From the logs it seems your controller machine id=2 has some soft failure,
> such as a long GC which casued it to session timeout from ZK. Then when it
> is shutting down it hits a known bug (related to KAFKA-1311) in 0.8.1 which
> will cause it to hang on controller shutdown. We are currently releasing a
> new version 0.8.1.1 with the fix to this bug.
>
> Thanks,
>
> Guozhang
>
>
> On Thu, Apr 3, 2014 at 5:11 AM, Krzysztof Ociepa <
> ociepa.krzysztof@gmail.com
> > wrote:
>
> > Hi Team,
> >
> > I test Kafka 0.8.1 and have problems with missing brokers. What I did:
> >
> > 1. Installed Zookeeper 3.3.6 servers on 3 dedicated machines
> > 2. Installed Kafka 0.8.1 brokers on 3 machines (different from Zookeeper
> > ones)
> > 3. Configured brokers as above:
> > # Replication configurations
> > num.replica.fetchers=2
> > replica.fetch.max.bytes=1048576
> > replica.fetch.wait.max.ms=500
> > replica.high.watermark.checkpoint.interval.ms=5000
> > replica.socket.timeout.ms=30000
> > replica.socket.receive.buffer.bytes=65536
> > replica.lag.time.max.ms=10000
> > replica.lag.max.messages=4000
> > controller.socket.timeout.ms=30000
> > controller.message.queue.size=10
> > controlled.shutdown.enable=true
> > auto.leader.rebalance.enable=true
> > leader.imbalance.check.interval.seconds=1200
> >
> > # ZK configuration
> > zookeeper.connection.timeout.ms=6000
> > zookeeper.sync.time.ms=3000
> >
> > 4.  Started Zookeeper and brokers. Created topics. In Zookeeper client I
> > saw:
> > ls /brokers/ids
> > [3, 2, 1]
> >
> > 5. Ran producers and consumers with very low traffic. All worked fine.
> > 6. Increased traffic to ~100 messages per second and the controller
> > (ID=2 / machine 2) failed with broker on the same machine (I've seen few
> > times this scenario on my cluster).
> >
> > My logs as following.
> > Machine 2:
> >
> >     Controller ID=2:
> >     [2014-04-02 10:38:37,152] INFO
> >     [Controller-2-to-broker-3-send-thread], Starting
> >     (kafka.controller.RequestSendThread)
> >     [2014-04-02 10:42:21,131] INFO [SessionExpirationListener on 2], ZK
> >     expired; shut down all controller components and try to re-elect
> >     (kafka.controller.KafkaController$SessionExpirationListener)
> >     [2014-04-02 10:42:21,176] INFO [delete-topics-thread], Shutting down
> >     (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
> >
> >     Server ID=2:
> >     [2014-04-02 10:38:39,113] INFO Partition [topic1,1] on broker 2:
> >     Expanding ISR for partition [topic1,1] from 2,1 to 2,1,3
> >     (kafka.cluster.Partition)
> >     [2014-04-02 10:38:39,129] INFO Partition [topic2,1] on broker 2:
> >     Expanding ISR for partition [topic2,1] from 2,1 to 2,1,3
> >     (kafka.cluster.Partition)
> >     [2014-04-02 10:38:39,488] INFO Closing socket connection to
> >     /11.111.1.19. (kafka.network.Processor)
> >     [2014-04-02 10:38:45,343] INFO Closing socket connection to
> >     /11.111.1.19. (kafka.network.Processor)
> >     [2014-04-02 10:39:21,842] INFO Closing socket connection to
> >     /11.111.1.110. (kafka.network.Processor)
> >     [2014-04-02 10:41:24,605] INFO Partition [topic3,1] on broker 2:
> >     Shrinking ISR for partition [topic3,1] from 2,1,3 to 2,1
> >     (kafka.cluster.Partition)
> >     [2014-04-02 10:41:26,748] INFO Partition [topic3,1] on broker 2:
> >     Expanding ISR for partition [topic3,1] from 2,1 to 2,1,3
> >     (kafka.cluster.Partition)
> >     [2014-04-02 10:42:22,727] INFO Closing socket connection to
> >     /11.111.1.21. (kafka.network.Processor)
> >     [2014-04-02 10:42:23,181] INFO Closing socket connection to
> >     /11.111.1.21. (kafka.network.Processor)
> >     [2014-04-02 10:42:24,351] INFO Closing socket connection to
> >     /11.111.1.25. (kafka.network.Processor)
> >     [2014-04-02 10:42:24,501] INFO Closing socket connection to
> >     /11.111.1.25. (kafka.network.Processor)
> >     [2014-04-02 10:42:27,915] ERROR [ReplicaFetcherThread-0-1], Error
> >     for partition [topic2,4] to broker 1:class
> >     kafka.common.NotLeaderForPartitionException
> >     (kafka.server.ReplicaFetcherThread)
> >     ...
> >     [2014-04-02 10:42:34,575] ERROR Conditional update of path
> >     /brokers/topics/topic1/partitions/1/state with data
> >
> > {"controller_epoch":7,"leader":2,"version":1,"leader_epoch":61,"isr":[2]}
> >     and expected version 119 failed due to
> >     org.apache.zookeeper.KeeperException$BadVersionException:
> >     KeeperErrorCode = BadVersion for
> >     /brokers/topics/topic1/partitions/1/state (kafka.utils.ZkUtils$)
> >     [2014-04-02 10:42:34,575] INFO Partition [topic1,1] on broker 2:
> >     Cached zkVersion [119] not equal to that in zookeeper, skip updating
> >     ISR (kafka.cluster.Partition)
> >
> >     State-change ID=2:
> >     [2014-04-02 10:38:37,134] TRACE Controller 2 epoch 7 received
> >     response correlationId 313 for a request sent to broker
> >     id:2,host:11.111.1.23,port:9092 (state.change.logger)
> >     [2014-04-02 10:38:37,696] TRACE Controller 2 epoch 7 received
> >     response correlationId 313 for a request sent to broker
> >     id:1,host:11.111.1.21,port:9092 (state.change.logger)
> >     [2014-04-02 10:38:38,682] TRACE Controller 2 epoch 7 received
> >     response correlationId 312 for a request sent to broker
> >     id:3,host:11.111.1.25,port:9092 (state.change.logger)
> >     [2014-04-02 10:38:38,909] TRACE Controller 2 epoch 7 received
> >     response correlationId 313 for a request sent to broker
> >     id:3,host:11.111.1.25,port:9092 (state.change.logger)
> >     [2014-04-02 10:38:38,918] TRACE Controller 2 epoch 7 received
> >     response correlationId 313 for a request sent to broker
> >     id:3,host:11.111.1.25,port:9092 (state.change.logger)
> >
> > Machine 1:
> >
> >     Controller ID=1:
> >     [2014-04-02 10:42:20,342] INFO [Controller 1]: Broker 1 starting
> >     become controller state transition (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:20,372] INFO [Controller 1]: Controller 1
> >     incremented epoch to 8 (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,176] DEBUG [Channel manager on controller 1]:
> >     Controller 1 trying to connect to broker 1
> >     (kafka.controller.ControllerChannelManager)
> >     [2014-04-02 10:42:21,238] INFO
> >     [Controller-1-to-broker-1-send-thread], Controller 1 connected to
> >     id:1,host:11.111.1.21,port:9092 for sending state change requests
> >     (kafka.controller.RequestSendThread)
> >     [2014-04-02 10:42:21,239] DEBUG [Channel manager on controller 1]:
> >     Controller 1 trying to connect to broker 3
> >     (kafka.controller.ControllerChannelManager)
> >     [2014-04-02 10:42:21,460] INFO
> >     [Controller-1-to-broker-3-send-thread], Controller 1 connected to
> >     id:3,host:11.111.1.25,port:9092 for sending state change requests
> >     (kafka.controller.RequestSendThread)
> >     [2014-04-02 10:42:21,483] INFO
> >     [Controller-1-to-broker-3-send-thread], Starting
> >     (kafka.controller.RequestSendThread)
> >     [2014-04-02 10:42:21,483] INFO
> >     [Controller-1-to-broker-1-send-thread], Starting
> >     (kafka.controller.RequestSendThread)
> >     [2014-04-02 10:42:21,506] INFO [Controller 1]: Partitions undergoing
> >     preferred replica election:  (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,515] INFO [Controller 1]: Partitions that
> >     completed preferred replica election:
> >     (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,516] INFO [Controller 1]: Resuming preferred
> >     replica election for partitions:  (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,653] INFO [Controller 1]: Partitions being
> >     reassigned: Map() (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,655] INFO [Controller 1]: Partitions already
> >     reassigned: List() (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,656] INFO [Controller 1]: Resuming reassignment
> >     of partitions: Map() (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,742] INFO [Controller 1]: List of topics to be
> >     deleted:  (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,743] INFO [Controller 1]: List of topics
> >     ineligible for deletion: topic1,topic2,topic3
> >     (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,848] INFO [Controller 1]: Currently active
> >     brokers in the cluster: Set(1, 3) (kafka.controller.KafkaController)
> >     [2014-04-02 10:42:21,849] INFO [Controller 1]: Currently shutting
> >     brokers in the cluster: Set() (kafka.controller.KafkaController)
> >
> > After failover, Controller is on machine 1 and broker ID=2 disappeared
> > in Zookeeper:
> > ls /brokers/ids
> > [3, 1]
> >
> > Only restart broker ID=2 helps in order to back. My questionis:
> > Why broker is not able to back and works?
> > (I see problem with the bad version in Zookeeper but I would like to
> > better understand it and prevent in future)
> >
> > Thanks,
> > Chris
> >
> >
> >
>
>
> --
> -- Guozhang
>

Re: Failing broker with errors for Conditional update

Posted by Guozhang Wang <wa...@gmail.com>.

Hi Chris,

>From the logs it seems your controller machine id=2 has some soft failure,
such as a long GC which casued it to session timeout from ZK. Then when it
is shutting down it hits a known bug (related to KAFKA-1311) in 0.8.1 which
will cause it to hang on controller shutdown. We are currently releasing a
new version 0.8.1.1 with the fix to this bug.

Thanks,

Guozhang


On Thu, Apr 3, 2014 at 5:11 AM, Krzysztof Ociepa <ociepa.krzysztof@gmail.com
> wrote:

> Hi Team,
>
> I test Kafka 0.8.1 and have problems with missing brokers. What I did:
>
> 1. Installed Zookeeper 3.3.6 servers on 3 dedicated machines
> 2. Installed Kafka 0.8.1 brokers on 3 machines (different from Zookeeper
> ones)
> 3. Configured brokers as above:
> # Replication configurations
> num.replica.fetchers=2
> replica.fetch.max.bytes=1048576
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.socket.timeout.ms=30000
> replica.socket.receive.buffer.bytes=65536
> replica.lag.time.max.ms=10000
> replica.lag.max.messages=4000
> controller.socket.timeout.ms=30000
> controller.message.queue.size=10
> controlled.shutdown.enable=true
> auto.leader.rebalance.enable=true
> leader.imbalance.check.interval.seconds=1200
>
> # ZK configuration
> zookeeper.connection.timeout.ms=6000
> zookeeper.sync.time.ms=3000
>
> 4.  Started Zookeeper and brokers. Created topics. In Zookeeper client I
> saw:
> ls /brokers/ids
> [3, 2, 1]
>
> 5. Ran producers and consumers with very low traffic. All worked fine.
> 6. Increased traffic to ~100 messages per second and the controller
> (ID=2 / machine 2) failed with broker on the same machine (I've seen few
> times this scenario on my cluster).
>
> My logs as following.
> Machine 2:
>
>     Controller ID=2:
>     [2014-04-02 10:38:37,152] INFO
>     [Controller-2-to-broker-3-send-thread], Starting
>     (kafka.controller.RequestSendThread)
>     [2014-04-02 10:42:21,131] INFO [SessionExpirationListener on 2], ZK
>     expired; shut down all controller components and try to re-elect
>     (kafka.controller.KafkaController$SessionExpirationListener)
>     [2014-04-02 10:42:21,176] INFO [delete-topics-thread], Shutting down
>     (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
>
>     Server ID=2:
>     [2014-04-02 10:38:39,113] INFO Partition [topic1,1] on broker 2:
>     Expanding ISR for partition [topic1,1] from 2,1 to 2,1,3
>     (kafka.cluster.Partition)
>     [2014-04-02 10:38:39,129] INFO Partition [topic2,1] on broker 2:
>     Expanding ISR for partition [topic2,1] from 2,1 to 2,1,3
>     (kafka.cluster.Partition)
>     [2014-04-02 10:38:39,488] INFO Closing socket connection to
>     /11.111.1.19. (kafka.network.Processor)
>     [2014-04-02 10:38:45,343] INFO Closing socket connection to
>     /11.111.1.19. (kafka.network.Processor)
>     [2014-04-02 10:39:21,842] INFO Closing socket connection to
>     /11.111.1.110. (kafka.network.Processor)
>     [2014-04-02 10:41:24,605] INFO Partition [topic3,1] on broker 2:
>     Shrinking ISR for partition [topic3,1] from 2,1,3 to 2,1
>     (kafka.cluster.Partition)
>     [2014-04-02 10:41:26,748] INFO Partition [topic3,1] on broker 2:
>     Expanding ISR for partition [topic3,1] from 2,1 to 2,1,3
>     (kafka.cluster.Partition)
>     [2014-04-02 10:42:22,727] INFO Closing socket connection to
>     /11.111.1.21. (kafka.network.Processor)
>     [2014-04-02 10:42:23,181] INFO Closing socket connection to
>     /11.111.1.21. (kafka.network.Processor)
>     [2014-04-02 10:42:24,351] INFO Closing socket connection to
>     /11.111.1.25. (kafka.network.Processor)
>     [2014-04-02 10:42:24,501] INFO Closing socket connection to
>     /11.111.1.25. (kafka.network.Processor)
>     [2014-04-02 10:42:27,915] ERROR [ReplicaFetcherThread-0-1], Error
>     for partition [topic2,4] to broker 1:class
>     kafka.common.NotLeaderForPartitionException
>     (kafka.server.ReplicaFetcherThread)
>     ...
>     [2014-04-02 10:42:34,575] ERROR Conditional update of path
>     /brokers/topics/topic1/partitions/1/state with data
>
> {"controller_epoch":7,"leader":2,"version":1,"leader_epoch":61,"isr":[2]}
>     and expected version 119 failed due to
>     org.apache.zookeeper.KeeperException$BadVersionException:
>     KeeperErrorCode = BadVersion for
>     /brokers/topics/topic1/partitions/1/state (kafka.utils.ZkUtils$)
>     [2014-04-02 10:42:34,575] INFO Partition [topic1,1] on broker 2:
>     Cached zkVersion [119] not equal to that in zookeeper, skip updating
>     ISR (kafka.cluster.Partition)
>
>     State-change ID=2:
>     [2014-04-02 10:38:37,134] TRACE Controller 2 epoch 7 received
>     response correlationId 313 for a request sent to broker
>     id:2,host:11.111.1.23,port:9092 (state.change.logger)
>     [2014-04-02 10:38:37,696] TRACE Controller 2 epoch 7 received
>     response correlationId 313 for a request sent to broker
>     id:1,host:11.111.1.21,port:9092 (state.change.logger)
>     [2014-04-02 10:38:38,682] TRACE Controller 2 epoch 7 received
>     response correlationId 312 for a request sent to broker
>     id:3,host:11.111.1.25,port:9092 (state.change.logger)
>     [2014-04-02 10:38:38,909] TRACE Controller 2 epoch 7 received
>     response correlationId 313 for a request sent to broker
>     id:3,host:11.111.1.25,port:9092 (state.change.logger)
>     [2014-04-02 10:38:38,918] TRACE Controller 2 epoch 7 received
>     response correlationId 313 for a request sent to broker
>     id:3,host:11.111.1.25,port:9092 (state.change.logger)
>
> Machine 1:
>
>     Controller ID=1:
>     [2014-04-02 10:42:20,342] INFO [Controller 1]: Broker 1 starting
>     become controller state transition (kafka.controller.KafkaController)
>     [2014-04-02 10:42:20,372] INFO [Controller 1]: Controller 1
>     incremented epoch to 8 (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,176] DEBUG [Channel manager on controller 1]:
>     Controller 1 trying to connect to broker 1
>     (kafka.controller.ControllerChannelManager)
>     [2014-04-02 10:42:21,238] INFO
>     [Controller-1-to-broker-1-send-thread], Controller 1 connected to
>     id:1,host:11.111.1.21,port:9092 for sending state change requests
>     (kafka.controller.RequestSendThread)
>     [2014-04-02 10:42:21,239] DEBUG [Channel manager on controller 1]:
>     Controller 1 trying to connect to broker 3
>     (kafka.controller.ControllerChannelManager)
>     [2014-04-02 10:42:21,460] INFO
>     [Controller-1-to-broker-3-send-thread], Controller 1 connected to
>     id:3,host:11.111.1.25,port:9092 for sending state change requests
>     (kafka.controller.RequestSendThread)
>     [2014-04-02 10:42:21,483] INFO
>     [Controller-1-to-broker-3-send-thread], Starting
>     (kafka.controller.RequestSendThread)
>     [2014-04-02 10:42:21,483] INFO
>     [Controller-1-to-broker-1-send-thread], Starting
>     (kafka.controller.RequestSendThread)
>     [2014-04-02 10:42:21,506] INFO [Controller 1]: Partitions undergoing
>     preferred replica election:  (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,515] INFO [Controller 1]: Partitions that
>     completed preferred replica election:
>     (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,516] INFO [Controller 1]: Resuming preferred
>     replica election for partitions:  (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,653] INFO [Controller 1]: Partitions being
>     reassigned: Map() (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,655] INFO [Controller 1]: Partitions already
>     reassigned: List() (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,656] INFO [Controller 1]: Resuming reassignment
>     of partitions: Map() (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,742] INFO [Controller 1]: List of topics to be
>     deleted:  (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,743] INFO [Controller 1]: List of topics
>     ineligible for deletion: topic1,topic2,topic3
>     (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,848] INFO [Controller 1]: Currently active
>     brokers in the cluster: Set(1, 3) (kafka.controller.KafkaController)
>     [2014-04-02 10:42:21,849] INFO [Controller 1]: Currently shutting
>     brokers in the cluster: Set() (kafka.controller.KafkaController)
>
> After failover, Controller is on machine 1 and broker ID=2 disappeared
> in Zookeeper:
> ls /brokers/ids
> [3, 1]
>
> Only restart broker ID=2 helps in order to back. My questionis:
> Why broker is not able to back and works?
> (I see problem with the bad version in Zookeeper but I would like to
> better understand it and prevent in future)
>
> Thanks,
> Chris
>
>
>


-- 
-- Guozhang