You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jun Rao <ju...@confluent.io> on 2015/02/03 06:31:39 UTC

Re: Replication stop working

One thing that you need to be aware is that if a broker goes down, the
affected partitions will remain under replicated until the broker is
restarted and catches up again.

Thanks,

Jun

On Tue, Jan 27, 2015 at 10:59 AM, Dong, John <zu...@ebay.com> wrote:

> Hi,
>
> I am new to this forum and I am not sure this is the correct mailing list
> for sending question. If not, please let me know and I will stop.
>
> I am looking for help to resolve replication issue. Replication stopped
> working a while back.
>
> Kafka environment: Kafka 0.8.1.1, Centos 6.5, 7 node cluster, default
> replication-factor 2, 10 partition per topic.
>
> Initially each partition is residing on two different nodes. It has been
> this way for several months and working. Starting two weeks ago, two things
> happened.
>
>   1.  one node's disk usage got to 100% and crashed kafka process. So we
> had to delete some *.log and *.index and restarted kafka process.
>   2.  In another case, some other node's disk usage reached 90%. Someone
> deleted some *.log and *.index files without shutting down kafka process.
> This caused issues and kafka was unable to restarted. I had to delete all
> *.log and *.index on this node to bring kafka back online.
>
> Now replication is all broken. Most of the partition has only one leader
> and one in ISR, even though replication is setup with two broker ids.
> Whenever I shutdown kafka process on a node, whatever leader running on
> this node will get moved to another node that is defined in replication.
> After I restart kafka on this node, it will never become a follower and its
> data directory never get updated.
>
> I tried the following:
>
>
>   1.  I had turned on TRACE/DEBUG level with kafka and zookeeper. I did
> not find anything that can help.
>   2.   I also tried to manipulate replication configuration in zookeeper
> using zkCLI.sh, like adding a follower to ISR list. That did not initiate a
> fether process to make itself become a follower.
>   3.   I also created new topic with replication working initially. But as
> soon as I shutdown kafka on one of its two nodes, that partition loses one
> replica in ISR and never come back. This confirms that it is reproducible.
>   4.  I ran kafka preferred replication election tool to force re-election
> of leader. That did not do anything. It is like nothing happen to the
> cluster.
>   5.  I added num.replica.fetchers=10 to server.properties and restarted
> kakfa. That did not do anything.
>
> Has anyone have any experience with this ? Or any advice where to look and
> what the next steps are for trouble-shooting ? There are only two things
> that I may have to do.
>
>
>   1.  Shutdown all kafka and zookeeper and restart them. I really do not
> want to go this route unless I have to. I would like to identify the root
> cause of it and not to randomly restart the whole cluster.
>   2.  Move all topics to another kafka cluster, and rebuild it. This will
> be very time consuming and a lot of changes in the application.
>
> Thanks.
>
> John Dong
>