You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by "Dong, John" <zu...@ebay.com> on 2015/01/27 19:59:22 UTC

Replication stop working

Hi,

I am new to this forum and I am not sure this is the correct mailing list for sending question. If not, please let me know and I will stop.

I am looking for help to resolve replication issue. Replication stopped working a while back.

Kafka environment: Kafka 0.8.1.1, Centos 6.5, 7 node cluster, default replication-factor 2, 10 partition per topic.

Initially each partition is residing on two different nodes. It has been this way for several months and working. Starting two weeks ago, two things happened.

1. one node's disk usage got to 100% and crashed kafka process. So we had to delete some *.log and *.index and restarted kafka process.
2. In another case, some other node's disk usage reached 90%. Someone deleted some *.log and *.index files without shutting down kafka process. This caused issues and kafka was unable to restarted. I had to delete all *.log and *.index on this node to bring kafka back online.

Now replication is all broken. Most of the partition has only one leader and one in ISR, even though replication is setup with two broker ids. Whenever I shutdown kafka process on a node, whatever leader running on this node will get moved to another node that is defined in replication. After I restart kafka on this node, it will never become a follower and its data directory never get updated.

I tried the following:

1. I had turned on TRACE/DEBUG level with kafka and zookeeper. I did not find anything that can help.
2. I also tried to manipulate replication configuration in zookeeper using zkCLI.sh, like adding a follower to ISR list. That did not initiate a fether process to make itself become a follower.
3. I also created new topic with replication working initially. But as soon as I shutdown kafka on one of its two nodes, that partition loses one replica in ISR and never come back. This confirms that it is reproducible.
4. I ran kafka preferred replication election tool to force re-election of leader. That did not do anything. It is like nothing happen to the cluster.
5. I added num.replica.fetchers=10 to server.properties and restarted kakfa. That did not do anything.

Has anyone have any experience with this ? Or any advice where to look and what the next steps are for trouble-shooting ? There are only two things that I may have to do.

1. Shutdown all kafka and zookeeper and restart them. I really do not want to go this route unless I have to. I would like to identify the root cause of it and not to randomly restart the whole cluster.
2. Move all topics to another kafka cluster, and rebuild it. This will be very time consuming and a lot of changes in the application.

Thanks.

John Dong

Re: Replication stop working

Posted by Jun Rao <ju...@confluent.io>.

One thing that you need to be aware is that if a broker goes down, the
affected partitions will remain under replicated until the broker is
restarted and catches up again.

Thanks,

Jun

On Tue, Jan 27, 2015 at 10:59 AM, Dong, John <zu...@ebay.com> wrote:

> Hi,
>
> I am new to this forum and I am not sure this is the correct mailing list
> for sending question. If not, please let me know and I will stop.
>
> I am looking for help to resolve replication issue. Replication stopped
> working a while back.
>
> Kafka environment: Kafka 0.8.1.1, Centos 6.5, 7 node cluster, default
> replication-factor 2, 10 partition per topic.
>
> Initially each partition is residing on two different nodes. It has been
> this way for several months and working. Starting two weeks ago, two things
> happened.
>
>   1.  one node's disk usage got to 100% and crashed kafka process. So we
> had to delete some *.log and *.index and restarted kafka process.
>   2.  In another case, some other node's disk usage reached 90%. Someone
> deleted some *.log and *.index files without shutting down kafka process.
> This caused issues and kafka was unable to restarted. I had to delete all
> *.log and *.index on this node to bring kafka back online.
>
> Now replication is all broken. Most of the partition has only one leader
> and one in ISR, even though replication is setup with two broker ids.
> Whenever I shutdown kafka process on a node, whatever leader running on
> this node will get moved to another node that is defined in replication.
> After I restart kafka on this node, it will never become a follower and its
> data directory never get updated.
>
> I tried the following:
>
>
>   1.  I had turned on TRACE/DEBUG level with kafka and zookeeper. I did
> not find anything that can help.
>   2.   I also tried to manipulate replication configuration in zookeeper
> using zkCLI.sh, like adding a follower to ISR list. That did not initiate a
> fether process to make itself become a follower.
>   3.   I also created new topic with replication working initially. But as
> soon as I shutdown kafka on one of its two nodes, that partition loses one
> replica in ISR and never come back. This confirms that it is reproducible.
>   4.  I ran kafka preferred replication election tool to force re-election
> of leader. That did not do anything. It is like nothing happen to the
> cluster.
>   5.  I added num.replica.fetchers=10 to server.properties and restarted
> kakfa. That did not do anything.
>
> Has anyone have any experience with this ? Or any advice where to look and
> what the next steps are for trouble-shooting ? There are only two things
> that I may have to do.
>
>
>   1.  Shutdown all kafka and zookeeper and restart them. I really do not
> want to go this route unless I have to. I would like to identify the root
> cause of it and not to randomly restart the whole cluster.
>   2.  Move all topics to another kafka cluster, and rebuild it. This will
> be very time consuming and a lot of changes in the application.
>
> Thanks.
>
> John Dong
>