You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Philippe Laflamme <pl...@hopper.com> on 2015/02/18 06:21:37 UTC

Broker Shutdown / Leader handoff issue

Hi,

I'm trying to replicate a broker shutdown in unit tests. I've got a simple
cluster running with 2 brokers (and one ZK). I'm successfully able to
create a topic with a single partition and replication factor of 2.

I'd like to test shutting down the current leader for the partition and
make sure my code handles the exceptions thrown such as
NotLeaderForPartitionException.

I can't seem to shutdown a broker and have the remaining one report that it
is now the leader for the partition. It looks as though the controller
successfully changes leadership, but the broker itself is unaware of the
change.

Here's a gist of the (convoluted) logs[1].

The sequence is as follows:
1- start 1 ZK and 2 brokers
2- create a topic (test-bogus) with 1 partition and 2 replication factor
3- wait for leadership
4- ask the controller who is the leader
5- ask all brokers who is the leader
6- shutdown leader
7- wait for leadership
8- ask the controller who is the leader
9- ask the remaining broker who is the leader

Steps 4-6 appear here in the logs[2]
Steps 8-9 appear here[3]

As you can see, the controller is aware of the leadership change, but not
the broker. I've activated controlled shutdown and this is still happening.
Any idea what may be causing this?

I'm using Kafka 0.8.1.1 and ZK 3.4.5-cdh4.6

I'm using a TopicMetadataRequest for asking the brokers and inspecting
ControllerContext.partitionLeadershipInfo to fetch leadership from the
Controller.

Thanks
Philippe
[1] https://gist.github.com/plaflamme/60805bfe15ae0106304a
[2]
https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L153-L158
[3]
https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L227-L228

Re: Broker Shutdown / Leader handoff issue

Posted by Jun Rao <ju...@confluent.io>.

Hmm, if you shut down the leader, the follower should get a socket
exception immediately.

Thanks,

Jun

On Wed, Feb 18, 2015 at 7:01 AM, Philippe Laflamme <pl...@hopper.com>
wrote:

> After further investigation, I've figured out that the issue is caused by
> the follower not processing messages from the controller until its
> ReplicaFetcherThread has shutdown completely (which only happens when the
> socket times out).
>
> If the test waits for the socket to timeout, the logs show that the
> ReplicaFetcherThread shuts down completely, and immediately thereafter, the
> UpdateMetadata requests get processed.
>
> Strangely, this happens even when controlled shutdown is enabled.
>
> Sounds related to this[1] which seems to have been fixed in 0.8.0. Are
> there other edge cases not covered by the fix? Is this a known problem in
> 0.8.1.1?
>
> Thanks,
> Philippe
> [1] https://issues.apache.org/jira/browse/KAFKA-612
>
> On Wed, Feb 18, 2015 at 12:21 AM, Philippe Laflamme <pl...@hopper.com>
> wrote:
>
> > Hi,
> >
> > I'm trying to replicate a broker shutdown in unit tests. I've got a
> simple
> > cluster running with 2 brokers (and one ZK). I'm successfully able to
> > create a topic with a single partition and replication factor of 2.
> >
> > I'd like to test shutting down the current leader for the partition and
> > make sure my code handles the exceptions thrown such as
> > NotLeaderForPartitionException.
> >
> > I can't seem to shutdown a broker and have the remaining one report that
> > it is now the leader for the partition. It looks as though the controller
> > successfully changes leadership, but the broker itself is unaware of the
> > change.
> >
> > Here's a gist of the (convoluted) logs[1].
> >
> > The sequence is as follows:
> > 1- start 1 ZK and 2 brokers
> > 2- create a topic (test-bogus) with 1 partition and 2 replication factor
> > 3- wait for leadership
> > 4- ask the controller who is the leader
> > 5- ask all brokers who is the leader
> > 6- shutdown leader
> > 7- wait for leadership
> > 8- ask the controller who is the leader
> > 9- ask the remaining broker who is the leader
> >
> > Steps 4-6 appear here in the logs[2]
> > Steps 8-9 appear here[3]
> >
> > As you can see, the controller is aware of the leadership change, but not
> > the broker. I've activated controlled shutdown and this is still
> happening.
> > Any idea what may be causing this?
> >
> > I'm using Kafka 0.8.1.1 and ZK 3.4.5-cdh4.6
> >
> > I'm using a TopicMetadataRequest for asking the brokers and inspecting
> > ControllerContext.partitionLeadershipInfo to fetch leadership from the
> > Controller.
> >
> > Thanks
> > Philippe
> > [1] https://gist.github.com/plaflamme/60805bfe15ae0106304a
> > [2]
> >
> https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L153-L158
> > [3]
> >
> https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L227-L228
> >
>

Re: Broker Shutdown / Leader handoff issue

Posted by Philippe Laflamme <pl...@hopper.com>.

After further investigation, I've figured out that the issue is caused by
the follower not processing messages from the controller until its
ReplicaFetcherThread has shutdown completely (which only happens when the
socket times out).

If the test waits for the socket to timeout, the logs show that the
ReplicaFetcherThread shuts down completely, and immediately thereafter, the
UpdateMetadata requests get processed.

Strangely, this happens even when controlled shutdown is enabled.

Sounds related to this[1] which seems to have been fixed in 0.8.0. Are
there other edge cases not covered by the fix? Is this a known problem in
0.8.1.1?

Thanks,
Philippe
[1] https://issues.apache.org/jira/browse/KAFKA-612

On Wed, Feb 18, 2015 at 12:21 AM, Philippe Laflamme <pl...@hopper.com>
wrote:

> Hi,
>
> I'm trying to replicate a broker shutdown in unit tests. I've got a simple
> cluster running with 2 brokers (and one ZK). I'm successfully able to
> create a topic with a single partition and replication factor of 2.
>
> I'd like to test shutting down the current leader for the partition and
> make sure my code handles the exceptions thrown such as
> NotLeaderForPartitionException.
>
> I can't seem to shutdown a broker and have the remaining one report that
> it is now the leader for the partition. It looks as though the controller
> successfully changes leadership, but the broker itself is unaware of the
> change.
>
> Here's a gist of the (convoluted) logs[1].
>
> The sequence is as follows:
> 1- start 1 ZK and 2 brokers
> 2- create a topic (test-bogus) with 1 partition and 2 replication factor
> 3- wait for leadership
> 4- ask the controller who is the leader
> 5- ask all brokers who is the leader
> 6- shutdown leader
> 7- wait for leadership
> 8- ask the controller who is the leader
> 9- ask the remaining broker who is the leader
>
> Steps 4-6 appear here in the logs[2]
> Steps 8-9 appear here[3]
>
> As you can see, the controller is aware of the leadership change, but not
> the broker. I've activated controlled shutdown and this is still happening.
> Any idea what may be causing this?
>
> I'm using Kafka 0.8.1.1 and ZK 3.4.5-cdh4.6
>
> I'm using a TopicMetadataRequest for asking the brokers and inspecting
> ControllerContext.partitionLeadershipInfo to fetch leadership from the
> Controller.
>
> Thanks
> Philippe
> [1] https://gist.github.com/plaflamme/60805bfe15ae0106304a
> [2]
> https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L153-L158
> [3]
> https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L227-L228
>