You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jason Rosenberg <jb...@squareup.com> on 2015/04/08 19:31:08 UTC

expected behavior if a node undergoes unclean shutdown

Hello,

I'm still trying to get to the bottom of an issue we had previously, with
an unclean shutdown during an upgrade to 0.8.2.1 (from 0.8.1.1).  In that
case, the controlled shutdown was interrupted, and the node was shutdown
abruptly.  This resulted in about 5 minutes of unavailability for most
partitions.  (I think that issue is related to the one reported by Thunder
in the thread titled: "Problem with node after restart no partitions?").

Anyway, while investigating that, I've gotten side-tracked, trying
understand what the expected behavior should be, if the controller node
dies abruptly.

To test this, I have a small test cluster (2 nodes, 100 partitions, each
with replication factor 2, using 0.8.2.1).  There are also a few test
producer clients, some of them high volume....

I intentionally killed the controller node hard.  I noticed that for 10
seconds, the second node spammed the logs for 10 seconds trying to fetch
data for the partitions it was following on the node that was killed.
Finally, after about 10 seconds, the second node elected itself the new
controller, and things slowly recovered.

Clients could not successfully produce to the affected partitions until the
new controller was elected (and got failed meta-data requests trying to
discover the new leader partition).

I would have expected the cluster to recover more quickly if a node fails,
if we have available replicas that can become leader and start receiving
data.  With just 100 partitions, I would have expected this recovery to
happen very quickly.  (Whereas in our previous issue, where it seemed to
take 5 minutes, the longer duration there was probably related to a much
larger number of partitions).

Anyway, before I start filing Jira's and attaching log snippets, I'd like
to understand what the expected behavior should be?

If a controller (or really any node in the cluster) undergoes unclean
shutdown, how should the cluster respond, in keeping replicas available
(assuming all replicas were in ISR before the shutdown).  How fast should
controller and partition leader election happen in this case?

Thanks,

Jason

Re: expected behavior if a node undergoes unclean shutdown

Posted by Jason Rosenberg <jb...@squareup.com>.

Anyone have any thoughts here?

On Wed, Apr 8, 2015 at 9:07 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> I've confirmed that the same thing happens even if it's not the controller
> that's killed hard.  Also, in several trials, it took between 10-30 seconds
> to recover.
>
> Jason
>
> On Wed, Apr 8, 2015 at 1:31 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>
>> Hello,
>>
>> I'm still trying to get to the bottom of an issue we had previously, with
>> an unclean shutdown during an upgrade to 0.8.2.1 (from 0.8.1.1).  In that
>> case, the controlled shutdown was interrupted, and the node was shutdown
>> abruptly.  This resulted in about 5 minutes of unavailability for most
>> partitions.  (I think that issue is related to the one reported by Thunder
>> in the thread titled: "Problem with node after restart no partitions?").
>>
>> Anyway, while investigating that, I've gotten side-tracked, trying
>> understand what the expected behavior should be, if the controller node
>> dies abruptly.
>>
>> To test this, I have a small test cluster (2 nodes, 100 partitions, each
>> with replication factor 2, using 0.8.2.1).  There are also a few test
>> producer clients, some of them high volume....
>>
>> I intentionally killed the controller node hard.  I noticed that for 10
>> seconds, the second node spammed the logs for 10 seconds trying to fetch
>> data for the partitions it was following on the node that was killed.
>> Finally, after about 10 seconds, the second node elected itself the new
>> controller, and things slowly recovered.
>>
>> Clients could not successfully produce to the affected partitions until
>> the new controller was elected (and got failed meta-data requests trying to
>> discover the new leader partition).
>>
>> I would have expected the cluster to recover more quickly if a node
>> fails, if we have available replicas that can become leader and start
>> receiving data.  With just 100 partitions, I would have expected this
>> recovery to happen very quickly.  (Whereas in our previous issue, where it
>> seemed to take 5 minutes, the longer duration there was probably related to
>> a much larger number of partitions).
>>
>> Anyway, before I start filing Jira's and attaching log snippets, I'd like
>> to understand what the expected behavior should be?
>>
>> If a controller (or really any node in the cluster) undergoes unclean
>> shutdown, how should the cluster respond, in keeping replicas available
>> (assuming all replicas were in ISR before the shutdown).  How fast should
>> controller and partition leader election happen in this case?
>>
>> Thanks,
>>
>> Jason
>>
>
>

Re: expected behavior if a node undergoes unclean shutdown

Posted by Jason Rosenberg <jb...@squareup.com>.

I've confirmed that the same thing happens even if it's not the controller
that's killed hard.  Also, in several trials, it took between 10-30 seconds
to recover.

Jason

On Wed, Apr 8, 2015 at 1:31 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hello,
>
> I'm still trying to get to the bottom of an issue we had previously, with
> an unclean shutdown during an upgrade to 0.8.2.1 (from 0.8.1.1).  In that
> case, the controlled shutdown was interrupted, and the node was shutdown
> abruptly.  This resulted in about 5 minutes of unavailability for most
> partitions.  (I think that issue is related to the one reported by Thunder
> in the thread titled: "Problem with node after restart no partitions?").
>
> Anyway, while investigating that, I've gotten side-tracked, trying
> understand what the expected behavior should be, if the controller node
> dies abruptly.
>
> To test this, I have a small test cluster (2 nodes, 100 partitions, each
> with replication factor 2, using 0.8.2.1).  There are also a few test
> producer clients, some of them high volume....
>
> I intentionally killed the controller node hard.  I noticed that for 10
> seconds, the second node spammed the logs for 10 seconds trying to fetch
> data for the partitions it was following on the node that was killed.
> Finally, after about 10 seconds, the second node elected itself the new
> controller, and things slowly recovered.
>
> Clients could not successfully produce to the affected partitions until
> the new controller was elected (and got failed meta-data requests trying to
> discover the new leader partition).
>
> I would have expected the cluster to recover more quickly if a node fails,
> if we have available replicas that can become leader and start receiving
> data.  With just 100 partitions, I would have expected this recovery to
> happen very quickly.  (Whereas in our previous issue, where it seemed to
> take 5 minutes, the longer duration there was probably related to a much
> larger number of partitions).
>
> Anyway, before I start filing Jira's and attaching log snippets, I'd like
> to understand what the expected behavior should be?
>
> If a controller (or really any node in the cluster) undergoes unclean
> shutdown, how should the cluster respond, in keeping replicas available
> (assuming all replicas were in ISR before the shutdown).  How fast should
> controller and partition leader election happen in this case?
>
> Thanks,
>
> Jason
>