You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Nicholas Feinberg <ni...@liftoff.io> on 2019/11/21 23:52:33 UTC

Broker shutdown slowdown between 1.1.0 and 2.3.1

I've been looking at upgrading my cluster from 1.1.0 to 2.3.1. While
testing, I've noticed that shutting brokers down seems to take consistently
longer on 2.3.1. Specifically, the process of 'creating snapshots' seems to
take several times longer than it did on 1.1.0. On a small testing setup,
the time needed to create snapshots and shut down goes from ~20s to ~120s;
with production-scale data, it goes from ~2min to ~30min.

To allow myself to roll back, I'm still using the 1.1 versions of the
inter-broker protocol and the message format - is it possible that those
could slow things down in 2.3.1? If not, any ideas what else could be at
fault, or what I could do to narrow down the issue further?

Thanks!
-Nicholas

Re: Broker shutdown slowdown between 1.1.0 and 2.3.1

Posted by Nicholas Feinberg <ni...@liftoff.io>.
Sure thing! Done <https://issues.apache.org/jira/browse/KAFKA-9227>.

On Fri, Nov 22, 2019 at 7:58 AM Ismael Juma <is...@juma.me.uk> wrote:

> Can you please file a JIRA?
>
> Ismael
>
> On Thu, Nov 21, 2019 at 3:52 PM Nicholas Feinberg <ni...@liftoff.io>
> wrote:
>
> > I've been looking at upgrading my cluster from 1.1.0 to 2.3.1. While
> > testing, I've noticed that shutting brokers down seems to take
> consistently
> > longer on 2.3.1. Specifically, the process of 'creating snapshots' seems
> to
> > take several times longer than it did on 1.1.0. On a small testing setup,
> > the time needed to create snapshots and shut down goes from ~20s to
> ~120s;
> > with production-scale data, it goes from ~2min to ~30min.
> >
> > To allow myself to roll back, I'm still using the 1.1 versions of the
> > inter-broker protocol and the message format - is it possible that those
> > could slow things down in 2.3.1? If not, any ideas what else could be at
> > fault, or what I could do to narrow down the issue further?
> >
> > Thanks!
> > -Nicholas
> >
>

Re: Broker shutdown slowdown between 1.1.0 and 2.3.1

Posted by Ismael Juma <is...@juma.me.uk>.
Can you please file a JIRA?

Ismael

On Thu, Nov 21, 2019 at 3:52 PM Nicholas Feinberg <ni...@liftoff.io>
wrote:

> I've been looking at upgrading my cluster from 1.1.0 to 2.3.1. While
> testing, I've noticed that shutting brokers down seems to take consistently
> longer on 2.3.1. Specifically, the process of 'creating snapshots' seems to
> take several times longer than it did on 1.1.0. On a small testing setup,
> the time needed to create snapshots and shut down goes from ~20s to ~120s;
> with production-scale data, it goes from ~2min to ~30min.
>
> To allow myself to roll back, I'm still using the 1.1 versions of the
> inter-broker protocol and the message format - is it possible that those
> could slow things down in 2.3.1? If not, any ideas what else could be at
> fault, or what I could do to narrow down the issue further?
>
> Thanks!
> -Nicholas
>

Re: Broker shutdown slowdown between 1.1.0 and 2.3.1

Posted by Nicholas Feinberg <ni...@liftoff.io>.
On Thu, Nov 21, 2019 at 4:25 PM Peter Bukowinski <pm...@gmail.com> wrote:

> How many partitions are on each of your brokers? That’s a key factor
> affecting shutdown and startup time.
>

The test hosts run about 384 partitions each (7 topics * 128 partitions
each * 3x replication / 7 brokers). The largest prod cluster has about 1344
partitions/broker; the smallest and slowest has 2560.


> I’m currently doing a rolling restart of a 150-broker cluster running
> kafka 2.3.1. The cluster is very busy (~500k msg/sec, ~1GB/sec). Each
> broker has about 65 partitions. Each broker restart cycle (stop/start,
> rejoin ISR) takes about 90 seconds.
>

In our largest prod cluster (16 d2.8xlarge broker cluster, 200k msg/s, 300
MB/s), our restart cycles take about 3 minutes on 1.1.0 (counting
ISR-rejoin time) and about 30 minutes on 2.3.1. The only other change we
made between versions was increasing heap size from 8G to 16G.

Thanks for the response!


>
> > On Nov 21, 2019, at 3:52 PM, Nicholas Feinberg <ni...@liftoff.io>
> wrote:
> >
> > I've been looking at upgrading my cluster from 1.1.0 to 2.3.1. While
> > testing, I've noticed that shutting brokers down seems to take
> consistently
> > longer on 2.3.1. Specifically, the process of 'creating snapshots' seems
> to
> > take several times longer than it did on 1.1.0. On a small testing setup,
> > the time needed to create snapshots and shut down goes from ~20s to
> ~120s;
> > with production-scale data, it goes from ~2min to ~30min.
> >
> > To allow myself to roll back, I'm still using the 1.1 versions of the
> > inter-broker protocol and the message format - is it possible that those
> > could slow things down in 2.3.1? If not, any ideas what else could be at
> > fault, or what I could do to narrow down the issue further?
> >
> > Thanks!
> > -Nicholas
>
>

Re: Broker shutdown slowdown between 1.1.0 and 2.3.1

Posted by Peter Bukowinski <pm...@gmail.com>.
How many partitions are on each of your brokers? That’s a key factor affecting shutdown and startup time. Even if it is large, though, I’ve seen a notable reduction in shutdown and startup times as I’ve moved from kafka 0.11 to 1.x to 2.x.

I’m currently doing a rolling restart of a 150-broker cluster running kafka 2.3.1. The cluster is very busy (~500k msg/sec, ~1GB/sec). Each broker has about 65 partitions. Each broker restart cycle (stop/start, rejoin ISR) takes about 90 seconds.


> On Nov 21, 2019, at 3:52 PM, Nicholas Feinberg <ni...@liftoff.io> wrote:
> 
> I've been looking at upgrading my cluster from 1.1.0 to 2.3.1. While
> testing, I've noticed that shutting brokers down seems to take consistently
> longer on 2.3.1. Specifically, the process of 'creating snapshots' seems to
> take several times longer than it did on 1.1.0. On a small testing setup,
> the time needed to create snapshots and shut down goes from ~20s to ~120s;
> with production-scale data, it goes from ~2min to ~30min.
> 
> To allow myself to roll back, I'm still using the 1.1 versions of the
> inter-broker protocol and the message format - is it possible that those
> could slow things down in 2.3.1? If not, any ideas what else could be at
> fault, or what I could do to narrow down the issue further?
> 
> Thanks!
> -Nicholas