You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Alexandre Dupriez <al...@gmail.com> on 2020/05/16 11:16:18 UTC

Re: High write operations rate on disk

Hi Soumyajit,

It is possible that due to the broker restart, you benefit from less
I/O merges than under steady state. Intuitively, that would come from
a shift from sequential workload with one more dispersed in nature. It
is likely your broker generates more disk read than before the
restart, especially if lots of page were written back and/or released
during the broker bounce.

What would be interesting to know is what the throughput is on the
device (read and write, steady state and at IOPS burst)? I refer to
the actual traffic on the disk - not the read/write at the file system
level.

Thanks,
Alexandre

Le mar. 7 avr. 2020 à 08:42, Seva Feldman <se...@ironsrc.com> a écrit :
>
> We are using mainly ephemeral instances like i3en as our pattern is more
> fit for it.
>
> On Tue, Apr 7, 2020 at 10:40 AM Soumyajit Sahu <so...@gmail.com>
> wrote:
>
> > @Suman, thanks for confirming. I will dig more then. The instances are
> > dedicated to running Kafka, and so is the mounted volume.
> >
> > @Seva, thanks for the insight. I guess if nothing works, then we will move
> > from st1 to gp2 volumes.
> >
> > On Tue, Apr 7, 2020 at 12:28 AM Suman B N <su...@gmail.com> wrote:
> >
> > > We have used st1 volumes and we never saw any issue.
> > > Yes, we are using m-series. Even t-series worked for us :D
> > >
> > > During those spikes, do you observe any background operations going on?
> > > Check server logs, controller logs.
> > >
> > > On Tue, Apr 7, 2020 at 12:49 PM Seva Feldman <se...@ironsrc.com> wrote:
> > >
> > > > ST1 EBS fit only for sequential rights and reads. Once you have many
> > > > partitions on EBS it will be mostly random.
> > > > Interesting to monitor random vs sequential...
> > > >
> > > > We tested kafka on ST1 with 1xx partitions on each EBS and it was
> > > > constantly lagging.
> > > >
> > > > BR
> > > >
> > > > On Tue, Apr 7, 2020 at 10:06 AM Soumyajit Sahu <
> > soumyajit.sahu@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Our typical IOPS stays at ~10K write ops/min, but it goes to 37K
> > write
> > > > > ops/min (which is where AWS throttles).
> > > > > The spike in write ops isn't accompanied by any spike in write
> > > throughput
> > > > > or produce requests (except for the first few minutes of catch up).
> > The
> > > > > write ops spike stays up (persistently for an hour or two) until we
> > > stop
> > > > > the broker ec2 instance for about 30 mins and then start it back.
> > > > >
> > > > > @Liam, no, we are not using log compaction except for a few consumer
> > > > offset
> > > > > topics and config topic (for Kafka Connect), and schema registry
> > store.
> > > > >
> > > > > @Suman, are you using m5 or r5 instances. Recently, we migrated from
> > r5
> > > > to
> > > > > m5, and I wonder if that has a hand in this.
> > > > >
> > > > > We have about 1000 partitions residing on each disk, but I don't
> > think
> > > > that
> > > > > matters as most of the time the brokers run flawlessly (even during
> > > peak
> > > > > traffic hours).
> > > > >
> > > > > Thanks!
> > > > >
> > > > > On Mon, Apr 6, 2020 at 11:39 PM Suman B N <su...@gmail.com>
> > > wrote:
> > > > >
> > > > > > We too have a similar setup but we never observed any such spikes.
> > > > > >
> > > > > > Are you sure your disk IOPS is good enough? Check if that is
> > > > throttling.
> > > > > >
> > > > > > After a broker restarts, there might be more traffic as well
> > because
> > > of
> > > > > > followers trying to catch up with the leader.
> > > > > >
> > > > > > -Suman
> > > > > >
> > > > > > On Tue, Apr 7, 2020 at 11:59 AM Soumyajit Sahu <
> > > > soumyajit.sahu@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > We are running Kafka on AWS EC2 instances (m5.2xlarge) with
> > mounted
> > > > EBS
> > > > > > st1
> > > > > > > volume (one on each machine).
> > > > > > > Occasionally, we have noticed that the write ops/second goes
> > > through
> > > > > the
> > > > > > > roof and we get throttled by AWS while the data throughput
> > wouldn't
> > > > > have
> > > > > > > changed much. As far as our observation goes, it happens usually
> > > > after
> > > > > a
> > > > > > > broker restart.
> > > > > > >
> > > > > > > Has anyone else come across this behavior?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > *Suman*
> > > > > > *OlaCabs*
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Seva Feldman
> > > > VP R&D Mobile Delivery
> > > > [image: ironSource] <http://www.ironsrc.com/>
> > > >
> > > > email seva.f@ironsrc.com
> > > > mobile +972544346089
> > > >
> > > > ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> > > >
> > >
> > >
> > > --
> > > *Suman*
> > > *OlaCabs*
> > >
> >
>
>
> --
> Seva Feldman
> VP R&D Mobile Delivery
> [image: ironSource] <http://www.ironsrc.com/>
>
> email seva.f@ironsrc.com
> mobile +972544346089
>
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv

Re: High write operations rate on disk

Posted by Ashutosh singh <ge...@gmail.com>.

I think this is expected behaviour.   You will have to tune up IOPS
accordingly.
When you restart your brokers it tries to read metadata for all available
topics and partitions (including all files) and that is where your read
IOPS will shoot up.  And if your cluster is busy one then it will try to
make replicas consistent after broker comes up and this is where write IOPS
will shoot up.
It's  depend on how many topics do you have and how many partitions are
available and how busy is your cluster . (obviously producer and consumer
are also accountable)
Keep a monitoring on read and write IOPS and you will be able to figure out
correct IOPS.

Analyze your cluster and see if you can move to Instance store. If you move
to instance store you wouldn't face this issue but that is have its
own  benefits
and drawback. Broker restart will be much quicker but Once you reboot your
machine for any maintenance you will loose data and once instance comes up
it will start copying all topics which will result in CPU and network
throttled.  So you would like to do restart your instance during off
business hours.

I recently moved from instance store to EBS and I saw similar behaviour and
after tuning my IOPS I don't see any issue.




On Sat, May 16, 2020 at 4:46 PM Alexandre Dupriez <
alexandre.dupriez@gmail.com> wrote:

> Hi Soumyajit,
>
> It is possible that due to the broker restart, you benefit from less
> I/O merges than under steady state. Intuitively, that would come from
> a shift from sequential workload with one more dispersed in nature. It
> is likely your broker generates more disk read than before the
> restart, especially if lots of page were written back and/or released
> during the broker bounce.
>
> What would be interesting to know is what the throughput is on the
> device (read and write, steady state and at IOPS burst)? I refer to
> the actual traffic on the disk - not the read/write at the file system
> level.
>
> Thanks,
> Alexandre
>
> Le mar. 7 avr. 2020 à 08:42, Seva Feldman <se...@ironsrc.com> a écrit :
> >
> > We are using mainly ephemeral instances like i3en as our pattern is more
> > fit for it.
> >
> > On Tue, Apr 7, 2020 at 10:40 AM Soumyajit Sahu <soumyajit.sahu@gmail.com
> >
> > wrote:
> >
> > > @Suman, thanks for confirming. I will dig more then. The instances are
> > > dedicated to running Kafka, and so is the mounted volume.
> > >
> > > @Seva, thanks for the insight. I guess if nothing works, then we will
> move
> > > from st1 to gp2 volumes.
> > >
> > > On Tue, Apr 7, 2020 at 12:28 AM Suman B N <su...@gmail.com>
> wrote:
> > >
> > > > We have used st1 volumes and we never saw any issue.
> > > > Yes, we are using m-series. Even t-series worked for us :D
> > > >
> > > > During those spikes, do you observe any background operations going
> on?
> > > > Check server logs, controller logs.
> > > >
> > > > On Tue, Apr 7, 2020 at 12:49 PM Seva Feldman <se...@ironsrc.com>
> wrote:
> > > >
> > > > > ST1 EBS fit only for sequential rights and reads. Once you have
> many
> > > > > partitions on EBS it will be mostly random.
> > > > > Interesting to monitor random vs sequential...
> > > > >
> > > > > We tested kafka on ST1 with 1xx partitions on each EBS and it was
> > > > > constantly lagging.
> > > > >
> > > > > BR
> > > > >
> > > > > On Tue, Apr 7, 2020 at 10:06 AM Soumyajit Sahu <
> > > soumyajit.sahu@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Our typical IOPS stays at ~10K write ops/min, but it goes to 37K
> > > write
> > > > > > ops/min (which is where AWS throttles).
> > > > > > The spike in write ops isn't accompanied by any spike in write
> > > > throughput
> > > > > > or produce requests (except for the first few minutes of catch
> up).
> > > The
> > > > > > write ops spike stays up (persistently for an hour or two) until
> we
> > > > stop
> > > > > > the broker ec2 instance for about 30 mins and then start it back.
> > > > > >
> > > > > > @Liam, no, we are not using log compaction except for a few
> consumer
> > > > > offset
> > > > > > topics and config topic (for Kafka Connect), and schema registry
> > > store.
> > > > > >
> > > > > > @Suman, are you using m5 or r5 instances. Recently, we migrated
> from
> > > r5
> > > > > to
> > > > > > m5, and I wonder if that has a hand in this.
> > > > > >
> > > > > > We have about 1000 partitions residing on each disk, but I don't
> > > think
> > > > > that
> > > > > > matters as most of the time the brokers run flawlessly (even
> during
> > > > peak
> > > > > > traffic hours).
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > On Mon, Apr 6, 2020 at 11:39 PM Suman B N <sumannewton@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > We too have a similar setup but we never observed any such
> spikes.
> > > > > > >
> > > > > > > Are you sure your disk IOPS is good enough? Check if that is
> > > > > throttling.
> > > > > > >
> > > > > > > After a broker restarts, there might be more traffic as well
> > > because
> > > > of
> > > > > > > followers trying to catch up with the leader.
> > > > > > >
> > > > > > > -Suman
> > > > > > >
> > > > > > > On Tue, Apr 7, 2020 at 11:59 AM Soumyajit Sahu <
> > > > > soumyajit.sahu@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > We are running Kafka on AWS EC2 instances (m5.2xlarge) with
> > > mounted
> > > > > EBS
> > > > > > > st1
> > > > > > > > volume (one on each machine).
> > > > > > > > Occasionally, we have noticed that the write ops/second goes
> > > > through
> > > > > > the
> > > > > > > > roof and we get throttled by AWS while the data throughput
> > > wouldn't
> > > > > > have
> > > > > > > > changed much. As far as our observation goes, it happens
> usually
> > > > > after
> > > > > > a
> > > > > > > > broker restart.
> > > > > > > >
> > > > > > > > Has anyone else come across this behavior?
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > *Suman*
> > > > > > > *OlaCabs*
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Seva Feldman
> > > > > VP R&D Mobile Delivery
> > > > > [image: ironSource] <http://www.ironsrc.com/>
> > > > >
> > > > > email seva.f@ironsrc.com
> > > > > mobile +972544346089
> > > > >
> > > > > ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> > > > >
> > > >
> > > >
> > > > --
> > > > *Suman*
> > > > *OlaCabs*
> > > >
> > >
> >
> >
> > --
> > Seva Feldman
> > VP R&D Mobile Delivery
> > [image: ironSource] <http://www.ironsrc.com/>
> >
> > email seva.f@ironsrc.com
> > mobile +972544346089
> >
> > ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>


-- 
Thanx & Regard
Ashutosh Singh
08151945559