You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jason Rosenberg <jb...@squareup.com> on 2015/06/01 21:56:54 UTC

Re: Cascading failures on running out of disk space

Hi Jananee,

Do you for sure that you ran out of disk space completely? Did you see an
IOExceptions failing to write?  Normally, when that happens, the broker is
supposed to immediately shut itself down.  Did the one broker shut itself
down?

The NotLeaderForPartitionException's are normal when partition leadership
changes, and clients don't yet know about it.  They usually discover a
leadership change by getting this failure, and then re-checking the
partition metadata.  But, this metadata request can also fail in certain
conditions, which result in repeated NotLeaderForPartitionExceptions......

I've seen consumer offsets get reset too, if/when there's an unclean leader
election.  E.g. if the leader goes down hard without the followers up to
date (perhaps this happened in this case, if the leader was on the broker
with the full disk)?  I'm not sure why the consumer offsets have to be
completely reset, but that's what I've seen too.

Probably the most important thing to know, is that you don't want to let
your disks fill up, so if you can add early warning/monitoring so you can
take action before that happens, you'd avoid these scenarios with unclean
leader election, etc.

Jason

On Wed, May 27, 2015 at 10:54 AM, Jananee S <ja...@gmail.com> wrote:

> Hi,
>
>   We have the following setup -
>
> Number of brokers: 3
> Number of zookeepers: 3
> Default replication factor: 3
> Offets Storage: kafka
>
> When one of our brokers ran out of disk space, we started seeing lot of
> errors in the broker logs at an alarming rate. This caused the other
> brokers also to run out of disk space.
>
> ERROR [ReplicaFetcherThread-0-101813211], Error for partition [xxxx,47] to
> broker 101813211:class kafka.common.UnknownException
> (kafka.server.ReplicaFetcherThread)
>
> WARN [Replica Manager on Broker 101813211]: Fetch request with correlation
> id 161672 from client ReplicaFetcherThread-0-101813211 on partition
> [xxxx,11] failed due to Leader not local for partition [xxxx,11] on broker
> 101813211 (kafka.server.ReplicaManager)
>
> We also noticed NotLeaderForPartitionException in the producer and consumer
> logs (also at alarming rate)
>
> ERROR [2015-05-27 09:54:48,613] kafka.consumer.ConsumerFetcherThread: [
> ConsumerFetcherThread-xxxx_prod2-1432719772385-bd7608b8-0-101813211], Error
> for partition [yyyy,1] to broker 101813211:class kafka.common.
> NotLeaderForPartitionException
>
> The __consumer_offsets topic somehow got corrupted and consumers started
> consuming already consumed messages on restart.
>
> We deleted the offending topic and tried restarting the brokers and
> zookeepers. Now we are getting lots of corrupt index errors on broker start
> up.
>
> Was all this due to the replication factor being the same as number of
> brokers? Why would the topic files get corrupted in such a scenario?
> Please let us know how to recover from this scenario. Also, how do we turn
> down the error logging rate?
>
> Thanks,
> Jananee
>

Re: Cascading failures on running out of disk space

Posted by Jananee S <ja...@gmail.com>.
Thanks Jason.

We did run out of disk space and noticed IOExceptions too. No, the broker
did not shut itself down. Is there some configuration that would enable
this for one or all brokers? That would be a better scenario to be in.
Right now, we have setup some alerts when disk space goes beyond a
threshold. We have also decreased the replication factor to 2. Hope this
should be enough to avert disaster.  Only thing that is worrying is the
consumer offsets getting reset part. All our systems use high level
consumers. In some cases, we have some state which can be used to prevent
reprocessing of old messages. On other cases, we don't have anything that
could help us here.

On Tue, Jun 2, 2015 at 1:26 AM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hi Jananee,
>
> Do you for sure that you ran out of disk space completely? Did you see an
> IOExceptions failing to write?  Normally, when that happens, the broker is
> supposed to immediately shut itself down.  Did the one broker shut itself
> down?
>
> The NotLeaderForPartitionException's are normal when partition leadership
> changes, and clients don't yet know about it.  They usually discover a
> leadership change by getting this failure, and then re-checking the
> partition metadata.  But, this metadata request can also fail in certain
> conditions, which result in repeated NotLeaderForPartitionExceptions......
>
> I've seen consumer offsets get reset too, if/when there's an unclean leader
> election.  E.g. if the leader goes down hard without the followers up to
> date (perhaps this happened in this case, if the leader was on the broker
> with the full disk)?  I'm not sure why the consumer offsets have to be
> completely reset, but that's what I've seen too.
>
> Probably the most important thing to know, is that you don't want to let
> your disks fill up, so if you can add early warning/monitoring so you can
> take action before that happens, you'd avoid these scenarios with unclean
> leader election, etc.
>
> Jason
>
> On Wed, May 27, 2015 at 10:54 AM, Jananee S <ja...@gmail.com> wrote:
>
> > Hi,
> >
> >   We have the following setup -
> >
> > Number of brokers: 3
> > Number of zookeepers: 3
> > Default replication factor: 3
> > Offets Storage: kafka
> >
> > When one of our brokers ran out of disk space, we started seeing lot of
> > errors in the broker logs at an alarming rate. This caused the other
> > brokers also to run out of disk space.
> >
> > ERROR [ReplicaFetcherThread-0-101813211], Error for partition [xxxx,47]
> to
> > broker 101813211:class kafka.common.UnknownException
> > (kafka.server.ReplicaFetcherThread)
> >
> > WARN [Replica Manager on Broker 101813211]: Fetch request with
> correlation
> > id 161672 from client ReplicaFetcherThread-0-101813211 on partition
> > [xxxx,11] failed due to Leader not local for partition [xxxx,11] on
> broker
> > 101813211 (kafka.server.ReplicaManager)
> >
> > We also noticed NotLeaderForPartitionException in the producer and
> consumer
> > logs (also at alarming rate)
> >
> > ERROR [2015-05-27 09:54:48,613] kafka.consumer.ConsumerFetcherThread: [
> > ConsumerFetcherThread-xxxx_prod2-1432719772385-bd7608b8-0-101813211],
> Error
> > for partition [yyyy,1] to broker 101813211:class kafka.common.
> > NotLeaderForPartitionException
> >
> > The __consumer_offsets topic somehow got corrupted and consumers started
> > consuming already consumed messages on restart.
> >
> > We deleted the offending topic and tried restarting the brokers and
> > zookeepers. Now we are getting lots of corrupt index errors on broker
> start
> > up.
> >
> > Was all this due to the replication factor being the same as number of
> > brokers? Why would the topic files get corrupted in such a scenario?
> > Please let us know how to recover from this scenario. Also, how do we
> turn
> > down the error logging rate?
> >
> > Thanks,
> > Jananee
> >
>