You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Andrew Otto <ot...@wikimedia.org> on 2018/03/12 18:56:44 UTC

Kafka 0.9 MirrorMaker failing with Batch Expired when producing to Kafka 1.0 cluster

Hi all,

I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why this
is happening, so I’d thought I’d ask here in case anyone else has seen this
before.

We’ve been running a Kafka 1.0 cluster for a few months now, replicating
data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker.  This had
mostly been working fine, until last week when we finished migrating a set
of high volume producers to this new Kafka 1.0 cluster.  Before last week,
the new 1.0 cluster was handling around 50-70K messages per second, as of
last week it is up to around 150K messages per second.

Everything in the new 1.0 cluster is working totally fine.  However, since
we increased traffic to the new 1.0 cluster, our previously operational
MirrorMaker instances started dying with a lot of

ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
Error when sending message to topic XXXX with key: null, value: 964 bytes
with error: Batch Expired

and

ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
Error when sending message to topic XXXX with key: null, value: 964 bytes
with error: Producer is closed forcefully.

and

INFO  kafka.tools.MirrorMaker$  - Exiting on send failure, skip committing
offsets.


This log is printed for some of the higher volume (not really, these are
around 1K messages per second max) topics that MirrorMaker is replicating.
This happens a few minutes after the MirrorMaker instance starts.  Until
this happens, it is able to produce fine.  Once it happens, the instance
dies and a rebalance and is triggered. I’m haven’t been able to
consistently reproduce what happens next, but it seems that after a short
series of instance flapping + rebalancing happens, the MirrorMaker
instances all totally get stuck.  They continue owning partitions, but
don’t produce anything, and don’t log any more errors.  The MirrorMaker
consumers start lagging.

A full restart of all instances seems to reset the thing, but eventually
they get stuck again.

Perhaps there’s some problem I’ve missed with older MirrorMaker producing
to new Kafka clusters?  Could more load on the brokers cause MirrorMaker
produce requests to expire like this?  We’re were running the same load on
our old 0.9.0.1 cluster before this migration, with the same MirrorMaker
setup with no problems.

Our MirrorMaker and 1.0 broker configuration is here:
https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872

Any help is appreciated, thanks!

-Andrew Otto
 Senior Systems Engineer
 Wikimedia Foundation

Re: Kafka 0.9 MirrorMaker failing with Batch Expired when producing to Kafka 1.0 cluster

Posted by Andrew Otto <ot...@wikimedia.org>.

Hm, the hardware should be mostly beefier in this new cluster, but there
are a couple of differences: Mainly we are now using RAID instead of JBOD,
and our high volume producers (that 150K / second I mentioned) use TLS.

Also, other producers seem have no problems, it is only MirrorMaker.  Even
stranger, this problem has gone away since yesterday.  Several bounces of
MirrorMaker have happened between then and now (as we tried a few different
MirrorMaker settings).  So far, this is a mystery.

Thanks for the reply Andras! If it happens again I’ll look into your
suggestion.


On Tue, Mar 13, 2018 at 4:15 PM, Andras Beni <an...@cloudera.com>
wrote:

> Hi Andrew,
>
> It seems the throughput of the new cluster is smaller than that of the old
> cluster. And for this reason MirrorMaker cannot send messages fast enough
> (i.e. they expire). I recommend comparing the configurations.
> For the hanging MirrorMaker instances, I think looking at stack dumps would
> help you get closer to the root cause.
>
> Best regards,
> Andras
>
> On Mon, Mar 12, 2018 at 7:56 PM, Andrew Otto <ot...@wikimedia.org> wrote:
>
> > Hi all,
> >
> > I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why
> this
> > is happening, so I’d thought I’d ask here in case anyone else has seen
> this
> > before.
> >
> > We’ve been running a Kafka 1.0 cluster for a few months now, replicating
> > data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker.  This had
> > mostly been working fine, until last week when we finished migrating a
> set
> > of high volume producers to this new Kafka 1.0 cluster.  Before last
> week,
> > the new 1.0 cluster was handling around 50-70K messages per second, as of
> > last week it is up to around 150K messages per second.
> >
> > Everything in the new 1.0 cluster is working totally fine.  However,
> since
> > we increased traffic to the new 1.0 cluster, our previously operational
> > MirrorMaker instances started dying with a lot of
> >
> > ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback
> -
> > Error when sending message to topic XXXX with key: null, value: 964 bytes
> > with error: Batch Expired
> >
> > and
> >
> > ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback
> -
> > Error when sending message to topic XXXX with key: null, value: 964 bytes
> > with error: Producer is closed forcefully.
> >
> > and
> >
> > INFO  kafka.tools.MirrorMaker$  - Exiting on send failure, skip
> committing
> > offsets.
> >
> >
> > This log is printed for some of the higher volume (not really, these are
> > around 1K messages per second max) topics that MirrorMaker is
> replicating.
> > This happens a few minutes after the MirrorMaker instance starts.  Until
> > this happens, it is able to produce fine.  Once it happens, the instance
> > dies and a rebalance and is triggered. I’m haven’t been able to
> > consistently reproduce what happens next, but it seems that after a short
> > series of instance flapping + rebalancing happens, the MirrorMaker
> > instances all totally get stuck.  They continue owning partitions, but
> > don’t produce anything, and don’t log any more errors.  The MirrorMaker
> > consumers start lagging.
> >
> > A full restart of all instances seems to reset the thing, but eventually
> > they get stuck again.
> >
> > Perhaps there’s some problem I’ve missed with older MirrorMaker producing
> > to new Kafka clusters?  Could more load on the brokers cause MirrorMaker
> > produce requests to expire like this?  We’re were running the same load
> on
> > our old 0.9.0.1 cluster before this migration, with the same MirrorMaker
> > setup with no problems.
> >
> > Our MirrorMaker and 1.0 broker configuration is here:
> > https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872
> >
> > Any help is appreciated, thanks!
> >
> > -Andrew Otto
> >  Senior Systems Engineer
> >  Wikimedia Foundation
> >
>

Re: Kafka 0.9 MirrorMaker failing with Batch Expired when producing to Kafka 1.0 cluster

Posted by Andras Beni <an...@cloudera.com>.

Hi Andrew,

It seems the throughput of the new cluster is smaller than that of the old
cluster. And for this reason MirrorMaker cannot send messages fast enough
(i.e. they expire). I recommend comparing the configurations.
For the hanging MirrorMaker instances, I think looking at stack dumps would
help you get closer to the root cause.

Best regards,
Andras

On Mon, Mar 12, 2018 at 7:56 PM, Andrew Otto <ot...@wikimedia.org> wrote:

> Hi all,
>
> I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why this
> is happening, so I’d thought I’d ask here in case anyone else has seen this
> before.
>
> We’ve been running a Kafka 1.0 cluster for a few months now, replicating
> data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker.  This had
> mostly been working fine, until last week when we finished migrating a set
> of high volume producers to this new Kafka 1.0 cluster.  Before last week,
> the new 1.0 cluster was handling around 50-70K messages per second, as of
> last week it is up to around 150K messages per second.
>
> Everything in the new 1.0 cluster is working totally fine.  However, since
> we increased traffic to the new 1.0 cluster, our previously operational
> MirrorMaker instances started dying with a lot of
>
> ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
> Error when sending message to topic XXXX with key: null, value: 964 bytes
> with error: Batch Expired
>
> and
>
> ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
> Error when sending message to topic XXXX with key: null, value: 964 bytes
> with error: Producer is closed forcefully.
>
> and
>
> INFO  kafka.tools.MirrorMaker$  - Exiting on send failure, skip committing
> offsets.
>
>
> This log is printed for some of the higher volume (not really, these are
> around 1K messages per second max) topics that MirrorMaker is replicating.
> This happens a few minutes after the MirrorMaker instance starts.  Until
> this happens, it is able to produce fine.  Once it happens, the instance
> dies and a rebalance and is triggered. I’m haven’t been able to
> consistently reproduce what happens next, but it seems that after a short
> series of instance flapping + rebalancing happens, the MirrorMaker
> instances all totally get stuck.  They continue owning partitions, but
> don’t produce anything, and don’t log any more errors.  The MirrorMaker
> consumers start lagging.
>
> A full restart of all instances seems to reset the thing, but eventually
> they get stuck again.
>
> Perhaps there’s some problem I’ve missed with older MirrorMaker producing
> to new Kafka clusters?  Could more load on the brokers cause MirrorMaker
> produce requests to expire like this?  We’re were running the same load on
> our old 0.9.0.1 cluster before this migration, with the same MirrorMaker
> setup with no problems.
>
> Our MirrorMaker and 1.0 broker configuration is here:
> https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872
>
> Any help is appreciated, thanks!
>
> -Andrew Otto
>  Senior Systems Engineer
>  Wikimedia Foundation
>