You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Christiane Lemke <ch...@gmail.com> on 2017/08/07 18:04:36 UTC

Random consumer offset rewinds/resets

Hi all,

we are fighting with offset rewinds of seemingly random size and hitting
seemingly random partitions on restarting any node in our kafka cluster. We
are running out of ideas - any help or pointers to things to investigate
are highly appreciated.

Our kafka setup is dual data center with two local broker clusters (3 nodes
each) and two aggregate broker clusters (5 nodes each), the latter running
mirror maker to consume messages from the local cluster.

Issues seem to have appeared since we upgraded from 0.10.1.0 to 0.11, but
not entirely sure it’s related.

We first had the theory of too big a consumer offset topic (we use
compaction for it) causing the issues on restart, and indeed we found that
cleaner threads had died after the upgrade. But restarting and cleaning
this topic did not help the issue.

Logs are pretty silent when it happens, before we cleaned the consumer
offset topic, we got a few of these every time it happened, but no longer
now:

[2017-08-04 11:19:25,970] ERROR [Group Metadata Manager on Broker 472]:
Error loading offsets from __consumer_offsets-14
(kafka.coordinator.group.GroupMetadataManager)
java.lang.IllegalStateException: Unexpected unload of active group
tns-ticket-store-b144c9d1-425a-4b90-8310-f6e886741494 while loading
partition __consumer_offsets-14
        at
kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:600)
        at
kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:595)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
        at
kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(GroupMetadataManager.scala:595)
        at
kafka.coordinator.group.GroupMetadataManager.kafka$coordinator$group$GroupMetadataManager$$doLoadGroupsAndOffsets$1(GroupMetadataManager.scala:455)
        at
kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:441)
        at
kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
        at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Does this seem familiar to anyone? Is there any suggestion as to what to
look into closer to investigate this issue? Happy to give more details
about anything that might be helpful.

Thanks a lot in advance,

Christiane

Re: Random consumer offset rewinds/resets

Posted by Ismael Juma <is...@juma.me.uk>.
Great to hear, thanks for the update.

Ismael

On Wed, Aug 9, 2017 at 4:14 PM, Christiane Lemke <christiane.lemke@gmail.com
> wrote:

> Hi Ismael,
>
> thanks a lot for your for your answer, it was indeed exactly the issue we
> had!
>
> We did see the ticket for the issue before, but the steps to reproduce
> included a consumer group restart, which we didn't do, so that we thought
> we our problem is different.
>
> However, since patching with the fixing commit, everything works fine.
>
> Thanks again! Christiane
>
> On 7 August 2017 at 22:07, Ismael Juma <is...@juma.me.uk> wrote:
>
> > Hi Christiane,
> >
> > Thanks for the email. That looks like
> > https://issues.apache.org/jira/browse/KAFKA-5600
> >
> > Ismael
> >
> > On Mon, Aug 7, 2017 at 7:04 PM, Christiane Lemke <
> > christiane.lemke@gmail.com
> > > wrote:
> >
> > > Hi all,
> > >
> > > we are fighting with offset rewinds of seemingly random size and
> hitting
> > > seemingly random partitions on restarting any node in our kafka
> cluster.
> > We
> > > are running out of ideas - any help or pointers to things to
> investigate
> > > are highly appreciated.
> > >
> > > Our kafka setup is dual data center with two local broker clusters (3
> > nodes
> > > each) and two aggregate broker clusters (5 nodes each), the latter
> > running
> > > mirror maker to consume messages from the local cluster.
> > >
> > > Issues seem to have appeared since we upgraded from 0.10.1.0 to 0.11,
> but
> > > not entirely sure it’s related.
> > >
> > > We first had the theory of too big a consumer offset topic (we use
> > > compaction for it) causing the issues on restart, and indeed we found
> > that
> > > cleaner threads had died after the upgrade. But restarting and cleaning
> > > this topic did not help the issue.
> > >
> > > Logs are pretty silent when it happens, before we cleaned the consumer
> > > offset topic, we got a few of these every time it happened, but no
> longer
> > > now:
> > >
> > > [2017-08-04 11:19:25,970] ERROR [Group Metadata Manager on Broker 472]:
> > > Error loading offsets from __consumer_offsets-14
> > > (kafka.coordinator.group.GroupMetadataManager)
> > > java.lang.IllegalStateException: Unexpected unload of active group
> > > tns-ticket-store-b144c9d1-425a-4b90-8310-f6e886741494 while loading
> > > partition __consumer_offsets-14
> > >         at
> > > kafka.coordinator.group.GroupMetadataManager$$anonfun$
> > > loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:600)
> > >         at
> > > kafka.coordinator.group.GroupMetadataManager$$anonfun$
> > > loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:595)
> > >         at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> > >         at
> > > kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(
> > > GroupMetadataManager.scala:595)
> > >         at
> > > kafka.coordinator.group.GroupMetadataManager.kafka$coordinator$group$
> > > GroupMetadataManager$$doLoadGroupsAndOffsets$1(
> > GroupMetadataManager.scala:
> > > 455)
> > >         at
> > > kafka.coordinator.group.GroupMetadataManager$$anonfun$
> > > loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:441)
> > >         at
> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> > > KafkaScheduler.scala:110)
> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
> > >         at
> > > java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> > >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > > ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.run(
> > > ScheduledThreadPoolExecutor.java:293)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > ThreadPoolExecutor.java:1142)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > ThreadPoolExecutor.java:617)
> > >         at java.lang.Thread.run(Thread.java:748)
> > >
> > > Does this seem familiar to anyone? Is there any suggestion as to what
> to
> > > look into closer to investigate this issue? Happy to give more details
> > > about anything that might be helpful.
> > >
> > > Thanks a lot in advance,
> > >
> > > Christiane
> > >
> >
>

Re: Random consumer offset rewinds/resets

Posted by Christiane Lemke <ch...@gmail.com>.
Hi Ismael,

thanks a lot for your for your answer, it was indeed exactly the issue we
had!

We did see the ticket for the issue before, but the steps to reproduce
included a consumer group restart, which we didn't do, so that we thought
we our problem is different.

However, since patching with the fixing commit, everything works fine.

Thanks again! Christiane

On 7 August 2017 at 22:07, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Christiane,
>
> Thanks for the email. That looks like
> https://issues.apache.org/jira/browse/KAFKA-5600
>
> Ismael
>
> On Mon, Aug 7, 2017 at 7:04 PM, Christiane Lemke <
> christiane.lemke@gmail.com
> > wrote:
>
> > Hi all,
> >
> > we are fighting with offset rewinds of seemingly random size and hitting
> > seemingly random partitions on restarting any node in our kafka cluster.
> We
> > are running out of ideas - any help or pointers to things to investigate
> > are highly appreciated.
> >
> > Our kafka setup is dual data center with two local broker clusters (3
> nodes
> > each) and two aggregate broker clusters (5 nodes each), the latter
> running
> > mirror maker to consume messages from the local cluster.
> >
> > Issues seem to have appeared since we upgraded from 0.10.1.0 to 0.11, but
> > not entirely sure it’s related.
> >
> > We first had the theory of too big a consumer offset topic (we use
> > compaction for it) causing the issues on restart, and indeed we found
> that
> > cleaner threads had died after the upgrade. But restarting and cleaning
> > this topic did not help the issue.
> >
> > Logs are pretty silent when it happens, before we cleaned the consumer
> > offset topic, we got a few of these every time it happened, but no longer
> > now:
> >
> > [2017-08-04 11:19:25,970] ERROR [Group Metadata Manager on Broker 472]:
> > Error loading offsets from __consumer_offsets-14
> > (kafka.coordinator.group.GroupMetadataManager)
> > java.lang.IllegalStateException: Unexpected unload of active group
> > tns-ticket-store-b144c9d1-425a-4b90-8310-f6e886741494 while loading
> > partition __consumer_offsets-14
> >         at
> > kafka.coordinator.group.GroupMetadataManager$$anonfun$
> > loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:600)
> >         at
> > kafka.coordinator.group.GroupMetadataManager$$anonfun$
> > loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:595)
> >         at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> >         at
> > kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(
> > GroupMetadataManager.scala:595)
> >         at
> > kafka.coordinator.group.GroupMetadataManager.kafka$coordinator$group$
> > GroupMetadataManager$$doLoadGroupsAndOffsets$1(
> GroupMetadataManager.scala:
> > 455)
> >         at
> > kafka.coordinator.group.GroupMetadataManager$$anonfun$
> > loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:441)
> >         at
> > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> > KafkaScheduler.scala:110)
> >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(
> > ScheduledThreadPoolExecutor.java:293)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> >         at java.lang.Thread.run(Thread.java:748)
> >
> > Does this seem familiar to anyone? Is there any suggestion as to what to
> > look into closer to investigate this issue? Happy to give more details
> > about anything that might be helpful.
> >
> > Thanks a lot in advance,
> >
> > Christiane
> >
>

Re: Random consumer offset rewinds/resets

Posted by Ismael Juma <is...@juma.me.uk>.
Hi Christiane,

Thanks for the email. That looks like
https://issues.apache.org/jira/browse/KAFKA-5600

Ismael

On Mon, Aug 7, 2017 at 7:04 PM, Christiane Lemke <christiane.lemke@gmail.com
> wrote:

> Hi all,
>
> we are fighting with offset rewinds of seemingly random size and hitting
> seemingly random partitions on restarting any node in our kafka cluster. We
> are running out of ideas - any help or pointers to things to investigate
> are highly appreciated.
>
> Our kafka setup is dual data center with two local broker clusters (3 nodes
> each) and two aggregate broker clusters (5 nodes each), the latter running
> mirror maker to consume messages from the local cluster.
>
> Issues seem to have appeared since we upgraded from 0.10.1.0 to 0.11, but
> not entirely sure it’s related.
>
> We first had the theory of too big a consumer offset topic (we use
> compaction for it) causing the issues on restart, and indeed we found that
> cleaner threads had died after the upgrade. But restarting and cleaning
> this topic did not help the issue.
>
> Logs are pretty silent when it happens, before we cleaned the consumer
> offset topic, we got a few of these every time it happened, but no longer
> now:
>
> [2017-08-04 11:19:25,970] ERROR [Group Metadata Manager on Broker 472]:
> Error loading offsets from __consumer_offsets-14
> (kafka.coordinator.group.GroupMetadataManager)
> java.lang.IllegalStateException: Unexpected unload of active group
> tns-ticket-store-b144c9d1-425a-4b90-8310-f6e886741494 while loading
> partition __consumer_offsets-14
>         at
> kafka.coordinator.group.GroupMetadataManager$$anonfun$
> loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:600)
>         at
> kafka.coordinator.group.GroupMetadataManager$$anonfun$
> loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:595)
>         at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>         at
> kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(
> GroupMetadataManager.scala:595)
>         at
> kafka.coordinator.group.GroupMetadataManager.kafka$coordinator$group$
> GroupMetadataManager$$doLoadGroupsAndOffsets$1(GroupMetadataManager.scala:
> 455)
>         at
> kafka.coordinator.group.GroupMetadataManager$$anonfun$
> loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:441)
>         at
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> KafkaScheduler.scala:110)
>         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
> ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
>
> Does this seem familiar to anyone? Is there any suggestion as to what to
> look into closer to investigate this issue? Happy to give more details
> about anything that might be helpful.
>
> Thanks a lot in advance,
>
> Christiane
>