You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Tom van den Berge <to...@gmail.com> on 2017/11/29 22:15:58 UTC

Lost messages and messed up offsets

I'm using Kafka 0.10.0.

I'm reading messages from a single topic (20 partitions), using 4 consumers
(one group), using a standard java consumer with default configuration,
except for the key and value deserializer, and a group id; no other
settings.

We've been experiencing a serious problem a few times now, after a large
burst of messages (75000) have been posted to the topic. The consumer lag
(as reported by Kafka's kafka-consumer-groups.sh) immediately shows a huge
lag, which is expected. The consumers start processing the messages, which
is expected to take them at least 30 minutes. In the mean time, more
messages are posted to the topic, but at a "normal" rate, which the
consumers normally handle easily. The problem is that the reported consumer
lag is not decreasing at all. After some 30 minutes, it has even increased
slightly. This would mean that the consumers are not able to process the
backlog at all, which is extremely unlikely.

After a restart of all consumer applications, something really surprising
happens: the lag immediately drops to nearly 0! It is technically
impossible that the consumers really processed all messages in a matter of
seconds. Manual verification showed that many messages were not processed
at all; they seem to have disappeared somehow. So it seems that restarting
the consumers somehow messed up the offset (I think).

On top of that, I noticed that the reported lag shows seemingly impossible
figures. During the time that the lag was not decreasing, before the
restart of the consumers, the "current offset" that was reported for some
partitions decreased. To my knowledge, that is impossible.

Does anyone have an idea on how this could have happened?

Re: Lost messages and messed up offsets

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

Can you also check if you have partition leaders flapping or changing rapidly?
Also, look at the following settings on your client configs:

max.partition.fetch.bytes
fetch.max.bytes
receive.buffer.bytes

We had a similar situation in our environment when the brokers were flooded with data.
The symptoms where apparent huge spikes in offset ids - much more than the data were sending.
That we traced to the fact that the brokers were not able to keep up with the incoming producer + consumer + replication traffic due to the NIC bandwidth.
(A bit of a lengthy story as to why the offset ids appeared to be high/spiky because of the flapping).

And then the consumer would have issues - and the problem there was that the producer had a very large buffer and batch size - so the data was coming in large batches.
However the client was configured to receive data in such large batches and it would give errors and would not be able to go past a certain offset.


On 11/30/17, 3:03 AM, "Tom van den Berge" <to...@gmail.com> wrote:

    The consumers are using default settings, which means that
    enable.auto.commit=true and auto.commit.interval.ms=5000. I'm not
    committing manually; just consuming messages.
    
    On Thu, Nov 30, 2017 at 1:09 AM, Frank Lyaruu <fl...@gmail.com> wrote:
    
    > Do you commit the received messages? Either by doing it manually or setting
    > enable.auto.commit and auto.commit.interval.ms?
    >
    > On Wed, Nov 29, 2017 at 11:15 PM, Tom van den Berge <
    > tom.vandenberge@gmail.com> wrote:
    >
    > > I'm using Kafka 0.10.0.
    > >
    > > I'm reading messages from a single topic (20 partitions), using 4
    > consumers
    > > (one group), using a standard java consumer with default configuration,
    > > except for the key and value deserializer, and a group id; no other
    > > settings.
    > >
    > > We've been experiencing a serious problem a few times now, after a large
    > > burst of messages (75000) have been posted to the topic. The consumer lag
    > > (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a
    > huge
    > > lag, which is expected. The consumers start processing the messages,
    > which
    > > is expected to take them at least 30 minutes. In the mean time, more
    > > messages are posted to the topic, but at a "normal" rate, which the
    > > consumers normally handle easily. The problem is that the reported
    > consumer
    > > lag is not decreasing at all. After some 30 minutes, it has even
    > increased
    > > slightly. This would mean that the consumers are not able to process the
    > > backlog at all, which is extremely unlikely.
    > >
    > > After a restart of all consumer applications, something really surprising
    > > happens: the lag immediately drops to nearly 0! It is technically
    > > impossible that the consumers really processed all messages in a matter
    > of
    > > seconds. Manual verification showed that many messages were not processed
    > > at all; they seem to have disappeared somehow. So it seems that
    > restarting
    > > the consumers somehow messed up the offset (I think).
    > >
    > > On top of that, I noticed that the reported lag shows seemingly
    > impossible
    > > figures. During the time that the lag was not decreasing, before the
    > > restart of the consumers, the "current offset" that was reported for some
    > > partitions decreased. To my knowledge, that is impossible.
    > >
    > > Does anyone have an idea on how this could have happened?
    > >
    >

Re: Lost messages and messed up offsets

Posted by Tom van den Berge <to...@gmail.com>.

This problem was solved by upgrading from 0.10 to 0.11 (broker + client).

Thanks for your feedback.


On Thu, Nov 30, 2017 at 10:03 AM, Tom van den Berge <
tom.vandenberge@gmail.com> wrote:

> The consumers are using default settings, which means that
> enable.auto.commit=true and auto.commit.interval.ms=5000. I'm not
> committing manually; just consuming messages.
>
> On Thu, Nov 30, 2017 at 1:09 AM, Frank Lyaruu <fl...@gmail.com> wrote:
>
>> Do you commit the received messages? Either by doing it manually or
>> setting
>> enable.auto.commit and auto.commit.interval.ms?
>>
>> On Wed, Nov 29, 2017 at 11:15 PM, Tom van den Berge <
>> tom.vandenberge@gmail.com> wrote:
>>
>> > I'm using Kafka 0.10.0.
>> >
>> > I'm reading messages from a single topic (20 partitions), using 4
>> consumers
>> > (one group), using a standard java consumer with default configuration,
>> > except for the key and value deserializer, and a group id; no other
>> > settings.
>> >
>> > We've been experiencing a serious problem a few times now, after a large
>> > burst of messages (75000) have been posted to the topic. The consumer
>> lag
>> > (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a
>> huge
>> > lag, which is expected. The consumers start processing the messages,
>> which
>> > is expected to take them at least 30 minutes. In the mean time, more
>> > messages are posted to the topic, but at a "normal" rate, which the
>> > consumers normally handle easily. The problem is that the reported
>> consumer
>> > lag is not decreasing at all. After some 30 minutes, it has even
>> increased
>> > slightly. This would mean that the consumers are not able to process the
>> > backlog at all, which is extremely unlikely.
>> >
>> > After a restart of all consumer applications, something really
>> surprising
>> > happens: the lag immediately drops to nearly 0! It is technically
>> > impossible that the consumers really processed all messages in a matter
>> of
>> > seconds. Manual verification showed that many messages were not
>> processed
>> > at all; they seem to have disappeared somehow. So it seems that
>> restarting
>> > the consumers somehow messed up the offset (I think).
>> >
>> > On top of that, I noticed that the reported lag shows seemingly
>> impossible
>> > figures. During the time that the lag was not decreasing, before the
>> > restart of the consumers, the "current offset" that was reported for
>> some
>> > partitions decreased. To my knowledge, that is impossible.
>> >
>> > Does anyone have an idea on how this could have happened?
>> >
>>
>
>

Re: Lost messages and messed up offsets

Posted by Tom van den Berge <to...@gmail.com>.

The consumers are using default settings, which means that
enable.auto.commit=true and auto.commit.interval.ms=5000. I'm not
committing manually; just consuming messages.

On Thu, Nov 30, 2017 at 1:09 AM, Frank Lyaruu <fl...@gmail.com> wrote:

> Do you commit the received messages? Either by doing it manually or setting
> enable.auto.commit and auto.commit.interval.ms?
>
> On Wed, Nov 29, 2017 at 11:15 PM, Tom van den Berge <
> tom.vandenberge@gmail.com> wrote:
>
> > I'm using Kafka 0.10.0.
> >
> > I'm reading messages from a single topic (20 partitions), using 4
> consumers
> > (one group), using a standard java consumer with default configuration,
> > except for the key and value deserializer, and a group id; no other
> > settings.
> >
> > We've been experiencing a serious problem a few times now, after a large
> > burst of messages (75000) have been posted to the topic. The consumer lag
> > (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a
> huge
> > lag, which is expected. The consumers start processing the messages,
> which
> > is expected to take them at least 30 minutes. In the mean time, more
> > messages are posted to the topic, but at a "normal" rate, which the
> > consumers normally handle easily. The problem is that the reported
> consumer
> > lag is not decreasing at all. After some 30 minutes, it has even
> increased
> > slightly. This would mean that the consumers are not able to process the
> > backlog at all, which is extremely unlikely.
> >
> > After a restart of all consumer applications, something really surprising
> > happens: the lag immediately drops to nearly 0! It is technically
> > impossible that the consumers really processed all messages in a matter
> of
> > seconds. Manual verification showed that many messages were not processed
> > at all; they seem to have disappeared somehow. So it seems that
> restarting
> > the consumers somehow messed up the offset (I think).
> >
> > On top of that, I noticed that the reported lag shows seemingly
> impossible
> > figures. During the time that the lag was not decreasing, before the
> > restart of the consumers, the "current offset" that was reported for some
> > partitions decreased. To my knowledge, that is impossible.
> >
> > Does anyone have an idea on how this could have happened?
> >
>

Re: Lost messages and messed up offsets

Posted by Frank Lyaruu <fl...@gmail.com>.

Do you commit the received messages? Either by doing it manually or setting
enable.auto.commit and auto.commit.interval.ms?

On Wed, Nov 29, 2017 at 11:15 PM, Tom van den Berge <
tom.vandenberge@gmail.com> wrote:

> I'm using Kafka 0.10.0.
>
> I'm reading messages from a single topic (20 partitions), using 4 consumers
> (one group), using a standard java consumer with default configuration,
> except for the key and value deserializer, and a group id; no other
> settings.
>
> We've been experiencing a serious problem a few times now, after a large
> burst of messages (75000) have been posted to the topic. The consumer lag
> (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a huge
> lag, which is expected. The consumers start processing the messages, which
> is expected to take them at least 30 minutes. In the mean time, more
> messages are posted to the topic, but at a "normal" rate, which the
> consumers normally handle easily. The problem is that the reported consumer
> lag is not decreasing at all. After some 30 minutes, it has even increased
> slightly. This would mean that the consumers are not able to process the
> backlog at all, which is extremely unlikely.
>
> After a restart of all consumer applications, something really surprising
> happens: the lag immediately drops to nearly 0! It is technically
> impossible that the consumers really processed all messages in a matter of
> seconds. Manual verification showed that many messages were not processed
> at all; they seem to have disappeared somehow. So it seems that restarting
> the consumers somehow messed up the offset (I think).
>
> On top of that, I noticed that the reported lag shows seemingly impossible
> figures. During the time that the lag was not decreasing, before the
> restart of the consumers, the "current offset" that was reported for some
> partitions decreased. To my knowledge, that is impossible.
>
> Does anyone have an idea on how this could have happened?
>

Re: [EXTERNAL] - Lost messages and messed up offsets

Posted by Tom van den Berge <to...@gmail.com>.

If I understand correctly, the "auto.offset.reset" setting is only used if
there is no offset available in Kafka (i.e. no offset has ever be
committed?), or if the offset does not exist anymore. In my situation, I
don't understand how either situation would be possible. The consumers
continuously commit (auto commit is enabled), and the messages in the topic
are retained for 7 days.

I agree that *if* "auto.offset.reset" would be used, it would explain the
skipped messages, but I can't see why it would be used. Do you have an idea?

Thanks,
Tom

On Thu, Nov 30, 2017 at 12:23 AM, Isabelle Giguère <ig...@opentext.com>
wrote:

> Hi;
>
> With default configuration, your consumers are set with
> auto.offset.reset=latest
> So on restart, the consumers start to read the offset of 0 minutes ago,
> not the offset of 30 minutes ago (or whatever the lag was).
>
> https://kafka.apache.org/documentation/#configuration
> auto.offset.reset
> What to do when there is no initial offset in Kafka or if the current
> offset does not exist anymore on the server (e.g. because that data has
> been deleted):
>     earliest: automatically reset the offset to the earliest offset
>     latest: automatically reset the offset to the latest offset
>     none: throw exception to the consumer if no previous offset is found
> for the consumer's group
>     anything else: throw exception to the consumer.
>
> For the "current offset" that seems to decrease, I have no idea.
>
> Isabelle Giguère
> Computational Linguist and Java Developer
> Linguiste informaticienne et développeur Java
>
> _________
> Open Text
> The Content Experts
>
> -----Original Message-----
> From: Tom van den Berge [mailto:tom.vandenberge@gmail.com]
> Sent: 29 novembre 2017 17:16
> To: users@kafka.apache.org
> Subject: [EXTERNAL] - Lost messages and messed up offsets
>
> I'm using Kafka 0.10.0.
>
> I'm reading messages from a single topic (20 partitions), using 4
> consumers (one group), using a standard java consumer with default
> configuration, except for the key and value deserializer, and a group id;
> no other settings.
>
> We've been experiencing a serious problem a few times now, after a large
> burst of messages (75000) have been posted to the topic. The consumer lag
> (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a huge
> lag, which is expected. The consumers start processing the messages, which
> is expected to take them at least 30 minutes. In the mean time, more
> messages are posted to the topic, but at a "normal" rate, which the
> consumers normally handle easily. The problem is that the reported consumer
> lag is not decreasing at all. After some 30 minutes, it has even increased
> slightly. This would mean that the consumers are not able to process the
> backlog at all, which is extremely unlikely.
>
> After a restart of all consumer applications, something really surprising
> happens: the lag immediately drops to nearly 0! It is technically
> impossible that the consumers really processed all messages in a matter of
> seconds. Manual verification showed that many messages were not processed
> at all; they seem to have disappeared somehow. So it seems that restarting
> the consumers somehow messed up the offset (I think).
>
> On top of that, I noticed that the reported lag shows seemingly impossible
> figures. During the time that the lag was not decreasing, before the
> restart of the consumers, the "current offset" that was reported for some
> partitions decreased. To my knowledge, that is impossible.
>
> Does anyone have an idea on how this could have happened?
>

RE: [EXTERNAL] - Lost messages and messed up offsets

Posted by Isabelle Giguère <ig...@opentext.com>.

Hi;

With default configuration, your consumers are set with auto.offset.reset=latest
So on restart, the consumers start to read the offset of 0 minutes ago, not the offset of 30 minutes ago (or whatever the lag was).

https://kafka.apache.org/documentation/#configuration
auto.offset.reset
What to do when there is no initial offset in Kafka or if the current offset does not exist anymore on the server (e.g. because that data has been deleted):
    earliest: automatically reset the offset to the earliest offset
    latest: automatically reset the offset to the latest offset
    none: throw exception to the consumer if no previous offset is found for the consumer's group
    anything else: throw exception to the consumer.

For the "current offset" that seems to decrease, I have no idea.

Isabelle Giguère
Computational Linguist and Java Developer
Linguiste informaticienne et développeur Java

_________
Open Text
The Content Experts

-----Original Message-----
From: Tom van den Berge [mailto:tom.vandenberge@gmail.com] 
Sent: 29 novembre 2017 17:16
To: users@kafka.apache.org
Subject: [EXTERNAL] - Lost messages and messed up offsets

I'm using Kafka 0.10.0.

I'm reading messages from a single topic (20 partitions), using 4 consumers (one group), using a standard java consumer with default configuration, except for the key and value deserializer, and a group id; no other settings.

We've been experiencing a serious problem a few times now, after a large burst of messages (75000) have been posted to the topic. The consumer lag (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a huge lag, which is expected. The consumers start processing the messages, which is expected to take them at least 30 minutes. In the mean time, more messages are posted to the topic, but at a "normal" rate, which the consumers normally handle easily. The problem is that the reported consumer lag is not decreasing at all. After some 30 minutes, it has even increased slightly. This would mean that the consumers are not able to process the backlog at all, which is extremely unlikely.

After a restart of all consumer applications, something really surprising
happens: the lag immediately drops to nearly 0! It is technically impossible that the consumers really processed all messages in a matter of seconds. Manual verification showed that many messages were not processed at all; they seem to have disappeared somehow. So it seems that restarting the consumers somehow messed up the offset (I think).

On top of that, I noticed that the reported lag shows seemingly impossible figures. During the time that the lag was not decreasing, before the restart of the consumers, the "current offset" that was reported for some partitions decreased. To my knowledge, that is impossible.

Does anyone have an idea on how this could have happened?