You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Divij Vaidya <di...@gmail.com> on 2022/05/02 12:09:53 UTC

Re: Unexpected loss of Offsets

Luke / James

I agree that this bug is critical enough to release a new patch. Plus,
there are 10 more bug fixes
<https://issues.apache.org/jira/browse/KAFKA-13805?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.8.2>
with
major/blocker priority waiting to be released in 2.8.2.

I will be happy to assist / perform the release process for 2.8.2 or assist
in any other way I can. Luke, please let me know how we want to proceed
ahead on this.

Regards,
Divij Vaidya



On Fri, Apr 29, 2022 at 5:09 AM James Olsen <ja...@inaseq.com> wrote:

> Luke,
>
> I would argue that https://issues.apache.org/jira/browse/KAFKA-13636 is a
> critical defect as it can have a very serious impact.
>
> We run on AWS MSK which supports these versions:
> https://docs.aws.amazon.com/msk/latest/developerguide/supported-kafka-versions.html.
> We are currently on 2.7.2.
>
> I note that MSK does not support any 3.x (maybe they're not ready for the
> Zookeeper removal).  So I suspect we're going to need a 2.x if MSK is going
> to adopt it any time soon.  I'd be happier with a 2.7.3 incorporating
> KAFKA-13636 in order to minimise the risk of introducing other issues, or
> the 2.8.2 if that's not possible.
>
> What can we do to make this happen ASAP?
>
> Regards, James.
>
> On 29/04/2022, at 14:50, Luke Chen <showuon@gmail.com<mailto:
> showuon@gmail.com>> wrote:
>
> Hi James,
>
> So far, v2.8.2 is not planned, yet. And usually, the patch release only
> has one, that is, v2.8.0, and v2.8.1.
> But there are of course some exceptions that some releases have 2 or 3
> patch releases.
>
> For KAFKA-13658, you can check KAFKA-13658<
> https://issues.apache.org/jira/browse/KAFKA-13658>, which is included in
> v3.0.1, v3.1.1, and v3.2.0.
> So far, the v3.0.1 is released, and v3.1.1 and v3.2.0 will be coming soon.
>
> Thank you.
> Luke
>
> On Fri, Apr 29, 2022 at 8:53 AM James Olsen <james@inaseq.com<mailto:
> james@inaseq.com>> wrote:
> Luke,
>
> Do you know if 2.8.2 will be released anytime soon?  It appears to be
> waiting on https://issues.apache.org/jira/browse/KAFKA-13805 for which
> fixes are available.
>
> Regards, James.
>
> On 11/04/2022, at 14:22, Luke Chen <showuon@gmail.com<mailto:
> showuon@gmail.com>> wrote:
>
> Hi James,
>
> This looks like this known issue KAFKA-13636
> <https://issues.apache.org/jira/browse/KAFKA-13636>, which should be fixed
> in the newer version.
>
> Thank you.
> Luke
>
> On Mon, Apr 11, 2022 at 9:18 AM James Olsen <james@inaseq.com<mailto:
> james@inaseq.com>> wrote:
>
> I recently observed the following series of events for a particular
> partition (MyTopic-6):
>
> 2022-03-18 03:18:28,562 INFO
> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
> 'executor-thread-2' [Consumer clientId=consumer-MyTopicService-group-3,
> groupId=MyTopicService-group] Setting offset for partition MyTopic-6 to the
> committed offset FetchPosition{offset=438, offsetEpoch=Optional.empty,
> currentLeader=LeaderAndEpoch{leader=Optional[b-2.redacted.kafka.us<
> http://b-2.redacted.kafka.us/><
> http://b-2.redacted.kafka.us<http://b-2.redacted.kafka.us/>>-
> east-1.amazonaws.com:9094<http://east-1.amazonaws.com:9094/> (id: 2 rack:
> use1-az4)], epoch=64}}
>
> -- RESTART (bring up new consumer node)
>
> 2022-04-01 15:17:47,943 INFO
> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
> 'executor-thread-6' [Consumer clientId=consumer-MyTopicService-group-7,
> groupId=MyTopicService-group] Setting offset for partition MyTopic-6 to the
> committed offset FetchPosition{offset=449, offsetEpoch=Optional.empty,
> currentLeader=LeaderAndEpoch{leader=Optional[b-2.redacted.kafka.us<
> http://b-2.redacted.kafka.us/><
> http://b-2.redacted.kafka.us<http://b-2.redacted.kafka.us/>>-
> east-1.amazonaws.com:9094<http://east-1.amazonaws.com:9094/> (id: 2 rack:
> use1-az4)], epoch=64}}
>
> -- REBALANCE (drop old consumer node)
>
> 2022-04-01 15:18:24,414 INFO
> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
> 'executor-thread-2' [Consumer clientId=consumer-MyTopicService-group-3,
> groupId=MyTopicService-group] Found no committed offset for partition
> MyTopic-6
> 2022-04-01 15:18:24,474 INFO
> [org.apache.kafka.clients.consumer.internals.SubscriptionState]
> 'executor-thread-2' [Consumer clientId=consumer-MyTopicService-group-3,
> groupId=MyTopicService-group] Resetting offset for partition MyTopic-6 to
> position FetchPosition{offset=411, offsetEpoch=Optional.empty,
> currentLeader=LeaderAndEpoch{leader=Optional[b-2.redacted.kafka.us<
> http://b-2.redacted.kafka.us/><
> http://b-2.redacted.kafka.us<http://b-2.redacted.kafka.us/>>-
> east-1.amazonaws.com:9094<http://east-1.amazonaws.com:9094/> (id: 2 rack:
> use1-az4)], epoch=64}}.
>
> Seems odd that no offsets were found at 2022-04-01 15:18:24,414 when they
> were clearly present 36 seconds earlier at 2022-04-01 15:17:47,943.
>
> This resulted in message replay from offset 411-449.  This was in a test
> system only and we have duplicate detection in place but I'd still like to
> avoid similar occurrences in production if we can.
>
> There has clearly been a low volume of traffic but there have been active
> consumers all the time.  We have log.retention.ms<http://log.retention.ms/
> ><http://log.retention.ms<http://log.retention.ms/>>=1814400000
> (3 weeks) which I believe explains why it resumed from 411 as messages
> prior to that will have been deleted.
>
> There may not have been any new traffic in the last 7 days (we have the
> default offset retention) so I'm wondering if there is a chance the offsets
> were deleted during the rebalance when I presume there's a brief moment
> when there is no active consumer.  My understanding is that they shouldn't
> be deleted until there has been no consumer for 7 days (
>
> https://kafka.apache.org/27/documentation.html#brokerconfigs_offsets.retention.minutes
> - not using static assignment).  Is it possible the logic is actually
> checking for no consumer now and no offsets for 7 days instead?
>
> Server and Client are 2.7.2.  Sorry I don't have any more detailed
> server-side logs.
>
> Regards, James.
>
>
>
>

Re: Unexpected loss of Offsets

Posted by Luke Chen <sh...@gmail.com>.
Hi James and Divij,

The community will consider to release a 2nd patch to an old version (i.e.
v2.8.2) unless there's a very critical issue.

I think you could ask Amazon MSK to support Kafka v3.0 and newer version to
have the fix. (and also many new features)
Or, maybe Amazon MSK can build from kafka v2.8 branch by themselves?

Note: Kafka v3.0 doesn't deprecate/remove Zookeeper.

Thank you.
Luke



On Mon, May 2, 2022 at 8:10 PM Divij Vaidya <di...@gmail.com> wrote:

> Luke / James
>
> I agree that this bug is critical enough to release a new patch. Plus,
> there are 10 more bug fixes
> <https://issues.apache.org/jira/browse/KAFKA-13805?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.8.2> with
> major/blocker priority waiting to be released in 2.8.2.
>
> I will be happy to assist / perform the release process for 2.8.2 or
> assist in any other way I can. Luke, please let me know how we want to
> proceed ahead on this.
>
> Regards,
> Divij Vaidya
>
>
>
> On Fri, Apr 29, 2022 at 5:09 AM James Olsen <ja...@inaseq.com> wrote:
>
>> Luke,
>>
>> I would argue that https://issues.apache.org/jira/browse/KAFKA-13636 is
>> a critical defect as it can have a very serious impact.
>>
>> We run on AWS MSK which supports these versions:
>> https://docs.aws.amazon.com/msk/latest/developerguide/supported-kafka-versions.html.
>> We are currently on 2.7.2.
>>
>> I note that MSK does not support any 3.x (maybe they're not ready for the
>> Zookeeper removal).  So I suspect we're going to need a 2.x if MSK is going
>> to adopt it any time soon.  I'd be happier with a 2.7.3 incorporating
>> KAFKA-13636 in order to minimise the risk of introducing other issues, or
>> the 2.8.2 if that's not possible.
>>
>> What can we do to make this happen ASAP?
>>
>> Regards, James.
>>
>> On 29/04/2022, at 14:50, Luke Chen <showuon@gmail.com<mailto:
>> showuon@gmail.com>> wrote:
>>
>> Hi James,
>>
>> So far, v2.8.2 is not planned, yet. And usually, the patch release only
>> has one, that is, v2.8.0, and v2.8.1.
>> But there are of course some exceptions that some releases have 2 or 3
>> patch releases.
>>
>> For KAFKA-13658, you can check KAFKA-13658<
>> https://issues.apache.org/jira/browse/KAFKA-13658>, which is included in
>> v3.0.1, v3.1.1, and v3.2.0.
>> So far, the v3.0.1 is released, and v3.1.1 and v3.2.0 will be coming soon.
>>
>> Thank you.
>> Luke
>>
>> On Fri, Apr 29, 2022 at 8:53 AM James Olsen <james@inaseq.com<mailto:
>> james@inaseq.com>> wrote:
>> Luke,
>>
>> Do you know if 2.8.2 will be released anytime soon?  It appears to be
>> waiting on https://issues.apache.org/jira/browse/KAFKA-13805 for which
>> fixes are available.
>>
>> Regards, James.
>>
>> On 11/04/2022, at 14:22, Luke Chen <showuon@gmail.com<mailto:
>> showuon@gmail.com>> wrote:
>>
>> Hi James,
>>
>> This looks like this known issue KAFKA-13636
>> <https://issues.apache.org/jira/browse/KAFKA-13636>, which should be
>> fixed
>> in the newer version.
>>
>> Thank you.
>> Luke
>>
>> On Mon, Apr 11, 2022 at 9:18 AM James Olsen <james@inaseq.com<mailto:
>> james@inaseq.com>> wrote:
>>
>> I recently observed the following series of events for a particular
>> partition (MyTopic-6):
>>
>> 2022-03-18 03:18:28,562 INFO
>> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
>> 'executor-thread-2' [Consumer clientId=consumer-MyTopicService-group-3,
>> groupId=MyTopicService-group] Setting offset for partition MyTopic-6 to
>> the
>> committed offset FetchPosition{offset=438, offsetEpoch=Optional.empty,
>> currentLeader=LeaderAndEpoch{leader=Optional[b-2.redacted.kafka.us<
>> http://b-2.redacted.kafka.us/><
>> http://b-2.redacted.kafka.us<http://b-2.redacted.kafka.us/>>-
>> east-1.amazonaws.com:9094<http://east-1.amazonaws.com:9094/> (id: 2 rack:
>> use1-az4)], epoch=64}}
>>
>> -- RESTART (bring up new consumer node)
>>
>> 2022-04-01 15:17:47,943 INFO
>> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
>> 'executor-thread-6' [Consumer clientId=consumer-MyTopicService-group-7,
>> groupId=MyTopicService-group] Setting offset for partition MyTopic-6 to
>> the
>> committed offset FetchPosition{offset=449, offsetEpoch=Optional.empty,
>> currentLeader=LeaderAndEpoch{leader=Optional[b-2.redacted.kafka.us<
>> http://b-2.redacted.kafka.us/><
>> http://b-2.redacted.kafka.us<http://b-2.redacted.kafka.us/>>-
>> east-1.amazonaws.com:9094<http://east-1.amazonaws.com:9094/> (id: 2 rack:
>> use1-az4)], epoch=64}}
>>
>> -- REBALANCE (drop old consumer node)
>>
>> 2022-04-01 15:18:24,414 INFO
>> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
>> 'executor-thread-2' [Consumer clientId=consumer-MyTopicService-group-3,
>> groupId=MyTopicService-group] Found no committed offset for partition
>> MyTopic-6
>> 2022-04-01 15:18:24,474 INFO
>> [org.apache.kafka.clients.consumer.internals.SubscriptionState]
>> 'executor-thread-2' [Consumer clientId=consumer-MyTopicService-group-3,
>> groupId=MyTopicService-group] Resetting offset for partition MyTopic-6 to
>> position FetchPosition{offset=411, offsetEpoch=Optional.empty,
>> currentLeader=LeaderAndEpoch{leader=Optional[b-2.redacted.kafka.us<
>> http://b-2.redacted.kafka.us/><
>> http://b-2.redacted.kafka.us<http://b-2.redacted.kafka.us/>>-
>> east-1.amazonaws.com:9094<http://east-1.amazonaws.com:9094/> (id: 2 rack:
>> use1-az4)], epoch=64}}.
>>
>> Seems odd that no offsets were found at 2022-04-01 15:18:24,414 when they
>> were clearly present 36 seconds earlier at 2022-04-01 15:17:47,943.
>>
>> This resulted in message replay from offset 411-449.  This was in a test
>> system only and we have duplicate detection in place but I'd still like to
>> avoid similar occurrences in production if we can.
>>
>> There has clearly been a low volume of traffic but there have been active
>> consumers all the time.  We have log.retention.ms<
>> http://log.retention.ms/><http://log.retention.ms<
>> http://log.retention.ms/>>=1814400000
>> (3 weeks) which I believe explains why it resumed from 411 as messages
>> prior to that will have been deleted.
>>
>> There may not have been any new traffic in the last 7 days (we have the
>> default offset retention) so I'm wondering if there is a chance the
>> offsets
>> were deleted during the rebalance when I presume there's a brief moment
>> when there is no active consumer.  My understanding is that they shouldn't
>> be deleted until there has been no consumer for 7 days (
>>
>> https://kafka.apache.org/27/documentation.html#brokerconfigs_offsets.retention.minutes
>> - not using static assignment).  Is it possible the logic is actually
>> checking for no consumer now and no offsets for 7 days instead?
>>
>> Server and Client are 2.7.2.  Sorry I don't have any more detailed
>> server-side logs.
>>
>> Regards, James.
>>
>>
>>
>>