You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Malcolm McFarland <mm...@cavulus.com> on 2019/11/01 01:09:38 UTC

Occasional checkpoint mismatch on Samza task reload

Hey folks,

We're running Samza 0.14.1 on Hadoop 2.7.6. Every once in a while while
restarting an application, the process will come up with some variation on
this error:

INFO Validating offset <offset> for topic and partition
[<topic>,<partition>]
WARN While refreshing brokers for [<topic>,<partition>]:
org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested
offset is not within the range of offsets maintained by the server..
Retrying

This error is really mystifying us. We're not doing anything severe here,
just using YARN's kill command to stop the application and then submitting
it via a normal mechanism. Are there any best practices or gotchas
surrounding restarting Samza applications on YARN that could help here?

Cheers,
Malcolm McFarland
Cavulus


This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.

Re: Occasional checkpoint mismatch on Samza task reload

Posted by Malcolm McFarland <mm...@cavulus.com>.

Also, is there a way to produce this error, ie if we added extra messages
to the __checkpoint topics?


Malcolm McFarland
Cavulus


This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.


On Thu, Nov 7, 2019 at 3:33 PM Malcolm McFarland <mm...@cavulus.com>
wrote:

> Hi Bharath,
>
> Interesting to hear about the TTL. Is this a Kafka topic-level
> configuration?
>
> This really does seem to only occur when the stream processors are
> restarted, and only occasionally at that. We've been looking at increasing
> the yarn.nodemanager.sleep-delay-before-sigkill.ms and
> yarn.nodemanager.process-kill-wait.ms YARN values. Would this give Samza
> more time to shutdown, perhaps allowing unpersisted checkpoints to be
> written out?
>
> Cheers,
> Malcolm McFarland
> Cavulus
>
>
> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of the
> contents of this message is prohibited. The information contained in this
> message is intended only for the personal and confidential use of the
> recipient(s) named above. If you have received this message in error,
> please notify the sender immediately and delete the original message.
>
> Malcolm McFarland
> Cavulus
>
>
> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of the
> contents of this message is prohibited. The information contained in this
> message is intended only for the personal and confidential use of the
> recipient(s) named above. If you have received this message in error,
> please notify the sender immediately and delete the original message.
>
>
> On Thu, Oct 31, 2019 at 6:51 PM Bharath Kumara Subramanian <
> codin.martial@gmail.com> wrote:
>
>> Hi Malcolm,
>>
>> The warning is not particularly related to your restarts. It happens when
>> the offset requested by the consumer no longer exists on the broker.
>> It can typically happen if you are consuming from a topic with TTL enabled
>> and the time retention kicked in and broker purged older offsets.
>>
>> You can find the starting offset from the container logs for that specific
>> topic and partition. Look for
>>
>> "Registering ssp: {} with offset: {}"
>>
>> Once you get the starting offset, you can verify if the above theory is
>> true by looking at the metadata of the topic in zookeeper to get the
>> beginning and end offsets for a particular partition.
>>
>> Let me know if you have any other questions.
>>
>> Thanks,
>> Bharath
>>
>> On Thu, Oct 31, 2019 at 6:09 PM Malcolm McFarland <mmcfarland@cavulus.com
>> >
>> wrote:
>>
>> > Hey folks,
>> >
>> > We're running Samza 0.14.1 on Hadoop 2.7.6. Every once in a while while
>> > restarting an application, the process will come up with some variation
>> on
>> > this error:
>> >
>> > INFO Validating offset <offset> for topic and partition
>> > [<topic>,<partition>]
>> > WARN While refreshing brokers for [<topic>,<partition>]:
>> > org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested
>> > offset is not within the range of offsets maintained by the server..
>> > Retrying
>> >
>> > This error is really mystifying us. We're not doing anything severe
>> here,
>> > just using YARN's kill command to stop the application and then
>> submitting
>> > it via a normal mechanism. Are there any best practices or gotchas
>> > surrounding restarting Samza applications on YARN that could help here?
>> >
>> > Cheers,
>> > Malcolm McFarland
>> > Cavulus
>> >
>> >
>> > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
>> > unauthorized or improper disclosure, copying, distribution, or use of
>> the
>> > contents of this message is prohibited. The information contained in
>> this
>> > message is intended only for the personal and confidential use of the
>> > recipient(s) named above. If you have received this message in error,
>> > please notify the sender immediately and delete the original message.
>> >
>>
>

Re: Occasional checkpoint mismatch on Samza task reload

Posted by Malcolm McFarland <mm...@cavulus.com>.

Hi Bharath,

Interesting to hear about the TTL. Is this a Kafka topic-level
configuration?

This really does seem to only occur when the stream processors are
restarted, and only occasionally at that. We've been looking at increasing
the yarn.nodemanager.sleep-delay-before-sigkill.ms and
yarn.nodemanager.process-kill-wait.ms YARN values. Would this give Samza
more time to shutdown, perhaps allowing unpersisted checkpoints to be
written out?

Cheers,
Malcolm McFarland
Cavulus

This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.

Malcolm McFarland
Cavulus

This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.

On Thu, Oct 31, 2019 at 6:51 PM Bharath Kumara Subramanian <
codin.martial@gmail.com> wrote:

> Hi Malcolm,
>
> The warning is not particularly related to your restarts. It happens when
> the offset requested by the consumer no longer exists on the broker.
> It can typically happen if you are consuming from a topic with TTL enabled
> and the time retention kicked in and broker purged older offsets.
>
> You can find the starting offset from the container logs for that specific
> topic and partition. Look for
>
> "Registering ssp: {} with offset: {}"
>
> Once you get the starting offset, you can verify if the above theory is
> true by looking at the metadata of the topic in zookeeper to get the
> beginning and end offsets for a particular partition.
>
> Let me know if you have any other questions.
>
> Thanks,
> Bharath
>
> On Thu, Oct 31, 2019 at 6:09 PM Malcolm McFarland <mm...@cavulus.com>
> wrote:
>
> > Hey folks,
> >
> > We're running Samza 0.14.1 on Hadoop 2.7.6. Every once in a while while
> > restarting an application, the process will come up with some variation
> on
> > this error:
> >
> > INFO Validating offset <offset> for topic and partition
> > [<topic>,<partition>]
> > WARN While refreshing brokers for [<topic>,<partition>]:
> > org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested
> > offset is not within the range of offsets maintained by the server..
> > Retrying
> >
> > This error is really mystifying us. We're not doing anything severe here,
> > just using YARN's kill command to stop the application and then
> submitting
> > it via a normal mechanism. Are there any best practices or gotchas
> > surrounding restarting Samza applications on YARN that could help here?
> >
> > Cheers,
> > Malcolm McFarland
> > Cavulus
> >
> >
> > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > unauthorized or improper disclosure, copying, distribution, or use of the
> > contents of this message is prohibited. The information contained in this
> > message is intended only for the personal and confidential use of the
> > recipient(s) named above. If you have received this message in error,
> > please notify the sender immediately and delete the original message.
> >
>

Re: Occasional checkpoint mismatch on Samza task reload

Posted by Bharath Kumara Subramanian <co...@gmail.com>.

Hi Malcolm,

The warning is not particularly related to your restarts. It happens when
the offset requested by the consumer no longer exists on the broker.
It can typically happen if you are consuming from a topic with TTL enabled
and the time retention kicked in and broker purged older offsets.

You can find the starting offset from the container logs for that specific
topic and partition. Look for

"Registering ssp: {} with offset: {}"

Once you get the starting offset, you can verify if the above theory is
true by looking at the metadata of the topic in zookeeper to get the
beginning and end offsets for a particular partition.

Let me know if you have any other questions.

Thanks,
Bharath

On Thu, Oct 31, 2019 at 6:09 PM Malcolm McFarland <mm...@cavulus.com>
wrote:

> Hey folks,
>
> We're running Samza 0.14.1 on Hadoop 2.7.6. Every once in a while while
> restarting an application, the process will come up with some variation on
> this error:
>
> INFO Validating offset <offset> for topic and partition
> [<topic>,<partition>]
> WARN While refreshing brokers for [<topic>,<partition>]:
> org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested
> offset is not within the range of offsets maintained by the server..
> Retrying
>
> This error is really mystifying us. We're not doing anything severe here,
> just using YARN's kill command to stop the application and then submitting
> it via a normal mechanism. Are there any best practices or gotchas
> surrounding restarting Samza applications on YARN that could help here?
>
> Cheers,
> Malcolm McFarland
> Cavulus
>
>
> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of the
> contents of this message is prohibited. The information contained in this
> message is intended only for the personal and confidential use of the
> recipient(s) named above. If you have received this message in error,
> please notify the sender immediately and delete the original message.
>