You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Will Schneider <ws...@tripadvisor.com> on 2018/08/20 13:16:12 UTC

Samza/Yarn cluster having issue with OffsetOutOfRangeException

Hello all,

We've recently been experiencing some Kafka/Samza issues we're not quite sure how to tackle. We've exhausted all our internal expertise and were hoping that someone on the mailing lists might have seen this before and knows what might cause it:

KafkaSystemConsumer [WARN] While refreshing brokers for [Store_LogParser_RedactedMetadata_RedactedEnvironment,35]: org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested offset is not within the range of offsets maintained by the server.. Retrying.

^ (Above repeats indefinitely until we intervene)

A bit about our use case:

  *   Versions:
     *   Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35)
     *   Samza 0.14.1
     *   Hadoop: 2.6.0-cdh5.12.1
  *   We've seen some manifestation of this error in 4 different environments with minor differences in configuration, but all running the same versions of the software
     *   Distributed Samza on Yarn (~10 node yarn environment, 3-7 node kafka environment)
     *   Non-distributed virtual test environment (Samza on yarn, but with no network in between)
  *   We have not found a reliable way to reproduce this error
  *   Issue typically presents on process startup. It usually doesn't make a difference if the application was down for 5 minutes or 5 days before that startup
  *   The LogParser application experiencing this issue is reading and parsing a set of log files, and supplementing them with metadata stored in the Store topic in question, and cached locally in RocksDB
  *   The LogParser application has 40-60 running tasks and partitions depending on configuration
  *   There is no discernable pattern for where the error presents itself:
     *   It is not consistent WRT which yarn node hosts tasks with the issue
     *   It is not consistent WRT which kafka node hosts the partitions relevant to the issue
     *   The pattern does not persist with issue nodes upon consecutive appearances of the error
     *   This leads us to believe the bug is probably endemic to the whole cluster and not the result of a random hardware issue
  *   Offsets for the LogParser application are maintained in a samza topic called something like:
     *   __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
  *   Upon startup, checkpoints are refreshed from that topic, and we'll see something in the log similar to:
     *   kafka.KafkaCheckpointManager [INFO] Read 6000 from topic: __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1. Current offset: 5999
     *   On more than one occasion, we have attempted to repair the job by killing individual yarn containers and letting samza retry them.
        *   This will occasionally work. More frequently, it will get the partition stuck in a loop trying to read from the __samza_checkpoint topic forever; we're suspicious that the retry loop above is storing offsets one or many times, causing the topic to fill up considerably.
  *   We are aware of only two workarounds:
     *   1- Fully clearing out the data disks on the Kafka servers and rebuilding the topics always seems to work, at least for a time.
     *   2- We can use a setting like: streams.Store_LogParser_RedactedMetadata_RedactedEnvironment.samza.reset.offset=true, which will necessarily ignore the checkpoint topic, and not bother to validate any offset on the Store.
        *   This works, but requires us to do a lengthy metadata refresh immediately after startup, which is less than ideal.
  *   We have also seen this on rare occasion on other, smaller Samza tiers
     *   In those cases, the common thread appears to be that the tier was left down for a period of time longer than the Kafka retention timeout, and got stuck in the loop upon restart. Attempts at reproducing it this way have been unsuccessful
     *   Worth adding that in this case, adding the samza.reset.offset parameter to the configuration did not seem to have the intended effect

On another possibly-related note, one of our clusters periodically throws an error like this, but usually recovers without intervention:

KafkaSystemAdmin [WARN] Exception while trying to get offset for SystemStreamPartition [kafka, Store_LogParser_RedactedMetadata_RedactedEnvironment, 32]: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Retrying.


  *   We've seen this error message crop up when we've had issues with the network in our datacenter, but we're not aware of any such issue at the times when we're experiencing the bigger issue. We're not sure if that might be related or not.

Has anyone seen these errors before? Is there a known workaround or fix for it?

Thanks for your help!

Attached is a copy of the Samza configuration for the job in question, in case it contains more valuable information I may have missed.

-Will Schneider


Re: Samza/Yarn cluster having issue with OffsetOutOfRangeException

Posted by Yi Pan <ni...@gmail.com>.
Hi, Will,

Can you check the description in SAMZA-1822 to see whether this is exactly
the problem you encountered? We just submitted the fix today.

Thanks!

On Tue, Aug 21, 2018 at 9:12 AM, Jagadish Venkatraman <
jagadish1989@gmail.com> wrote:

> Hi Will,
>
> Is the topic in question your change-log topic or the checkpoint-topic or
> one of your inputs? (My understanding from reading this is its your
> checkpoint)
>
> Can you please attach some more surrounding logs?
>
> Thanks,
> Jagadish
>
>
>
> On Mon, Aug 20, 2018 at 6:16 AM, Will Schneider <
> wschneider@tripadvisor.com>
> wrote:
>
> > Hello all,
> >
> > We've recently been experiencing some Kafka/Samza issues we're not quite
> > sure how to tackle. We've exhausted all our internal expertise and were
> > hoping that someone on the mailing lists might have seen this before and
> > knows what might cause it:
> >
> > KafkaSystemConsumer [WARN] While refreshing brokers for [Store_LogParser_
> > RedactedMetadata_RedactedEnvironment,35]: org.apache.kafka.common.
> errors.OffsetOutOfRangeException:
> > The requested offset is not within the range of offsets maintained by the
> > server.. Retrying.
> >
> > ^ (Above repeats indefinitely until we intervene)
> >
> > A bit about our use case:
> >
> >    - Versions:
> >       - Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35)
> >       - Samza 0.14.1
> >       - Hadoop: 2.6.0-cdh5.12.1
> >    - We've seen some manifestation of this error in 4 different
> >    environments with minor differences in configuration, but all running
> the
> >    same versions of the software
> >       - Distributed Samza on Yarn (~10 node yarn environment, 3-7 node
> >       kafka environment)
> >       - Non-distributed virtual test environment (Samza on yarn, but with
> >       no network in between)
> >    - We have not found a reliable way to reproduce this error
> >    - Issue typically presents on process startup. It usually doesn't make
> >    a difference if the application was down for 5 minutes or 5 days
> before
> >    that startup
> >    - The LogParser application experiencing this issue is reading and
> >    parsing a set of log files, and supplementing them with metadata
> stored in
> >    the Store topic in question, and cached locally in RocksDB
> >    - The LogParser application has 40-60 running tasks and partitions
> >    depending on configuration
> >    - There is no discernable pattern for where the error presents itself:
> >       - It is not consistent WRT which yarn node hosts tasks with the
> >       issue
> >       - It is not consistent WRT which kafka node hosts the partitions
> >       relevant to the issue
> >       - The pattern does not persist with issue nodes upon consecutive
> >       appearances of the error
> >       - This leads us to believe the bug is probably endemic to the whole
> >       cluster and not the result of a random hardware issue
> >    - Offsets for the LogParser application are maintained in a samza
> >    topic called something like:
> >       - __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
> >       - Upon startup, checkpoints are refreshed from that topic, and
> >    we'll see something in the log similar to:
> >       - kafka.KafkaCheckpointManager [INFO] Read 6000 from topic:
> >       __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1.
> >       Current offset: 5999
> >       - On more than one occasion, we have attempted to repair the job by
> >       killing individual yarn containers and letting samza retry them.
> >       - This will occasionally work. More frequently, it will get the
> >          partition stuck in a loop trying to read from the
> __samza_checkpoint topic
> >          forever; we're suspicious that the retry loop above is storing
> offsets one
> >          or many times, causing the topic to fill up considerably.
> >       - We are aware of only two workarounds:
> >       - 1- Fully clearing out the data disks on the Kafka servers and
> >       rebuilding the topics always seems to work, at least for a time.
> >       - 2- We can use a setting like: streams.Store_LogParser_
> >       RedactedMetadata_RedactedEnvironment.samza.reset.offset=true,
> which
> >       will necessarily ignore the checkpoint topic, and not bother to
> validate
> >       any offset on the Store.
> >          - This works, but requires us to do a lengthy metadata refresh
> >          immediately after startup, which is less than ideal.
> >       - We have also seen this on rare occasion on other, smaller Samza
> >    tiers
> >       - In those cases, the common thread appears to be that the tier was
> >       left down for a period of time longer than the Kafka retention
> timeout, and
> >       got stuck in the loop upon restart. Attempts at reproducing it
> this way
> >       have been unsuccessful
> >       - Worth adding that in this case, adding the samza.reset.offset
> >       parameter to the configuration did not seem to have the intended
> effect
> >
> > On another possibly-related note, one of our clusters periodically throws
> > an error like this, but usually recovers without intervention:
> >
> > KafkaSystemAdmin [WARN] Exception while trying to get offset for
> > SystemStreamPartition [kafka, Store_LogParser_RedactedMetadata_
> RedactedEnvironment,
> > 32]: org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> > server is not the leader for that topic-partition.. Retrying.
> >
> >
> >    - We've seen this error message crop up when we've had issues with the
> >    network in our datacenter, but we're not aware of any such issue at
> the
> >    times when we're experiencing the bigger issue. We're not sure if that
> >    might be related or not.
> >
> >
> > Has anyone seen these errors before? Is there a known workaround or fix
> > for it?
> >
> > Thanks for your help!
> >
> > Attached is a copy of the Samza configuration for the job in question, in
> > case it contains more valuable information I may have missed.
> >
> > -Will Schneider
> >
> >
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>

Re: Samza/Yarn cluster having issue with OffsetOutOfRangeException

Posted by Jagadish Venkatraman <ja...@gmail.com>.
Hi Will,

Is the topic in question your change-log topic or the checkpoint-topic or
one of your inputs? (My understanding from reading this is its your
checkpoint)

Can you please attach some more surrounding logs?

Thanks,
Jagadish



On Mon, Aug 20, 2018 at 6:16 AM, Will Schneider <ws...@tripadvisor.com>
wrote:

> Hello all,
>
> We've recently been experiencing some Kafka/Samza issues we're not quite
> sure how to tackle. We've exhausted all our internal expertise and were
> hoping that someone on the mailing lists might have seen this before and
> knows what might cause it:
>
> KafkaSystemConsumer [WARN] While refreshing brokers for [Store_LogParser_
> RedactedMetadata_RedactedEnvironment,35]: org.apache.kafka.common.errors.OffsetOutOfRangeException:
> The requested offset is not within the range of offsets maintained by the
> server.. Retrying.
>
> ^ (Above repeats indefinitely until we intervene)
>
> A bit about our use case:
>
>    - Versions:
>       - Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35)
>       - Samza 0.14.1
>       - Hadoop: 2.6.0-cdh5.12.1
>    - We've seen some manifestation of this error in 4 different
>    environments with minor differences in configuration, but all running the
>    same versions of the software
>       - Distributed Samza on Yarn (~10 node yarn environment, 3-7 node
>       kafka environment)
>       - Non-distributed virtual test environment (Samza on yarn, but with
>       no network in between)
>    - We have not found a reliable way to reproduce this error
>    - Issue typically presents on process startup. It usually doesn't make
>    a difference if the application was down for 5 minutes or 5 days before
>    that startup
>    - The LogParser application experiencing this issue is reading and
>    parsing a set of log files, and supplementing them with metadata stored in
>    the Store topic in question, and cached locally in RocksDB
>    - The LogParser application has 40-60 running tasks and partitions
>    depending on configuration
>    - There is no discernable pattern for where the error presents itself:
>       - It is not consistent WRT which yarn node hosts tasks with the
>       issue
>       - It is not consistent WRT which kafka node hosts the partitions
>       relevant to the issue
>       - The pattern does not persist with issue nodes upon consecutive
>       appearances of the error
>       - This leads us to believe the bug is probably endemic to the whole
>       cluster and not the result of a random hardware issue
>    - Offsets for the LogParser application are maintained in a samza
>    topic called something like:
>       - __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
>       - Upon startup, checkpoints are refreshed from that topic, and
>    we'll see something in the log similar to:
>       - kafka.KafkaCheckpointManager [INFO] Read 6000 from topic:
>       __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1.
>       Current offset: 5999
>       - On more than one occasion, we have attempted to repair the job by
>       killing individual yarn containers and letting samza retry them.
>       - This will occasionally work. More frequently, it will get the
>          partition stuck in a loop trying to read from the __samza_checkpoint topic
>          forever; we're suspicious that the retry loop above is storing offsets one
>          or many times, causing the topic to fill up considerably.
>       - We are aware of only two workarounds:
>       - 1- Fully clearing out the data disks on the Kafka servers and
>       rebuilding the topics always seems to work, at least for a time.
>       - 2- We can use a setting like: streams.Store_LogParser_
>       RedactedMetadata_RedactedEnvironment.samza.reset.offset=true, which
>       will necessarily ignore the checkpoint topic, and not bother to validate
>       any offset on the Store.
>          - This works, but requires us to do a lengthy metadata refresh
>          immediately after startup, which is less than ideal.
>       - We have also seen this on rare occasion on other, smaller Samza
>    tiers
>       - In those cases, the common thread appears to be that the tier was
>       left down for a period of time longer than the Kafka retention timeout, and
>       got stuck in the loop upon restart. Attempts at reproducing it this way
>       have been unsuccessful
>       - Worth adding that in this case, adding the samza.reset.offset
>       parameter to the configuration did not seem to have the intended effect
>
> On another possibly-related note, one of our clusters periodically throws
> an error like this, but usually recovers without intervention:
>
> KafkaSystemAdmin [WARN] Exception while trying to get offset for
> SystemStreamPartition [kafka, Store_LogParser_RedactedMetadata_RedactedEnvironment,
> 32]: org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server is not the leader for that topic-partition.. Retrying.
>
>
>    - We've seen this error message crop up when we've had issues with the
>    network in our datacenter, but we're not aware of any such issue at the
>    times when we're experiencing the bigger issue. We're not sure if that
>    might be related or not.
>
>
> Has anyone seen these errors before? Is there a known workaround or fix
> for it?
>
> Thanks for your help!
>
> Attached is a copy of the Samza configuration for the job in question, in
> case it contains more valuable information I may have missed.
>
> -Will Schneider
>
>


-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University