You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@metron.apache.org by Nick Allen <ni...@nickallen.org> on 2018/10/19 18:20:12 UTC

[DISCUSS] Recurrent Large Indexing Error Messages

I want to discuss solutions for the problem that I have described in
METRON-1832; Recurrent Large Indexing Error Messages. I feel this is a very
easy trap to fall into when using the default settings that currently come
with Metron.


## Problem


https://issues.apache.org/jira/browse/METRON-1832


If any index destination like HDFS, Elasticsearch, or Solr goes down while
the Indexing topology is running, an error message is created and sent back
to the user-defined error topic.  By default, this is defined to also be
the 'indexing' topic.

The Indexing topology then consumes this error message and attempts to
write it again. If the index destination is still down, another error
occurs and another error message is created that encapsulates the original
error message.  That message is then sent to the 'indexing' topic, which is
later consumed, yet again, by the Indexing topology.

These error messages will continue to be recycled and grow larger and
larger as each new error message encapsulates all previous error messages
in the "raw_message" field.

Once the index destination recovers, one giant error message will finally
be written that contains massively duplicated, useless information which
can further negatively impact performance of the index destination.

Also, the escape character '\' continually compounds one another leading to
long strings of '\' characters in the error message.


## Background

There was some discussion on how to handle this on the original PR #453
https://github.com/apache/metron/pull/453.

## Solutions

(1) The first, easiest option is to just do nothing.  There was already a
discussion around this and this is the solution that we landed on in #453.

Pros: Really easy; do nothing.

Cons: Intermittent problems with ES/Solr can easily create very large error
messages that can significantly impact both search and ingest performance.


(2) Change the default indexing error topic to 'indexing_errors' to avoid
recycling error messages. Nothing will consume from the 'indexing_errors'
topic, thus preventing a cycle.

Pros: Simple, easy change that prevents the cycle.

Cons: Recoverable indexing errors are not visible to users as they will
never be indexed in ES/Solr.

(2) Add logic to limit the number times a message can be 'recycled' through
the Indexing topology.  This effectively sets a maximum number of retry
attempts.  If a message fails N times, then write the message to a separate
unrecoverable, error topic.

Pros: Recoverable errors are visible to users in ES/Solr.

Cons: More complex.  Users still need to check the unrecoverable, error
topic for potential problems.

(4) Do not further encapsulate error messages in the 'raw_message' field.
If an error message fails, don't encapsulate it in another error message.
Just push it to the error topic as-is.  Could add a field that indicates
how many times the message has failed.

Pros: Prevents giant error messages from being created from recoverable
errors.

Cons: Extended outages would still cause the Indexing topology to
repeatedly recycle these error messages, which would ultimately exhaust
resources in Storm.



What other ways can we solve this?

Re: [DISCUSS] Recurrent Large Indexing Error Messages

Posted by Otto Fowler <ot...@gmail.com>.

Why not have a second indexing topology configured just for errors?
We can load the same code with two different configurations in two
topologies.



On December 5, 2018 at 03:55:59, Ali Nazemian (alinazemian@gmail.com) wrote:

I think if you look at the indexing error management, it is pretty much
similar to parser and enrichment error use cases. It is even more common to
expect something ended up in error topics. I think a wider independent job
can be used to take care of error management. It can be decided to add a
separate topology later on to manage error logs and create
alert/notifications separately.
It can be even integrated with log feeder and log search.
The scenario of sending solution operational logs to the same solution is a
bit weird and not enterprise friendly. Normally platform operation team
would be a separate team with different objectives and probably they have
got a separate monitoring/notification solution in placed already.

I don't think it is the end of the world if this part is left to be managed
by users. So I prefer option 2 as a short term. Long term solution can be
discussed separately.

Cheers,
Ali

On Sat, 20 Oct. 2018, 05:20 Nick Allen <nick@nickallen.org wrote:

> I want to discuss solutions for the problem that I have described in
> METRON-1832; Recurrent Large Indexing Error Messages. I feel this is a
very
> easy trap to fall into when using the default settings that currently
come
> with Metron.
>
>
> ## Problem
>
>
> https://issues.apache.org/jira/browse/METRON-1832
>
>
> If any index destination like HDFS, Elasticsearch, or Solr goes down
while
> the Indexing topology is running, an error message is created and sent
back
> to the user-defined error topic. By default, this is defined to also be
> the 'indexing' topic.
>
> The Indexing topology then consumes this error message and attempts to
> write it again. If the index destination is still down, another error
> occurs and another error message is created that encapsulates the
original
> error message. That message is then sent to the 'indexing' topic, which
is
> later consumed, yet again, by the Indexing topology.
>
> These error messages will continue to be recycled and grow larger and
> larger as each new error message encapsulates all previous error messages
> in the "raw_message" field.
>
> Once the index destination recovers, one giant error message will finally
> be written that contains massively duplicated, useless information which
> can further negatively impact performance of the index destination.
>
> Also, the escape character '\' continually compounds one another leading
to
> long strings of '\' characters in the error message.
>
>
> ## Background
>
> There was some discussion on how to handle this on the original PR #453
> https://github.com/apache/metron/pull/453.
>
> ## Solutions
>
> (1) The first, easiest option is to just do nothing. There was already a
> discussion around this and this is the solution that we landed on in
#453.
>
> Pros: Really easy; do nothing.
>
> Cons: Intermittent problems with ES/Solr can easily create very large
error
> messages that can significantly impact both search and ingest
performance.
>
>
> (2) Change the default indexing error topic to 'indexing_errors' to avoid
> recycling error messages. Nothing will consume from the 'indexing_errors'
> topic, thus preventing a cycle.
>
> Pros: Simple, easy change that prevents the cycle.
>
> Cons: Recoverable indexing errors are not visible to users as they will
> never be indexed in ES/Solr.
>
> (2) Add logic to limit the number times a message can be 'recycled'
through
> the Indexing topology. This effectively sets a maximum number of retry
> attempts. If a message fails N times, then write the message to a
separate
> unrecoverable, error topic.
>
> Pros: Recoverable errors are visible to users in ES/Solr.
>
> Cons: More complex. Users still need to check the unrecoverable, error
> topic for potential problems.
>
> (4) Do not further encapsulate error messages in the 'raw_message' field.
> If an error message fails, don't encapsulate it in another error message.
> Just push it to the error topic as-is. Could add a field that indicates
> how many times the message has failed.
>
> Pros: Prevents giant error messages from being created from recoverable
> errors.
>
> Cons: Extended outages would still cause the Indexing topology to
> repeatedly recycle these error messages, which would ultimately exhaust
> resources in Storm.
>
>
>
> What other ways can we solve this?
>

Re: [DISCUSS] Recurrent Large Indexing Error Messages

Posted by Ali Nazemian <al...@gmail.com>.

I think if you look at the indexing error management, it is pretty much
similar to parser and enrichment error use cases. It is even more common to
expect something ended up in error topics. I think a wider independent job
can be used to take care of error management. It can be decided to add a
separate topology later on to manage error logs and create
alert/notifications separately.
It can be even integrated with log feeder and log search.
The scenario of sending solution operational logs to the same solution is a
bit weird and not enterprise friendly. Normally platform operation team
would be a separate team with different objectives and probably they have
got a separate monitoring/notification solution in placed already.

I don't think it is the end of the world if this part is left to be managed
by users. So I prefer option 2 as a short term. Long term solution can be
discussed separately.

Cheers,
Ali

On Sat, 20 Oct. 2018, 05:20 Nick Allen <nick@nickallen.org wrote:

> I want to discuss solutions for the problem that I have described in
> METRON-1832; Recurrent Large Indexing Error Messages. I feel this is a very
> easy trap to fall into when using the default settings that currently come
> with Metron.
>
>
> ## Problem
>
>
> https://issues.apache.org/jira/browse/METRON-1832
>
>
> If any index destination like HDFS, Elasticsearch, or Solr goes down while
> the Indexing topology is running, an error message is created and sent back
> to the user-defined error topic.  By default, this is defined to also be
> the 'indexing' topic.
>
> The Indexing topology then consumes this error message and attempts to
> write it again. If the index destination is still down, another error
> occurs and another error message is created that encapsulates the original
> error message.  That message is then sent to the 'indexing' topic, which is
> later consumed, yet again, by the Indexing topology.
>
> These error messages will continue to be recycled and grow larger and
> larger as each new error message encapsulates all previous error messages
> in the "raw_message" field.
>
> Once the index destination recovers, one giant error message will finally
> be written that contains massively duplicated, useless information which
> can further negatively impact performance of the index destination.
>
> Also, the escape character '\' continually compounds one another leading to
> long strings of '\' characters in the error message.
>
>
> ## Background
>
> There was some discussion on how to handle this on the original PR #453
> https://github.com/apache/metron/pull/453.
>
> ## Solutions
>
> (1) The first, easiest option is to just do nothing.  There was already a
> discussion around this and this is the solution that we landed on in #453.
>
> Pros: Really easy; do nothing.
>
> Cons: Intermittent problems with ES/Solr can easily create very large error
> messages that can significantly impact both search and ingest performance.
>
>
> (2) Change the default indexing error topic to 'indexing_errors' to avoid
> recycling error messages. Nothing will consume from the 'indexing_errors'
> topic, thus preventing a cycle.
>
> Pros: Simple, easy change that prevents the cycle.
>
> Cons: Recoverable indexing errors are not visible to users as they will
> never be indexed in ES/Solr.
>
> (2) Add logic to limit the number times a message can be 'recycled' through
> the Indexing topology.  This effectively sets a maximum number of retry
> attempts.  If a message fails N times, then write the message to a separate
> unrecoverable, error topic.
>
> Pros: Recoverable errors are visible to users in ES/Solr.
>
> Cons: More complex.  Users still need to check the unrecoverable, error
> topic for potential problems.
>
> (4) Do not further encapsulate error messages in the 'raw_message' field.
> If an error message fails, don't encapsulate it in another error message.
> Just push it to the error topic as-is.  Could add a field that indicates
> how many times the message has failed.
>
> Pros: Prevents giant error messages from being created from recoverable
> errors.
>
> Cons: Extended outages would still cause the Indexing topology to
> repeatedly recycle these error messages, which would ultimately exhaust
> resources in Storm.
>
>
>
> What other ways can we solve this?
>

Re: [DISCUSS] Recurrent Large Indexing Error Messages

Posted by Michael Miklavcic <mi...@gmail.com>.

Thanks for putting this together Nick.

For point 2, I believe this was the relevant comment from that thread -
https://github.com/apache/metron/pull/453#issuecomment-283349461

I'm partial to a combo of option 3 (the 2nd #2 listed) or 4. I would want
to:

   - Not bog down indexing for good messages bc of errors bottlenecking the
   system (in agreement with @Casey Stella <ce...@gmail.com>, this
   should really not be an issue in a properly tuned system.)
   - Not flood the system with ever-increasing in size messages due to
   wrapped exceptions.
   - Still be able to see errors regardless of what's happening with
   ES/Solr - maybe also add error log output after k retries.
   - Not DOS ES/Solr if they are already bogged down or down altogether. If
   our index destination is down, Ambari is our current indicator of this to
   some degree. We shouldn't logjam it further. Do we get error-specific
   exceptions for this sort of thing that we can use to be more intelligent
   about how to deal with a failure? I'd expect a 404 or service unavailable
   message of some sort.
   - Override the default TTL or backoff settings. Other tools, e.g. Oozie,
   have provided options like exponential backoff. This is not nearly as
   trivial as other options because we would need to create error message
   handling logic that checks metadata on the error message, and further
   decide whether or not to attempt to index it.

Best,
Mike


On Fri, Oct 19, 2018 at 12:20 PM Nick Allen <ni...@nickallen.org> wrote:

> I want to discuss solutions for the problem that I have described in
> METRON-1832; Recurrent Large Indexing Error Messages. I feel this is a very
> easy trap to fall into when using the default settings that currently come
> with Metron.
>
>
> ## Problem
>
>
> https://issues.apache.org/jira/browse/METRON-1832
>
>
> If any index destination like HDFS, Elasticsearch, or Solr goes down while
> the Indexing topology is running, an error message is created and sent back
> to the user-defined error topic.  By default, this is defined to also be
> the 'indexing' topic.
>
> The Indexing topology then consumes this error message and attempts to
> write it again. If the index destination is still down, another error
> occurs and another error message is created that encapsulates the original
> error message.  That message is then sent to the 'indexing' topic, which is
> later consumed, yet again, by the Indexing topology.
>
> These error messages will continue to be recycled and grow larger and
> larger as each new error message encapsulates all previous error messages
> in the "raw_message" field.
>
> Once the index destination recovers, one giant error message will finally
> be written that contains massively duplicated, useless information which
> can further negatively impact performance of the index destination.
>
> Also, the escape character '\' continually compounds one another leading to
> long strings of '\' characters in the error message.
>
>
> ## Background
>
> There was some discussion on how to handle this on the original PR #453
> https://github.com/apache/metron/pull/453.
>
> ## Solutions
>
> (1) The first, easiest option is to just do nothing.  There was already a
> discussion around this and this is the solution that we landed on in #453.
>
> Pros: Really easy; do nothing.
>
> Cons: Intermittent problems with ES/Solr can easily create very large error
> messages that can significantly impact both search and ingest performance.
>
>
> (2) Change the default indexing error topic to 'indexing_errors' to avoid
> recycling error messages. Nothing will consume from the 'indexing_errors'
> topic, thus preventing a cycle.
>
> Pros: Simple, easy change that prevents the cycle.
>
> Cons: Recoverable indexing errors are not visible to users as they will
> never be indexed in ES/Solr.
>
> (2) Add logic to limit the number times a message can be 'recycled' through
> the Indexing topology.  This effectively sets a maximum number of retry
> attempts.  If a message fails N times, then write the message to a separate
> unrecoverable, error topic.
>
> Pros: Recoverable errors are visible to users in ES/Solr.
>
> Cons: More complex.  Users still need to check the unrecoverable, error
> topic for potential problems.
>
> (4) Do not further encapsulate error messages in the 'raw_message' field.
> If an error message fails, don't encapsulate it in another error message.
> Just push it to the error topic as-is.  Could add a field that indicates
> how many times the message has failed.
>
> Pros: Prevents giant error messages from being created from recoverable
> errors.
>
> Cons: Extended outages would still cause the Indexing topology to
> repeatedly recycle these error messages, which would ultimately exhaust
> resources in Storm.
>
>
>
> What other ways can we solve this?
>