You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Arvind Prabhakar (Commented) (JIRA)" <ji...@apache.org> on 2012/03/15 05:53:43 UTC

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229888#comment-13229888 ] 

Arvind Prabhakar commented on FLUME-1030:
-----------------------------------------

Thanks Juhani for filing this issue. Here are my thoughts on the issue:

Any exception including EventDeliveryException can likely be due to a relatively permanent failure. Therefore it is non-trivial for the sink implementation to detect and throw the appropriate exception type as expected by any upstream contract. Failure to throw the correct exception will cause the system to enter an inconsistent state.

I therefore suggest we stick to simple exception handling mechanism - where the processor catches all exceptions and backs off from retries for a predictable amount of time. If the problem is permanent, it will eventually be resolved by human intervention, and the backoff mechanism will ensure that it does not tax the system too much. 

Having sophisticated exceptions will lead to unpredictable behavior and will still require manual intervention for recovery, only the state will be more complex than the other implementation.
                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.1.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira