You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by "Juhani Connolly (Created) (JIRA)" <ji...@apache.org> on 2012/03/15 01:30:40 UTC

[jira] [Created] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Distinguish between temporary and longterm failure to avoid repeated beating on dead components
-----------------------------------------------------------------------------------------------

Key: FLUME-1030
URL: https://issues.apache.org/jira/browse/FLUME-1030
Project: Flume
Issue Type: Improvement
Components: Sinks+Sources
Reporter: Juhani Connolly
Assignee: Juhani Connolly
Fix For: v1.1.0

One may want to refer to FLUME-984 for some history of this.

As it stands, a sink can have several outcomes:
- OK - succesfully transferred some data
- TRY_LATER - no data to transfer
- throw EventDeliveryException - Give the sink a short breather to recover, then try again
- throw anything else - get logged and more or less ignored

I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).

One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).

If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.

If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Retry logic for failover sink processor to handle downstream exceptions in a predictable manner.

Posted by "Arvind Prabhakar (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated FLUME-1030:
------------------------------------

    Summary: Retry logic for failover sink processor to handle downstream exceptions in a predictable manner.  (was: Distinguish between temporary and longterm failure to avoid repeated beating on dead components)

Updating the title of the issue to match what has been checked in.
                
> Retry logic for failover sink processor to handle downstream exceptions in a predictable manner.
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch, FLUME-1030.4.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Arvind Prabhakar (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated FLUME-1030:
------------------------------------

    Fix Version/s:     (was: v1.1.0)
                   v1.2.0
    
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235729#comment-13235729 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/#review6231
-----------------------------------------------------------

Thanks for the patch Juhani. I was able to run the tests successfully. I have some minor feedback below for your consideration.

flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java
<https://reviews.apache.org/r/4445/#comment13432>

    It will be good to cap this penalty amount to a predefined/configured ceiling value.

flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java
<https://reviews.apache.org/r/4445/#comment13438>

    There is one slight issue here though - which is if the channel is empty, the sink being attempted to recover will likely return BACKOFF, implying that the sink is normal and has recovered. 

    A minor nit: it will be nice if the process invocation on the failed sink was from within the process() that calls the active Sink. That way the logic stays in one place.

- Arvind

On 2012-03-22 08:23:00, Juhani Connolly wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4445/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-03-22 08:23:00)
bq.  
bq.  
bq.  Review request for Flume.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
bq.  Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)
bq.  
bq.  
bq.  This addresses bug FLUME-1030.
bq.      https://issues.apache.org/jira/browse/FLUME-1030
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
bq.    flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 
bq.  
bq.  Diff: https://reviews.apache.org/r/4445/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Modified the test for the new functionality, new test passes
bq.  
bq.  No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Juhani
bq.  
bq.

> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Juhani Connolly (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juhani Connolly updated FLUME-1030:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Juhani Connolly (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juhani Connolly updated FLUME-1030:
-----------------------------------

    Attachment:     (was: FLUME-1030.3.patch)
    
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Juhani Connolly (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juhani Connolly updated FLUME-1030:
-----------------------------------

    Attachment: FLUME-1030.2.patch

Adding the updated patch from review
                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Juhani Connolly (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juhani Connolly updated FLUME-1030:
-----------------------------------

    Attachment: FLUME-1030.3.patch
    
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236774#comment-13236774 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/#review6289
-----------------------------------------------------------

Ship it!

+1

- Arvind

On 2012-03-23 05:28:45, Juhani Connolly wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4445/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-03-23 05:28:45)
bq.  
bq.  
bq.  Review request for Flume.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
bq.  Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)
bq.  
bq.  
bq.  This addresses bug FLUME-1030.
bq.      https://issues.apache.org/jira/browse/FLUME-1030
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
bq.    flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 
bq.  
bq.  Diff: https://reviews.apache.org/r/4445/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Modified the test for the new functionality, new test passes
bq.  
bq.  No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Juhani
bq.  
bq.

> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch, FLUME-1030.4.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235445#comment-13235445 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/
-----------------------------------------------------------

Review request for Flume.


Summary
-------

As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)


This addresses bug FLUME-1030.
    https://issues.apache.org/jira/browse/FLUME-1030


Diffs
-----

  flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
  flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 

Diff: https://reviews.apache.org/r/4445/diff


Testing
-------

Modified the test for the new functionality, new test passes

No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow


Thanks,

Juhani


                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Arvind Prabhakar (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232773#comment-13232773 ] 

Arvind Prabhakar commented on FLUME-1030:
-----------------------------------------

bq. One way of dealing with this with multiple sinks is to just put sinks that had exceptions on a priority list with the time to reactivate them, passing events to other sinks until "recovery". Since balancing/failover processors have other alternatives, they can just get another sink to deal with it, using longer timeouts than would be applied by backoff. Would this be a better way to deal with balancing/failover?

Yes - that makes sense to me and will be a deterministic solution regardless of the underlying problem.

bq. This has made me curious of exactly what the intended use of EventDeliveryException is now. The distinction between it and other Exceptions is pretty blurred now that we just elect to log everything

As the name suggests - it indicates a failure to relay the event to it's next hop destination. Ideally it should have the causal exception buried within itself that gives more of a clue as to what may have gone wrong. Eventually, if we see a pattern of failures emerging out of this, we should modify the component that is responsible to deal with it rather than adding that logic to the exception handling code.
                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Juhani Connolly (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juhani Connolly updated FLUME-1030:
-----------------------------------

    Attachment: FLUME-1030.4.patch
    
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch, FLUME-1030.4.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236273#comment-13236273 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------

bq.  On 2012-03-22 17:07:01, Arvind Prabhakar wrote:
bq.  > Thanks for the patch Juhani. I was able to run the tests successfully. I have some minor feedback below for your consideration.

thanks for running the tests. back to normal on my end too

bq.  On 2012-03-22 17:07:01, Arvind Prabhakar wrote:
bq.  > flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, line 90
bq.  > <https://reviews.apache.org/r/4445/diff/1/?file=94575#file94575line90>
bq.  >
bq.  >     It will be good to cap this penalty amount to a predefined/configured ceiling value.

Added a config variable

bq.  On 2012-03-22 17:07:01, Arvind Prabhakar wrote:
bq.  > flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, lines 120-121
bq.  > <https://reviews.apache.org/r/4445/diff/1/?file=94575#file94575line120>
bq.  >
bq.  >     There is one slight issue here though - which is if the channel is empty, the sink being attempted to recover will likely return BACKOFF, implying that the sink is normal and has recovered. 
bq.  >     
bq.  >     A minor nit: it will be nice if the process invocation on the failed sink was from within the process() that calls the active Sink. That way the logic stays in one place.

I got rid of the queue subclass and put the code in process... Though I'm not sure if that is the easiest way for the human brain to parse it...

I also changed things so that a backoff results in being returned to the failed list without a penalty

- Juhani

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/#review6231
-----------------------------------------------------------

On 2012-03-22 08:23:00, Juhani Connolly wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4445/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-03-22 08:23:00)
bq.  
bq.  
bq.  Review request for Flume.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
bq.  Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)
bq.  
bq.  
bq.  This addresses bug FLUME-1030.
bq.      https://issues.apache.org/jira/browse/FLUME-1030
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
bq.    flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 
bq.  
bq.  Diff: https://reviews.apache.org/r/4445/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Modified the test for the new functionality, new test passes
bq.  
bq.  No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Juhani
bq.  
bq.

> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Arvind Prabhakar (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229888#comment-13229888 ] 

Arvind Prabhakar commented on FLUME-1030:
-----------------------------------------

Thanks Juhani for filing this issue. Here are my thoughts on the issue:

Any exception including EventDeliveryException can likely be due to a relatively permanent failure. Therefore it is non-trivial for the sink implementation to detect and throw the appropriate exception type as expected by any upstream contract. Failure to throw the correct exception will cause the system to enter an inconsistent state.

I therefore suggest we stick to simple exception handling mechanism - where the processor catches all exceptions and backs off from retries for a predictable amount of time. If the problem is permanent, it will eventually be resolved by human intervention, and the backoff mechanism will ensure that it does not tax the system too much. 

Having sophisticated exceptions will lead to unpredictable behavior and will still require manual intervention for recovery, only the state will be more complex than the other implementation.
                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.1.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236275#comment-13236275 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/
-----------------------------------------------------------

(Updated 2012-03-23 01:48:59.198037)


Review request for Flume.


Changes
-------

Updated with the suggested changes.

All tests pass


Summary
-------

As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)


This addresses bug FLUME-1030.
    https://issues.apache.org/jira/browse/FLUME-1030


Diffs (updated)
-----

  flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
  flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 

Diff: https://reviews.apache.org/r/4445/diff


Testing
-------

Modified the test for the new functionality, new test passes

No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow


Thanks,

Juhani


                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236348#comment-13236348 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/
-----------------------------------------------------------

(Updated 2012-03-23 05:28:45.159536)


Review request for Flume.


Changes
-------

Fixed suggested changes and also added some javadoc describing functioning and new config setting.


Summary
-------

As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)


This addresses bug FLUME-1030.
    https://issues.apache.org/jira/browse/FLUME-1030


Diffs (updated)
-----

  flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
  flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 

Diff: https://reviews.apache.org/r/4445/diff


Testing
-------

Modified the test for the new functionality, new test passes

No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow


Thanks,

Juhani


                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236342#comment-13236342 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, line 193
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line193>
bq.  >
bq.  >     Log the exception.

Did this, changed EventDeliveryException->Exception

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, line 182
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line182>
bq.  >
bq.  >     It will be good to log this exception so we have a trace of the failures that are happening.

Was just using it to guard against null and number exceptions at the same time. Separated it out and checked for null. Logging the Number exception because it's probably a typo in config

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, line 138
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line138>
bq.  >
bq.  >     This seems like a typo. Perhaps you want to do something like
bq.  >     
bq.  >     maxPenalty = context.getInteger(CONF_KEY_MAX_PENALTY, DEFAULT_MAX_PENALTY);
bq.  >     
bq.  >

Yes. Probably not enough sleep. I checked in the debugger that things were getting correctly set/read now

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, lines 94-95
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line94>
bq.  >
bq.  >     The max penalty calculated during configure of the sink processor should be applied here to enforce the ceiling.

Done

On 2012-03-23 03:48:14, Juhani Connolly wrote:
bq.  > Rest of the changes look good to me.

Wow, what a mess. Must've been tired or something.

The name of the max limit also had a period on its end that I removed.

- Juhani

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/#review6265
-----------------------------------------------------------

On 2012-03-23 02:41:03, Juhani Connolly wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4445/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-03-23 02:41:03)
bq.  
bq.  
bq.  Review request for Flume.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
bq.  Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)
bq.  
bq.  
bq.  This addresses bug FLUME-1030.
bq.      https://issues.apache.org/jira/browse/FLUME-1030
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
bq.    flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 
bq.  
bq.  Diff: https://reviews.apache.org/r/4445/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Modified the test for the new functionality, new test passes
bq.  
bq.  No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Juhani
bq.  
bq.

> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236288#comment-13236288 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/
-----------------------------------------------------------

(Updated 2012-03-23 02:41:03.472482)


Review request for Flume.


Changes
-------

Wasn't applying the penalty limit, fixed now


Summary
-------

As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)


This addresses bug FLUME-1030.
    https://issues.apache.org/jira/browse/FLUME-1030


Diffs (updated)
-----

  flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
  flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 

Diff: https://reviews.apache.org/r/4445/diff


Testing
-------

Modified the test for the new functionality, new test passes

No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow


Thanks,

Juhani


                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Retry logic for failover sink processor to handle downstream exceptions in a predictable manner.

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236798#comment-13236798 ] 

Hudson commented on FLUME-1030:
-------------------------------

Integrated in flume-trunk #141 (See [https://builds.apache.org/job/flume-trunk/141/])
    FLUME-1030. Retry mechanism for failover sink processor.

(Juhani Connolly via Arvind Prabhakar) (Revision 1304474)

     Result = SUCCESS
arvind : http://svn.apache.org/viewvc/?view=rev&rev=1304474
Files : 
* /incubator/flume/trunk/flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java
* /incubator/flume/trunk/flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java

                
> Retry logic for failover sink processor to handle downstream exceptions in a predictable manner.
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch, FLUME-1030.4.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-1030) Retry logic for failover sink processor to handle downstream exceptions in a predictable manner.

Posted by "Arvind Prabhakar (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated FLUME-1030:
------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch committed. Thanks Juhani!
                
> Retry logic for failover sink processor to handle downstream exceptions in a predictable manner.
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch, FLUME-1030.4.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236310#comment-13236310 ] 

jiraposter@reviews.apache.org commented on FLUME-1030:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/#review6265
-----------------------------------------------------------


Thanks for making the changes Juhani. Some feedback follows.


flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java
<https://reviews.apache.org/r/4445/#comment13571>

    The max penalty calculated during configure of the sink processor should be applied here to enforce the ceiling.



flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java
<https://reviews.apache.org/r/4445/#comment13570>

    This seems like a typo. Perhaps you want to do something like
    
    maxPenalty = context.getInteger(CONF_KEY_MAX_PENALTY, DEFAULT_MAX_PENALTY);
    
    



flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java
<https://reviews.apache.org/r/4445/#comment13568>

    It will be good to log this exception so we have a trace of the failures that are happening.



flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java
<https://reviews.apache.org/r/4445/#comment13569>

    Log the exception.


Rest of the changes look good to me.

- Arvind


On 2012-03-23 02:41:03, Juhani Connolly wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4445/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-03-23 02:41:03)
bq.  
bq.  
bq.  Review request for Flume.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  As discussed in the JIRA item, I modified FailoverSink to deal with all exceptions.
bq.  Now a sink that fails will be put onto a failed links queue, from which a recovery will be attempted after a timeout. Each sequential failure the timeout will increase. I am open to other methods of increasing the timeout(maybe add on a ceiling?)
bq.  
bq.  
bq.  This addresses bug FLUME-1030.
bq.      https://issues.apache.org/jira/browse/FLUME-1030
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java 195c121 
bq.    flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 7eada57 
bq.  
bq.  Diff: https://reviews.apache.org/r/4445/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Modified the test for the new functionality, new test passes
bq.  
bq.  No other tests should be affected, but my environment was having some weird problems. I'll look into them tomorrow, just leaving this up so people can have a browse and will confirm tests passing tomorrow
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Juhani
bq.  
bq.


                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Posted by "Juhani Connolly (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229974#comment-13229974 ] 

Juhani Connolly commented on FLUME-1030:
----------------------------------------

The method you describe is fine for a processor dealing with a single sink but seems a bit vague for multiple sinks that are being balanced or being used for failover.

One way of dealing with this with multiple sinks  is to just put sinks that had exceptions on a priority list with the time to reactivate them, passing events to other sinks until "recovery". Since balancing/failover processors have other alternatives, they can just get another sink to deal with it, using longer timeouts than would be applied by backoff. Would this be a better way to deal with balancing/failover?

This has made me curious of exactly what the intended use of EventDeliveryException is now. The distinction between it and other Exceptions is pretty blurred now that we just elect to log everything
                
> Distinguish between temporary and longterm failure to avoid repeated beating on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.1.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.
> If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira