You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2014/08/05 12:45:12 UTC

[jira] [Resolved] (SLIDER-203) Implement scalable failure threshold based on percentage of instances failing over a time period

     [ https://issues.apache.org/jira/browse/SLIDER-203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran resolved SLIDER-203.
-----------------------------------

    Resolution: Won't Fix

> Implement scalable failure threshold based on percentage of instances failing over a time period
> ------------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-203
>                 URL: https://issues.apache.org/jira/browse/SLIDER-203
>             Project: Slider
>          Issue Type: Sub-task
>          Components: appmaster, test
>    Affects Versions: Slider 0.40
>            Reporter: Steve Loughran
>
> SLIDER-77 proposed weighted moving averages for failures. This has some flaws
> # it's hard to understand and configure
> # different cluster sizes need different default values
> # if you flex a cluster, it the threshold may become inapppropriate
> I propose something more tangible and related to how to track physical nodes: percentage failing over a time period.
> For example, we could define a functional hbase cluster as:
> 200% of masters failing per day (for two masters == 4 failures)
> 80% of region servers per day (for 20 region servers, that's 16 failures)
> Every day the counter could be reset.
> Flexing complicates the equation: it may be simplest just to reset the counters, at least when scaling down. Otherwise if a 20 worker cluster had a failure count of 5, and a 40% threshold, all would be well. But scale it down to 10 nodes and the failure count is immediately over the limit. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)