You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@slider.apache.org by st...@apache.org on 2015/05/11 21:43:05 UTC

svn commit: r1678807 - /incubator/slider/site/trunk/content/design/rolehistory.md

Author: stevel
Date: Mon May 11 19:43:05 2015
New Revision: 1678807

URL: http://svn.apache.org/r1678807
Log:
SLIDER-856 slider needs to treat pre-emption events as not-a-real-failure

Modified:
    incubator/slider/site/trunk/content/design/rolehistory.md

Modified: incubator/slider/site/trunk/content/design/rolehistory.md
URL: http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/design/rolehistory.md?rev=1678807&r1=1678806&r2=1678807&view=diff
==============================================================================
--- incubator/slider/site/trunk/content/design/rolehistory.md (original)
+++ incubator/slider/site/trunk/content/design/rolehistory.md Mon May 11 19:43:05 2015
@@ -33,11 +33,25 @@ A major rework of placement has taken pl
 that have reached their escalation timeout and yet have not been satisfied.
 1. Such requests are cancelled and "relaxed" requests re-issued.
 1. Labels are always respected; even relaxed requests use any labels specified in `resources.json`
-1. If a node is considered unreliable (as per-the slider 0.70 changes), it is not used in the initial
+1. If a node is considered unreliable (as per-the slider-0.70-incubating changes), it is not used in the initial
 request. YARN may still allocate relaxed instances on such nodes. That is: there is no explicit
 blacklisting, merely deliberate exclusion of unreliable nodes from explicitly placed requests.
+1. Node and component failure counts are reset on a regular schedule. The "recently failed"
+counters are the ones used to decide if a node is unreliable or a component has failed too 
+many times. Long-lived applications can therefore tolerate a low rate of component failures.
+1. The notion of "failed" differentiates between application failures, node failures and
+pre-emption.
+    * YARN container pre-emption is not considered a failure.
+    * Node failures are: anything reported as such by YARN, and any unexpected application exit
+    (as these may be caused by node-related issues; port conflict with other applications...etc)
+    * Application failures are resource limits being exceeded (RAM, VRAM), and unexpected application
+    exit.
+    * Only "application failures" are added to the "failed recently" count —and so only they are 
+      used to decide whether a component has a failed too many times for the application
+      to be considered working.
+  
 
-Role History Reloading Enhancements
+##### Role History Reloading Enhancements
 
 How persisted role history has also been improved [SLIDER-600]((https://issues.apache.org/jira/browse/SLIDER-600)