You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by Steve Loughran <st...@hortonworks.com> on 2014/05/13 11:48:27 UTC
How failures are handled

This is how Slider reacts to failures

*AM failure:*


   1. YARN will restart if failure count < limit. Default is 2; slider
   likes more (as will any long-lived service). One JIRA under YARN-896 is
   weighted moving average failure tracking, as MR job failure thresholds
   should be different from YARN service failures.
   2. YARN gives AM a list of previous containers, sends async events of
   container failures while the AM was down. Slider synchronizes is state
   rebuild so that it doesn't process the failure events until its built up is
   core model. Did I mention slider has a model-view-controller design?
   3. For the Hoya HBase and accumulo providers, rebuilding state is
   trivial: any container that is running, is running.
   4. We do abuse the YARN container priority to map to role ID. Everything
   in priority 1 is a master, priority 2 a worker, etc. This lets us rebuild
   the map of role -> container.
   5. Desired cluster state is stored in the cluster json files in HDFS
   -and reread.
   6. The rolehistory JSON files are a set of files recording where
   containers were and when they were last used -this is reloaded for best
   effort placement of containers, even on restarted or frozen/thawed clusters.
   7. Once running, whether restarted, thawed or simply flexed, the AM
   compares actual cluster state (#of containers in roles) with desired,
   releases some from YARN when needed, requesting when more are. The role
   history is used to list where we want them.
   8. And when the container assignments come in, we use that container
   priority to recognise what role has just been assigned.
   9. For the Agent provider we are going to have to do more work, as it
   needs to identify agent state from the reporting in agents, and distinguish
   running from not running (the sole state agents provide). I'm thinking I'd
   like a "booting" state too: agent is started but not given any instructions
   yet.
   10. The AM will retain all running containers, discard all non-running
   containers, as their state isn't known. If we have a booting state, we will
   be able to push out the start operation to those containers, so discard
   less -and so not lose recently assigned containers.
   11. The agents will need to rebind to a moved AM, so it knows where to
   send HTTP requests. This implies registry lookups when the agent HTTP
   operations fail.
   12. ...and then there's security. We don't have a story for secure comms
   from agent to AM, AM trusting agents reporting in, etc.


Getting the AM restart working was hard because we were trying to run and
test code on hadoop 2.2 and against 2.3+, using introspection to set the
restart flag an reread the containers, with tests designed to handle
situations like: hadoop 2.4 client-side, hadoop 2.2 server side. I think
that is why a restart problem in hadoop 2.4 appears to have snuck
through: *SLIDER-34
<https://issues.apache.org/jira/browse/SLIDER-34> *and it is why we mustn't
make ongoing slider work backwards compatible with '2.4 once we adopt
Hadoop 2.5 features. It's not just the introspection grief, it's the
testing of those new features

*Container failure*

   1. AM  notes event, adds to node history (failure of role, failure on
   specific node). "start failures" -within a short period of time from launch
   are distinguished from running failures, but there's little policy above
   that.
   2. If the failures are above a threshold, the AM commits suicide. Like
   the YARN RM, we need moving averages there. I think I'll write the code for
   both, having some classes in YARN that slider can reuse
   3. Container failures trigger a review/request/release cycle.
   4. There's some bias towards asking for the previous nodes -I think we
   need to be smarter there. Do we want to ask for the same nodes for better
   locality, or not ask for them in the hope of avoiding whatever caused the
   failure?

For apps where the config changes on container restart, we're going to have
to do more.

I'm thinking of having flags for a component that state policy there

   1. failure of role instance triggers rebuild of client config
   2. failure of role instance triggers rebuild of entire cluster -if
   agents support restart this could retain existing containers, otherwise
   release and recreate
   3. failure of role instance triggers immediate application termination.

Oh, and there's any event history we want to keep and share. It'll all get
lost on restart, but at the least we should have an event "restarted with N
containers".

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.