You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by Steve Loughran <st...@hortonworks.com> on 2014/05/13 11:48:27 UTC
How failures are handled
This is how Slider reacts to failures
*AM failure:*
1. YARN will restart if failure count < limit. Default is 2; slider
likes more (as will any long-lived service). One JIRA under YARN-896 is
weighted moving average failure tracking, as MR job failure thresholds
should be different from YARN service failures.
2. YARN gives AM a list of previous containers, sends async events of
container failures while the AM was down. Slider synchronizes is state
rebuild so that it doesn't process the failure events until its built up is
core model. Did I mention slider has a model-view-controller design?
3. For the Hoya HBase and accumulo providers, rebuilding state is
trivial: any container that is running, is running.
4. We do abuse the YARN container priority to map to role ID. Everything
in priority 1 is a master, priority 2 a worker, etc. This lets us rebuild
the map of role -> container.
5. Desired cluster state is stored in the cluster json files in HDFS
-and reread.
6. The rolehistory JSON files are a set of files recording where
containers were and when they were last used -this is reloaded for best
effort placement of containers, even on restarted or frozen/thawed clusters.
7. Once running, whether restarted, thawed or simply flexed, the AM
compares actual cluster state (#of containers in roles) with desired,
releases some from YARN when needed, requesting when more are. The role
history is used to list where we want them.
8. And when the container assignments come in, we use that container
priority to recognise what role has just been assigned.
9. For the Agent provider we are going to have to do more work, as it
needs to identify agent state from the reporting in agents, and distinguish
running from not running (the sole state agents provide). I'm thinking I'd
like a "booting" state too: agent is started but not given any instructions
yet.
10. The AM will retain all running containers, discard all non-running
containers, as their state isn't known. If we have a booting state, we will
be able to push out the start operation to those containers, so discard
less -and so not lose recently assigned containers.
11. The agents will need to rebind to a moved AM, so it knows where to
send HTTP requests. This implies registry lookups when the agent HTTP
operations fail.
12. ...and then there's security. We don't have a story for secure comms
from agent to AM, AM trusting agents reporting in, etc.
Getting the AM restart working was hard because we were trying to run and
test code on hadoop 2.2 and against 2.3+, using introspection to set the
restart flag an reread the containers, with tests designed to handle
situations like: hadoop 2.4 client-side, hadoop 2.2 server side. I think
that is why a restart problem in hadoop 2.4 appears to have snuck
through: *SLIDER-34
<https://issues.apache.org/jira/browse/SLIDER-34> *and it is why we mustn't
make ongoing slider work backwards compatible with '2.4 once we adopt
Hadoop 2.5 features. It's not just the introspection grief, it's the
testing of those new features
*Container failure*
1. AM notes event, adds to node history (failure of role, failure on
specific node). "start failures" -within a short period of time from launch
are distinguished from running failures, but there's little policy above
that.
2. If the failures are above a threshold, the AM commits suicide. Like
the YARN RM, we need moving averages there. I think I'll write the code for
both, having some classes in YARN that slider can reuse
3. Container failures trigger a review/request/release cycle.
4. There's some bias towards asking for the previous nodes -I think we
need to be smarter there. Do we want to ask for the same nodes for better
locality, or not ask for them in the hope of avoiding whatever caused the
failure?
For apps where the config changes on container restart, we're going to have
to do more.
I'm thinking of having flags for a component that state policy there
1. failure of role instance triggers rebuild of client config
2. failure of role instance triggers rebuild of entire cluster -if
agents support restart this could retain existing containers, otherwise
release and recreate
3. failure of role instance triggers immediate application termination.
Oh, and there's any event history we want to keep and share. It'll all get
lost on restart, but at the least we should have an event "restarted with N
containers".
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.