You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "kyungwan nam (JIRA)" <ji...@apache.org> on 2017/03/24 08:55:41 UTC

[jira] [Created] (SLIDER-1221) the way to cope against SliderAM split brain

kyungwan nam created SLIDER-1221:
------------------------------------

             Summary: the way to cope against SliderAM split brain
                 Key: SLIDER-1221
                 URL: https://issues.apache.org/jira/browse/SLIDER-1221
             Project: Slider
          Issue Type: Bug
            Reporter: kyungwan nam


I have met a problem like “Slider-AM split brain”.
normally, AM is failed, RM will launch new one.
but, even without failing AM, It can happens if there is something like networking issue between AM and RM.
because, RM is launching the new AM if there is no heartbeat from the AM for some time (yarn.am.liveness-monitor.expiry-interval-ms)
in that case, previous AM and new AM can coexist and containers keep connection with previous AM.
it could cause lots of problems.
new AM couldn't know the containers launched by previous AM.
as a result, simultaneous the containers could be killed after some time.

slider-agent should register to the new SliderAM as soon as possible.
I think it could be improved as follows.

- SliderAM record the time at which heartbeat response is arrived from the RM.
- SliderAM send a message “stale SliderAM” to the slider-agent if there is no AM-RM heartbeat for some time (“stale.slider.am.interval”)
- when slider-agent receive “stale SliderAM”, slider-agent should try to discover the new SliderAM. if discovered, register to the new one.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)