You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ambari.apache.org by "Weiwei Yang (JIRA)" <ji...@apache.org> on 2016/11/25 04:03:59 UTC

[jira] [Updated] (AMBARI-18929) Yarn service check fails when either resource manager is down in HA enabled cluster

     [ https://issues.apache.org/jira/browse/AMBARI-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weiwei Yang updated AMBARI-18929:
---------------------------------
    Attachment: AMBARI-18929_trunk.patch

Hi [~Tim Thorpe], [~dili]

Attached a patch to fix this. With this patch, yarn service check first queries rest api {{http://<rm_host>:<port>/ws/v1/cluster/info}} to figure out the active rm address (this api is available since hadoop 2.3 the very first version to support HA), and this api is provided by both active and standby RMs as well as the non-HA env single RM, no redirection. Once active RM figured, the rest of logic remains same. Otherwise the service check will fail either because http service can not be accessed on both RMs, or both RMs are in standby state.

I tested this patch on following scenarios

HA environment
# Both active & standby RMs are up : SUCCESS
# Shutdown standby RM, active remains up : SUCCESS
# Shutdown active RM, active transited to the other RM : SUCCESS
# Shutdown zookeeper, both RMs are standby : FAIL
# Both RMs are down : FAIL

Non-HA environment
# RM is up : SUCCESS
# RM is down : FAIL

Please help to review the patch.

> Yarn service check fails when either resource manager is down in HA enabled cluster
> -----------------------------------------------------------------------------------
>
>                 Key: AMBARI-18929
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18929
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Weiwei Yang
>         Attachments: AMBARI-18929_trunk.patch
>
>
> When HA is enabled, yarn service_check.py fails if one of RM is down, even the other one is active. This gives user the wrong impression the yarn cluster is not healthy. Instead, service check should pass, or at least pass with warning that lets user know there is one RM down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)