You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Tao Yang (Jira)" <ji...@apache.org> on 2021/09/16 07:51:00 UTC
[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

     [ https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tao Yang updated YARN-10955:
----------------------------
    Description: 
RM is the most complex component in YARN with many basic or core services including RPC servers, event dispatchers, HTTP server, core scheduler, state managers etc., and some of them depends on other basic components like ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics and tremendous logs while encountering an unclear issue, hope to locate the root cause of the problem. For example, some applications keep staying in NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam in event dispatcher, the useful traces is sinking in many metrics and logs. That's not easy to figure out what happened even for some experts, let alone common users.

So I propose to add a common health check mechanism to improve troubleshooting skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and keyMetrics(Map<String, Object>).

 * make some key services implement HealthReporter interface and generate health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable services, support checking and fetching health reports periodically and manually (can be triggered by REST API), publishing metrics and logs as well.

  was:
RM is the most complex component in YARN with many basic or core services including RPC servers, event dispatchers, HTTP server, core scheduler, state managers etc., and some of them depends on other basic components like ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics and tremendous logs while encountering an unclear issue, hope to locate the root cause of the problem. For example, some applications keep staying in NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam in event dispatcher, the useful traces is sinking in many metrics and logs. That's not easy to figure out what happened even for some experts, let alone common users.

So I propose to add a common health check mechanism to improve troubleshooting skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), updateTime(long), diagnostics(string) and keyMetrics(Map<String, Object>).

 * make some key services implement HealthReporter interface and generate health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable services, support checking and fetching health reports periodically and manually (can be triggered by REST API), publishing metrics and logs as well.


> Add health check mechanism to improve troubleshooting skills for RM
> -------------------------------------------------------------------
>
>                 Key: YARN-10955
>                 URL: https://issues.apache.org/jira/browse/YARN-10955
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>
> RM is the most complex component in YARN with many basic or core services including RPC servers, event dispatchers, HTTP server, core scheduler, state managers etc., and some of them depends on other basic components like ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related metrics and tremendous logs while encountering an unclear issue, hope to locate the root cause of the problem. For example, some applications keep staying in NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam in event dispatcher, the useful traces is sinking in many metrics and logs. That's not easy to figure out what happened even for some experts, let alone common users.
> So I propose to add a common health check mechanism to improve troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and keyMetrics(Map<String, Object>).
>  * make some key services implement HealthReporter interface and generate health report via evaluating the internal state.
>  * add HealthCheckerService which can manage and monitor all reportable services, support checking and fetching health reports periodically and manually (can be triggered by REST API), publishing metrics and logs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org