You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2014/11/24 05:27:12 UTC

[jira] [Commented] (YARN-2832) Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors

    [ https://issues.apache.org/jira/browse/YARN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222643#comment-14222643 ] 

Hadoop QA commented on YARN-2832:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12680358/health.check.service.1.patch
  against trunk revision a4df9ee.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5914//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5914//console

This message is automatically generated.

> Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors
> ------------------------------------------------------------------
>
>                 Key: YARN-2832
>                 URL: https://issues.apache.org/jira/browse/YARN-2832
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.1, 2.5.1
>         Environment: Any environment
>            Reporter: Tianyin Xu
>         Attachments: health.check.service.1.patch
>
>
> NodeManager allows users to specify the health checker script that will be invoked by the health-checker service via the configuration parameter, "_yarn.nodemanager.health-checker.script.path_" 
> During the _serviceInit()_ of the health-check service, NM checks whether the parameter is set correctly using _shouldRun()_, as follows,
> {code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
>   protected void serviceInit(Configuration conf) throws Exception {
>     if (NodeHealthScriptRunner.shouldRun(conf)) {
>       nodeHealthScriptRunner = new NodeHealthScriptRunner();
>       addService(nodeHealthScriptRunner);
>     }
>     addService(dirsHandler);
>     super.serviceInit(conf);
>   }
> {code}
> The problem is that if the parameter is misconfigured (e.g., permission problem, wrong path), NM does not have any log message to inform users which could cause latent errors or mysterious problems (e.g., "why my scripts does not work?")
> I see the checking and printing logic is put in _serviceStart()_ function in _NodeHealthScriptRunner.java_ (see the following code snippets). However, the logic is very wrong. For an incorrect parameter that does not pass the "shouldRun" check, _serviceStart()_ would never be called because the _NodeHealthScriptRunner_ instance does not have the chance to be created (see the code snippets above).
> {code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
>   protected void serviceStart() throws Exception {
>     // if health script path is not configured don't start the thread.
>     if (!shouldRun(conf)) {
>       LOG.info("Not starting node health monitor");
>       return;
>     }
>     ... 
>   }  
> {code}
> Basically, I think the checking and printing logic should be put in the serviceInit() in NodeHealthCheckerService instead of serviceStart() in NodeHealthScriptRunner.
> See the attachment for the simple patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)