You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2008/05/29 17:19:44 UTC

[jira] Created: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failure on compute nodes back to hod client

[HOD] HOD can improve error messages by reporting failure on compute nodes back to hod client
---------------------------------------------------------------------------------------------

                 Key: HADOOP-3464
                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
             Project: Hadoop Core
          Issue Type: Improvement
            Reporter: Vinod Kumar Vavilapalli
            Assignee: Hemanth Yamijala
             Fix For: 0.18.0


This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3464:
--------------------------------------------

    Attachment: HADOOP-3464

Attaching first patch.

 - This solves the problem of reporting errors on ringmaster side back to the hod client, HodRing problems are still NOT addressed.
 - Changes to hodlib/Common/setup.py are borrowed from the patch to HADOOP-2961. Need merging of these two while committing.
 - Also fixed another issue - earlier, any validation errors in ringmaster were not getting logged due to late log initialization, changed this now so that these errors can also be reported back to the hod client.
 - Tested with 1) an invalid tar file e.g. a junk file 2) a non-existent path value for hodring.java-home and 3) a non-existent path value for gridservice-hdfs.pkgs and verified that errors are properly propagated back to the hod client.

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3464:
--------------------------------------------

    Component/s: contrib/hod
       Assignee: Vinod Kumar Vavilapalli  (was: Hemanth Yamijala)
        Summary: [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client  (was: [HOD] HOD can improve error messages by reporting failure on compute nodes back to hod client)

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3464:
--------------------------------------------

    Attachment: HADOOP-3464.1

Attaching a new patch
 - Fixes the problem of transferring hodring error msgs to hod client.
 - Fixes a minor problem in the earlier patch - now print ringmaster error msgs both when return status is 5 or 6.

This patch needs some cleanup - removing extraneous debug statements, making log statements better, make better how data(error msgs) are transferred from hodrings to ringmaster and then to hod client, and even perhaps clean up of api.

Extra effort would be changing the error messages themselves - this patch only addresses the issue of bringing them to hod client, that is all. That anyway should be part of new jira issues.

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464, HADOOP-3464.1
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3464:
-------------------------------------

    Release Note: Implemented a mechanism to transfer HOD errors that occur on compute nodes to the submit node running the HOD client, so users have good feedback on why an allocation failed.
    Hadoop Flags: [Reviewed]
          Status: Patch Available  (was: Open)

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464, HADOOP-3464.1, HADOOP-3464.4
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3464:
-------------------------------------

    Attachment: HADOOP-3464.4

The attached patch fixes most of the comments I mentioned in the previous comment. The two that are not handled are:
- Not returning error if all hodrings fail. This will be addressed in the fix for HADOOP-3184
- Still creating a new XMLRPC client - as this is not too much overhead.

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464, HADOOP-3464.1, HADOOP-3464.4
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Mukund Madhugiri (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mukund Madhugiri updated HADOOP-3464:
-------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this for Hemant. Thanks Vinod for the patch!

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464, HADOOP-3464.1, HADOOP-3464.4
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601595#action_12601595 ] 

Hemanth Yamijala commented on HADOOP-3464:
------------------------------------------

Few comments:

- When ringmaster fails, we are printing out the errors as a array of strings in a single line. For better readability, they should be printed one per line.
- When ringmaster fails due to problems with hadoop pkgs, the error message is not helpful. It says something like int cannot be NoneType or some such. This should be improved.
- We use ringmaster.addMasterParams to report errors from the hodrings. This is confusing. We should define a new API, something like setHodRingError and report errors back using that RPC.
- The PID of the hodring process is part of the 'host' reporting the error. It appears this is important, as removing the PID caused the functionality to break. However, when we print these messages to the client, the name is printed as hostname_pid, which does not make too much sense. So, we can try and see if the pid part can be avoided.
- At few places we are constructing an XML-RPC client object. If already constructed, can be reuse this ?
- When hodrings fail due to a config error, we don't report this back. This is because error reporting happens only if the getCommand method has been called by a hodring. In case of config errors, getCommand is not called and so these errors are not caught. The requirement is that we should be able to report Master command failures - that is if an internal HDFS daemon fails, or MapRed daemon fails. If there are n nodes in the ring, atleast 2 (in case of internal) or 1 hodring should come up successfully for the masters. If the number of reported failures exceeds this, we can report a failure to the service registry client.
- When a hadoop daemon fails, the message simply says failed to launch hadoop command. Typically the daemon.err file has more useful information. If possible, this should be fetched and displayed to the client.

Will try and submit a patch addressing these points.

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464, HADOOP-3464.1
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3464) [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602370#action_12602370 ] 

Hadoop QA commented on HADOOP-3464:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383376/HADOOP-3464.4
  against trunk revision 663079.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/console

This message is automatically generated.

> [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464, HADOOP-3464.1, HADOOP-3464.4
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.