You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2010/06/02 12:03:56 UTC

[jira] Commented: (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools

    [ https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874516#action_12874516 ] 

Steve Loughran commented on HADOOP-6473:
----------------------------------------

Some things we could check for and related bugreps/stack traces that are independent of node/client role

Network
# Network: IPv4 is in use: HADOOP-6056. Check out system property see if local ip address lookup with an IPv6 address ::1 works, see what happens on nslookup of 127.0.0.1
# DNS status. At the very least, the local hostname must resolve: HADOOP-3426. 
# Also check rdns and some nslookup of some common external addresses (hadoop.apache.org), warn if these are not available, but mention its not important if you don't want external network access
# Proxy server settings; print them, if non-null, check (hostname, port) and warn if missing

Classpath
# Print it out. 
# Check for duplicate filenames (with/without version endings?)

Dependencies
# XML engine name and version
# XML parser supports XInclude: HADOOP-5254
# XSL engine name and version

Local filesystem
# Space
# Temp dir is writeable. Write something, read it back in, verify checksum and that the file's timestamp is within a few seconds of the system clock.




> Add hadoop health check/diagnostics to run from command line, JSP pages, other tools
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6473
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6473
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Steve Loughran
>            Priority: Minor
>
> If the lifecycle ping() is for short-duration "are we still alive" checks, Hadoop still needs something bigger to check the overall system health,.This would be for end users, but also for automated cluster deployment, a complete validation of the cluster, 
> It could be a command line tool, and something that runs on different nodes, checked via IPC or JSP. the idea would be to do thorough checks with good diagnostics.  Oh, and they should be executable through JUnit too.
> For example
>  -if running on windows, check that cygwin is on the path, fail with a pointer to a wiki issue if not
>  -datanodes should check that it can create locks on the filesystem, create files, timestamps are (roughly) aligned with local time.
>  -namenodes should try and create files/locks in the filesystem
>  -task tracker should try and exec() something
>  -run through the classpath and look for problems; duplicate JARs, unsupported java, xerces versions, etc.
> * The number of tests should be extensible -rather than one single class with all the tests, there'd be something separate for name, task, data, job tracker nodes
> * They can't be in the nodes themselves, as they should be executable even if the nodes don't come up. 
> * output could be in human readable text or html, and a form that could be processed through hadoop itself in future
> * these tests could have side effects, such as actually trying to submit work to a cluster

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.