You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by yh...@apache.org on 2008/06/20 18:31:41 UTC
svn commit: r669980 [3/3] - in /hadoop/core/trunk: docs/ src/contrib/hod/ src/docs/src/documentation/content/xdocs/

Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_admin_guide.xml
URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_admin_guide.xml?rev=669980&r1=669979&r2=669980&view=diff
==============================================================================
--- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_admin_guide.xml (original)
+++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_admin_guide.xml Fri Jun 20 09:31:41 2008
@@ -315,10 +315,13 @@
     </section>
     <section>
       <title>checklimits.sh - Tool to update torque comment field reflecting resource limits</title>
-      <p>checklimits is a HOD tool specific to torque/maui environment. It
+      <p>checklimits is a HOD tool specific to Torque/Maui environment
+      (<a href="ext:hod/maui">Maui Cluster Scheduler</a> is an open source job
+      scheduler for clusters and supercomputers, from clusterresources). The
+      checklimits.sh script
       updates torque comment field when newly submitted job(s) violate/cross
-      over user limits set up in maui scheduler. It uses qstat, does one pass
-      over torque job list to find out queued or unfinished jobs, runs maui
+      over user limits set up in Maui scheduler. It uses qstat, does one pass
+      over torque job list to find out queued or unfinished jobs, runs Maui
       tool checkjob on each job to see if user limits are violated and then
       runs torque's qalter utility to update job attribute 'comment'. Currently
       it updates the comment as <em>User-limits exceeded. Requested:([0-9]*)
@@ -330,7 +333,9 @@
         <p>checklimits.sh is available under hod_install_location/support
         folder. This is a shell script and can be run directly as <em>sh
         checklimits.sh </em>or as <em>./checklimits.sh</em> after enabling
-        execute permissions. In order for this tool to be able to update
+        execute permissions. Torque and Maui binaries should be available
+        on the machine where the tool is run and should be in the path
+        of the shell script process. In order for this tool to be able to update
         comment field of jobs from different users, it has to be run with
         torque administrative privileges. This tool has to be run repeatedly
         after specific intervals of time to frequently update jobs violating

Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_config_guide.xml
URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_config_guide.xml?rev=669980&r1=669979&r2=669980&view=diff
==============================================================================
--- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_config_guide.xml (original)
+++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_config_guide.xml Fri Jun 20 09:31:41 2008
@@ -161,6 +161,22 @@
                        as many paths are specified as there are disks available
                        to ensure all disks are being utilized. The restrictions
                        and notes for the temp-dir variable apply here too.</li>
+          <li>max-master-failures: It defines how many times a hadoop master
+                       daemon can fail to launch, beyond which HOD will fail
+                       the cluster allocation altogether. In HOD clusters,
+                       sometimes there might be a single or few "bad" nodes due
+                       to issues like missing java, missing/incorrect version
+                       of Hadoop etc. When this configuration variable is set
+                       to a positive integer, the RingMaster returns an error
+                       to the client only when the number of times a hadoop
+                       master (JobTracker or NameNode) fails to start on these
+                       bad nodes because of above issues, exceeds the specified
+                       value. If the number is not exceeded, the next HodRing
+                       which requests for a command to launch is given the same
+                       hadoop master again. This way, HOD tries its best for a
+                       successful allocation even in the presence of a few bad
+                       nodes in the cluster.
+                       </li>
         </ul>
       </section>
       

Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml
URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml?rev=669980&r1=669979&r2=669980&view=diff
==============================================================================
--- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml (original)
+++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml Fri Jun 20 09:31:41 2008
@@ -28,7 +28,7 @@
   <section><title>A typical HOD session</title><anchor id="HOD_Session"></anchor>
   <p>A typical session of HOD will involve at least three steps: allocate, run hadoop jobs, deallocate. In order to do this, perform the following steps.</p>
   <p><strong> Create a Cluster Directory </strong></p><anchor id="Create_a_Cluster_Directory"></anchor>
-  <p>The <em>cluster directory</em> is a directory on the local file system where <code>hod</code> will generate the Hadoop configuration, <em>hadoop-site.xml</em>, corresponding to the cluster it allocates. Create this directory and pass it to the <code>hod</code> operations as stated below. Once a cluster is allocated, a user can utilize it to run Hadoop jobs by specifying the cluster directory as the Hadoop --config option. </p>
+  <p>The <em>cluster directory</em> is a directory on the local file system where <code>hod</code> will generate the Hadoop configuration, <em>hadoop-site.xml</em>, corresponding to the cluster it allocates. Pass this directory to the <code>hod</code> operations as stated below. If the cluster directory passed doesn't already exist, HOD will automatically try to create it and use it. Once a cluster is allocated, a user can utilize it to run Hadoop jobs by specifying the cluster directory as the Hadoop --config option. </p>
   <p><strong> Operation <em>allocate</em></strong></p><anchor id="Operation_allocate"></anchor>
   <p>The <em>allocate</em> operation is used to allocate a set of nodes and install and provision Hadoop on them. It has the following syntax. Note that it requires a cluster_dir ( -d, --hod.clusterdir) and the number of nodes (-n, --hod.nodecount) needed to be allocated:</p>
     <table>
@@ -92,7 +92,7 @@
   <p>This will be a regular shell script that will typically contain hadoop commands, such as:</p>
   <table><tr><td><code>$ hadoop jar jar_file options</code></td>
   </tr></table>
-  <p>However, the user can add any valid commands as part of the script. HOD will execute this script setting <em>HADOOP_CONF_DIR</em> automatically to point to the allocated cluster. So users do not need to worry about this. The users however need to create a cluster directory just like when using the allocate operation.</p>
+  <p>However, the user can add any valid commands as part of the script. HOD will execute this script setting <em>HADOOP_CONF_DIR</em> automatically to point to the allocated cluster. So users do not need to worry about this. The users however need to specify a cluster directory just like when using the allocate operation.</p>
   <p><strong> Running the script </strong></p><anchor id="Running_the_script"></anchor>
   <p>The syntax for the <em>script operation</em> as is as follows. Note that it requires a cluster directory ( -d, --hod.clusterdir), number of nodes (-n, --hod.nodecount) and a script file (-s, --hod.script):</p>
     <table>
@@ -151,6 +151,7 @@
   <ul>
     <li> For better distribution performance it is recommended that the Hadoop tarball contain only the libraries and binaries, and not the source or documentation.</li>
     <li> When you want to run jobs against a cluster allocated using the tarball, you must use a compatible version of hadoop to submit your jobs. The best would be to untar and use the version that is present in the tarball itself.</li>
+    <li> You need to make sure that there are no Hadoop configuration files, hadoop-env.sh and hadoop-site.xml, present in the conf directory of the tarred distribution. The presence of these files with incorrect values could make the cluster allocation to fail.</li>
   </ul>
   </section>
   <section><title> Using an external HDFS </title><anchor id="Using_an_external_HDFS"></anchor>
@@ -332,7 +333,7 @@
   <p><em>-c config_file</em><br />
     Provides the configuration file to use. Can be used with all other options of HOD. Alternatively, the <code>HOD_CONF_DIR</code> environment variable can be defined to specify a directory that contains a file named <code>hodrc</code>, alleviating the need to specify the configuration file in each HOD command.</p>
   <p><em>-d cluster_dir</em><br />
-        This is required for most of the hod operations. As described <a href="#Create_a_Cluster_Directory">here</a>, the <em>cluster directory</em> is a directory on the local file system where <code>hod</code> will generate the Hadoop configuration, <em>hadoop-site.xml</em>, corresponding to the cluster it allocates. Create this directory and pass it to the <code>hod</code> operations as an argument to -d or --hod.clusterdir. Once a cluster is allocated, a user can utilize it to run Hadoop jobs by specifying the clusterdirectory as the Hadoop --config option.</p>
+        This is required for most of the hod operations. As described <a href="#Create_a_Cluster_Directory">here</a>, the <em>cluster directory</em> is a directory on the local file system where <code>hod</code> will generate the Hadoop configuration, <em>hadoop-site.xml</em>, corresponding to the cluster it allocates. Pass it to the <code>hod</code> operations as an argument to -d or --hod.clusterdir. If it doesn't already exist, HOD will automatically try to create it and use it. Once a cluster is allocated, a user can utilize it to run Hadoop jobs by specifying the clusterdirectory as the Hadoop --config option.</p>
   <p><em>-n number_of_nodes</em><br />
   This is required for the hod 'allocation' operation and for script operation. This denotes the number of nodes to be allocated.</p>
   <p><em>-s script-file</em><br/>
@@ -416,19 +417,26 @@
       <tr>
         <td> 6 </td>
         <td> Ringmaster failure </td>
-        <td> 1. Invalid configuration in the <code>ringmaster</code> section,<br />
-          2. invalid <code>pkgs</code> option in <code>gridservice-mapred or gridservice-hdfs</code> section,<br />
-          3. an invalid hadoop tarball,<br />
-          4. mismatched version in Hadoop between the MapReduce and an external HDFS.<br />
-          The Torque <code>qstat</code> command will most likely show a job in the <code>C</code> (Completed) state. Refer to the section <em>Locating Ringmaster Logs</em> below for more information. </td>
+        <td> HOD prints the message "Cluster could not be allocated because of the following errors on the ringmaster host &lt;hostname&gt;". The actual error message may indicate one of the following:<br/>
+          1. Invalid configuration on the node running the ringmaster, specified by the hostname in the error message.<br/>
+          2. Invalid configuration in the <code>ringmaster</code> section,<br />
+          3. Invalid <code>pkgs</code> option in <code>gridservice-mapred or gridservice-hdfs</code> section,<br />
+          4. An invalid hadoop tarball, or a tarball which has bundled an invalid configuration file in the conf directory,<br />
+          5. Mismatched version in Hadoop between the MapReduce and an external HDFS.<br />
+          The Torque <code>qstat</code> command will most likely show a job in the <code>C</code> (Completed) state. <br/>
+          One can login to the ringmaster host as given by HOD failure message and debug the problem with the help of the error message. If the error message doesn't give complete information, ringmaster logs should help finding out the root cause of the problem. Refer to the section <em>Locating Ringmaster Logs</em> below for more information. </td>
       </tr>
       <tr>
         <td> 7 </td>
         <td> DFS failure </td>
-        <td> 1. Problem in starting Hadoop clusters. Review the Hadoop related configuration. Look at the Hadoop logs using information specified in <em>Getting Hadoop Logs</em> section above. <br />
-          2. Invalid configuration in the <code>hodring</code> section of hodrc. <code>ssh</code> to all allocated nodes (determined by <code>qstat -f torque_job_id</code>) and grep for <code>ERROR</code> or <code>CRITICAL</code> in hodring logs. Refer to the section <em>Locating Hodring Logs</em> below for more information. <br />
-          3. Invalid tarball specified which is not packaged correctly. <br />
-          4. Cannot communicate with an externally configured HDFS. </td>
+        <td> When HOD fails to allocate due to DFS failures (or Job tracker failures, error code 8, see below), it prints a failure message "Hodring at &lt;hostname&gt; failed with following errors:" and then gives the actual error message, which may indicate one of the following:<br/>
+          1. Problem in starting Hadoop clusters. Usually the actual cause in the error message will indicate the problem on the hostname mentioned. Also, review the Hadoop related configuration in the HOD configuration files. Look at the Hadoop logs using information specified in <em>Collecting and Viewing Hadoop Logs</em> section above. <br />
+          2. Invalid configuration on the node running the hodring, specified by the hostname in the error message <br/>
+          3. Invalid configuration in the <code>hodring</code> section of hodrc. <code>ssh</code> to the hostname specified in the error message and grep for <code>ERROR</code> or <code>CRITICAL</code> in hodring logs. Refer to the section <em>Locating Hodring Logs</em> below for more information. <br />
+          4. Invalid tarball specified which is not packaged correctly. <br />
+          5. Cannot communicate with an externally configured HDFS.<br/>
+          When such DFS or Job tracker failure occurs, one can login into the host with hostname mentioned in HOD failure message and debug the problem. While fixing the problem, one should also review other log messages in the ringmaster log to see which other machines also might have had problems bringing up the jobtracker/namenode, apart from the hostname that is reported in the failure message. This possibility of other machines also having problems occurs because HOD continues to try and launch hadoop daemons on multiple machines one after another depending upon the value of the configuration variable <a href="hod_config_guide.html#3.4+ringmaster+options">ringmaster.max-master-failures</a>. Refer to the section <em>Locating Ringmaster Logs</em> below to find more about ringmaster logs.
+          </td>
       </tr>
       <tr>
         <td> 8 </td>

Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml?rev=669980&r1=669979&r2=669980&view=diff
==============================================================================
--- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml (original)
+++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml Fri Jun 20 09:31:41 2008
@@ -82,6 +82,7 @@
       <torque-mailing-list href="http://www.clusterresources.com/pages/resources/mailing-lists.php" />
       <torque-basic-config href="http://www.clusterresources.com/wiki/doku.php?id=torque:1.2_basic_configuration" />
       <torque-advanced-config href="http://www.clusterresources.com/wiki/doku.php?id=torque:1.3_advanced_configuration" />
+      <maui href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php"/>
       <python href="http://www.python.org" />
       <twisted-python href="http://twistedmatrix.com/trac/" />
     </hod>