You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by cd...@apache.org on 2008/11/30 10:37:46 UTC

svn commit: r721790 [3/3] - in /hadoop/core/branches/branch-0.19: CHANGES.txt docs/mapred_tutorial.html docs/mapred_tutorial.pdf src/docs/src/documentation/content/xdocs/mapred_tutorial.xml

Modified: hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml?rev=721790&r1=721789&r2=721790&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml (original)
+++ hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml Sun Nov 30 01:37:46 2008
@@ -1679,21 +1679,26 @@
         <title>Other Useful Features</title>
  
         <section>
-          <title>Submitting Jobs to a Queue</title>
-          <p>Some job schedulers supported in Hadoop, like the 
-            <a href="capacity_scheduler.html">Capacity
-            Scheduler</a>, support multiple queues. If such a scheduler is
-            being used, users can submit jobs to one of the queues
-            administrators would have defined in the
-            <em>mapred.queue.names</em> property of the Hadoop site
-            configuration. The queue name can be specified through the
-            <em>mapred.job.queue.name</em> property, or through the
-            <a href="ext:api/org/apache/hadoop/mapred/jobconf/setqueuename">setQueueName(String)</a>
-            API. Note that administrators may choose to define ACLs
-            that control which queues a job can be submitted to by a
-            given user. In that case, if the job is not submitted
-            to one of the queues where the user has access,
-            the job would be rejected.</p>
+          <title>Submitting Jobs to Queues</title>
+          <p>Users submit jobs to Queues. Queues, as collection of jobs, 
+          allow the system to provide specific functionality. For example, 
+          queues use ACLs to control which users 
+          who can submit jobs to them. Queues are expected to be primarily 
+          used by Hadoop Schedulers. </p> 
+
+          <p>Hadoop comes configured with a single mandatory queue, called 
+          'default'. Queue names are defined in the 
+          <code>mapred.queue.names</code> property of the Hadoop site
+          configuration. Some job schedulers, such as the 
+          <a href="capacity_scheduler.html">Capacity Scheduler</a>, 
+          support multiple queues.</p>
+          
+          <p>A job defines the queue it needs to be submitted to through the
+          <code>mapred.job.queue.name</code> property, or through the
+          <a href="ext:api/org/apache/hadoop/mapred/jobconf/setqueuename">setQueueName(String)</a>
+          API. Setting the queue name is optional. If a job is submitted 
+          without an associated queue name, it is submitted to the 'default' 
+          queue.</p> 
         </section>
         <section>
           <title>Counters</title>
@@ -1893,40 +1898,41 @@
         
         <section>
           <title>Debugging</title>
-          <p>Map/Reduce framework provides a facility to run user-provided 
-          scripts for debugging. When map/reduce task fails, user can run 
-          script for doing post-processing on task logs i.e task's stdout,
-          stderr, syslog and jobconf. The stdout and stderr of the
-          user-provided debug script are printed on the diagnostics. 
-          These outputs are also displayed on job UI on demand. </p>
+          <p>The Map/Reduce framework provides a facility to run user-provided 
+          scripts for debugging. When a map/reduce task fails, a user can run 
+          a debug script, to process task logs for example. The script is 
+          given access to the task's stdout and stderr outputs, syslog and 
+          jobconf. The output from the debug script's stdout and stderr is 
+          displayed on the console diagnostics and also as part of the 
+          job UI. </p>
 
-          <p> In the following sections we discuss how to submit debug script
-          along with the job. For submitting debug script, first it has to
-          distributed. Then the script has to supplied in Configuration. </p>
+          <p> In the following sections we discuss how to submit a debug script
+          with a job. The script file needs to be distributed and submitted to 
+          the framework.</p>
           <section>
-          <title> How to distribute script file: </title>
+          <title> How to distribute the script file: </title>
           <p>
-          The user has to use 
+          The user needs to use  
           <a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
-          mechanism to <em>distribute</em> and <em>symlink</em> the
-          debug script file.</p>
+          to <em>distribute</em> and <em>symlink</em> the script file.</p>
           </section>
           <section>
-          <title> How to submit script: </title>
-          <p> A quick way to submit debug script is to set values for the 
-          properties "mapred.map.task.debug.script" and 
-          "mapred.reduce.task.debug.script" for debugging map task and reduce
-          task respectively. These properties can also be set by using APIs 
+          <title> How to submit the script: </title>
+          <p> A quick way to submit the debug script is to set values for the 
+          properties <code>mapred.map.task.debug.script</code> and 
+          <code>mapred.reduce.task.debug.script</code>, for debugging map and 
+          reduce tasks respectively. These properties can also be set by using APIs 
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmapdebugscript">
           JobConf.setMapDebugScript(String) </a> and
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setreducedebugscript">
-          JobConf.setReduceDebugScript(String) </a>. For streaming, debug 
-          script can be submitted with command-line options -mapdebug,
-          -reducedebug for debugging mapper and reducer respectively.</p>
+          JobConf.setReduceDebugScript(String) </a>. In streaming mode, a debug 
+          script can be submitted with the command-line options 
+          <code>-mapdebug</code> and <code>-reducedebug</code>, for debugging 
+          map and reduce tasks respectively.</p>
             
-          <p>The arguments of the script are task's stdout, stderr, 
+          <p>The arguments to the script are the task's stdout, stderr, 
           syslog and jobconf files. The debug command, run on the node where
-          the map/reduce failed, is: <br/>
+          the map/reduce task failed, is: <br/>
           <code> $script $stdout $stderr $syslog $jobconf </code> </p> 
 
           <p> Pipes programs have the c++ program name as a fifth argument
@@ -2003,67 +2009,62 @@
         
         <section>
           <title>Skipping Bad Records</title>
-          <p>Hadoop provides an optional mode of execution in which the bad 
-          records are detected and skipped in further attempts. 
-          Applications can control various settings via 
+          <p>Hadoop provides an option where a certain set of bad input 
+          records can be skipped when processing map inputs. Applications 
+          can control this feature through the  
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords">
-          SkipBadRecords</a>.</p>
+          SkipBadRecords</a> class.</p>
           
-          <p>This feature can be used when map/reduce tasks crashes 
-          deterministically on certain input. This happens due to bugs in the 
-          map/reduce function. The usual course would be to fix these bugs. 
-          But sometimes this is not possible; perhaps the bug is in third party 
-          libraries for which the source code is not available. Due to this, 
-          the task never reaches to completion even with multiple attempts and 
-          complete data for that task is lost.</p>
+          <p>This feature can be used when map tasks crash deterministically 
+          on certain input. This usually happens due to bugs in the 
+          map function. Usually, the user would have to fix these bugs. 
+          This is, however, not possible sometimes. The bug may be in third 
+          party libraries, for example, for which the source code is not 
+          available. In such cases, the task never completes successfully even
+          after multiple attempts, and the job fails. With this feature, only 
+          a small portion of data surrounding the 
+          bad records is lost, which may be acceptable for some applications 
+          (those performing statistical analysis on very large data, for 
+          example). </p>
 
-          <p>With this feature, only a small portion of data is lost surrounding 
-          the bad record. This may be acceptable for some user applications; 
-          for example applications which are doing statistical analysis on 
-          very large data. By default this feature is disabled. For turning it 
-          on refer <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
+          <p>By default this feature is disabled. For enabling it, 
+          refer to <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
           </p>
  
-          <p>The skipping mode gets kicked off after certain no of failures
+          <p>With this feature enabled, the framework gets into 'skipping 
+          mode' after a certain number of map failures. For more details, 
           see <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setattemptsTostartskipping">
-          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>.
-          </p>
- 
-          <p>In the skipping mode, the map/reduce task maintains the record 
-          range which is getting processed at all times. For maintaining this 
-          range, the framework relies on the processed record 
-          counter. see <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_map_processed_records">
+          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>. 
+          In 'skipping mode', map tasks maintain the range of records being 
+          processed. To do this, the framework relies on the processed record 
+          counter. See <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_map_processed_records">
           SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_reduce_processed_groups">
           SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
-          Based on this counter, the framework knows that how 
-          many records have been processed successfully by mapper/reducer.
-          Before giving the 
-          input to the map/reduce function, it sends this record range to the 
-          Task tracker. If task crashes, the Task tracker knows which one was 
-          the last reported range. On further attempts that range get skipped.
-          </p>
+          This counter enables the framework to know how many records have 
+          been processed successfully, and hence, what record range caused 
+          a task to crash. On further attempts, this range of records is 
+          skipped.</p>
      
-          <p>The number of records skipped for a single bad record depends on 
-          how frequent, the processed counters are incremented by the application. 
-          It is recommended to increment the counter after processing every 
-          single record. However in some applications this might be difficult as 
-          they may be batching up their processing. In that case, the framework 
-          might skip more records surrounding the bad record. If users want to 
-          reduce the number of records skipped, then they can specify the 
-          acceptable value using 
+          <p>The number of records skipped depends on how frequently the 
+          processed record counter is incremented by the application. 
+          It is recommended that this counter be incremented after every 
+          record is processed. This may not be possible in some applications 
+          that typically batch their processing. In such cases, the framework 
+          may skip additional records surrounding the bad record. Users can 
+          control the number of skipped records through 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
-          The framework tries to narrow down the skipped range by employing the 
-          binary search kind of algorithm during task re-executions. The skipped
-          range is divided into two halves and only one half get executed. 
-          Based on the subsequent failure, it figures out which half contains 
-          the bad record. This task re-execution will keep happening till 
+          The framework tries to narrow the range of skipped records using a 
+          binary search-like approach. The skipped range is divided into two 
+          halves and only one half gets executed. On subsequent 
+          failures, the framework figures out which half contains 
+          bad records. A task will be re-executed till the
           acceptable skipped value is met or all task attempts are exhausted.
           To increase the number of task attempts, use
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmaxmapattempts">
@@ -2072,9 +2073,8 @@
           JobConf.setMaxReduceAttempts(int)</a>.
           </p>
           
-          <p>The skipped records are written to the hdfs in the sequence file 
-          format, which could be used for later analysis. The location of 
-          skipped records output path can be changed by 
+          <p>Skipped records are written to HDFS in the sequence file 
+          format, for later analysis. The location can be changed through 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setskipoutputpath">
           SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
           </p>