You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/10/07 00:05:19 UTC

[Hadoop Wiki] Trivial Update of "TestFaqPage" by SomeOtherAccount

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "TestFaqPage" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/TestFaqPage?action=diff&rev1=2&rev2=3

--------------------------------------------------

   * [[http://svn.apache.org/viewvc/hadoop/core/trunk/src/c++/libhdfs|libhdfs]], a JNI-based C API for talking to hdfs (only).
   * [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html|Hadoop Pipes]], a [[http://www.swig.org/|SWIG]]-compatible  C++ API (non-JNI) to write map-reduce jobs.
  
- <<BR>> <<Anchor(2.2)>> '''2. [[#A2.2|What is the Distributed Cache used for?]]'''
+ == What is the Distributed Cache used for? ==
  
  The distributed cache is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a url (either hdfs: or http:) on to the slave node before any tasks for the job are executed on that node. The files are only copied once per job and so should not be modified by the application.
  
- <<BR>> <<Anchor(2.3)>> '''3. [[#A2.3|Can I write create/write-to hdfs files directly from my map/reduce tasks?]]'''
+ == Can I write create/write-to hdfs files directly from my map/reduce tasks? ==
  
  Yes. (Clearly, you want this since you need to create/write-to files other than the output-file written out by [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html|OutputCollector]].)
  
  Caveats:
  
- <glossary>
- 
  ${mapred.output.dir} is the eventual output directory for the job ([[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]] / [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()|JobConf.getOutputPath]]).
  
  ${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).
  
- </glossary>
- 
  With ''speculative-execution'' '''on''', one could face issues with 2 instances of the same TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't create/write-to files directly via reduce tasks.)
  
  To get around this the framework helps the application-writer out by maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.
@@ -125, +121 @@

  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs.
  
- <<BR>> <<Anchor(2.4)>> '''4. [[#A2.4|How do I get each of my maps to work on one complete input-file and not allow the framework to split-up my files?]]'''
+ == How do I get each of my maps to work on one complete input-file and not allow the framework to split-up my files? ==
  
  Essentially a job's input is represented by the [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html|InputFormat]](interface)/[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html|FileInputFormat]](base class).
  
@@ -137, +133 @@

  
  The other, quick-fix option, is to set [[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.min.split.size|mapred.min.split.size]] to large enough value.
  
- <<BR>> <<Anchor(2.5)>> '''5. [[#A2.5|Why I do see broken images in jobdetails.jsp page?]]'''
+ == Why I do see broken images in jobdetails.jsp page? ==
  
  In hadoop-0.15, Map / Reduce task completion graphics are added. The graphs are produced as SVG(Scalable Vector Graphics) images, which are basically xml files, embedded in html content. The graphics are tested successfully in Firefox 2 on Ubuntu and MAC OS. However for other browsers, one should install an additional plugin to the browser to see the SVG images. Adobe's SVG Viewer can be found at http://www.adobe.com/svg/viewer/install/.
  
- <<BR>> <<Anchor(2.6)>> '''6. [[#A2.6|I see a maximum of 2 maps/reduces spawned concurrently on each TaskTracker, how do I increase that?]]'''
+ == I see a maximum of 2 maps/reduces spawned concurrently on each TaskTracker, how do I increase that? ==
  
  Use the configuration knob: [[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.map.tasks.maximum|mapred.tasktracker.map.tasks.maximum]] and [[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.reduce.tasks.maximum|mapred.tasktracker.reduce.tasks.maximum]] to control the number of maps/reduces spawned simultaneously on a !TaskTracker. By default, it is set to ''2'', hence one sees a maximum of 2 maps and 2 reduces at a given instance on a !TaskTracker.
  
  You can set those on a per-tasktracker basis to accurately reflect your hardware (i.e. set those to higher nos. on a beefier tasktracker etc.).
  
- <<BR>> <<Anchor(2.7)>> '''7. [[#A2.7|Submitting map/reduce jobs as a different user doesn't work.]]'''
+ == Submitting map/reduce jobs as a different user doesn't work. ==
  
  The problem is that you haven't configured your map/reduce system   directory to a fixed value. The default works for single node systems, but not for   "real" clusters. I like to use:
  
@@ -159, +155 @@

     </description>
  </property>
  }}}
- Note that this directory is in your default file system and must be   accessible from both the client and server machines and is typically   in HDFS.
+ Note that this directory is in your default file system and must be   accessible from both the client and server machines and is typically in HDFS.
  
- <<BR>> <<Anchor(2.8)>> '''8. [[#A2.8|How do Map/Reduce InputSplit's handle record boundaries correctly?]]'''
+ == How do Map/Reduce InputSplit's handle record boundaries correctly? ==
  
  It is the responsibility of the InputSplit's RecordReader to start and end at a record boundary. For SequenceFile's every 2k bytes has a 20 bytes '''sync''' mark between the records. These sync marks allow the RecordReader to seek to the start of the InputSplit, which contains a file, offset and length and find the first sync mark after the start of the split. The RecordReader continues processing records until it reaches the first sync mark after the end of the split. The first split of each file naturally starts immediately and not after the first sync mark. In this way, it is guaranteed that each record will be processed by exactly one mapper.
  
  Text files are handled similarly, using newlines instead of sync marks.
  
- <<BR>> <<Anchor(2.9)>> '''9. [[#A2.9|How do I change final output file name with the desired name rather than in partitions like part-00000, part-00001 ?]]'''
+ == How do I change final output file name with the desired name rather than in partitions like part-00000, part-00001? ==
  
  You can subclass the [[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/OutputFormat.java?view=markup|OutputFormat.java]] class and write your own. You can look at the code of [[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/TextOutputFormat.java?view=markup|TextOutputFormat]] [[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MultipleOutputFormat.java?view=markup|MultipleOutputFormat.java]] etc. for reference. It might be the case that you only need to do minor changes to any of the existing Output Format classes. To do that you can just subclass that class and override the methods you need to change.
  
- <<BR>> <<Anchor(2.10)>> ''10. [[#A2.10|When writing a New InputFormat, what is the format for the array of string returned by InputSplit\#getLocations()?]]''
+ == When writing a New InputFormat, what is the format for the array of string returned by InputSplit\#getLocations()? ==
  
  It appears that DatanodeID.getHost() is the standard place to retrieve this name, and the machineName variable, populated in DataNode.java\#startDataNode, is where the name is first set. The first method attempted is to get "slave.host.name" from the configuration; if that is not available, DNS.getDefaultHost is used instead.
  
- <<BR>> <<Anchor(2.11)>> '''11. [[#A2.11|How do you gracefully stop a running job?]]'''
+ == How do you gracefully stop a running job? ==
  
+ {{{
  hadoop job -kill JOBID
+ }}}
  
+ = HDFS =
- 
- <<BR>> <<Anchor(3)>> [[#A3|HDFS]]
  
  <<BR>> <<Anchor(3.1)>> '''1. [[#A3.1|If I add new data-nodes to the cluster will HDFS move the blocks to the newly added nodes in order to balance disk space utilization between the nodes?]]'''