You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/04/20 11:39:41 UTC

[Hadoop Wiki] Update of "FAQ" by ChristophSchmitz

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "FAQ" page has been changed by ChristophSchmitz.
http://wiki.apache.org/hadoop/FAQ?action=diff&rev1=95&rev2=96

--------------------------------------------------

  $ bin/hadoop-daemon.sh start datanode
  $ bin/hadoop-daemon.sh start tasktracker
  }}}
- 
  If you are using the dfs.include/mapred.include functionality, you will need to additionally add the node to the dfs.include/mapred.include file, then issue {{{hadoop dfsadmin -refreshNodes}}} and {{{hadoop mradmin -refreshNodes}}} so that the NameNode and JobTracker know of the additional node that has been added.
  
  == Is there an easy way to see the status and health of a cluster? ==
@@ -92, +91 @@

   * [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html|Hadoop Pipes]], a [[http://www.swig.org/|SWIG]]-compatible  C++ API (non-JNI) to write map-reduce jobs.
  
  == How do I submit extra content (jars, static files, etc) for my job to use during runtime? ==
- 
  The [[http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html|distributed cache]] feature is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a URL (either hdfs: or http:) on to the slave node before any tasks for the job are executed on that node. The files are only copied once per job and so should not be modified by the application.
  
  For streaming, see the HadoopStreaming wiki for more information.
@@ -101, +99 @@

  
  == How do I get my MapReduce Java Program to read the Cluster's set configuration and not just defaults? ==
  The configuration property files ({core|mapred|hdfs}-site.xml) that are available in the various '''conf/''' directories of your Hadoop installation needs to be on the '''CLASSPATH''' of your Java application for it to get found and applied. Another way of ensuring that no set configuration gets overridden by any Job is to set those properties as final; for example:
+ 
  {{{
  <name>mapreduce.task.io.sort.mb</name>
  <value>400</value>
  <final>true</final>
  }}}
- 
  Setting configuration properties as final is a common thing Administrators do, as is noted in the [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html|Configuration]] API docs.
  
  A better alternative would be to have a service serve up the Cluster's configuration to you upon request, in code. [[HADOOP-5670|https://issues.apache.org/jira/browse/HADOOP-5670]] may be of some interest in this regard, perhaps.
@@ -122, +120 @@

  
  With ''speculative-execution'' '''on''', one could face issues with 2 instances of the same TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't create/write-to files directly via reduce tasks.)
  
- To get around this the framework helps the application-writer out by maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.
+ To get around this the framework helps the application-writer out by maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each reduce task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.
  
  The application-writer can take advantage of this by creating any side-files required in ${mapred.output.dir} during execution of his reduce-task, and the framework will move them out similarly - thus you don't have to pick unique paths per task-attempt.
  
- Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]]. ''So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.''
+ Fine-print: the value of ${mapred.output.dir} during execution of a particular ''reduce'' task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]]. ''So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.''
+ 
+ For ''map'' task attempts, the automatic substitution of ${mapred.output.dir}/_${taskid} for''' '''${mapred.output.dir} does not take place. You can still access the map task attempt directory, though, by using FileOutputFormat.getWorkOutputPath(TaskInputOutputContext). Files created there will be dealt with as described above.
  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs.
  
@@ -281, +281 @@

  = Platform Specific =
  == Mac OS X ==
  === Building on Mac OS X 10.6 ===
- 
  Be aware that Apache Hadoop 0.22 and earlier require Apache Forrest to build the documentation.  As of Snow Leopard, Apple no longer ships Java 1.5 which Apache Forrest requires.  This can be accomplished by either copying /System/Library/Frameworks/JavaVM.Framework/Versions/1.5 and 1.5.0 from a 10.5 machine or using a utility like Pacifist to install from an official Apple package. http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard provides some step-by-step directions.
- 
  
  == Solaris ==
  === Why do files and directories show up as DrWho and/or user names are missing/weird? ===