You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2007/09/22 09:25:14 UTC

[Lucene-hadoop Wiki] Update of "FAQ" by Arun C Murthy

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by Arun C Murthy:
http://wiki.apache.org/lucene-hadoop/FAQ

The comment on the change is:
Added a section on how to create/write-to side-files via map/reduce tasks

------------------------------------------------------------------------------
  
  The distributed cache is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a url (either hdfs: or http:) on to the slave node before any tasks for the job are executed on that node. The files are only copied once per job and so should not be modified by the application.
  
+ == 9. Can I write create/write-to hdfs files directly from my map/reduce tasks? ==
+ 
+ Yes. (Clearly, you want this since you need to create/write-to files other than the output-file written out by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/OutputCollector.html OutputCollector].)
+ 
+ Caveats:
+ 
+ <glossary>
+ 
+ ${mapred.output.dir} is the eventual output directory for the job ([http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path) JobConf.setOutputPath] / [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath() JobConf.getOutputPath]).
+ 
+ ${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).
+ 
+ </glossary>
+ 
+ With ''speculative-execution'' '''on''', one could face issues with 2 instances of the same TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't create/write-to files directly via reduce tasks.)
+ 
+ To get around this the framework helps the application-writer out by maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.
+ 
+ The app-writer can take advantage of this by creating any side-files required in ${mapred.output.dir} during execution of his reduce-task, and the framework will move them out similarly - thus you don't have to pick unique paths per task-attempt.
+ 
+ Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path) JobConf.setOutputPath]. ''So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.''
+ 
+ The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs.
+