You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@oozie.apache.org by ka...@apache.org on 2012/02/07 21:37:41 UTC

svn commit: r1241607 [2/3] - in /incubator/oozie/site/publish: ./ images/

Added: incubator/oozie/site/publish/images/step4.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/step4.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/step4.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/step5.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/step5.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/step5.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/step6.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/step6.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/step6.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/wc1.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/wc1.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/wc1.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/wc2.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/wc2.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/wc2.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/wc3.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/wc3.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/wc3.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/wc4.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/wc4.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/wc4.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/wc5.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/wc5.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/wc5.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/oozie/site/publish/images/wc6.png
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/images/wc6.png?rev=1241607&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/oozie/site/publish/images/wc6.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: incubator/oozie/site/publish/index.html
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/index.html?rev=1241607&r1=1241606&r2=1241607&view=diff
==============================================================================
--- incubator/oozie/site/publish/index.html (original)
+++ incubator/oozie/site/publish/index.html Tue Feb  7 20:37:39 2012
@@ -1,8 +1,8 @@
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
-<!-- Generated by Apache Maven Doxia at Jan 26, 2012 -->
+<!-- Generated by Apache Maven Doxia at Feb 7, 2012 -->
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
-    <title>Apache Oozie Workflow Scheduler for Hadoop</title>
+    <title>Apache Oozie - Apache Oozie Workflow Scheduler for Hadoop</title>
     <style type="text/css" media="all">
       @import url("./css/maven-base.css");
       @import url("./css/maven-theme.css");
@@ -10,8 +10,7 @@
     </style>
     <link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
         <meta name="author" content="$maven.build.timestamp" />
-        <meta name="Date-Revision-yyyymmdd" content="20120126" />
-    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
       </head>
   <body class="composite">
     <div id="banner">
@@ -25,10 +24,10 @@
     <div id="breadcrumbs">
             
                                 <div class="xleft">
-        Last Published: 2012-01-26
+        Last Published: 2012-02-07
                           |                   <a href="index.html">Apache Oozie</a>
         &gt;
-    Apache Oozie Workflow Scheduler for Hadoop
+    Apache Oozie - Apache Oozie Workflow Scheduler for Hadoop
               </div>
             <div class="xright">            <a href="http://www.apache.org/" class="externalLink">ASF</a>
               
@@ -79,6 +78,15 @@
                   <li class="none">
                   <a href="./QuickStart.html">Quick start</a>
             </li>
+                  <li class="none">
+                  <a href="./overview.html">Overview</a>
+            </li>
+                  <li class="none">
+                  <a href="./map-reduce-cookbook.html">MapReduce Cookbook</a>
+            </li>
+                  <li class="none">
+                  <a href="./pig-cookbook.html">Pig Cookbook</a>
+            </li>
           </ul>
                                  <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
           <img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
@@ -88,7 +96,17 @@
     </div>
     <div id="bodyColumn">
       <div id="contentBox">
-        <!-- Licensed under the Apache License, Version 2.0 (the "License"); --><!-- you may not use this file except in compliance with the License. --><!-- You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. See accompanying LICENSE file. --><div class="section"><h2>Apache Oozie(TM) Workflow Scheduler for Hadoop<a name="Apache_OozieTM_Workflow_Scheduler_for_Hadoop"></a></h2><div class="section"><h3>Overview<a name="Overview"></a></h3><p>Oozie is a workflow/coordination system to manage Apache Hadoop(TM) jobs.</p><p>Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of act
 ions.</p><p>Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.</p><p>Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (Java map-reduce, Streaming map-reduce, Pig, Distcp, etc.)</p><p>Oozie is a scalable, reliable and extensible system.</p><p>Developers interested in getting more involved with Oozie may join the <a href="./MailingLists.html">mailing lists</a>, <a href="./IssueTracking.html">report bugs</a>, retrieve code from the <a href="./VersionControl">version control system</a>, and make <a href="./HowToContribute.html">contributions</a>.</p></div></div>
+        <div class="section"><h2>Apache Oozie(TM) Workflow Scheduler for Hadoop</h2>
+<div class="section"><h3>Overview</h3>
+<p>Oozie is a workflow/coordination system to manage Apache Hadoop(TM) jobs.</p>
+<p>Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.</p>
+<p>Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.</p>
+<p>Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (Java map-reduce, Streaming map-reduce, Pig, Distcp, etc.)</p>
+<p>Oozie is a scalable, reliable and extensible system.</p>
+<p>Developers interested in getting more involved with Oozie may join the <a href="./MailingLists.html">mailing lists</a>, <a href="./IssueTracking.html">report bugs</a>, retrieve code from the <a href="./VersionControl">version control system</a>, and make <a href="./HowToContribute.html">contributions</a>.</p>
+</div>
+</div>
+
       </div>
     </div>
     <div class="clear">

Modified: incubator/oozie/site/publish/mailing_list.html
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/mailing_list.html?rev=1241607&r1=1241606&r2=1241607&view=diff
==============================================================================
--- incubator/oozie/site/publish/mailing_list.html (original)
+++ incubator/oozie/site/publish/mailing_list.html Tue Feb  7 20:37:39 2012
@@ -1,16 +1,15 @@
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
-<!-- Generated by Apache Maven Doxia at Jan 26, 2012 -->
+<!-- Generated by Apache Maven Doxia at Feb 7, 2012 -->
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
-    <title></title>
+    <title>Apache Oozie - </title>
     <style type="text/css" media="all">
       @import url("./css/maven-base.css");
       @import url("./css/maven-theme.css");
       @import url("./css/site.css");
     </style>
     <link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
-        <meta name="Date-Revision-yyyymmdd" content="20120126" />
-    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
       </head>
   <body class="composite">
     <div id="banner">
@@ -24,10 +23,10 @@
     <div id="breadcrumbs">
             
                                 <div class="xleft">
-        Last Published: 2012-01-26
+        Last Published: 2012-02-07
                           |                   <a href="index.html">Apache Oozie</a>
         &gt;
-    
+    Apache Oozie - 
               </div>
             <div class="xright">            <a href="http://www.apache.org/" class="externalLink">ASF</a>
               
@@ -78,6 +77,15 @@
                   <li class="none">
                   <a href="./QuickStart.html">Quick start</a>
             </li>
+                  <li class="none">
+                  <a href="./overview.html">Overview</a>
+            </li>
+                  <li class="none">
+                  <a href="./map-reduce-cookbook.html">MapReduce Cookbook</a>
+            </li>
+                  <li class="none">
+                  <a href="./pig-cookbook.html">Pig Cookbook</a>
+            </li>
           </ul>
                                  <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
           <img alt="Built by Maven" src="./images/logos/maven-feather.png"/>

Added: incubator/oozie/site/publish/map-reduce-cookbook.html
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/map-reduce-cookbook.html?rev=1241607&view=auto
==============================================================================
--- incubator/oozie/site/publish/map-reduce-cookbook.html (added)
+++ incubator/oozie/site/publish/map-reduce-cookbook.html Tue Feb  7 20:37:39 2012
@@ -0,0 +1,708 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<!-- Generated by Apache Maven Doxia at Feb 7, 2012 -->
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>Apache Oozie - MapReduce Cookbook</title>
+    <style type="text/css" media="all">
+      @import url("./css/maven-base.css");
+      @import url("./css/maven-theme.css");
+      @import url("./css/site.css");
+    </style>
+    <link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
+        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+      </head>
+  <body class="composite">
+    <div id="banner">
+                  <span id="bannerLeft">
+                 
+                </span>
+                    <div class="clear">
+        <hr/>
+      </div>
+    </div>
+    <div id="breadcrumbs">
+            
+                                <div class="xleft">
+        Last Published: 2012-02-07
+                          |                   <a href="index.html">Apache Oozie</a>
+        &gt;
+    Apache Oozie - MapReduce Cookbook
+              </div>
+            <div class="xright">            <a href="http://www.apache.org/" class="externalLink">ASF</a>
+              
+                                 Version: 3.1.0-SNAPSHOT
+      </div>
+      <div class="clear">
+        <hr/>
+      </div>
+    </div>
+    <div id="leftColumn">
+      <div id="navcolumn">
+             
+                                                <h5>Project</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./index.html">Home</a>
+            </li>
+                  <li class="none">
+                  <a href="./Downloads.html">Downloads</a>
+            </li>
+                  <li class="none">
+                  <a href="./Credits.html">Credits</a>
+            </li>
+                  <li class="none">
+                  <a href="./MailingLists.html">Mailing Lists</a>
+            </li>
+                  <li class="none">
+                  <a href="./IssueTracking.html">Issue Tracking</a>
+            </li>
+                  <li class="none">
+                  <a href="./IRCChannel.html">IRC Channel</a>
+            </li>
+          </ul>
+                       <h5>Developers</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./VersionControl.html">Version Control</a>
+            </li>
+                  <li class="none">
+                  <a href="./HowToContribute.html">How To Contribute</a>
+            </li>
+                  <li class="none">
+                  <a href="HowToRelease.html">How to Release</a>
+            </li>
+          </ul>
+                       <h5>Documentation</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./QuickStart.html">Quick start</a>
+            </li>
+                  <li class="none">
+                  <a href="./overview.html">Overview</a>
+            </li>
+                  <li class="none">
+                  <a href="./map-reduce-cookbook.html">MapReduce Cookbook</a>
+            </li>
+                  <li class="none">
+                  <a href="./pig-cookbook.html">Pig Cookbook</a>
+            </li>
+          </ul>
+                                 <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
+          <img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
+        </a>
+                       
+                            </div>
+    </div>
+    <div id="bodyColumn">
+      <div id="contentBox">
+        <div class="section"><h2>MapReduce Cookbook</h2>
+<p>This document comprehensively describes the procedure of running a MapReduce job using Oozie. Its targeted audience is all forms of users who will install, use and operate Oozie.</p>
+<p><b>NOTE</b>: This tutorial has been prepared assuming GNU/Linux as the choice of development and production platform.</p>
+<div class="section"><h3>Overview</h3>
+<p>Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on the Grid. Although MapReduce applications can be launched independently, there are obvious advantages on submitting them via Oozie such as:</p>
+<ul><li>Managing complex workflow dependencies</li>
+<li>Frequency-based execution</li>
+<li>Operational flexibility</li>
+</ul>
+</div>
+<div class="section"><h3>Running a MapReduce application without Oozie</h3>
+<p>Follow the instructions on <a class="externalLink" href="http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0">Hadoop MapReduce Tutorial</a> to run a simple MapReduce application (wordcount). This involves invoking the hadoop commands to submit the mapreduce job by specifying various commandline options.</p>
+</div>
+<div class="section"><h3>Running a MapReduce application using Oozie</h3>
+<p>In order to describe the key differences in submitting a job using Oozie, lets compare the following two approaches.</p>
+<p><b>Hadoop Map-Reduce Job Submission</b></p>
+<div class="source"><pre>$ hadoop jar /usr/ninja/wordcount.jar org.myorg.WordCount -Dmapred.job.queue.name=queue_name /usr/ninja/wordcount/input /usr/ninja/wordcount/output</pre>
+</div>
+<p><b>Oozie Map-Reduce Job Submission</b></p>
+<p>Oozie acts as a middle-man between the user and hadoop. The user provides details of his job to Oozie and Oozie executes it on Hadoop via a launcher job followed by returning the results. It provides a way for the user to set the various above parameters such as <i>mapred.job.queue.name</i>, <i>input directory</i>, <i>output directory</i> in a workflow XML file. A workflow is defined as a set of actions arranged in a DAG (Direct Acyclic Graph) as shown below:</p>
+<img src="images/MR-Dag-WF.png" alt="MapReduce WorkFlow DAG" /><p>Below are the three components required to launch a simple MapReduce workflow:</p>
+<p><b>I: Properties File</b> - <i>job.properties</i></p>
+<p>This file is present locally on the node from which the job is submitted (either a local machine or the gateway node). Its main purpose is to specify essential parameters needed for running the workflow. One mandatory property to be specified is <i>oozie.wf.application.path</i> that points to the location of the HDFS where <i>workflow.xml</i> exists. In addition, definition for all those variables used in workflow.xml (eg: <i>$<a name="jobTracker">jobTracker</a></i>, <i>$<a name="inputDir">inputDir</a></i>, ..) can be added here. For some specific versions of Hadoop, additional authentication parameters<i>PROVIDE LINK</i> might also need to be defined. For our example, this file looks like below:</p>
+<div class="source"><pre>nameNode=hdfs://localhost:9000    # or use a remote-server url. eg: hdfs://abc.xyz.yahoo.com:8020
+jobTracker=localhost:9001         # or use a remote-server url. eg: abc.xyz.yahoo.com:50300
+queueName=default
+examplesRoot=map-reduce
+
+oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
+inputDir=input-data
+outputDir=map-reduce</pre>
+</div>
+<p><b>II: Workflow XML</b> - <i>workflow.xml</i></p>
+<p>This file defines the workflow for the particular job as a set of actions. For our example, this file looks like below:</p>
+<div class="source"><pre>&lt;workflow-app name='wordcount-wf' xmlns=&quot;uri:oozie:workflow:0.2&quot;&gt;
+    &lt;start to='wordcount'/&gt;
+    &lt;action name='wordcount'&gt;
+        &lt;map-reduce&gt;
+            &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+            &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+            &lt;prepare&gt;
+            &lt;/prepare&gt;
+            &lt;configuration&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                    &lt;value&gt;${queueName}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.mapper.class&lt;/name&gt;
+                    &lt;value&gt;org.myorg.WordCount.Map&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.reducer.class&lt;/name&gt;
+                    &lt;value&gt;org.myorg.WordCount.Reduce&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.input.dir&lt;/name&gt;
+                    &lt;value&gt;${inputDir}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.output.dir&lt;/name&gt;
+                    &lt;value&gt;${outputDir}&lt;/value&gt;
+                &lt;/property&gt;
+            &lt;/configuration&gt;
+        &lt;/map-reduce&gt;
+        &lt;ok to='end'/&gt;
+        &lt;error to='end'/&gt;
+    &lt;/action&gt;
+    &lt;kill name='kill'&gt;
+        &lt;value&gt;${wf:errorCode(&quot;wordcount&quot;)}&lt;/value&gt;
+    &lt;/kill/&gt;
+    &lt;end name='end'/&gt;
+&lt;/workflow-app&gt;</pre>
+</div>
+<p>Details of the basic XML tags used in workflow.xml (exhaustive description of all elements available in the Workflow XML Schema Definition<i>provide link</i>):</p>
+<ul><li><b>&lt;jobtracker&gt;</b> element is used to specify the url of the hadoop job tracker.<p><i>Format</i>: jobtracker_hostname:port_number</p>
+<p><i>Example</i>: localhost:9001, abc.xyz.yahoo.com:50300</p>
+</li>
+<li><b>&lt;namenode&gt;</b> element is used to specify the url of the hadoop namenode.<p><i>Format</i>: hdfs://namenode_hostname:port_number</p>
+<p><i>Example</i>: hdfs://localhost:9000, hdfs://abc.xyz.yahoo.com:8020</p>
+<p>jobtracker and namenode need to be same as the ones defined in the hadoop configuration files. If they are different, they would need to be updated.</p>
+</li>
+<li><b>&lt;prepare&gt;</b> element is used to specify a list of operations needed to be performed before beginning an action such as deleting an existing output directory (<b>&lt;delete&gt;</b>) or creating a new one (<b>&lt;mkdir&gt;</b>). This element is however, optional for the map-reduce action.</li>
+<li><b>&lt;configuration&gt;</b> element is used to specify key/value properties for the map-reduce job. Some common properties include:<ul><li><i>mapred.job.queue.name</i> specifies the queuename that the job will be submitted to. If not mentioned, the default queue <i>default</i> is assumed.</li>
+<li><i>mapred.mapper.class</i> specifies the Mapper class to be used.</li>
+<li><i>mapred.reducer.class</i> specifies the Reducer class to be used.</li>
+<li><i>mapred.input.dir</i> specifies the input directory on HDFS where the input for the MapReduce job resides.</li>
+<li><i>mapred.output.dir</i> specifies the directory on HDFS where output of the MapReduce job will be generated.</li>
+</ul>
+</li>
+</ul>
+<p><b>III: Libraries</b> - <i>lib/</i> This is a directory in the HDFS which contains libraries used in the workflow (such as <i>jar files</i> (.jar) or <i>shared object files</i> (.so)). During runtime, the Oozie server picks up contents of this directory and deploys them on the actual compute node using Hadoop distributed cache. When submitting job from the users local system, this lib directory would have to be manually copied over to the HDFS before the workflow can run. This can be done using the Hadoop filesystem commmand <i>put</i>. Additional ways of linking files and archives are dealt with in a subsequent section <a href="#CASE-6">How to specify symbolic links for files and archives</a> . In our example, the lib/ directory would contain the <b>wordcount.jar</b> file.</p>
+<p><b>Now, follow the steps below to try out the wordcount mapreduce application:</b></p>
+<ol type="1"><li>Follow the instructions on <a href="../../target/site/QuickStart.html">Quick Start</a> to setup Oozie with hadoop and ensure that Oozie service is started.</li>
+<li>Create a directory in your home to store all your workflow components (properties, workflow XML and the libraries). Inside, this workflow directory, create a sub-directory called <i>lib/</i>.<div class="source"><pre>$ cd ~
+$ mkdir map-reduce
+$ mkdir map-reduce/lib</pre>
+</div>
+<p>It should be noted that this workflow directory and its contents are created on the local filesystem and they must be copied to the HDFS filesystem once they are ready.</p>
+</li>
+<li>Create the files <i>workflow.xml</i> and <i>job.properties</i> with contents as shown above.<p><b>Tip:</b> The following Oozie command line option can be used to perform XML schema validation on the workflow XML file and return errors if any:</p>
+<div class="source"><pre>$ oozie validate ~/map_reduce/workflow.xml</pre>
+</div>
+</li>
+<li>Your job.properties file should will look like the following:<div class="source"><pre>$ cat ~/map_reduce/job.properties
+
+nameNode=hdfs://localhost:9000
+jobTracker=localhost:9001
+queueName=default
+examplesRoot=map-reduce
+
+oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
+inputDir=input-data
+outputDir=map-reduce</pre>
+</div>
+</li>
+<li>Copy the wordcount.jar file into the workflow/lib directory. Now, your workflow directory should have contents as below:<div class="source"><pre>job.properties
+workflow.xml
+lib/
+lib/wordcount.jar</pre>
+</div>
+<p>However, it should be noted that the job.properties file is always used from the local filesystem and it need not be copied to the HDFS filesystem.</p>
+</li>
+<li>Copy the workflow directory to HDFS. Please note that if a directory by this name already exists in HDFS, it might need to be deleted prior to copying.<div class="source"><pre>$ hadoop fs -put ~/map-reduce map-reduce</pre>
+</div>
+</li>
+<li>Run the following Oozie command to submit your workflow. Once the workflow is submitted, Oozie server returns the workflow ID which can be used for monitoring and debugging purposes.<div class="source"><pre>$ oozie job -oozie http://localhost:4080/oozie/ -config ~/map-reduce/job.properties -run
+
+...
+...
+job: 14-20090525161321-oozie-ninj</pre>
+</div>
+<p><b>-config</b> option specifies the location of the properties file, which in our case is in the user's home directory. (Note: only the workflow and libraries need to be on HDFS, not the properties file).</p>
+<p><b>-oozie</b> option specifies the location of the Oozie server. This could be omitted if the variable <b>OOZIE_URL</b> is set with the server url.</p>
+</li>
+<li>Check status of the submitted MapReduce workflow job. The following command displays a detailed breakdown of the workflow job submission.<div class="source"><pre>$ oozie job -info 14-20090525161321-oozie-ninj -oozie http://localhost:4080/oozie/
+
+...
+...
+
+.---------------------------------------------------------------------------------------------------------------------------------------------------------------
+Workflow Name :  wordcount-wf
+App Path      :  hdfs://localhost:4080/user/ninja/map-reduce
+Status        :  SUCCEEDED
+Run           :  0
+User          :  ninja
+Group         :  users
+Created       :  2011-09-21 05:01 +0000
+Started       :  2011-09-21 05:01 +0000
+Ended         :  2011-09-21 05:01 +0000
+Actions
+.----------------------------------------------------------------------------------------------------------------------------------------------------------------
+Action Name             Type        Status     Transition  External Id            External Status  Error Code    Start Time              End Time
+.----------------------------------------------------------------------------------------------------------------------------------------------------------------
+wordcount                 map-reduce  OK         end         job_200904281535_0254  SUCCEEDED        -             2011-09-21 05:01 +0000  2011-09-21 05:01 +0000
+.----------------------------------------------------------------------------------------------------------------------------------------------------------------</pre>
+</div>
+<p><b>Tip:</b> A graphical view of the workflow job status can be obtained via the <a href="#CASE-8">Oozie web console</a>.</p>
+</li>
+</ol>
+</div>
+<div class="section"><h3>Other Use Cases</h3>
+<ul><li><b> <a name="CASE-1">CASE-1</a>: HOW TO PARAMETERIZE OOZIE JOBS</b><p>In Oozie, there are numerous ways to specify the values for parameters being passed using configuration xml files. And, there is an order of precedence in which these parameters are evaluated. Below are the different ways of passing parameters for the MapReduce job to Oozie. The variables values can be defined in:</p>
+<ul><li>job.properties</li>
+<li>config-default.xml</li>
+<li>workflow.xml</li>
+<li>job-xml file for <i>job-xml</i></li>
+</ul>
+<p>Variable Substitution</p>
+<ul><li>Example 1: job properties have precedence over the config-default.xml</li>
+<li>Example 2: parameters in workflow.xml have precedence over the job properties</li>
+<li>Example 3: properties defined in workflow.xml have precedence over job-xml's properties</li>
+<li>Example 4: within the EL functions, it does not need &quot;$&quot; to resolve variables or EL functions.</li>
+</ul>
+<p><i>Example 1:</i></p>
+<p>in job.properties:</p>
+<div class="source"><pre>         VAR=job-properties</pre>
+</div>
+<p>in config-default.xml:</p>
+<div class="source"><pre>         &lt;property&gt;&lt;name&gt;VAR&lt;/name&gt;&lt;value&gt;config-default-xml&lt;/value&gt;&lt;/property&gt;</pre>
+</div>
+<p>in workflow.xml:</p>
+<div class="source"><pre>         &lt;property&gt;&lt;name&gt;VARIABLE&lt;/name&gt;&lt;value&gt;${VAR}&lt;/value&gt;&lt;/property&gt;</pre>
+</div>
+<p>then VARIABLE is resolved to &quot;job-properties&quot;.</p>
+<p><i>Example 2:</i></p>
+<p>in job.properties:</p>
+<div class="source"><pre>         variable4=job-properties</pre>
+</div>
+<p>in config-default.xml:</p>
+<div class="source"><pre>         &lt;property&gt;&lt;name&gt;variable4&lt;/name&gt;&lt;value&gt;config-default-xml&lt;/value&gt;&lt;/property&gt;</pre>
+</div>
+<p>in workflow.xml:</p>
+<div class="source"><pre>         &lt;job-xml&gt;mr3-job.xml&lt;/job-xml&gt;
+         ... ...
+         &lt;property&gt;&lt;name&gt;variable4&lt;/name&gt;&lt;value&gt;grideng&lt;/value&gt;&lt;/property&gt;</pre>
+</div>
+<p>in mr3-job.xml:</p>
+<div class="source"><pre>          &lt;property&gt;&lt;name&gt;mapred.job.queue.name&lt;/name&gt;&lt;value&gt;${variable4}&lt;/value&gt;&lt;/property&gt;</pre>
+</div>
+<p>then mapred.job.queue.name is resolved to &quot;grideng&quot;.</p>
+<p><i>Example 3:</i></p>
+<p>in workflow.xml:</p>
+<div class="source"><pre>         &lt;job-xml&gt;mr3-job.xml&lt;/job-xml&gt;
+         ... ...
+         &lt;property&gt;&lt;name&gt;mapred.job.queue.name&lt;/name&gt;&lt;value&gt;grideng&lt;/value&gt;&lt;/property&gt;</pre>
+</div>
+<p>in mr3-job.xml:</p>
+<div class="source"><pre>          &lt;property&gt;&lt;name&gt;mapred.job.queue.name&lt;/name&gt;&lt;value&gt;bogus&lt;/value&gt;&lt;/property&gt;</pre>
+</div>
+<p>then mapred.job.queue.name is resolved to &quot;grideng&quot;.</p>
+<p><i>Example 4:</i></p>
+<p>in job.properties:</p>
+<div class="source"><pre>         nameNode=hdfs://abc.xyz.yahoo.com:8020
+         outputDir=output-allactions</pre>
+</div>
+<p>in workflow.xml:</p>
+<div class="source"><pre>        &lt;case to=&quot;end&quot;&gt;${fs:exists(concat(concat(concat(concat(concat(nameNode,&quot;/user/&quot;),wf:user()),&quot;/&quot;),wf:conf(&quot;outputDir&quot;)),&quot;/streaming/part-00000&quot;)) and (fs:fileSize(concat(concat(concat(concat(concat(nameNode,&quot;/user/&quot;),wf:user()),&quot;/&quot;),wf:conf(&quot;outputDir&quot;)),&quot;/streaming/part-00000&quot;)) gt 0) == &quot;true&quot;}&lt;/case&gt;</pre>
+</div>
+<p><b>Note:</b> A variable defined in workflow.xml <i>VAR</i> can be referenced as <b>${VAR}</b> only if <i>VAR</i> follows strict Java identifier naming conventions (no spaces in between). If however, dots must be used in the names, then the variable would need to be referenced as <b>${wf:conf('VAR')}</b>.</p>
+</li>
+<li><b> <a name="CASE-2">CASE-2</a>: RUNNING MAPREDUCE USING THE NEW HADOOP API</b><p>Since new MR API (a.k.a. Hadoop 20 API) is neither stable nor supported, it is highly recommended not to use new MR API. Instead, Hadoop team recommends using the old API at least until Hadoop 0.23.x is released. The reasons behind this recommendation are as follows:</p>
+<ul><li>You are guaranteed needing to rewrite once the api changes. <i>You would not be saving the cost of rewrite</i>.</li>
+<li>The api is not final and not mature. <i>You would be taking the risk/cost of testing the code and then have it changed on you in the future</i>.</li>
+<li>There is a possibility of backward incompatibility as Hadoop 20 API is not approved. <i>You would take the risk of figuring our backward incompatibility issues</i>.</li>
+<li>There would not be any support efforts if users bump into a problem. <i>You would take the risk of maintaining unsupported code</i>.</li>
+</ul>
+<p>However, if you really need to run MapReduce jobs written using the 20 API in Oozie, below are the changes you need to make in workflow.xml.</p>
+<ul><li>change <i>mapred.mapper.class</i> to <b>mapreduce.map.class</b></li>
+<li>change <i>mapred.reducer.class</i> to <b>mapreduce.reduce.class</b></li>
+<li>add <i>mapred.output.key.class</i></li>
+<li>add <i>mapred.output.value.class</i></li>
+<li>and, include the following property into MR action configuration<div class="source"><pre>  &lt;property&gt;
+      &lt;name&gt;mapred.reducer.new-api&lt;/name&gt;
+      &lt;value&gt;true&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapred.mapper.new-api&lt;/name&gt;
+      &lt;value&gt;true&lt;/value&gt;
+  &lt;/property&gt;</pre>
+</div>
+</li>
+</ul>
+<p>The changes to be made in workflow.xml file are highlighted below:</p>
+<div class="source"><pre>&lt;map-reduce xmlns=&quot;uri:oozie:workflow:0.1&quot;&gt;
+  &lt;job-tracker&gt;abc.xyz.yahoo.com:50300&lt;/job-tracker&gt;
+  &lt;name-node&gt;hdfs://abc.xyz.yahoo.com:8020&lt;/name-node&gt;
+  &lt;prepare&gt;
+    &lt;delete path=&quot;hdfs://abc.xyz.yahoo.com:8020/user/ninja/yoozie_test/output-mr20-fail&quot; /&gt;
+  &lt;/prepare&gt;
+  &lt;configuration&gt;
+
+    &lt;!-- BEGIN: SNIPPET TO ADD IN ORDER TO MAKE USE OF HADOOP 20 API --&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapred.mapper.new-api&lt;/name&gt;
+      &lt;value&gt;true&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapred.reducer.new-api&lt;/name&gt;
+      &lt;value&gt;true&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;!-- END: SNIPPET --&gt;
+
+    &lt;property&gt;
+       &lt;name&gt;mapreduce.map.class&lt;/name&gt;
+       &lt;value&gt;org.myorg.WordCount$TokenizerMapper&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+       &lt;name&gt;mapreduce.reduce.class&lt;/name&gt;
+       &lt;value&gt;org.myorg.WordCount$IntSumReducer&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+       &lt;name&gt;mapred.output.key.class&lt;/name&gt;
+       &lt;value&gt;org.apache.hadoop.io.Text&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+       &lt;name&gt;mapred.output.value.class&lt;/name&gt;
+       &lt;value&gt;org.apache.hadoop.io.IntWritable&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapred.map.tasks&lt;/name&gt;
+      &lt;value&gt;1&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapred.input.dir&lt;/name&gt;
+      &lt;value&gt;/user/ninja/yoozie_test/input-data&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapred.output.dir&lt;/name&gt;
+      &lt;value&gt;/user/ninja/yoozie_test/output-mr20/mapRed20&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+      &lt;value&gt;grideng&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;mapreduce.job.acl-view-job&lt;/name&gt;
+      &lt;value&gt;*&lt;/value&gt;
+    &lt;/property&gt;
+    &lt;property&gt;
+      &lt;name&gt;oozie.launcher.mapreduce.job.acl-view-job&lt;/name&gt;
+      &lt;value&gt;*&lt;/value&gt;
+    &lt;/property&gt;
+  &lt;/configuration&gt;
+&lt;/map-reduce&gt;</pre>
+</div>
+</li>
+<li><b> <a name="CASE-3">CASE-3</a>: ACCESSING HADOOP COUNTERS IN PREVIOUS ACTIONS</b><p>This example generates a user-defined hadoop-counter named <b>['COMMON']['COMMON.ERROR_ACCESS_DH_FILES']</b> in the first action <i>mr1</i> and describes how to access its value in the subsequent action <i>java1</i>.</p>
+<p>The changes made in the workflow.xml are highlighted below:</p>
+<div class="source"><pre>&lt;workflow-app xmlns='uri:oozie:workflow:0.1' name='java-wf'&gt;
+    &lt;start to='mr1' /&gt;
+
+    &lt;action name='mr1'&gt;
+        &lt;map-reduce&gt;
+            &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+            &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+            &lt;configuration&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.mapper.class&lt;/name&gt;
+                    &lt;value&gt;org.myorg.SampleMapper&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.reducer.class&lt;/name&gt;
+                    &lt;value&gt;org.myorg.SampleReducer&lt;/value&gt;
+                &lt;/property&gt;
+
+                &lt;property&gt;
+                    &lt;name&gt;mapred.input.dir&lt;/name&gt;
+                    &lt;value&gt;${inputDir}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.output.dir&lt;/name&gt;
+                    &lt;value&gt;${outputDir}/streaming-output&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                  &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                  &lt;value&gt;${queueName}&lt;/value&gt;
+                &lt;/property&gt;
+
+               &lt;property&gt;
+                  &lt;name&gt;mapred.child.java.opts&lt;/name&gt;
+                  &lt;value&gt;-Xmx1024M&lt;/value&gt;
+               &lt;/property&gt;
+
+            &lt;/configuration&gt;
+        &lt;/map-reduce&gt;
+        &lt;ok to=&quot;java1&quot; /&gt;
+        &lt;error to=&quot;fail&quot; /&gt;
+    &lt;/action&gt;
+
+    &lt;action name='java1'&gt;
+        &lt;java&gt;
+            &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+            &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+            &lt;configuration&gt;
+               &lt;property&gt;
+                    &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                    &lt;value&gt;${queueName}&lt;/value&gt;
+                &lt;/property&gt;
+            &lt;/configuration&gt;
+            &lt;main-class&gt;org.myorg.MyTest&lt;/main-class&gt;
+
+            &lt;!-- BEGIN: SNIPPET TO ADD TO ACCESS HADOOP COUNTERS DEFINED IN PREVIOUS ACTIONS --&gt;
+            &lt;arg&gt;${hadoop:counters(&quot;mr1&quot;)[&quot;COMMON&quot;][&quot;COMMON.ERROR_ACCESS_DH_FILES&quot;]}&lt;/arg&gt;
+            &lt;!-- END: SNIPPET TO ADD --&gt;
+
+            &lt;capture-output/&gt;
+        &lt;/java&gt;
+        &lt;ok to=&quot;pig1&quot; /&gt;
+        &lt;error to=&quot;fail&quot; /&gt;
+    &lt;/action&gt;
+
+    &lt;kill name=&quot;fail&quot;&gt;
+        &lt;value&gt;${wf:errorCode(&quot;wordcount&quot;)}&lt;/value&gt;
+    &lt;/kill&gt;
+    &lt;end name='end' /&gt;
+&lt;/workflow-app&gt;</pre>
+</div>
+</li>
+<li><b> <a name="CASE-4">CASE-4</a>: INCREASING MEMORY FOR THE HADOOP JOB</b><p>MapReduce tasks are launched with some default memory limits that are provided by the system or by the cluster's administrators. Memory intensive jobs might need to use more than these default values. Hadoop has some configuration options that allow these to be changed. Without such modifications, memory intensive jobs could fail due to <i>OutOfMemory</i> errors in tasks or could get killed when the limits are enforced by the system. This section describes a way in which this could be tuned from within Oozie.</p>
+<p>A property <i>mapred.child.java.opts</i> can be defined in workflow.xml as below:</p>
+<div class="source"><pre>&lt;workflow-app xmlns='uri:oozie:workflow:0.1' name='streaming-wf'&gt;
+    &lt;start to='streaming1' /&gt;
+    &lt;action name='streaming1'&gt;
+        &lt;map-reduce&gt;
+            &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+            &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+            &lt;streaming&gt;
+                &lt;mapper&gt;/bin/cat&lt;/mapper&gt;
+                &lt;reducer&gt;/usr/bin/wc&lt;/reducer&gt;
+            &lt;/streaming&gt;
+            &lt;configuration&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.input.dir&lt;/name&gt;
+                    &lt;value&gt;${inputDir}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.output.dir&lt;/name&gt;
+                    &lt;value&gt;${outputDir}/streaming-output&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                  &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                  &lt;value&gt;${queueName}&lt;/value&gt;
+                &lt;/property&gt;
+
+                &lt;!-- BEGIN: SNIPPET TO ADD TO INCREASE MEMORY FOR THE HADOOP JOB--&gt;
+                &lt;property&gt;
+                  &lt;name&gt;mapred.child.java.opts&lt;/name&gt;
+                  &lt;value&gt;-Xmx1024M&lt;/value&gt;
+                &lt;/property&gt;
+                                &lt;!-- END: SNIPPET TO ADD --&gt;
+
+            &lt;/configuration&gt;
+        &lt;/map-reduce&gt;
+        &lt;ok to=&quot;end&quot; /&gt;
+        &lt;error to=&quot;fail&quot; /&gt;
+    &lt;/action&gt;
+    &lt;kill name=&quot;fail&quot;&gt;
+        &lt;value&gt;${wf:errorCode(&quot;wordcount&quot;)}&lt;/value&gt;
+    &lt;/kill&gt;
+    &lt;end name='end' /&gt;
+&lt;/workflow-app&gt;</pre>
+</div>
+</li>
+<li><b> <a name="CASE-5">CASE-5</a>: USING A CUSTOM INPUT FORMAT FOR THE MAPREDUCE JOB</b><p>In order to define and use a <a class="externalLink" href="http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat">custom input format</a> in the map-reduce action, the property <i>mapred.input.format.class</i> needs to be included in the workflow.xml as highlighted below:</p>
+<div class="source"><pre>&lt;workflow-app xmlns='uri:oozie:workflow:0.1' name='streaming-wf'&gt;
+    &lt;start to='streaming1' /&gt;
+    &lt;action name='streaming1'&gt;
+        &lt;map-reduce&gt;
+            &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+            &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+            &lt;streaming&gt;
+                &lt;mapper&gt;/bin/cat&lt;/mapper&gt;
+                &lt;reducer&gt;/usr/bin/wc&lt;/reducer&gt;
+            &lt;/streaming&gt;
+            &lt;configuration&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.input.dir&lt;/name&gt;
+                    &lt;value&gt;${inputDir}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.output.dir&lt;/name&gt;
+                    &lt;value&gt;${outputDir}/streaming-output&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                  &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                  &lt;value&gt;${queueName}&lt;/value&gt;
+                &lt;/property&gt;
+
+                &lt;!-- BEGIN: SNIPPET TO ADD TO DEFINE A CUSTOM INPUT FORMAT --&gt;
+                &lt;property&gt;
+                  &lt;name&gt;mapred.input.format.class&lt;/name&gt;
+                  &lt;value&gt;com.yahoo.ymail.antispam.featurelibrary.TextInputFormat&lt;/value&gt;
+                &lt;/property&gt;
+                                &lt;!-- END: SNIPPET TO ADD --&gt;
+
+            &lt;/configuration&gt;
+        &lt;/map-reduce&gt;
+        &lt;ok to=&quot;end&quot; /&gt;
+        &lt;error to=&quot;fail&quot; /&gt;
+    &lt;/action&gt;
+    &lt;kill name=&quot;fail&quot;&gt;
+        &lt;value&gt;${wf:errorCode(&quot;wordcount&quot;)}&lt;/value&gt;
+    &lt;/kill&gt;
+    &lt;end name='end' /&gt;
+&lt;/workflow-app&gt;</pre>
+</div>
+<p><b>NOTE:</b> It should be noted that the <i>jar</i> file containing the custom input format class must be placed in the workflow's <i>lib/</i> directory.</p>
+</li>
+<li><b> <a name="CASE-6">CASE-6</a>: HOW TO SPECIFY SYMBOLIC LINKS FOR FILES AND ARCHIVES</b><p>MapReduce applications can specify symbolic names for files and archives passed through the options <i>\–files</i> and <i>\–archives</i> using # such as below:</p>
+<div class="source"><pre>$ hadoop jar hadoop-examples.jar wordcount -files dir1/dict.txt#dict1,dir2/dict.txt#dict2 -archives mytar.tgz#tgzdir input output</pre>
+</div>
+<p>Here, the files <i>dir1/dict.txt</i> and <i>dir2/dict.txt</i> can be accessed by tasks using the symbolic names <i>dict1</i> and <i>dict2</i> respectively. The archive <i>mytar.tgz</i> will be placed and unarchived into a directory by the name <i>tgzdir</i>.</p>
+<p>Oozie supports these by allowing <b>&lt;file&gt;</b> and <b>&lt;archive&gt;</b> tags that can be defined in the workflow.xml as below:</p>
+<div class="source"><pre>&lt;workflow-app name='wordcount-wf' xmlns=&quot;uri:oozie:workflow:0.2&quot;&gt;
+    &lt;start to='wordcount'/&gt;
+    &lt;action name='wordcount'&gt;
+        &lt;map-reduce&gt;
+            &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+            &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+            &lt;prepare&gt;
+                &lt;delete path=&quot;hdfs://abc.xyz.yahoo.com:8020/user/ninja/test/output&quot; /&gt;
+            &lt;/prepare&gt;
+            &lt;configuration&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                    &lt;value&gt;${queueName}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.mapper.class&lt;/name&gt;
+                    &lt;value&gt;org.myorg.WordCount.Map&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.reducer.class&lt;/name&gt;
+                    &lt;value&gt;org.myorg.WordCount.Reduce&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.input.dir&lt;/name&gt;
+                    &lt;value&gt;${inputDir}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.output.dir&lt;/name&gt;
+                    &lt;value&gt;${outputDir}&lt;/value&gt;
+                &lt;/property&gt;
+            &lt;/configuration&gt;
+
+            &lt;!-- BEGIN: SNIPPET TO ADD TO DEFINE FILE/ARCHIVE TAGS --&gt;
+            &lt;file&gt;testdir1/dict.txt#dict1&lt;/file&gt;
+            &lt;archive&gt;testtar.tgz#tgzdir&lt;/archive&gt;
+            &lt;!-- END: SNIPPET TO ADD --&gt;
+
+        &lt;/map-reduce&gt;
+        &lt;ok to='end'/&gt;
+        &lt;error to='end'/&gt;
+    &lt;/action&gt;
+    &lt;kill name='kill'&gt;
+       &lt;value&gt;${wf:errorCode(&quot;wordcount&quot;)}&lt;/value&gt;
+    &lt;/kill/&gt;
+    &lt;end name='end'/&gt;
+&lt;/workflow-app&gt;</pre>
+</div>
+<p>Here, the files <i>test1/dict.txt</i> can be accessed by tasks using the symbolic names. The symbolic names cannot be a multi-level path (ie. their names should not contain any '/'). This also emphasizes the fact that symbolic links are always in the current working directory. The archive <i>testtar.tgz</i> will be placed and unarchived into a directory by the name <i>tgzdir</i>.</p>
+<p>Please note that the <i>–libjars</i> option supported by the Hadoop command-line is not supported by Oozie.</p>
+</li>
+<li><b> <a name="CASE-7">CASE-7</a>: HOW TO LAUNCH A MAPREDUCE STREAMING JOB</b><p><a class="externalLink" href="http://hadoop.apache.org/common/docs/r0.20.1/streaming.html">Hadoop Streaming</a> allows the user to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer (instead of providing the mapper and reducer as conventional java classes). The commandline way of launching such a Hadoop MapReduce streaming job is as follows:</p>
+<div class="source"><pre>$ hadoop jar lib/hadoop-streaming-0.20.1.3006291003.jar -D mapred.job.queue.name=unfunded -input /user/ninja/input-data -output /user/ninja/output-dir -mapper /bin/cat -reducer /usr/bin/wc</pre>
+</div>
+<p>In order to accomplish the same using Oozie, the following <b>&lt;streaming&gt;</b> element needs to be included inside the <b>&lt;map-reduce&gt;</b> action.</p>
+<div class="source"><pre>&lt;streaming&gt;
+        &lt;mapper&gt;/bin/cat&lt;/mapper&gt;
+        &lt;reducer&gt;/usr/bin/wc&lt;/reducer&gt;
+&lt;/streaming&gt;</pre>
+</div>
+<p>The complete workflow.xml file looks like below with the highlighted addition:</p>
+<div class="source"><pre>&lt;workflow-app xmlns='uri:oozie:workflow:0.1' name='streaming-wf'&gt;
+    &lt;start to='streaming1' /&gt;
+    &lt;action name='streaming1'&gt;
+        &lt;map-reduce&gt;
+            &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+            &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+
+            &lt;!-- BEGIN: SNIPPET TO ADD FOR HADOOP STREAMING ACTION --&gt;
+            &lt;streaming&gt;
+                &lt;mapper&gt;/bin/cat&lt;/mapper&gt;
+                &lt;reducer&gt;/usr/bin/wc&lt;/reducer&gt;
+            &lt;/streaming&gt;
+            &lt;!-- END: SNIPPET TO ADD --&gt;
+
+            &lt;configuration&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.input.dir&lt;/name&gt;
+                    &lt;value&gt;${inputDir}&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                    &lt;name&gt;mapred.output.dir&lt;/name&gt;
+                    &lt;value&gt;${outputDir}/streaming-output&lt;/value&gt;
+                &lt;/property&gt;
+                &lt;property&gt;
+                  &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                  &lt;value&gt;${queueName}&lt;/value&gt;
+                &lt;/property&gt;
+               &lt;property&gt;
+                  &lt;name&gt;mapred.child.java.opts&lt;/name&gt;
+                  &lt;value&gt;-Xmx1024M&lt;/value&gt;
+               &lt;/property&gt;
+
+            &lt;/configuration&gt;
+        &lt;/map-reduce&gt;
+        &lt;ok to=&quot;end&quot; /&gt;
+        &lt;error to=&quot;fail&quot; /&gt;
+    &lt;/action&gt;
+    &lt;kill name=&quot;fail&quot;&gt;
+        &lt;value&gt;${wf:errorCode(&quot;wordcount&quot;)}&lt;/value&gt;
+    &lt;/kill&gt;
+    &lt;end name='end' /&gt;
+&lt;/workflow-app&gt;</pre>
+</div>
+<p><b> <a name="CASE-8">CASE-8</a>: HOW TO USE THE OOZIE WEB-CONSOLE</b></p>
+<p>Oozie web-console provides a way to view all the submitted workflow and coordinator jobs in a browser. Each job could be examined in detail to reveal its job configuration, workflow definition and all the actions defined for it. It can be accessed by visiting the url used to submit the job, for eg: <a class="externalLink" href="http://localhost:4080/oozie">http://localhost:4080/oozie</a>.</p>
+<p><b>Note</b>: Please note that the web-console is read-only user interface and it cannot be used to submit a job or modify its status.</p>
+<p>Below are some screenshots describing how a job could be drilled down for further details using the web-console.</p>
+<p><i>All the jobs are listed in the grid with filters available above to view the desired job</i>.</p>
+<img src="images/wc1.png" /><p><i>Clicking a job displays the job details and all actions defined under it</i>.</p>
+<img src="images/wc2.png" /><p><i>Each action could be further drilled down by clicking on the browse icon beside the Console URL field</i>.</p>
+<img src="images/wc3.png" /><p><i>Hadoop job logs are available at this point and the task tracker logs can be accessed by clicking around</i>.</p>
+<img src="images/wc4.png" /><img src="images/wc5.png" /><img src="images/wc6.png" /></li>
+</ul>
+</div>
+<div class="section"><h3>FAQs</h3>
+</div>
+</div>
+
+      </div>
+    </div>
+    <div class="clear">
+      <hr/>
+    </div>
+    <div id="footer">
+      <div class="xright">
+        &#169;            2012
+              Apache Software Foundation
+            
+                       - <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
+        Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
+      </div>
+      <div class="clear">
+        <hr/>
+      </div>
+    </div>
+  </body>
+</html>

Added: incubator/oozie/site/publish/overview.html
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/overview.html?rev=1241607&view=auto
==============================================================================
--- incubator/oozie/site/publish/overview.html (added)
+++ incubator/oozie/site/publish/overview.html Tue Feb  7 20:37:39 2012
@@ -0,0 +1,260 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<!-- Generated by Apache Maven Doxia at Feb 7, 2012 -->
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>Apache Oozie - Oozie Workflows and Actions</title>
+    <style type="text/css" media="all">
+      @import url("./css/maven-base.css");
+      @import url("./css/maven-theme.css");
+      @import url("./css/site.css");
+    </style>
+    <link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
+        <meta name="author" content="$maven.build.timestamp" />
+        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+      </head>
+  <body class="composite">
+    <div id="banner">
+                  <span id="bannerLeft">
+                 
+                </span>
+                    <div class="clear">
+        <hr/>
+      </div>
+    </div>
+    <div id="breadcrumbs">
+            
+                                <div class="xleft">
+        Last Published: 2012-02-07
+                          |                   <a href="index.html">Apache Oozie</a>
+        &gt;
+    Apache Oozie - Oozie Workflows and Actions
+              </div>
+            <div class="xright">            <a href="http://www.apache.org/" class="externalLink">ASF</a>
+              
+                                 Version: 3.1.0-SNAPSHOT
+      </div>
+      <div class="clear">
+        <hr/>
+      </div>
+    </div>
+    <div id="leftColumn">
+      <div id="navcolumn">
+             
+                                                <h5>Project</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./index.html">Home</a>
+            </li>
+                  <li class="none">
+                  <a href="./Downloads.html">Downloads</a>
+            </li>
+                  <li class="none">
+                  <a href="./Credits.html">Credits</a>
+            </li>
+                  <li class="none">
+                  <a href="./MailingLists.html">Mailing Lists</a>
+            </li>
+                  <li class="none">
+                  <a href="./IssueTracking.html">Issue Tracking</a>
+            </li>
+                  <li class="none">
+                  <a href="./IRCChannel.html">IRC Channel</a>
+            </li>
+          </ul>
+                       <h5>Developers</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./VersionControl.html">Version Control</a>
+            </li>
+                  <li class="none">
+                  <a href="./HowToContribute.html">How To Contribute</a>
+            </li>
+                  <li class="none">
+                  <a href="HowToRelease.html">How to Release</a>
+            </li>
+          </ul>
+                       <h5>Documentation</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./QuickStart.html">Quick start</a>
+            </li>
+                  <li class="none">
+                  <a href="./overview.html">Overview</a>
+            </li>
+                  <li class="none">
+                  <a href="./map-reduce-cookbook.html">MapReduce Cookbook</a>
+            </li>
+                  <li class="none">
+                  <a href="./pig-cookbook.html">Pig Cookbook</a>
+            </li>
+          </ul>
+                                 <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
+          <img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
+        </a>
+                       
+                            </div>
+    </div>
+    <div id="bodyColumn">
+      <div id="contentBox">
+        <div class="section"><h2>Oozie Workflows and Actions</h2>
+<p>At this stage, basic requirements such as Java 1.6+ JDK, Hadoop and Oozie installations should be available. The following brief documentation will explain working with Oozie workflows.</p>
+<div class="section"><h3>The Oozie Application Directory</h3>
+<p>Copy the workflow application directory to your HDFS. ($HADOOP_HOME/bin should be in command path)</p>
+<div class="source"><pre>      $ hadoop fs -put &lt;src path on local file system&gt; &lt;destination path&gt;</pre>
+</div>
+<p>A workflow application directory has the following structure</p>
+<ul><li>my-app/workflow.xml</li>
+<li>my-app/lib (containing required classes in the form of JARs)</li>
+</ul>
+<p>A coordinator application directory has a 'coordinator.xml' file in addition to the above.</p>
+</div>
+<div class="section"><h3><a name="Configuring">Configuring</a> Actions in the Workflow</h3>
+<p>Oozie workflow enables you to execute your task via multiple action options e.g. Java action, Map-Reduce action, Pig action, Fs action and so on.</p>
+<p>Oozie jobs are executed on the Hadoop cluster via a Launcher (Refer to section <a href="#Launcher-Mapper">Launcher-Mapper</a> on this page). Hence the workflow has to be configured with the parameters-</p>
+<ul><li>Jobtracker URL</li>
+<li>Namenode URL</li>
+<li>Kerberos principles for authentication to the hadoop cluster</li>
+<li>Queue name</li>
+<li>Other properties specified as name-value pairs</li>
+</ul>
+<p>For example, the usage without Oozie for submitting a hadoop job on CLI is,</p>
+<div class="source"><pre>      $ hadoop [COMMAND] [GENERIC_OPTIONS]</pre>
+</div>
+<p>The GENERIC_OPTIONS comprise</p>
+<ul><li>-conf &lt;configuration_file&gt;</li>
+<li>-fs &lt;namenode:port&gt;</li>
+<li>-jt &lt;jobtracker:port&gt;</li>
+<li>-files &lt;comma-separated list of files to be copied to HDFS&gt;</li>
+<li>-archives &lt;comma-separated list of archives to be unarchived on compute nodes&gt;</li>
+<li>-D &lt;property=value&gt;</li>
+</ul>
+<p>Now with Oozie, the equivalent properties can be specified</p>
+<ul><li>as inline xml tags in the &quot;workflow.xml&quot; file<div class="source"><pre>          &lt;job-xml&gt; ... &lt;/job-xml&gt;
+          &lt;name-node&gt; ... &lt;/name-node&gt;
+          &lt;job-tracker&gt; ... &lt;/job-tracker&gt;
+          &lt;files&gt; ... &lt;/files&gt;
+          &lt;archives&gt; ... &lt;/archives&gt;
+
+          &lt;configuration&gt;
+           &lt;property&gt;
+                        &lt;name&gt; ... &lt;/name&gt;
+                        &lt;value&gt; ... &lt;/value&gt;
+           &lt;/property&gt;
+          &lt;/configuration&gt;</pre>
+</div>
+<p>OR</p>
+</li>
+<li>as a list of name-value pairs in a &quot;job.properties&quot; file. This enables frequently repeated property values to be parameterized in the workflow specification as EL expressions</li>
+</ul>
+<p>Note: The job.properties files need not be uploaded to HDFS as part of the workflow app directory. It is only required locally on the machine from where oozie job is submitted. That way you can use various property values to submit same job to different cluster environments.</p>
+<p>Sample <i>job.properties</i> file</p>
+<div class="source"><pre>      nameNode=foo:9000
+
+      jobTracker=bar:9001
+
+      jobInput=/somedirpath
+
+      queueName=default</pre>
+</div>
+</div>
+<div class="section"><h3>Syntax For Composing Workflows</h3>
+<p>Sample Syntax for the <i>workflow.xml</i> file with a Java action (illustrating use of EL expressions from job.properties).</p>
+<div class="source"><pre>      &lt;workflow-app name=&quot;[WF-DEF-NAME]&quot; xmlns=&quot;uri:oozie:workflow:0.2&quot;&gt;
+        ...
+          &lt;action name=&quot;[NODE-NAME]&quot;&gt;
+          &lt;java&gt;
+              &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+              &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+              &lt;prepare&gt;
+                  &lt;delete path=&quot;[PATH]&quot;/&gt;
+                  ...
+                  &lt;mkdir path=&quot;[PATH]&quot;/&gt;
+                  ...
+              &lt;/prepare&gt;
+              &lt;configuration&gt;
+                  &lt;property&gt;
+                    &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                    &lt;value&gt;${queueName}&lt;/value&gt;
+                                  &lt;/property&gt;
+                                  &lt;property&gt;
+                    &lt;name&gt;mapred.input.dir&lt;/name&gt;
+                    &lt;value&gt;${jobInput}&lt;/value&gt;
+                                  &lt;/property&gt;
+                  ...
+              &lt;/configuration&gt;
+              &lt;main-class&gt;[MAIN-CLASS]&lt;/main-class&gt;
+                          &lt;java-opts&gt;[JAVA-STARTUP-OPTS]&lt;/java-opts&gt;
+                          &lt;arg&gt;ARGUMENT&lt;/arg&gt;
+              ...
+          &lt;/java&gt;
+          &lt;ok to=&quot;[NODE-NAME]&quot;/&gt;
+              &lt;error to=&quot;[NODE-NAME]&quot;/&gt;
+          &lt;/action&gt;
+          ...
+      &lt;/workflow-app&gt;</pre>
+</div>
+<p>The syntax of the tags remains the same for Java, Map-Reduce, Pig, Fs or Ssh actions in Oozie.</p>
+<p>There are different ways of parameterization of configuration values, by passing them in workflow.xml, job.properties, config-default.xml, or a custom xml file referred to via the <i>job-xml</i> tag in the workflow. For more details refer to the section <a href="../../target/site/map-reduce-cookbook.html#CASE-1">How to Parameterize Oozie Jobs</a>.</p>
+</div>
+<div class="section"><h3>Prepare block</h3>
+<p>A workflow action can be configured to perform HDFS files/directories cleanup before starting the application. This capability enables Oozie to retry an application in the situation of a transient or non-transient failure (This can be used to cleanup any temporary data which may have been created by the application in case of failure).</p>
+<p>The prepare element, if present, indicates a list of paths to do file operations upon, before starting the application. This should be used exclusively for directory cleanup for the application to be executed; only <tt>delete</tt> and <tt>mkdir</tt> operations can be done in order.</p>
+<div class="source"><pre>        &lt;prepare&gt;
+            &lt;delete path=[PATH] /&gt;
+            ..
+            &lt;mkdir path=[PATH] /&gt;
+            ..
+        &lt;/prepare&gt;</pre>
+</div>
+</div>
+<div class="section"><h3><a name="Adding">Adding</a> Files and Archives for your Job</h3>
+<p>It is possible to add files and archives as workflow elements to be available to the application. If the specified path is relative, it is assumed the file or archive are within the application directory, in the corresponding sub-path. If the path is absolute, the file or archive it is expected in the given absolute path. These files are copied to the map reduce cluster compute node. The archives specified in the arguments list are unarchived on the compute machines.</p>
+<p>Files specified with the file element, will be symbolic links in the current working directory i.e. home directory of the task. If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as an '.so' file in the task running directory, thus available to the task JVM. To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. (Illustrated below)</p>
+<p>Oozie supports these by allowing <tt>file</tt> and <tt>archive</tt> tags that can be defined in the application workflow as below:</p>
+<div class="source"><pre>        &lt;file&gt; dir1/dict.txt#dict1 &lt;/file&gt;
+        &lt;file&gt; dir2/dict.txt#dict2 &lt;/file&gt;
+        &lt;archive&gt; mytar.tgz#tgzdir &lt;/archive&gt;</pre>
+</div>
+<p>Here, the files dir1/dict.txt and dir2/dict.txt can be accessed by jobs using the symbolic names dict1 and dict2 respectively. The archive mytar.tgz will be placed and unarchived into a directory by the name &quot;tgzdir&quot;.</p>
+<p>Please note that the <i>addlibjars</i> option supported by the Hadoop command-line is not supported by Oozie.</p>
+</div>
+<div class="section"><h3><a name="Launcher-Mapper">Launcher-Mapper</a>: How Oozie Launches Actions in Workflow</h3>
+<p>A common misunderstanding among the users is that the Oozie-server launches the MapReduce/Pig jobs by itself. The following diagram shows what actually happens when Oozie tries to launch actions in a workflow:</p>
+<img src="images/Launcher.png" alt="Launcher Job" /><ol type="1"><li>Oozie server contacts the JobTracker first and submits the MapReduce launcher job.</li>
+<li>Job Tracker then initiates a <i>map only</i> job called the <b>Launcher Job</b>.</li>
+<li>This Launcher job then creates the various MapReduce jobs on Hadoop.</li>
+<li>The Launcher job exits after all jobs are done.<p>The reasons for using a Launcher job as an intermediate step are:</p>
+<ul><li>To prevent Oozie-server from becoming a performance bottleneck and thus single point of failure.</li>
+<li>To help Oozie become more scalable.</li>
+</ul>
+</li>
+</ol>
+</div>
+<div class="section"><h3>Workflow Specification</h3>
+<p>Now to begin translating your tasks into equivalent Oozie jobs by utilizing different actions and writing workflows incorporating the above features, refer to <a class="externalLink" href="http://yahoo.github.com/oozie/releases/3.0.0/WorkflowFunctionalSpec.html">Workflow Specifications</a> comprising of different Oozie Actions.</p>
+<p>Detailed use-cases and composition :-</p>
+<ul><li>Map-Reduce Action - <a href="../../target/site/map-reduce-cookbook.html">Map-Reduce Action Cookbook</a></li>
+<li>Pig Action - <a href="../../target/site/pig-cookbook.html">Pig Action Cookbook</a></li>
+</ul>
+</div>
+</div>
+
+      </div>
+    </div>
+    <div class="clear">
+      <hr/>
+    </div>
+    <div id="footer">
+      <div class="xright">
+        &#169;            2012
+              Apache Software Foundation
+            
+                       - <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
+        Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
+      </div>
+      <div class="clear">
+        <hr/>
+      </div>
+    </div>
+  </body>
+</html>

Added: incubator/oozie/site/publish/pig-cookbook.html
URL: http://svn.apache.org/viewvc/incubator/oozie/site/publish/pig-cookbook.html?rev=1241607&view=auto
==============================================================================
--- incubator/oozie/site/publish/pig-cookbook.html (added)
+++ incubator/oozie/site/publish/pig-cookbook.html Tue Feb  7 20:37:39 2012
@@ -0,0 +1,345 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<!-- Generated by Apache Maven Doxia at Feb 7, 2012 -->
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>Apache Oozie - Pig Cookbook</title>
+    <style type="text/css" media="all">
+      @import url("./css/maven-base.css");
+      @import url("./css/maven-theme.css");
+      @import url("./css/site.css");
+    </style>
+    <link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
+        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+      </head>
+  <body class="composite">
+    <div id="banner">
+                  <span id="bannerLeft">
+                 
+                </span>
+                    <div class="clear">
+        <hr/>
+      </div>
+    </div>
+    <div id="breadcrumbs">
+            
+                                <div class="xleft">
+        Last Published: 2012-02-07
+                          |                   <a href="index.html">Apache Oozie</a>
+        &gt;
+    Apache Oozie - Pig Cookbook
+              </div>
+            <div class="xright">            <a href="http://www.apache.org/" class="externalLink">ASF</a>
+              
+                                 Version: 3.1.0-SNAPSHOT
+      </div>
+      <div class="clear">
+        <hr/>
+      </div>
+    </div>
+    <div id="leftColumn">
+      <div id="navcolumn">
+             
+                                                <h5>Project</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./index.html">Home</a>
+            </li>
+                  <li class="none">
+                  <a href="./Downloads.html">Downloads</a>
+            </li>
+                  <li class="none">
+                  <a href="./Credits.html">Credits</a>
+            </li>
+                  <li class="none">
+                  <a href="./MailingLists.html">Mailing Lists</a>
+            </li>
+                  <li class="none">
+                  <a href="./IssueTracking.html">Issue Tracking</a>
+            </li>
+                  <li class="none">
+                  <a href="./IRCChannel.html">IRC Channel</a>
+            </li>
+          </ul>
+                       <h5>Developers</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./VersionControl.html">Version Control</a>
+            </li>
+                  <li class="none">
+                  <a href="./HowToContribute.html">How To Contribute</a>
+            </li>
+                  <li class="none">
+                  <a href="HowToRelease.html">How to Release</a>
+            </li>
+          </ul>
+                       <h5>Documentation</h5>
+                  <ul>
+                  <li class="none">
+                  <a href="./QuickStart.html">Quick start</a>
+            </li>
+                  <li class="none">
+                  <a href="./overview.html">Overview</a>
+            </li>
+                  <li class="none">
+                  <a href="./map-reduce-cookbook.html">MapReduce Cookbook</a>
+            </li>
+                  <li class="none">
+                  <a href="./pig-cookbook.html">Pig Cookbook</a>
+            </li>
+          </ul>
+                                 <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
+          <img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
+        </a>
+                       
+                            </div>
+    </div>
+    <div id="bodyColumn">
+      <div id="contentBox">
+        <div class="section"><h2>Pig Cookbook</h2>
+<p>This document comprehensively describes the procedure of running a Pig job using Oozie. Its targeted audience is all forms of users who will install, use and operate Oozie.</p>
+<div class="section"><h3>Overview</h3>
+<p>Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Refer to <a class="externalLink" href="http://pig.apache.org">Pig documentation</a> for information on Pig.</p>
+<p>Although Pig jobs can be launched independently, there are obvious advantages on submitting them via Oozie such as:</p>
+<ul><li>Managing complex workflow dependencies</li>
+<li>Frequency-based execution</li>
+<li>Operational flexibility<p>An execution of a Pig job is referred as a Pig action in Oozie. A Pig action can be specified in the workflow definition (xml) file. The workflow job will wait until the Pig job completes before continuing to the next action.</p>
+<p>The Pig action has to be configured with the Pig script and the necessary parameters and configuration to run the Pig job. A Pig script contains Pig Latin statements and Pig commands in a single file. For configuration related to job-tracker, name-node, job-xml etc., refer to section <a href="../../target/site/overview.html#Configuring">Configuring Actions in the Workflow</a></p>
+</li>
+<li><b>Syntax of Pig action</b><div class="source"><pre>&lt;workflow-app&gt;
+...
+&lt;action name=&quot;[NODE-NAME]&quot;&gt;
+&lt;pig&gt;
+...
+&lt;script&gt;[PIG-SCRIPT]&lt;/script&gt;
+&lt;argument&gt;[ARGUMENT-VALUE]&lt;/argument&gt;
+...
+&lt;argument&gt;[ARGUMENT-VALUE]&lt;/argument&gt;
+
+...
+&lt;/pig&gt;
+&lt;ok to=&quot;[NODE-NAME]&quot;/&gt;
+&lt;error to=&quot;[NODE-NAME]&quot;/&gt;
+&lt;/action&gt;
+...
+&lt;/workflow-app&gt;</pre>
+</div>
+<p>The &quot;script&quot; element contains the Pig script to execute.</p>
+<p>The &quot;argument&quot; element, if present, contains arguments to be passed to the Pig script. This can be used for <a class="externalLink" href="http://wiki.apache.org/pig/ParameterSubstitution">parameter substitution</a> and other purposes.</p>
+<p>As with Hadoop map-reduce jobs, it is possible to add files and archives to be available to the Pig job, refer to section <a href="../../target/site/overview.html#Adding">Adding Files and Archives for your Job</a></p>
+</li>
+</ul>
+</div>
+<div class="section"><h3>Use cases</h3>
+<ul><li><b>CASE 1: Launch a simple Pig job</b><p>Oozie allows the user to run a Pig job by specifying the Pig script and other necessary arguments. A command line way to launch a Pig job is:</p>
+<div class="source"><pre>pig -Dmapred.job.queue.name=myqueue -file script.pig</pre>
+</div>
+<p>To accomplish the same using Oozie, the complete xml file for specifying the workflow is below:</p>
+<div class="source"><pre>        &lt;workflow-app name='pig-wf' xmlns=&quot;uri:oozie:workflow:0.3&quot;&gt;
+            &lt;start to='pig-node'/&gt;
+            &lt;action name='pig-node'&gt;
+               &lt;pig&gt;
+                    &lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;
+                    &lt;name-node&gt;${nameNode}&lt;/name-node&gt;
+                    &lt;prepare&gt; &lt;delete path=&quot;${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/pig&quot;/&gt;&lt;/prepare&gt;
+                    &lt;configuration&gt;
+                        &lt;property&gt;
+                            &lt;name&gt;mapred.job.queue.name&lt;/name&gt;
+                            &lt;value&gt;${queueName}&lt;/value&gt;
+                        &lt;/property&gt;
+                    &lt;/configuration&gt;
+                    &lt;script&gt;script.pig&lt;/script&gt;
+               &lt;/pig&gt;
+               &lt;ok to=&quot;end&quot;/&gt;
+                   &lt;error to=&quot;fail&quot;/&gt;
+                &lt;/action&gt;
+                &lt;kill name=&quot;fail&quot;&gt;
+                     &lt;message&gt;Pig failed, error message[${wf:errorMessage(wf:lastErrorNode())}]&lt;/message&gt;
+                &lt;/kill&gt;
+                &lt;end name=&quot;end&quot;/&gt;
+        &lt;/workflow-app&gt;</pre>
+</div>
+<ul><li><b>&lt;jobtracker&gt;</b> element is used to specify the url of the hadoop job tracker.<p><i>Format</i>: jobtracker_hostname:port_number</p>
+<p><i>Example</i>: localhost:9001, abc.xyz.yahoo.com:50300</p>
+</li>
+<li><b>&lt;namenode&gt;</b> element is used to specify the url of the hadoop namenode.<p><i>Format</i>: hdfs://namenode_hostname:port_number</p>
+<p><i>Example</i>: hdfs://localhost:9000, hdfs://abc.xyz.yahoo.com:8020</p>
+<p>jobtracker and namenode need to be same as the ones defined in the hadoop configuration files. If they are different, they would need to be updated.</p>
+</li>
+<li><b>&lt;prepare&gt;</b> element is used to specify a list of operations needed to be performed before beginning an action such as deleting an existing output directory (<b>&lt;delete&gt;</b>) or creating a new one (<b>&lt;mkdir&gt;</b>).</li>
+<li><b>&lt;configuration&gt;</b> element is used to specify key/value properties. Some common properties include:</li>
+<li><i>mapred.job.queue.name</i> specifies the queuename that the job will be submitted to. If not mentioned, the default queue <i>default</i> is assumed.</li>
+<li><b>&lt;script&gt;</b> element is used to specify the Pig script</li>
+</ul>
+</li>
+<li><b>CASE 2: Running a Pig job by passing parameters through command line</b><p>Many users want to create a template Pig script and run it with different parameters. This can be accomplished using the <i>-param</i> construct in Pig. For more information, refer to <a class="externalLink" href="http://wiki.apache.org/pig/ParameterSubstitution">parameter substitution</a>. A command line way to run a Pig job by using the <i>-param</i> construct is:</p>
+<div class="source"><pre>pig -file script.pig -param INPUT=inputdir -param OUTPUT=outputdir</pre>
+</div>
+<p>In order to accomplish the same using Oozie, the <i>&lt;argument&gt;</i> element needs to be included inside the <i>&lt;pig&gt;</i> action.</p>
+<p>A partial xml file for such a Pig action:</p>
+<div class="source"><pre>&lt;script&gt;script.pig&lt;/script&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;INPUT=inputdir&lt;/argument&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;OUTPUT=outputdir&lt;/argument&gt;</pre>
+</div>
+</li>
+<li><b>CASE 3: Running a Pig job by passing parameters through a parameter file</b><p>A parameter file can also be used to pass parameters. It is primarily used when the number of parameters to be passed are high. A command line way to run a Pig job by using the <i>-param_file</i> construct to pass parameters through file is:</p>
+<div class="source"><pre>pig -file script.pig -param_file paramfile</pre>
+</div>
+<p>The are multiple ways of running such a Pig job through Oozie.</p>
+<ul><li><b>a) Using the absolute hdfs path</b><p>Partial xml file for Pig action:</p>
+<div class="source"><pre>&lt;script&gt;script.pig&lt;/script&gt;
+&lt;argument&gt;-param_file&lt;/argument&gt;
+&lt;argument&gt;hdfs://localhost:9000/user/ninja/paramfile.txt&lt;/argument&gt;</pre>
+</div>
+<p>Pig expects the parameter file to be a local file. As Oozie runs on compute node, the location of the parameter file in hdfs should be specified.</p>
+<p>The Pig action requires the Pig jar file in the hdfs. Such libraries which are used in the workflow can be stored in the <i>lib</i> directory. During runtime, the Oozie server picks up contents of this directory and deploys them on the actual compute node using Hadoop distributed cache. This <i>lib</i> directory has to be manually copied over to the HDFS before the workflow can run.</p>
+<p>hadoop fs -put command can be used to copy the files to hdfs.</p>
+<p>hadoop fs -ls command can be used to list the files in hdfs.</p>
+<p>Layout of application directory in hdfs:</p>
+<div class="source"><pre>/user/ninja/examples/apps/pig/workflow.xml
+/user/ninja/examples/apps/pig/script.pig
+/user/ninja/paramfile.txt
+/user/ninja/examples/apps/pig/lib/pig-0.9.jar</pre>
+</div>
+</li>
+<li><b>b) Storing the parameter file in &quot;lib&quot; directory</b><p>The parameter file can be stored in the <i>lib</i> directory as contents of this directory are automatically added to the classpath by Oozie server.</p>
+<p>Partial xml file for Pig action:</p>
+<div class="source"><pre>&lt;script&gt;script.pig&lt;/script&gt;
+&lt;argument&gt;-param_file&lt;/argument&gt;
+&lt;argument&gt;paramfile.txt&lt;/argument&gt;</pre>
+</div>
+<p>Layout of application directory in hdfs:</p>
+<div class="source"><pre>/user/ninja/examples/apps/pig/workflow.xml
+/user/ninja/examples/apps/pig/script.pig
+/user/ninja/examples/apps/pig/lib/pig-0.9.jar
+/user/ninja/examples/apps/pig/lib/paramfile.txt</pre>
+</div>
+</li>
+<li><b>c) Using the &lt;file&gt; element</b><p>Partial xml file for Pig action</p>
+<div class="source"><pre>&lt;script&gt;script.pig&lt;/script&gt;
+&lt;argument&gt;-param_file&lt;/argument&gt;
+&lt;argument&gt;symlink.txt&lt;/argument&gt;
+&lt;file&gt;/user/ninja/param/paramfile.txt#symlink.txt&lt;/file&gt;</pre>
+</div>
+<p>The files under the <i>&lt;file&gt;</i> element are added to the distributed cache. To force a symlink for a file, '#' is used followed by the symlink name. Hence, the file <i>/user/ninja/param/paramfile.txt</i> can be accessed locally using the symbolic name <i>symlink.txt</i>. For detailed usage of symbolic link in <i>&lt;file&gt;</i> and <i>&lt;archive&gt;</i>, refer to Adding Files and Archives for the Job (put link)</p>
+<p>Layout of application directory in hdfs</p>
+<div class="source"><pre>/user/ninja/examples/apps/pig/workflow.xml
+/user/ninja/examples/apps/pig/script.pig
+/user/ninja/param/paramfile.txt
+/user/ninja/examples/apps/pig/lib/pig-0.9.jar</pre>
+</div>
+</li>
+</ul>
+</li>
+<li><b>CASE 4: Pig Actions with UDF</b><p>Pig provides support for user-defined functions (UDFs) as a way to specify custom processing. Refer to <a class="externalLink" href="http://pig.apache.org/docs/r0.9.1/udf.html">Pig UDF</a> for information on Pig User Defined Functions (UDF)</p>
+<p>Following is an example script file using UDF</p>
+<div class="source"><pre>REGISTER udfjar/tutorial.jar
+A = load '$INPUT/student_data' using PigStorage('\t') as (name: chararray, age: int, gpa: float);
+B = foreach A generate org.apache.pig.tutorial.UPPER(name); store B into '$OUTPUT' USING PigStorage();</pre>
+</div>
+<p>A command line way to run this Pig job is:</p>
+<div class="source"><pre>pig -file script.pig -param INPUT=inputdir -param OUTPUT=outputdir</pre>
+</div>
+<p>While running through Oozie, the UDF binary has to reside on the compute node.</p>
+<p>There are multiple ways of specifying Pig UDF:</p>
+<ul><li><b>a) Using the &lt;archive&gt; element</b><p>Specify the name of the customized jar under the <i>&lt;archive&gt;</i> element and use 'REGISTER' in pig script.</p>
+<p>Partial xml file using the <i>&lt;archive&gt;</i> element:</p>
+<div class="source"><pre>&lt;script&gt;script.pig&lt;/script&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;INPUT=inputdir &lt;/argument&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;OUTPUT=outputdir &lt;/argument&gt;
+&lt;archive&gt;archive/tutorial.jar#udfjar&lt;/archive&gt;</pre>
+</div>
+<p>The archive <i>tutorial.jar</i> will be placed into a directory by the name <i>udfjar</i> in the current working directory of the tasks. Hence the jar file in the distributed cache will be available locally to the Pig job.</p>
+<p>Layout of application directory in hdfs:</p>
+<div class="source"><pre>/examples/apps/pig/workflow.xml
+/examples/apps/pig/script.pig
+/examples/apps/pig/lib/pig-0.9.jar
+/examples/apps/pig/archive/tutorial.jar</pre>
+</div>
+</li>
+<li><b>b) Using the &lt;file&gt; element</b><p>Specify the name of the customized jar under the <i>&lt;file&gt;</i> element and use 'REGISTER' in Pig script.</p>
+<p>Following is an example script file</p>
+<div class="source"><pre>REGISTER udfjar.jar
+A = load '$INPUT/student_data' using PigStorage('\t') as (name: chararray, age: int, gpa: float);
+B = foreach A generate org.apache.pig.tutorial.UPPER(name); store B into '$OUTPUT' USING PigStorage();</pre>
+</div>
+<p>Partial xml file using the <i>&lt;file&gt;</i> element</p>
+<div class="source"><pre>&lt;script&gt;script.pig&lt;/script&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;INPUT=inputdir &lt;/argument&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;OUTPUT=outputdir &lt;/argument&gt;
+&lt;file&gt;archive/tutorial.jar#udfjar.jar&lt;/file&gt;</pre>
+</div>
+<p>The files under the <i>&lt;file&gt;</i> element are added to the distributed cache. To force a symlink for a file, '#' is used followed by the symlink name. Hence, the file <i>archive/tutorial.jar</i> can be accessed locally using the symbolic name <i>udfjar.jar</i>. For detailed usage of symbolic link in <i>&lt;file&gt;</i> and <i>&lt;archive&gt;</i>, refer to Adding Files and Archives for the Job (put link)</p>
+<p>Layout of application directory in hdfs</p>
+<div class="source"><pre>/examples/apps/pig/workflow.xml
+/examples/apps/pig/script.pig
+/examples/apps/pig/lib/pig-0.9.jar
+/examples/apps/pig/archive/tutorial.jar</pre>
+</div>
+</li>
+<li><b>c) Storing the customized jar in &quot;lib&quot; directory</b><p>Jars in the <i>lib</i> directory are automatically added to the classpath by Oozie server. So, if the customized jar (tutorial.jar) is in <i>lib</i> directory, then the jar file should not be in <i>&lt;archive&gt;</i> and hence, &quot;REGISTER&quot; should be removed from Pig script.</p>
+<p>Partial xml file:</p>
+<div class="source"><pre>&lt;script&gt;script.pig&lt;/script&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;INPUT=inputdir &lt;/argument&gt;
+&lt;argument&gt;-param&lt;/argument&gt;
+&lt;argument&gt;OUTPUT=outputdir &lt;/argument&gt;</pre>
+</div>
+<p>Layout of application directory in hdfs.</p>
+<div class="source"><pre>/examples/apps/pig/workflow.xml
+/examples/apps/pig/script.pig
+/examples/apps/pig/lib/pig-0.9.jar
+/examples/apps/pig/lib/tutorial.jar</pre>
+</div>
+</li>
+</ul>
+</li>
+</ul>
+</div>
+<div class="section"><h3>HOW TO USE THE OOZIE WEB-CONSOLE</h3>
+<p>Oozie web-console provides a way to view all the submitted workflow and coordinator jobs in a browser. Each job could be examined in detail to reveal its job configuration, workflow definition and all the actions defined for it. It can be accessed by visiting the url used to submit the job, for eg: <a class="externalLink" href="http://localhost:4080/oozie">http://localhost:4080/oozie</a>.</p>
+<p><b>Note</b>: Please note that the web-console is read-only user interface and it cannot be used to submit a job or modify its status.</p>
+<p>Below are some screenshots describing how a job could be drilled down for further details using the web-console.</p>
+<p><i>All the jobs are listed in the grid with filters available above to view the desired job</i>.</p>
+<img src="images/step1.png" /><p><i>Clicking a job displays the job details and all actions defined under it</i>.</p>
+<img src="images/step2.png" /><p><i>Each action could be further drilled down by clicking on the browse icon beside the Console URL field</i>.</p>
+<img src="images/step3.png" /><p><i>Hadoop job logs can be viewed</i></p>
+<img src="images/step4.png" /><p><i>The map task of the launcher job can be accessed by clicking around</i></p>
+<img src="images/step5.png" /><p><i>Pig produces a sequence of Map-Reduce programs. The details of all these Map-Reduce jobs can be obtained through the task log files.</i></p>
+<img src="images/step6.png" /></div>
+<div class="section"><h3>FAQs</h3>
+<p>Question: How can one increase the memory for the PIG launcher job?</p>
+<p>Answer: You can define a property (oozie.launcher.*) in your action:</p>
+<div class="source"><pre>        &lt;property&gt;
+                &lt;name&gt;oozie.launcher.mapred.child.java.opts&lt;/name&gt;
+                &lt;value&gt;-server -Xmx1G -Djava.net.preferIPv4Stack=true&lt;/value&gt;
+                &lt;description&gt;setting memory usage to 1024MB&lt;/description&gt;
+        &lt;/property&gt;</pre>
+</div>
+</div>
+</div>
+
+      </div>
+    </div>
+    <div class="clear">
+      <hr/>
+    </div>
+    <div id="footer">
+      <div class="xright">
+        &#169;            2012
+              Apache Software Foundation
+            
+                       - <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
+        Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
+      </div>
+      <div class="clear">
+        <hr/>
+      </div>
+    </div>
+  </body>
+</html>