You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@crunch.apache.org by bu...@apache.org on 2012/09/16 20:50:05 UTC

svn commit: r832180 [3/5] - in /websites/staging/crunch/trunk/content: ./ crunch/ crunch/css/ crunch/js/

Added: websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css (added)
+++ websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css Sun Sep 16 18:50:04 2012
@@ -0,0 +1,9 @@
+/*!
+ * Bootstrap v2.1.0
+ *
+ * Copyright 2012 Twitter, Inc
+ * Licensed under the Apache License v2.0
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Designed and built with all the love in the world @twitter by @mdo and @fat.

[... 2 lines stripped ...]
Added: websites/staging/crunch/trunk/content/crunch/css/crunch.css
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/css/crunch.css (added)
+++ websites/staging/crunch/trunk/content/crunch/css/crunch.css Sun Sep 16 18:50:04 2012
@@ -0,0 +1,4 @@
+.nav-list {
+  padding-left: 5px;
+  padding-right: 5px;
+}

Added: websites/staging/crunch/trunk/content/crunch/future-work.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/future-work.html (added)
+++ websites/staging/crunch/trunk/content/crunch/future-work.html Sun Sep 16 18:50:04 2012
@@ -0,0 +1,141 @@
+<!DOCTYPE html>
+
+
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta http-equiv="Content-Language" content="en" />
+
+    <title>Apache Crunch - Current Limitations and Future Work</title>
+
+    <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+    <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+    <script type="text/javascript" src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+  </head>
+  <body>
+
+    <div class="navbar navbar-inverse navbar-static-top">
+      
+        <div class="container-fluid">
+
+          <a class="nav pull-right brand" href="http://incubator.apache.org">
+            <img src="http://incubator.apache.org/images/egg-logo.png" alt="apache Incubator Logo" />
+          </a>
+
+        </div>
+      
+    </div>
+
+    <ul class="breadcrumb">
+      <li>
+        <a href="/">Incubator</a>
+	<span class="divider">&raquo;</span>
+      </li>
+      <li>
+        <a href="/crunch/">Crunch</a>
+      </li>
+      
+    </ul>
+
+    <div class="container-fluid">
+      <div class="row-fluid">
+
+        <!-- SIDEBAR AREA -->
+        <div class="span2">
+          <div class="sidebar-nav">
+            <ul class="nav nav-list">
+              
+                
+                  <li class="nav-header">Apache Crunch</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/index.html">Overview</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/apidocs/">API</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="https://cwiki.apache.org/confluence/display/CRUNCH/">Wiki</a></li>
+                  
+                
+              
+                
+                  <li class="nav-header">Project</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/source-repository.html">Source Code</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/mailing-lists.html">Mailing Lists</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="http://issues.apache.org/jira/browse/CRUNCH">Issue Tracking</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
+                  
+                
+              
+            </ul>
+          </div> <!-- /well -->
+        </div> <!-- /span -->
+
+        <!-- CONTENT AREA -->
+        <div class="span10">
+          <h1 class="title">
+            Current Limitations and Future Work
+            
+          </h1>
+
+          <p>This section contains an almost certainly incomplete list of known limitations of Crunch and plans for future work.</p>
+<ul>
+<li>We would like to have easy support for reading and writing data from/to HCatalog.</li>
+<li>The decision of how to split up processing tasks between dependent MapReduce jobs is very naiive right now- we simply
+delegate all of the work to the reduce stage of the predecessor job. We should take advantage of information about the
+expected size of different PCollections to optimize this processing.</li>
+<li>The Crunch optimizer does not yet merge different groupByKey operations that run over the same input data into a single
+MapReduce job. Implementing this optimization will provide a major performance benefit for a number of problems.</li>
+</ul>
+        </div> <!-- /span -->
+
+      </div> <!-- /row-fluid -->
+
+    </div>
+
+    <hr/>
+
+    <footer>
+      <div class="container-fluid">
+        <div class="row span12">Copyright &copy; 2012
+          <a href="http://www.apache.org/">The Apache Software Foundation</a>,
+          licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+	  <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+	  Apache feather logo are trademarks of The Apache Software Foundation.
+	  Other names appearing on the site may be trademarks of their
+	  respective owners.</small></p>
+        </div>
+      </div>
+    </footer>
+
+  </body>
+</html>

Modified: websites/staging/crunch/trunk/content/crunch/index.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/index.html (original)
+++ websites/staging/crunch/trunk/content/crunch/index.html Sun Sep 16 18:50:04 2012
@@ -1,56 +1,161 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-<html lang="en">
-  <head>
-    <title>Home Page</title>
+<!DOCTYPE html>
 
-    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
-    <meta property="og:image" content="http://www.apache.org/images/asf_logo.gif" />
 
-    <link rel="stylesheet" type="text/css" media="screen" href="http://www.apache.org/css/style.css">
-    <link rel="stylesheet" type="text/css" media="screen" href="http://www.apache.org/css/code.css">
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta http-equiv="Content-Language" content="en" />
+
+    <title>Apache Crunch - Apache Crunch</title>
+
+    <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+    <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+    <script type="text/javascript" src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+  </head>
+  <body>
 
-    
+    <div class="navbar navbar-inverse navbar-static-top">
+      
+        <div class="container-fluid">
+
+          <a class="nav pull-right brand" href="http://incubator.apache.org">
+            <img src="http://incubator.apache.org/images/egg-logo.png" alt="apache Incubator Logo" />
+          </a>
 
-    
-    <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the &quot;License&quot;); you may not use this file except in compliance with the License.  You may obtain a copy of the License at . http://www.apache.org/licenses/LICENSE-2.0 . Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an &quot;AS IS&quot; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License. -->
-  </head>
+        </div>
+      
+    </div>
 
-  <body>
-    <div id="page" class="container_16">
-      <div id="header" class="grid_8">
-        <img src="http://www.apache.org/images/feather-small.gif" alt="The Apache Software Foundation">
-        <h1>The Apache Software Foundation</h1>
-        <h2>Home Page</h2>
-      </div>
-      <div id="nav" class="grid_8">
-        <ul>
-          <!-- <li><a href="/" title="Welcome!">Home</a></li> -->
-          <li><a href="http://www.apache.org/foundation/" title="The Foundation">Foundation</a></li>
-          <li><a href="http://projects.apache.org" title="The Projects">Projects</a></li>
-          <li><a href="http://people.apache.org" title="The People">People</a></li>
-          <li><a href="http://www.apache.org/foundation/getinvolved.html" title="Get Involved">Get Involved</a></li>
-          <li><a href="http://www.apache.org/dyn/closer.cgi" title="Download">Download</a></li>
-          <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Support Apache">Support Apache</a></li>
-        </ul>
-        <p><a href="/">Home</a>&nbsp;&raquo&nbsp;<a href="/crunch/">Crunch</a></p>
-        <form name="search" id="search" action="http://www.google.com/search" method="get">
-          <input value="*.apache.org" name="sitesearch" type="hidden"/>
-          <input type="text" name="q" id="query">
-          <input type="submit" id="submit" value="Search">
-        </form>
-      </div>
-      <div class="clear"></div>
-      <div id="content" class="grid_16"><div class="section-content"><h1 id="welcome">Welcome</h1>
-<p>Welcome to the Apache CMS.  Please see the following resources for further help:</p>
+    <ul class="breadcrumb">
+      <li>
+        <a href="/">Incubator</a>
+	<span class="divider">&raquo;</span>
+      </li>
+      <li>
+        <a href="/crunch/">Crunch</a>
+      </li>
+      
+    </ul>
+
+    <div class="container-fluid">
+      <div class="row-fluid">
+
+        <!-- SIDEBAR AREA -->
+        <div class="span2">
+          <div class="sidebar-nav">
+            <ul class="nav nav-list">
+              
+                
+                  <li class="nav-header">Apache Crunch</li>
+                
+              
+                
+                  
+                    <li><b>Overview</b></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/apidocs/">API</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="https://cwiki.apache.org/confluence/display/CRUNCH/">Wiki</a></li>
+                  
+                
+              
+                
+                  <li class="nav-header">Project</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/source-repository.html">Source Code</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/mailing-lists.html">Mailing Lists</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="http://issues.apache.org/jira/browse/CRUNCH">Issue Tracking</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
+                  
+                
+              
+            </ul>
+          </div> <!-- /well -->
+        </div> <!-- /span -->
+
+        <!-- CONTENT AREA -->
+        <div class="span10">
+          <h1 class="title">
+            Apache Crunch
+            
+              <small>Simple and Efficient MapReduce Pipelines</small>
+            
+          </h1>
+
+          <hr />
+<blockquote>
+<p><em>Apache Crunch (incubating)</em> is a Java library for writing, testing, and
+running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+pipelines that are composed of many user-defined functions simple to write,
+easy to test, and efficient to run.</p>
+</blockquote>
+<hr />
+<p>Running on top of <a href="http://hadoop.apache.org/mapreduce/">Hadoop MapReduce</a>, Apache
+Crunch provides a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. For Scala users, there is also
+Scrunch, an idiomatic Scala API to Crunch.</p>
+<h2 id="documentation">Documentation</h2>
 <ul>
-<li><a href="http://www.apache.org/dev/cmsref.html">http://www.apache.org/dev/cmsref.html</a></li>
-<li><a href="http://wiki.apache.org/general/ApacheCms2010">http://wiki.apache.org/general/ApacheCms2010</a></li>
-</ul></div></div>
-      <div class="clear"></div>
-    </div>
+<li><a href="intro.html">Introduction to Apache Crunch</a></li>
+<li><a href="scrunch.html">Introduction to Scrunch</a></li>
+<li><a href="future-work.html">Current Limitations and Future Work</a></li>
+</ul>
+<h2 id="disclaimer">Disclaimer</h2>
+<p>Apache Crunch is an effort undergoing incubation at <a href="http://apache.org/">The Apache Software Foundation
+(ASF)</a> sponsored by the <a href="http://incubator.apache.org/">Apache Incubator PMC</a>.
+Incubation is required of all newly accepted projects until a further review
+indicates that the infrastructure, communications, and decision making process
+have stabilized in a manner consistent with other successful ASF projects.
+While incubation status is not necessarily a reflection of the completeness or
+stability of the code, it does indicate that the project has yet to be fully
+endorsed by the ASF.</p>
+        </div> <!-- /span -->
+
+      </div> <!-- /row-fluid -->
 
-    <div id="copyright" class="container_16">
-      <p>Copyright &#169; 2011 The Apache Software Foundation, Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br/>Apache and the Apache feather logo are trademarks of The Apache Software Foundation.</p>
     </div>
+
+    <hr/>
+
+    <footer>
+      <div class="container-fluid">
+        <div class="row span12">Copyright &copy; 2012
+          <a href="http://www.apache.org/">The Apache Software Foundation</a>,
+          licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+	  <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+	  Apache feather logo are trademarks of The Apache Software Foundation.
+	  Other names appearing on the site may be trademarks of their
+	  respective owners.</small></p>
+        </div>
+      </div>
+    </footer>
+
   </body>
 </html>

Added: websites/staging/crunch/trunk/content/crunch/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/intro.html (added)
+++ websites/staging/crunch/trunk/content/crunch/intro.html Sun Sep 16 18:50:04 2012
@@ -0,0 +1,298 @@
+<!DOCTYPE html>
+
+
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta http-equiv="Content-Language" content="en" />
+
+    <title>Apache Crunch - Introduction to Apache Crunch</title>
+
+    <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+    <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+    <script type="text/javascript" src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+  </head>
+  <body>
+
+    <div class="navbar navbar-inverse navbar-static-top">
+      
+        <div class="container-fluid">
+
+          <a class="nav pull-right brand" href="http://incubator.apache.org">
+            <img src="http://incubator.apache.org/images/egg-logo.png" alt="apache Incubator Logo" />
+          </a>
+
+        </div>
+      
+    </div>
+
+    <ul class="breadcrumb">
+      <li>
+        <a href="/">Incubator</a>
+	<span class="divider">&raquo;</span>
+      </li>
+      <li>
+        <a href="/crunch/">Crunch</a>
+      </li>
+      
+    </ul>
+
+    <div class="container-fluid">
+      <div class="row-fluid">
+
+        <!-- SIDEBAR AREA -->
+        <div class="span2">
+          <div class="sidebar-nav">
+            <ul class="nav nav-list">
+              
+                
+                  <li class="nav-header">Apache Crunch</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/index.html">Overview</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/apidocs/">API</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="https://cwiki.apache.org/confluence/display/CRUNCH/">Wiki</a></li>
+                  
+                
+              
+                
+                  <li class="nav-header">Project</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/source-repository.html">Source Code</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/mailing-lists.html">Mailing Lists</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="http://issues.apache.org/jira/browse/CRUNCH">Issue Tracking</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
+                  
+                
+              
+            </ul>
+          </div> <!-- /well -->
+        </div> <!-- /span -->
+
+        <!-- CONTENT AREA -->
+        <div class="span10">
+          <h1 class="title">
+            Introduction to Apache Crunch
+            
+          </h1>
+
+          <h2 id="build-and-installation">Build and Installation</h2>
+<p>To use Crunch you first have to build the source code using Maven and install
+it in your local repository:</p>
+<div class="codehilite"><pre><span class="n">mvn</span> <span class="n">clean</span> <span class="n">install</span>
+</pre></div>
+
+
+<p>This also runs the integration test suite which will take a while. Afterwards
+you can run the bundled example applications:</p>
+<div class="codehilite"><pre><span class="n">hadoop</span> <span class="n">jar</span> <span class="n">examples</span><span class="sr">/target/c</span><span class="n">runch</span><span class="o">-</span><span class="n">examples</span><span class="o">-*-</span><span class="n">job</span><span class="o">.</span><span class="n">jar</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">examples</span><span class="o">.</span><span class="n">WordCount</span> <span class="sr">&lt;inputfile&gt;</span> <span class="sr">&lt;outputdir&gt;</span>
+</pre></div>
+
+
+<h2 id="high-level-concepts">High Level Concepts</h2>
+<h3 id="data-model-and-operators">Data Model and Operators</h3>
+<p>Crunch is centered around three interfaces that represent distributed datasets: <code>PCollection&lt;T&gt;</code>, <code>PTable&lt;K, V&gt;</code>, and <code>PGroupedTable&lt;K, V&gt;</code>.</p>
+<p>A <code>PCollection&lt;T&gt;</code> represents a distributed, unordered collection of elements of type T. For example, we represent a text file in Crunch as a
+<code>PCollection&lt;String&gt;</code> object. PCollection provides a method, <code>parallelDo</code>, that applies a function to each element in a PCollection in parallel,
+and returns a new PCollection as its result.</p>
+<p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of PCollection that represents a distributed, unordered multimap of its key type K to its value type V.
+In addition to the parallelDo operation, PTable provides a <code>groupByKey</code> operation that aggregates all of the values in the PTable that
+have the same key into a single record. It is the groupByKey operation that triggers the sort phase of a MapReduce job.</p>
+<p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, V&gt;</code> object, which is a distributed, sorted map of keys of type K to an Iterable
+collection of values of type V. In addition to parallelDo, the PGroupedTable provides a <code>combineValues</code> operation, which allows for
+a commutative and associative aggregation operator to be applied to the values of the PGroupedTable instance on both the map side and the
+reduce side of a MapReduce job.</p>
+<p>Finally, PCollection, PTable, and PGroupedTable all support a <code>union</code> operation, which takes a series of distinct PCollections and treats
+them as a single, virtual PCollection. The union operator is required for operations that combine multiple inputs, such as cogroups and
+joins.</p>
+<h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
+<p>Every Crunch pipeline starts with a <code>Pipeline</code> object that is used to coordinate building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct MapReduce jobs from the different stages of the pipelines when
+the Pipeline object's <code>run</code> or <code>done</code> methods are called.</p>
+<h2 id="a-detailed-example">A Detailed Example</h2>
+<p>Here is the classic WordCount application using Crunch:</p>
+<div class="codehilite"><pre><span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">DoFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">Emitter</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">PCollection</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">PTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">Pipeline</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">impl</span><span class="o">.</span><span class="n">mr</span><span class="o">.</span><span class="n">MRPipeline</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">lib</span><span class="o">.</span><span class="n">Aggregate</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">types</span><span class="o">.</span><span class="n">writable</span><span class="o">.</span><span class="n">Writables</span><span class="p">;</span>
+
+<span class="n">public</span> <span class="n">class</span> <span class="n">WordCount</span> <span class="p">{</span>
+  <span class="n">public</span> <span class="n">static</span> <span class="n">void</span> <span class="n">main</span><span class="p">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="p">)</span> <span class="n">throws</span> <span class="n">Exception</span> <span class="p">{</span>
+    <span class="n">Pipeline</span> <span class="n">pipeline</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MRPipeline</span><span class="p">(</span><span class="n">WordCount</span><span class="o">.</span><span class="n">class</span><span class="p">);</span>
+    <span class="n">PCollection</span><span class="sr">&lt;String&gt;</span> <span class="n">lines</span> <span class="o">=</span> <span class="n">pipeline</span><span class="o">.</span><span class="n">readTextFile</span><span class="p">(</span><span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
+
+    <span class="n">PCollection</span><span class="sr">&lt;String&gt;</span> <span class="n">words</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">parallelDo</span><span class="p">(</span><span class="s">&quot;my splitter&quot;</span><span class="p">,</span> <span class="k">new</span> <span class="n">DoFn</span><span class="o">&lt;</span><span class="n">String</span><span class="p">,</span> <span class="n">String</span><span class="o">&gt;</span><span class="p">()</span> <span class="p">{</span>
+      <span class="n">public</span> <span class="n">void</span> <span class="n">process</span><span class="p">(</span><span class="n">String</span> <span class="n">line</span><span class="p">,</span> <span class="n">Emitter</span><span class="sr">&lt;String&gt;</span> <span class="n">emitter</span><span class="p">)</span> <span class="p">{</span>
+        <span class="k">for</span> <span class="p">(</span><span class="n">String</span> <span class="n">word</span> <span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="nb">split</span><span class="p">(</span><span class="s">&quot;\\s+&quot;</span><span class="p">))</span> <span class="p">{</span>
+          <span class="n">emitter</span><span class="o">.</span><span class="n">emit</span><span class="p">(</span><span class="n">word</span><span class="p">);</span>
+        <span class="p">}</span>
+      <span class="p">}</span>
+    <span class="p">},</span> <span class="n">Writables</span><span class="o">.</span><span class="n">strings</span><span class="p">());</span>
+
+    <span class="n">PTable</span><span class="o">&lt;</span><span class="n">String</span><span class="p">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">counts</span> <span class="o">=</span> <span class="n">Aggregate</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">words</span><span class="p">);</span>
+
+    <span class="n">pipeline</span><span class="o">.</span><span class="n">writeTextFile</span><span class="p">(</span><span class="n">counts</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>
+    <span class="n">pipeline</span><span class="o">.</span><span class="n">run</span><span class="p">();</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>Let's walk through the example line by line.</p>
+<h3 id="step-1-creating-a-pipeline-and-referencing-a-text-file">Step 1: Creating a Pipeline and referencing a text file</h3>
+<p>The <code>MRPipeline</code> implementation of the Pipeline interface compiles the individual stages of a
+pipeline into a series of MapReduce jobs. The MRPipeline constructor takes a class argument
+that is used to tell Hadoop where to find the code that is used in the pipeline execution.</p>
+<p>We now need to tell the Pipeline about the inputs it will be consuming. The Pipeline interface
+defines a <code>readTextFile</code> method that takes in a String and returns a PCollection of Strings.
+In addition to text files, Crunch supports reading data from SequenceFiles and Avro container files,
+via the <code>SequenceFileSource</code> and <code>AvroFileSource</code> classes defined in the org.apache.crunch.io package.</p>
+<p>Note that each PCollection is a <em>reference</em> to a source of data- no data is actually loaded into a
+PCollection on the client machine.</p>
+<h3 id="step-2-splitting-the-lines-of-text-into-words">Step 2: Splitting the lines of text into words</h3>
+<p>Crunch defines a small set of primitive operations that can be composed in order to build complex data
+pipelines. The first of these primitives is the <code>parallelDo</code> function, which applies a function (defined
+by a subclass of <code>DoFn</code>) to every record in a PCollection, and returns a new PCollection that contains
+the results.</p>
+<p>The first argument to parallelDo is a string that is used to identify this step in the pipeline. When
+a pipeline is composed into a series of MapReduce jobs, it is often the case that multiple stages will
+run within the same Mapper or Reducer. Having a string that identifies each processing step is useful
+for debugging errors that occur in a running pipeline.</p>
+<p>The second argument to parallelDo is an anonymous subclass of DoFn. Each DoFn subclass must override
+the <code>process</code> method, which takes in a record from the input PCollection and an <code>Emitter</code> object that
+may have any number of output values written to it. In this case, our DoFn splits each lines up into
+words, using a blank space as a separator, and emits the words from the split to the output PCollection.</p>
+<p>The last argument to parallelDo is an instance of the <code>PType</code> interface, which specifies how the data
+in the output PCollection is serialized. While Crunch takes advantage of Java Generics to provide
+compile-time type safety, the generic type information is not available at runtime. Crunch needs to know
+how to map the records stored in each PCollection into a Hadoop-supported serialization format in order
+to read and write data to disk. Two serialization implementations are supported in crunch via the
+<code>PTypeFamily</code> interface: a Writable-based system that is defined in the org.apache.crunch.types.writable
+package, and an Avro-based system that is defined in the org.apache.crunch.types.avro package. Each
+implementation provides convenience methods for working with the common PTypes (Strings, longs, bytes, etc.)
+as well as utility methods for creating PTypes from existing Writable classes or Avro schemas.</p>
+<h3 id="step-3-counting-the-words">Step 3: Counting the words</h3>
+<p>Out of Crunch's simple primitive operations, we can build arbitrarily complex chains of operations in order
+to perform higher-level operations, like aggregations and joins, that can work on any type of input data.
+Let's look at the implementation of the <code>Aggregate.count</code> function:</p>
+<div class="codehilite"><pre><span class="nb">package</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">lib</span><span class="p">;</span>
+
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">CombineFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">MapFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">PCollection</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">PGroupedTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">PTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">Pair</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">crunch</span><span class="o">.</span><span class="n">types</span><span class="o">.</span><span class="n">PTypeFamily</span><span class="p">;</span>
+
+<span class="n">public</span> <span class="n">class</span> <span class="n">Aggregate</span> <span class="p">{</span>
+
+  <span class="n">private</span> <span class="n">static</span> <span class="n">class</span> <span class="n">Counter</span><span class="sr">&lt;S&gt;</span> <span class="n">extends</span> <span class="n">MapFn</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">Pair</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">Long</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
+    <span class="n">public</span> <span class="n">Pair</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="nb">map</span><span class="p">(</span><span class="n">S</span> <span class="n">input</span><span class="p">)</span> <span class="p">{</span>
+          <span class="k">return</span> <span class="n">Pair</span><span class="o">.</span><span class="n">of</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="mi">1</span><span class="n">L</span><span class="p">);</span>
+    <span class="p">}</span>
+  <span class="p">}</span>
+
+  <span class="n">public</span> <span class="n">static</span> <span class="sr">&lt;S&gt;</span> <span class="n">PTable</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">count</span><span class="p">(</span><span class="n">PCollection</span><span class="sr">&lt;S&gt;</span> <span class="n">collect</span><span class="p">)</span> <span class="p">{</span>
+    <span class="n">PTypeFamily</span> <span class="n">tf</span> <span class="o">=</span> <span class="n">collect</span><span class="o">.</span><span class="n">getTypeFamily</span><span class="p">();</span>
+
+    <span class="sr">//</span> <span class="n">Create</span> <span class="n">a</span> <span class="n">PTable</span> <span class="n">from</span> <span class="n">the</span> <span class="n">PCollection</span> <span class="n">by</span> <span class="n">mapping</span> <span class="nb">each</span> <span class="n">element</span>
+    <span class="sr">//</span> <span class="n">to</span> <span class="n">a</span> <span class="n">key</span> <span class="n">of</span> <span class="n">the</span> <span class="n">PTable</span> <span class="n">with</span> <span class="n">the</span> <span class="n">value</span> <span class="n">equal</span> <span class="n">to</span> <span class="mi">1</span><span class="n">L</span>
+    <span class="n">PTable</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">withCounts</span> <span class="o">=</span> <span class="n">collect</span><span class="o">.</span><span class="n">parallelDo</span><span class="p">(</span><span class="s">&quot;count:&quot;</span> <span class="o">+</span> <span class="n">collect</span><span class="o">.</span><span class="n">getName</span><span class="p">(),</span>
+        <span class="k">new</span> <span class="n">Counter</span><span class="sr">&lt;S&gt;</span><span class="p">(),</span> <span class="n">tf</span><span class="o">.</span><span class="n">tableOf</span><span class="p">(</span><span class="n">collect</span><span class="o">.</span><span class="n">getPType</span><span class="p">(),</span> <span class="n">tf</span><span class="o">.</span><span class="n">longs</span><span class="p">()));</span>
+
+    <span class="sr">//</span> <span class="n">Group</span> <span class="n">the</span> <span class="n">records</span> <span class="n">of</span> <span class="n">the</span> <span class="n">PTable</span> <span class="n">based</span> <span class="n">on</span> <span class="n">their</span> <span class="n">key</span><span class="o">.</span>
+    <span class="n">PGroupedTable</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">grouped</span> <span class="o">=</span> <span class="n">withCounts</span><span class="o">.</span><span class="n">groupByKey</span><span class="p">();</span>
+
+    <span class="sr">//</span> <span class="n">Sum</span> <span class="n">the</span> <span class="mi">1</span><span class="n">L</span> <span class="nb">values</span> <span class="n">associated</span> <span class="n">with</span> <span class="n">the</span> <span class="nb">keys</span> <span class="n">to</span> <span class="n">get</span> <span class="n">the</span>
+    <span class="sr">//</span> <span class="n">count</span> <span class="n">of</span> <span class="nb">each</span> <span class="n">element</span> <span class="n">in</span> <span class="n">this</span> <span class="n">PCollection</span><span class="p">,</span> <span class="ow">and</span> <span class="k">return</span> <span class="n">it</span>
+    <span class="sr">//</span> <span class="n">as</span> <span class="n">a</span> <span class="n">PTable</span> <span class="n">so</span> <span class="n">that</span> <span class="n">it</span> <span class="n">may</span> <span class="n">be</span> <span class="n">processed</span> <span class="n">further</span> <span class="ow">or</span> <span class="n">written</span>
+    <span class="sr">//</span> <span class="n">out</span> <span class="k">for</span> <span class="n">storage</span><span class="o">.</span>
+    <span class="k">return</span> <span class="n">grouped</span><span class="o">.</span><span class="n">combineValues</span><span class="p">(</span><span class="n">CombineFn</span><span class="o">.</span><span class="sr">&lt;S&gt;</span><span class="n">SUM_LONGS</span><span class="p">());</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>First, we get the PTypeFamily that is associated with the PType for the collection. The
+call to parallelDo converts each record in this PCollection into a Pair of the input record
+and the number one by extending the <code>MapFn</code> convenience subclass of DoFn, and uses the
+<code>tableOf</code> method of the PTypeFamily to specify that the returned PCollection should be a
+PTable instance, with the key being the PType of the PCollection and the value being the Long
+implementation for this PTypeFamily.</p>
+<p>The next line features the second of Crunch's four operations, <code>groupByKey</code>. The groupByKey
+operation may only be applied to a PTable, and returns an instance of the <code>PGroupedTable</code>
+interface, which references the grouping of all of the values in the PTable that have the same key.
+The groupByKey operation is what triggers the reduce phase of a MapReduce within Crunch.</p>
+<p>The last line in the function returns the output of the third of Crunch's four operations,
+<code>combineValues</code>. The combineValues operator takes a <code>CombineFn</code> as an argument, which is a
+specialized subclass of DoFn that operates on an implementation of Java's Iterable interface. The
+use of combineValues (as opposed to parallelDo) signals to Crunch that the CombineFn may be used to
+aggregate values for the same key on the map side of a MapReduce job as well as the reduce side.</p>
+<h3 id="step-4-writing-the-output-and-running-the-pipeline">Step 4: Writing the output and running the pipeline</h3>
+<p>The Pipeline object also provides a <code>writeTextFile</code> convenience method for indicating that a
+PCollection should be written to a text file. There are also output targets for SequenceFiles and
+Avro container files, available in the org.apache.crunch.io package.</p>
+<p>After you are finished constructing a pipeline and specifying the output destinations, call the
+pipeline's blocking <code>run</code> method in order to compile the pipeline into one or more MapReduce
+jobs and execute them.</p>
+<h2 id="more-information">More Information</h2>
+<p><a href="pipelines.html">Writing Your Own Pipelines</a></p>
+        </div> <!-- /span -->
+
+      </div> <!-- /row-fluid -->
+
+    </div>
+
+    <hr/>
+
+    <footer>
+      <div class="container-fluid">
+        <div class="row span12">Copyright &copy; 2012
+          <a href="http://www.apache.org/">The Apache Software Foundation</a>,
+          licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+	  <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+	  Apache feather logo are trademarks of The Apache Software Foundation.
+	  Other names appearing on the site may be trademarks of their
+	  respective owners.</small></p>
+        </div>
+      </div>
+    </footer>
+
+  </body>
+</html>