You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@crunch.apache.org by bu...@apache.org on 2013/11/26 02:28:28 UTC

svn commit: r888090 - in /websites/staging/crunch/trunk/content: ./ intro.html

Author: buildbot
Date: Tue Nov 26 01:28:28 2013
New Revision: 888090

Log:
Staging update by buildbot for crunch

Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/intro.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Nov 26 01:28:28 2013
@@ -1 +1 @@
-1545156
+1545498

Modified: websites/staging/crunch/trunk/content/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/intro.html (original)
+++ websites/staging/crunch/trunk/content/intro.html Tue Nov 26 01:28:28 2013
@@ -247,8 +247,8 @@ we cannot know at runtime what type of d
 us with an object that contains this information: in our example word count application, the object that tells us that we are working with strings is
 returned by the <code>Writables.strings()</code> static method that is the third argument to the <code>parallelDo</code> function in <code>countWords</code>. Every <code>DoFn</code> instance must
 return a type that has an associated object, called a <code>PType&lt;T&gt;</code>, that contains instructions for how to serialize the data returned by that <code>DoFn</code>. By default, Crunch
-supports two serialization frameworks, called <em>type families</em>: one based on Hadoop's <code>Writable</code> interface, and another based on <code>Apache Avro</code>.
-You can read more about how to work with Crunch's serialization libraries here. TODO</p>
+supports two serialization frameworks, called <em>type families</em>: one based on Hadoop's <code>Writable</code> interface, and another based on <code>Apache Avro</code>. Details
+on the type families are contained in the section on "Serializing Data with PTypes" in this document.</p>
 <p>Because all of the core logic in our application is exposed via a single static method that operates on Crunch interfaces, we can use Crunch's
 in-memory API to test our business logic using a unit testing framework like JUnit. Let's look at an example unit test for the word count
 application:</p>
@@ -390,7 +390,30 @@ contained in this class satisfies the co
 interface, which is defined right alongside the CombineFn class in the top-level <code>org.apache.crunch</code> package. There are a number of implementations of the Aggregator
 interface defined via static factory methods in the <a href="apidocs/0.8.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> class.</p>
 <h3 id="serializing-data-with-ptypes">Serializing Data with PTypes</h3>
-<p>Why PTypes Are Necessary, the two type families, the core methods and tuples.</p>
+<p>Every <code>PCollection&lt;T&gt;</code> has an associated <code>PType&lt;T&gt;</code> that encapsulates the information on how to serialize and deserialize the contents of that
+PCollection. PTypes are necessary because of <a href="http://docs.oracle.com/javase/tutorial/java/generics/erasure.html">type erasure</a>; at runtime, when
+the Crunch planner is mapping from PCollections to a series of MapReduce jobs, the type of a PCollection (that is, the <code>T</code> in <code>PCollection&lt;T&gt;</code>)
+is no longer available to us, and must be provided by the associated PType instance.</p>
+<p>Crunch supports two independent <em>type families</em>, which each implement the <a href="apidocs/0.8.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a> interface:
+one for Hadoop's <a href="apidocs/0.8.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable interface</a> and another based on
+<a href="apidocs/0.8.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache Avro</a>. There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for <a href="apidocs/0.8.0/org/apache/crunch/types/writable/Writables.html">Writables</a> and one for
+<a href="apidocs/0.8.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
+<p>The two different type families exist for historical reasons: Writables have long been the standard form for representing serializable data in Hadoop,
+but the Avro based serialization scheme is very compact, fast, and allows for complex record schemas to evolve over time. It's fine (and even encouraged)
+to mix-and-match PCollections that use different PTypes in the same Crunch pipeline (e.g., you could
+read in Writable data, do a shuffle using Avro, and then write the output data as Writables), but each PCollection's PType must belong to a single
+type family; for example, you cannot have a PTable whose key is serialized as a Writable and whose value is serialized as an Avro record.</p>
+<h4 id="core-ptypes">Core PTypes</h4>
+<p>Both type families support a common set of primitive types (strings, longs, ints, floats, doubles, booleans, and bytes) as well as more complex
+PTypes that can be constructed out of other PTypes:</p>
+<ol>
+<li>Tuples of other PTypes (<code>pairs</code>, <code>trips</code>, <code>quads</code>, and <code>tuples</code> for arbitrary N),</li>
+<li>Collections of other PTypes (<code>collections</code> to create a <code>Collection&lt;T&gt;</code> and <code>maps</code> to return a <code>Map&lt;String, T&gt;</code>),</li>
+<li>and <code>tableOf</code> to construct a <code>PTableType&lt;K, V&gt;</code>, the PType used to distinguish a <code>PTable&lt;K, V&gt;</code> from a <code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code>.</li>
+</ol>
+<p>Both of the type families have additional methods for working with records that are specific to each serialization format (for example, the
+AvroTypeFamily contains methods to support Generic and Specific records as well as Avro's reflection-based serialization.)</p>
 <h4 id="extending-ptypes">Extending PTypes</h4>
 <p>The simplest way to create a new <code>PType&lt;T&gt;</code> for a data object is to create a <em>derived</em> PType from one of the built-in PTypes for the Avro
 and Writable type families. If we have a base <code>PType&lt;S&gt;</code>, we can create a derived <code>PType&lt;T&gt;</code> by implementing an input <code>MapFn&lt;S, T&gt;</code> and an