You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@crunch.apache.org by jw...@apache.org on 2013/11/26 02:28:17 UTC
svn commit: r1545498 - /crunch/site/trunk/content/intro.mdtext

Author: jwills
Date: Tue Nov 26 01:28:17 2013
New Revision: 1545498

URL: http://svn.apache.org/r1545498
Log:
Add details on PType serialization

Modified:
    crunch/site/trunk/content/intro.mdtext

Modified: crunch/site/trunk/content/intro.mdtext
URL: http://svn.apache.org/viewvc/crunch/site/trunk/content/intro.mdtext?rev=1545498&r1=1545497&r2=1545498&view=diff
==============================================================================
--- crunch/site/trunk/content/intro.mdtext (original)
+++ crunch/site/trunk/content/intro.mdtext Tue Nov 26 01:28:17 2013
@@ -135,8 +135,8 @@ we cannot know at runtime what type of d
 us with an object that contains this information: in our example word count application, the object that tells us that we are working with strings is
 returned by the `Writables.strings()` static method that is the third argument to the `parallelDo` function in `countWords`. Every `DoFn` instance must
 return a type that has an associated object, called a `PType<T>`, that contains instructions for how to serialize the data returned by that `DoFn`. By default, Crunch
-supports two serialization frameworks, called _type families_: one based on Hadoop's `Writable` interface, and another based on `Apache Avro`.
-You can read more about how to work with Crunch's serialization libraries here. TODO
+supports two serialization frameworks, called _type families_: one based on Hadoop's `Writable` interface, and another based on `Apache Avro`. Details
+on the type families are contained in the section on "Serializing Data with PTypes" in this document.
 
 Because all of the core logic in our application is exposed via a single static method that operates on Crunch interfaces, we can use Crunch's
 in-memory API to test our business logic using a unit testing framework like JUnit. Let's look at an example unit test for the word count
@@ -307,7 +307,34 @@ interface defined via static factory met
 
 ### Serializing Data with PTypes
 
-Why PTypes Are Necessary, the two type families, the core methods and tuples.
+Every `PCollection<T>` has an associated `PType<T>` that encapsulates the information on how to serialize and deserialize the contents of that
+PCollection. PTypes are necessary because of [type erasure](http://docs.oracle.com/javase/tutorial/java/generics/erasure.html); at runtime, when
+the Crunch planner is mapping from PCollections to a series of MapReduce jobs, the type of a PCollection (that is, the `T` in `PCollection<T>`)
+is no longer available to us, and must be provided by the associated PType instance.
+
+Crunch supports two independent _type families_, which each implement the [PTypeFamily](apidocs/0.8.0/org/apache/crunch/types/PTypeFamily.html) interface:
+one for Hadoop's [Writable interface](apidocs/0.8.0/org/apache/crunch/types/writable/WritableTypeFamily.html) and another based on
+[Apache Avro](apidocs/0.8.0/org/apache/crunch/types/avro/AvroTypeFamily.html). There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for [Writables](apidocs/0.8.0/org/apache/crunch/types/writable/Writables.html) and one for
+[Avros](apidocs/0.8.0/org/apache/crunch/types/avro/Avros.html).
+
+The two different type families exist for historical reasons: Writables have long been the standard form for representing serializable data in Hadoop,
+but the Avro based serialization scheme is very compact, fast, and allows for complex record schemas to evolve over time. It's fine (and even encouraged)
+to mix-and-match PCollections that use different PTypes in the same Crunch pipeline (e.g., you could
+read in Writable data, do a shuffle using Avro, and then write the output data as Writables), but each PCollection's PType must belong to a single
+type family; for example, you cannot have a PTable whose key is serialized as a Writable and whose value is serialized as an Avro record.
+
+#### Core PTypes
+
+Both type families support a common set of primitive types (strings, longs, ints, floats, doubles, booleans, and bytes) as well as more complex
+PTypes that can be constructed out of other PTypes:
+
+1. Tuples of other PTypes (`pairs`, `trips`, `quads`, and `tuples` for arbitrary N),
+2. Collections of other PTypes (`collections` to create a `Collection<T>` and `maps` to return a `Map<String, T>`),
+3. and `tableOf` to construct a `PTableType<K, V>`, the PType used to distinguish a `PTable<K, V>` from a `PCollection<Pair<K, V>>`.
+
+Both of the type families have additional methods for working with records that are specific to each serialization format (for example, the
+AvroTypeFamily contains methods to support Generic and Specific records as well as Avro's reflection-based serialization.)
 
 #### Extending PTypes