You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@crunch.apache.org by jw...@apache.org on 2012/12/14 02:27:35 UTC

svn commit: r1421632 - in /incubator/crunch/site/trunk/content/crunch: download.mdtext future-work.mdtext getting-started.mdtext index.mdtext intro.mdtext mailing-lists.mdtext pipelines.mdtext scrunch.mdtext source-repository.mdtext

Author: jwills
Date: Fri Dec 14 01:27:33 2012
New Revision: 1421632

URL: http://svn.apache.org/viewvc?rev=1421632&view=rev
Log:
Trademarkification of the website

Modified:
    incubator/crunch/site/trunk/content/crunch/download.mdtext
    incubator/crunch/site/trunk/content/crunch/future-work.mdtext
    incubator/crunch/site/trunk/content/crunch/getting-started.mdtext
    incubator/crunch/site/trunk/content/crunch/index.mdtext
    incubator/crunch/site/trunk/content/crunch/intro.mdtext
    incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext
    incubator/crunch/site/trunk/content/crunch/pipelines.mdtext
    incubator/crunch/site/trunk/content/crunch/scrunch.mdtext
    incubator/crunch/site/trunk/content/crunch/source-repository.mdtext

Modified: incubator/crunch/site/trunk/content/crunch/download.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/download.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/download.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/download.mdtext Fri Dec 14 01:27:33 2012
@@ -16,7 +16,7 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-Apache Crunch is distributed under the [Apache License 2.0][license].
+The Apache Crunch (incubating) libraries are distributed under the [Apache License 2.0][license].
 
 The link in the Download column takes you to a list of mirrors based on
 your location. Checksum and signature are located on Apache's main

Modified: incubator/crunch/site/trunk/content/crunch/future-work.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/future-work.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/future-work.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/future-work.mdtext Fri Dec 14 01:27:33 2012
@@ -16,11 +16,9 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-This section contains an almost certainly incomplete list of known limitations of Crunch and plans for future work.
+This section contains an almost certainly incomplete list of known limitations and plans for future work.
 
-* We would like to have easy support for reading and writing data from/to HCatalog.
-* The decision of how to split up processing tasks between dependent MapReduce jobs is very naiive right now- we simply
-delegate all of the work to the reduce stage of the predecessor job. We should take advantage of information about the
-expected size of different PCollections to optimize this processing.
-* The Crunch optimizer does not yet merge different groupByKey operations that run over the same input data into a single
+* We would like to have easy support for reading and writing data from/to the Hive metastore via the HCatalog
+APIs.
+* The optimizer does not yet merge different groupByKey operations that run over the same input data into a single
 MapReduce job. Implementing this optimization will provide a major performance benefit for a number of problems.

Modified: incubator/crunch/site/trunk/content/crunch/getting-started.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/getting-started.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/getting-started.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/getting-started.mdtext Fri Dec 14 01:27:33 2012
@@ -16,13 +16,13 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-Crunch is developed against Apache Hadoop version 1.0.3 and is also tested against
-Apache Hadoop 2.0.0-alpha. Crunch should work with any version of Hadoop
-after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from
-vendors like Cloudera, Hortonworks, and IBM. Crunch is _not_ compatible with
-versions of Hadoop prior to 1.0.x or 2.0.x, such as Apache Hadoop 0.20.x.
+The Apache Crunch (incubating) library is developed against version 1.0.3 of the Apache Hadoop library,
+and is also tested against version 2.0.0-alpha. The library should also work with any version
+after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from vendors like Cloudera,
+Hortonworks, and IBM. The library is _not_ compatible with versions of Hadoop prior to 1.0.x or 2.0.x,
+such as version 0.20.x.
 
-The easiest way to get started with Crunch is to use its Maven archetype
+The easiest way to get started with the library is to use the Maven archetype
 to generate a simple project. The archetype is available from Maven Central;
 just enter the following command, answer a few questions, and you're ready to
 go:
@@ -30,7 +30,7 @@ go:
 <pre>
 $ <strong>mvn archetype:generate -Dfilter=org.apache.crunch:crunch-archetype</strong>
 [...]
-1: remote -> org.apache.crunch:crunch-archetype (Create a basic, self-contained job for Apache Crunch.)
+1: remote -> org.apache.crunch:crunch-archetype (Create a basic, self-contained job with the core library.)
 Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): : <strong>1</strong>
 Define value for property 'groupId': : <strong>com.example</strong>
 Define value for property 'artifactId': : <strong>crunch-demo</strong>
@@ -72,7 +72,7 @@ $ <strong>tree</strong>
                     `-- TokenizerTest.java
 </pre>
  
-The `WordCount.java` file contains the main class that defines a Crunch-based
+The `WordCount.java` file contains the main class that defines a pipeline
 application which is referenced from `pom.xml`.
 
 Build the code:
@@ -92,9 +92,9 @@ $ <strong>hadoop jar target/hadoop-job-d
 </pre>
 
 The `<in>` parameter references a text file or a directory containing text
-files, while `<out>` is a directory where Crunch writes the final results to.
+files, while `<out>` is a directory where the pipeline writes the final results to.
 
-Crunch also lets you run applications from within an IDE, either as standalone
+The library also supports running applications from within an IDE, either as standalone
 Java applications or from unit tests. All required dependencies are on Maven's
 classpath so you can run the `WordCount` class directly without any additional
 setup.

Modified: incubator/crunch/site/trunk/content/crunch/index.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/index.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/index.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/index.mdtext Fri Dec 14 01:27:33 2012
@@ -1,4 +1,4 @@
-Title:    Apache Crunch
+Title:    Apache Crunch &trade;
 Subtitle: Simple and Efficient MapReduce Pipelines
 Notice:   Licensed to the Apache Software Foundation (ASF) under one
           or more contributor license agreements.  See the NOTICE file
@@ -19,22 +19,25 @@ Notice:   Licensed to the Apache Softwar
 
 ---
 
-> *Apache Crunch (incubating)* is a Java library for writing, testing, and
-> running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+> The *Apache Crunch (incubating)* Java library provides a framework for writing, testing, and
+> running MapReduce pipelines, and is based on Google's FlumeJava library. Its goal is to make
 > pipelines that are composed of many user-defined functions simple to write,
 > easy to test, and efficient to run.
 
 ---
 
-Running on top of [Hadoop MapReduce](http://hadoop.apache.org/mapreduce/), Apache
-Crunch provides a simple Java API for tasks like joining and data aggregation
-that are tedious to implement on plain MapReduce. For Scala users, there is also
-Scrunch, an idiomatic Scala API to Crunch.
+Running on top of [Hadoop MapReduce](http://hadoop.apache.org/mapreduce/), the Apache
+Crunch library is a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. The APIs are especially useful when
+processing data that does not fit naturally into relational model, such as time series,
+serialized object formats like protocol buffers or Avro records, and HBase rows and columns.
+For Scala users, there is the Scrunch API, which is built on top of the Java APIs and
+includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
 
 ## Documentation
 
-  * [Introduction to Apache Crunch](intro.html)
-  * [Introduction to Scrunch](scrunch.html)
+  * [Introduction to the Apache Crunch API](intro.html)
+  * [Introduction to the Scrunch API](scrunch.html)
   * [Current Limitations and Future Work](future-work.html)
 
 ## Disclaimer

Modified: incubator/crunch/site/trunk/content/crunch/intro.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/intro.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/intro.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/intro.mdtext Fri Dec 14 01:27:33 2012
@@ -18,12 +18,15 @@ Notice:   Licensed to the Apache Softwar
 
 ## Build and Installation
 
-To use Crunch you first have to build the source code using Maven and install
+You can download the most recently released libraries from the [Download](download.html) page or from the Maven
+Central Repository.
+
+If you prefer, you can also build the libraries from the source code using Maven and install
 it in your local repository:
 
     mvn clean install
 
-This also runs the integration test suite which will take a while. Afterwards
+This also runs the integration test suite which will take a while to complete. Afterwards
 you can run the bundled example applications such as WordCount:
 
     hadoop jar crunch-examples/target/crunch-examples-*-job.jar org.apache.crunch.examples.WordCount <inputfile> <outputdir>
@@ -36,9 +39,9 @@ crunch-examples/src/main/resources/acces
 
 ### Data Model and Operators
 
-Crunch is centered around three interfaces that represent distributed datasets: `PCollection<T>`, `PTable<K, V>`, and `PGroupedTable<K, V>`.
+The Java API is centered around three interfaces that represent distributed datasets: `PCollection<T>`, `PTable<K, V>`, and `PGroupedTable<K, V>`.
 
-A `PCollection<T>` represents a distributed, unordered collection of elements of type T. For example, we represent a text file in Crunch as a
+A `PCollection<T>` represents a distributed, unordered collection of elements of type T. For example, we represent a text file as a
 `PCollection<String>` object. PCollection provides a method, `parallelDo`, that applies a function to each element in a PCollection in parallel,
 and returns a new PCollection as its result.
 
@@ -57,13 +60,13 @@ joins.
 
 ### Pipeline Building and Execution
 
-Every Crunch pipeline starts with a `Pipeline` object that is used to coordinate building the pipeline and executing the underlying MapReduce
-jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct MapReduce jobs from the different stages of the pipelines when
+Every pipeline starts with a `Pipeline` object that is used to coordinate building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, the library uses lazy evaluation, so it will only construct MapReduce jobs from the different stages of the pipelines when
 the Pipeline object's `run` or `done` methods are called.
 
 ## A Detailed Example
 
-Here is the classic WordCount application using Crunch:
+Here is the classic WordCount application using the APIs:
 
     import org.apache.crunch.DoFn;
     import org.apache.crunch.Emitter;
@@ -104,7 +107,7 @@ that is used to tell Hadoop where to fin
 
 We now need to tell the Pipeline about the inputs it will be consuming. The Pipeline interface
 defines a `readTextFile` method that takes in a String and returns a PCollection of Strings.
-In addition to text files, Crunch supports reading data from SequenceFiles and Avro container files,
+In addition to text files, the library supports reading data from SequenceFiles and Avro container files,
 via the `SequenceFileSource` and `AvroFileSource` classes defined in the org.apache.crunch.io package.
 
 Note that each PCollection is a _reference_ to a source of data- no data is actually loaded into a
@@ -112,7 +115,7 @@ PCollection on the client machine.
 
 ### Step 2: Splitting the lines of text into words
 
-Crunch defines a small set of primitive operations that can be composed in order to build complex data
+The library defines a small set of primitive operations that can be composed in order to build complex data
 pipelines. The first of these primitives is the `parallelDo` function, which applies a function (defined
 by a subclass of `DoFn`) to every record in a PCollection, and returns a new PCollection that contains
 the results.
@@ -128,8 +131,8 @@ may have any number of output values wri
 words, using a blank space as a separator, and emits the words from the split to the output PCollection.
 
 The last argument to parallelDo is an instance of the `PType` interface, which specifies how the data
-in the output PCollection is serialized. While Crunch takes advantage of Java Generics to provide
-compile-time type safety, the generic type information is not available at runtime. Crunch needs to know
+in the output PCollection is serialized. While the API takes advantage of Java Generics to provide
+compile-time type safety, the generic type information is not available at runtime. The job planner needs to know
 how to map the records stored in each PCollection into a Hadoop-supported serialization format in order
 to read and write data to disk. Two serialization implementations are supported in crunch via the
 `PTypeFamily` interface: a Writable-based system that is defined in the org.apache.crunch.types.writable
@@ -139,7 +142,7 @@ as well as utility methods for creating 
 
 ### Step 3: Counting the words
 
-Out of Crunch's simple primitive operations, we can build arbitrarily complex chains of operations in order
+Out of the simple primitive operations, we can build arbitrarily complex chains of operations in order
 to perform higher-level operations, like aggregations and joins, that can work on any type of input data.
 Let's look at the implementation of the `Aggregate.count` function:
 
@@ -187,15 +190,15 @@ and the number one by extending the `Map
 PTable instance, with the key being the PType of the PCollection and the value being the Long
 implementation for this PTypeFamily.
 
-The next line features the second of Crunch's four operations, `groupByKey`. The groupByKey
+The next line features the second of the four primary operations, `groupByKey`. The groupByKey
 operation may only be applied to a PTable, and returns an instance of the `PGroupedTable`
 interface, which references the grouping of all of the values in the PTable that have the same key.
-The groupByKey operation is what triggers the reduce phase of a MapReduce within Crunch.
+The groupByKey operation is what triggers the reduce phase of a MapReduce.
 
-The last line in the function returns the output of the third of Crunch's four operations,
+The last line in the function returns the output of the third of the four primary operations,
 `combineValues`. The combineValues operator takes a `CombineFn` as an argument, which is a
 specialized subclass of DoFn that operates on an implementation of Java's Iterable interface. The
-use of combineValues (as opposed to parallelDo) signals to Crunch that the CombineFn may be used to
+use of combineValues (as opposed to parallelDo) signals to the planner that the CombineFn may be used to
 aggregate values for the same key on the map side of a MapReduce job as well as the reduce side.
 
 ### Step 4: Writing the output and running the pipeline

Modified: incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext Fri Dec 14 01:27:33 2012
@@ -21,7 +21,7 @@ Notice:   Licensed to the Apache Softwar
   so we use plain HTML tables.
 -->
 
-There are several mailing lists for Apache Crunch. To subscribe or unsubscribe
+There are several mailing lists for the Apache Crunch project. To subscribe or unsubscribe
 to a list send mail to the respective administrative address given below. You
 will then receive a confirmation mail with further instructions.
 

Modified: incubator/crunch/site/trunk/content/crunch/pipelines.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/pipelines.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/pipelines.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/pipelines.mdtext Fri Dec 14 01:27:33 2012
@@ -16,7 +16,7 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-This section discusses the different steps of creating your own Crunch pipelines in more detail.
+This section discusses the different steps of creating your own pipelines in more detail.
 
 ## Writing a DoFn
 
@@ -25,7 +25,7 @@ don't need them while still keeping them
 
 ### Serialization
 
-First, all DoFn instances are required to be `java.io.Serializable`. This is a key aspect of Crunch's design:
+First, all DoFn instances are required to be `java.io.Serializable`. This is a key aspect of the library's design:
 once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce job, all of the state
 of that DoFn is serialized so that it may be distributed to all of the nodes in the Hadoop cluster that
 will be running that task. There are two important implications of this for developers:
@@ -53,14 +53,14 @@ are associated with a MapReduce stage, s
 
 ### Performing Cogroups and Joins
 
-In Crunch, cogroups and joins are performed on PTable instances that have the same key type. This section walks through
-the basic flow of a cogroup operation, explaining how this higher-level operation is composed of Crunch's four primitives.
-In general, these common operations are provided as part of the core Crunch library or in extensions, you do not need
+Cogroups and joins are performed on PTable instances that have the same key type. This section walks through
+the basic flow of a cogroup operation, explaining how this higher-level operation is composed of the four primitive operations.
+In general, these common operations are provided as part of the core library or in extensions, you do not need
 to write them yourself. But it can be useful to understand how they work under the covers.
 
 Assume we have a `PTable<K, U>` named "a" and a different `PTable<K, V>` named "b" that we would like to combine into a
 single `PTable<K, Pair<Collection<U>, Collection<V>>>`. First, we need to apply parallelDo operations to a and b that
-convert them into the same Crunch type, `PTable<K, Pair<U, V>>`:
+convert them into the same PType, `PTable<K, Pair<U, V>>`:
 
     // Perform the "tagging" operation as a parallelDo on PTable a
     PTable<K, Pair<U, V>> aPrime = a.parallelDo("taga", new MapFn<Pair<K, U>, Pair<K, Pair<U, V>>>() {

Modified: incubator/crunch/site/trunk/content/crunch/scrunch.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/scrunch.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/scrunch.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/scrunch.mdtext Fri Dec 14 01:27:33 2012
@@ -1,5 +1,5 @@
 Title:    Scrunch
-Subtitle: A Scala Wrapper for Apache Crunch
+Subtitle: A Scala Wrapper for the Apache Crunch (incubating) Java API
 Notice:   Licensed to the Apache Software Foundation (ASF) under one
           or more contributor license agreements.  See the NOTICE file
           distributed with this work for additional information
@@ -19,16 +19,16 @@ Notice:   Licensed to the Apache Softwar
 
 ## Introduction
 
-Scrunch is an experimental Scala wrapper for Crunch, based on the same ideas as the
-[Cascade](http://days2011.scala-lang.org/node/138/282) project at Google, which created
-a Scala wrapper for FlumeJava.
+Scrunch is an experimental Scala wrapper for the Apache Crunch (incubating) Java API, based on the same ideas as the
+[Cascade](http://days2011.scala-lang.org/node/138/282) project at Google, which created a Scala wrapper for
+FlumeJava.
 
 ## Why Scala?
 
-In many ways, Scala is the perfect language for writing Crunch pipelines. Scala supports
+In many ways, Scala is the perfect language for writing MapReduce pipelines. Scala supports
 a mixture of functional and object-oriented programming styles and has powerful type-inference
 capabilities, allowing us to create complex pipelines using very few keystrokes. Here is
-the Scrunch analogue of the classic WordCount problem:
+an implementation of the classic WordCount problem using the Scrunch API:
 
 	import org.apache.crunch.io.{From => from}
 	import org.apache.crunch.scrunch._
@@ -46,7 +46,7 @@ the Scrunch analogue of the classic Word
 	}
 
 The Scala compiler can infer the return type of the flatMap function as an Array[String], and
-the Scrunch wrapper uses the type inference mechanism to figure out how to serialize the
+the Scrunch wrapper code uses the type inference mechanism to figure out how to serialize the
 data between the Map and Reduce stages. Here's a slightly more complex example, in which we
 get the word counts for two different files and compute the deltas of how often different
 words occur, and then only returns the words where the first file had more occurrences then
@@ -60,14 +60,10 @@ the second:
 	  }
 	}
 
-Note that all of the functions are using Scala Tuples, not Crunch Tuples. Under the covers,
-Scrunch uses Scala's implicit type conversion mechanism to transparently convert data from the
-Crunch format to the Scala format and back again.
-
 ## Materializing Job Outputs
 
-Scrunch also incorporates Crunch's materialize functionality, which allows us to easily read
-the output of a Crunch pipeline into the client:
+The Scrunch API also incorporates the Java library's `materialize` functionality, which allows us to easily read
+the output of a MapReduce pipeline into the client:
 
 	class WordCountExample {
 	  def hasHamlet = wordGt("shakespeare.txt", "maugham.txt").materialize.exists(_ == "hamlet")
@@ -75,13 +71,8 @@ the output of a Crunch pipeline into the
 
 ## Notes and Thanks
 
-Scrunch is alpha-quality code, written by someone who was learning Scala on the fly. There will be bugs,
-rough edges, and non-idiomatic Scala usage all over the place. This will improve with time, and we welcome
-contributions from Scala experts who are interested in helping us make Scrunch into a first-class project.
-
 Scrunch emerged out of conversations with [Dmitriy Ryaboy](http://twitter.com/#!/squarecog),
 [Oscar Boykin](http://twitter.com/#!/posco), and [Avi Bryant](http://twitter.com/#!/avibryant) from Twitter.
 Many thanks to them for their feedback, guidance, and encouragement. We are also grateful to
 [Matei Zaharia](http://twitter.com/#!/matei_zaharia), whose [Spark Project](http://www.spark-project.org/)
-inspired much of our implementation and was kind enough to loan us the ClosureCleaner implementation
-Spark developed for use in Scrunch.
+inspired much of the original Scrunch API implementation.

Modified: incubator/crunch/site/trunk/content/crunch/source-repository.mdtext
URL: http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/source-repository.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/source-repository.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/source-repository.mdtext Fri Dec 14 01:27:33 2012
@@ -16,7 +16,7 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-Apache Crunch uses [Git](http://git-scm.com/) for version control. Run the
+The Apache Crunch (incubating) Project uses [Git](http://git-scm.com/) for version control. Run the
 following command to clone the repository:
 
     git clone https://git-wip-us.apache.org/repos/asf/incubator-crunch.git