You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@incubator.apache.org by Josh Wills <jw...@cloudera.com> on 2012/05/23 20:45:49 UTC

[VOTE] Accept Crunch into the Apache Incubator

I would like to call a vote for accepting "Apache Crunch" for
incubation in the Apache Incubator. The full proposal is available
below. We ask the Incubator PMC to sponsor it, with phunt as
Champion, and phunt, tomwhite, and acmurthy volunteering to be
Mentors.

Please cast your vote:

[ ] +1, bring Crunch into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Crunch into Incubator, because...

This vote will be open for 72 hours and only votes from the Incubator
PMC are binding.

http://wiki.apache.org/incubator/CrunchProposal

Proposal text from the wiki:
----------------------------------------------------------------------------------------------------------------------
= Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =

== Abstract ==

Crunch is a Java library for writing, testing, and running pipelines
of !MapReduce jobs on Apache Hadoop.

== Proposal ==

Crunch is a Java library for writing, testing, and running pipelines
of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
high-level API for writing and testing complex !MapReduce jobs that
require multiple processing stages. It has a simple, flexible, and
extensible data model that makes it ideal for processing data that
does not naturally fit into a relational structure, such as time
series and serialized object formats like JSON and Avro. It supports
running pipelines either as a series of !MapReduce jobs on an Apache
Hadoop cluster or in memory on a single machine for fast testing and
debugging.

== Background ==

Crunch was initially developed by Cloudera to simplify the process of
creating sequences of dependent !MapReduce jobs, especially jobs that
processed non-relational data like time series. Its design was based
on a paper Google published about a Java library they developed called
!FlumeJava that was created in order to solve a similar class of
problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
2.0 licensed project in October 2011. During this time Crunch has been
formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
(February 2012), with an incremental update to version 0.2.1 (March
2012) . These releases are also distributed by Cloudera as source and
binaries from Cloudera's Maven repository.

== Rationale ==

Most of the interesting analytical and data processing tasks that are
run on an Apache Hadoop cluster require a series of !MapReduce jobs to
be executed in sequence. Developers who are creating these pipelines
today need to manually assign the sequence of tasks to perform in a
dependent chain of !MapReduce jobs, even though there are a number of
well-known patterns for fusing dependent computations together into a
single !MapReduce stage and for performing common types of joins and
aggregations. This results in !MapReduce pipelines that are more
difficult to test, maintain, and extend to support new functionality.

Furthermore, the type of data that is being stored and processed using
Apache Hadoop is evolving. Although Hadoop was originally used for
storing large volumes of structured text in the form of webpages and
log files, it is now common for Hadoop to store complex, structured
data formats such as JSON, Apache Avro, and Apache Thrift. These
formats allow developers to work with serialized objects in
programming languages like Java, C++, and Python, and allow for new
types of analysis to be performed on complex data types. Hadoop has
also been adopted by the scientific research community, who are using
Hadoop to process time series data, structured binary files in the
HDF5 format, and large medical and satellite images.

Crunch addresses these challenges by providing a lightweight and
extensible Java API for defining the stages of a data processing
pipeline, which can then be run on an Apache Hadoop cluster as a
sequence of dependent !MapReduce jobs, or in-memory on a single
machine to facilitate fast testing and debugging. Crunch relies on a
small set of primitive abstractions that represent immutable,
distributed collections of objects. Developers define functions that
are applied to those objects in order to generate new immutable,
distributed collections of objects. Crunch also provides a library of
common !MapReduce patterns for performing efficient joins and
aggregation operations over these distributed collections that
developers may integrate into their own pipelines. Crunch also
provides native support for processing structured binary data formats
like JSON, Apache Avro, and Apache Thrift, and is designed to be
extensible to support working with any kind of data format that Java
supports in its native form.

== Initial Goals ==

Crunch is currently in its first major release with a considerable
number of enhancement requests, tasks, and issues recorded towards its
future development. The initial goal of this project will be to
continue to build community in the spirit of the "Apache Way", and to
address the highly requested features and bug-fixes towards the next
dot release.

Some goals include:
* To stand up a sustaining Apache-based community around the Crunch codebase.
* Improved documentation of Java libraries and best practices.
* Support the ability to "fuse" logically independent pipeline stages
that aggregate the same data in different ways into a single
!MapReduce job.
* Performance, usability, and robustness improvements.
* Improving diagnostic reporting and debugging for individual !MapReduce jobs.
* Providing a centralized place for contributed extensions and
domain-specific applications.

= Current Status =

== Meritocracy ==

Crunch was initially developed by Josh Wills in September 2011 at
Cloudera. Developers external to Cloudera provided feedback, suggested
features and fixes and implemented extensions of Crunch. Cloudera's
engineering team has since maintained the project with Josh Wills, Tom
White, and Brock Noland dedicated towards its improvement.
Contributors to Crunch include developers from multiple organizations,
including businesses and universities.

== Community ==

Crunch is currently used by a number of organizations all over the
world. Crunch has an active and growing user and developer community
with active participation in
[[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
and [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
mailing lists.

Since open sourcing the project, there have been eight individuals
from five organizations who have contributed code.

== Core Developers ==

The core developers for Crunch are:
* Brock Noland: Wrote many of the test cases, user documentation, and
contributed several bug fixes.
* Josh Wills: Josh wrote much of the original Crunch code.
* Gabriel Reid: Gabriel significantly improved Crunch's handling of
Avro data and has contributed several bug fixes for the core planner.
* Tom White: Tom added several libraries for common !MapReduce
pipeline operations, including the sort library and a library of set
operations.
* Christian Tzolov: Christian has contributed several bug fixes for
the Avro serialization module and the unit testing framework.
* Robert Chu: Robert did the left/right/outer join implementations
for Crunch and fixed several bugs in the runtime configuration logic.

Several of the core developers of Crunch have contributed towards
Hadoop or related Apache projects and are familiar with Apache
principles and philosophy for community driven software development.

== Alignment ==

Crunch complements several current Apache projects. It complements
Hadoop !MapReduce by providing a higher-level API for developing
complex data processing pipelines that require a sequence of
!MapReduce jobs to perform. Crunch also supports Apache HBase in order
to simplify the process of writing !MapReduce jobs that execute over
HBase tables. Crunch makes extensive use of the Apache Avro data
format as an internal data representation process that makes
!MapReduce jobs execute quickly and efficiently.

= Known Risks =

== Orphaned Products ==

Crunch is already deployed in production at multiple companies and
they are actively participating in creating new features. Crunch is
getting traction with developers and thus the risks of it being
orphaned are minimal.

== Inexperience with Open Source ==

All code developed for Crunch has been open sourced by Cloudera under
Apache 2.0 license. All committers to Crunch are intimately familiar
with the Apache model for open-source development and are experienced
with working with new contributors.

== Homogeneous Developers ==

The initial set of committers is from a reduced set of organizations.
However, we expect that once approved for incubation, the project will
attract new contributors from diverse organizations and will thus grow
organically. The submission of patches from developers from several
different organizations is a strong indication that Crunch will be
widely adopted.

== Reliance on Salaried Developers ==

It is expected that Crunch will be developed on salaried and volunteer
time, although all of the initial developers will work on it mainly on
salaried time.

== Relationships with Other Apache Products ==

Crunch depends upon other Apache Projects: Apache Hadoop, Apache
HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
Commons components. Its build depends upon Apache Maven.

Crunch's functionality has some indirect or direct overlap with the
functionality of Apache Pig and Apache Hive but has several
significant differences in terms of their user community and the types
of data they are designed to work with. Both Hive and Pig are
high-level languages that are designed to allow non-programmers to
quickly create and run !MapReduce jobs. Crunch is a Java library whose
primary community is Java developers who are creating scalable data
pipelines and !MapReduce-based applications. Additionally, Hive and
Pig both employ a relational, tuple-oriented data model on top of
HDFS, which introduces overhead and limits expressive power for
developers who are working with serialized objects and non-relational
data types. Crunch uses a lower-level data model that gives developers
the freedom to work with data in a format that is optimized for the
problem they are trying to solve.

== An Excessive Fascination with the Apache Brand ==

We would like Crunch to become an Apache project to further foster a
healthy community of contributors and consumers around the project.
Since Crunch directly interacts with many Apache Hadoop-related
projects and solves an important problem of many Hadoop users,
residing in the Apache Software Foundation will increase interaction
with the larger community.

= Documentation =

* Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
* Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
* Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/

= Initial Source =

* https://github.com/cloudera/crunch/tree/

== Source and Intellectual Property Submission Plan ==

* The initial source is already licensed under the Apache License,
Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt

== External Dependencies ==

The required external dependencies are all Apache License or
compatible licenses. Following components with non-Apache licenses are
enumerated:

* com.google.protobuf : New BSD
* org.hamcrest: New BSD
* org.slf4j: MIT-like License

Non-Apache build tools that are used by Crunch are as follows:

* Cobertura: GNU GPLv2

Note that Cobertura is optional and is only used for calculating unit
test coverage.

== Cryptography ==

Crunch uses standard APIs and tools for SSH and SSL communication
where necessary.

= Required Resources =

== Mailing lists ==

* crunch-private (with moderated subscriptions)
* crunch-dev
* crunch-commits
* crunch-user

== Github Repositories ==

http://github.com/apache/crunch
git://git.apache.org/crunch.git

== Issue Tracking ==

JIRA Crunch (CRUNCH)

== Other Resources ==

The existing code already has unit and integration tests so we would
like a Jenkins instance to run them whenever a new patch is submitted.
This can be added after project creation.

= Initial Committers =

* Brock Noland (brock at cloudera dot com)
* Josh Wills (jwills at cloudera dot com)
* Gabriel Reid (gabriel dot reid at gmail dot com)
* Tom White (tom at cloudera dot com)
* Christian Tzolov (christian dot tzolov at gmail dot com)
* Robert Chu (robert at wibidata dot com)
* Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)

= Affiliations =

* Brock Noland, Cloudera
* Josh Wills, Cloudera
* Gabriel Reid, !TomTom
* Tom White, Cloudera
* Christian Tzolov, !TomTom
* Robert Chu, !WibiData
* Vinod Kumar Vavilapalli, Hortonworks

= Sponsors =

== Champion ==

* Patrick Hunt

== Nominated Mentors ==

* Tom White
* Patrick Hunt
* Arun Murthy

== Sponsoring Entity ==

* Apache Incubator PMC

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Doug Cutting <cu...@apache.org>.

+1

Doug

On 05/23/2012 11:45 AM, Josh Wills wrote:
> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
>   * To stand up a sustaining Apache-based community around the Crunch codebase.
>   * Improved documentation of Java libraries and best practices.
>   * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
>   * Performance, usability, and robustness improvements.
>   * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
>   * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
> and [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
>   * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
>   * Josh Wills: Josh wrote much of the original Crunch code.
>   * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
>   * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
>   * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
>   * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
>   * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
>   * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
>   * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
>   * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
>   * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
>   * com.google.protobuf : New BSD
>   * org.hamcrest: New BSD
>   * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
>   * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required  Resources =
>
> == Mailing lists ==
>
>   * crunch-private (with moderated subscriptions)
>   * crunch-dev
>   * crunch-commits
>   * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
>   * Brock Noland (brock at cloudera dot com)
>   * Josh Wills (jwills at cloudera dot com)
>   * Gabriel Reid (gabriel dot reid at gmail dot com)
>   * Tom White (tom at cloudera dot com)
>   * Christian Tzolov (christian dot tzolov at gmail dot com)
>   * Robert Chu (robert at wibidata dot com)
>   * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
>   * Brock Noland, Cloudera
>   * Josh Wills, Cloudera
>   * Gabriel Reid, !TomTom
>   * Tom White, Cloudera
>   * Christian Tzolov, !TomTom
>   * Robert Chu, !WibiData
>   * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
>   * Patrick Hunt
>
> == Nominated Mentors ==
>
>   * Tom White
>   * Patrick Hunt
>   * Arun Murthy
>
> == Sponsoring Entity ==
>
>   * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Benson Margulies <bi...@gmail.com>.

+1 (binding) ...

And a friendly reminder to the ppmc via their mentors -- in response
to that email about limiting the initial committers to a tight group.
They will soon be learning that the big challenge of incubation is not
writing a lot of code, its recruiting new faces. They'll want to
switch from 'just us chickens' to putting out the welcome mat as soon
as possible.

On Thu, May 24, 2012 at 2:29 AM, Bertrand Delacretaz
<bd...@apache.org> wrote:
>> [X ] +1, bring Crunch into Incubator
>
> -Bertrand
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Bertrand Delacretaz <bd...@apache.org>.

> [X ] +1, bring Crunch into Incubator

-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Luciano Resende <lu...@gmail.com>.

On Wed, May 23, 2012 at 11:45 AM, Josh Wills <jw...@cloudera.com> wrote:
> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>


+1, bring Crunch into Incubator


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Niall Pemberton <ni...@gmail.com>.

+1

Niall

On Wed, May 23, 2012 at 7:45 PM, Josh Wills <jw...@cloudera.com> wrote:
> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
>  * To stand up a sustaining Apache-based community around the Crunch codebase.
>  * Improved documentation of Java libraries and best practices.
>  * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
>  * Performance, usability, and robustness improvements.
>  * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
>  * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
> and [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
>  * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
>  * Josh Wills: Josh wrote much of the original Crunch code.
>  * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
>  * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
>  * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
>  * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
>  * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
>  * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
>  * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
>  * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
>  * com.google.protobuf : New BSD
>  * org.hamcrest: New BSD
>  * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
>  * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required  Resources =
>
> == Mailing lists ==
>
>  * crunch-private (with moderated subscriptions)
>  * crunch-dev
>  * crunch-commits
>  * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
>  * Brock Noland (brock at cloudera dot com)
>  * Josh Wills (jwills at cloudera dot com)
>  * Gabriel Reid (gabriel dot reid at gmail dot com)
>  * Tom White (tom at cloudera dot com)
>  * Christian Tzolov (christian dot tzolov at gmail dot com)
>  * Robert Chu (robert at wibidata dot com)
>  * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
>  * Brock Noland, Cloudera
>  * Josh Wills, Cloudera
>  * Gabriel Reid, !TomTom
>  * Tom White, Cloudera
>  * Christian Tzolov, !TomTom
>  * Robert Chu, !WibiData
>  * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
>  * Patrick Hunt
>
> == Nominated Mentors ==
>
>  * Tom White
>  * Patrick Hunt
>  * Arun Murthy
>
> == Sponsoring Entity ==
>
>  * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, May 23, 2012 at 8:45 PM, Josh Wills <jw...@cloudera.com> wrote:
> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.

Sounds good to me!

  [x] +1, bring Crunch into Incubator

See the discuss thread for some comments on the details.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Brock Noland <br...@cloudera.com>.

[X] +1, bring Crunch into Incubator (non-binding)

On Wed, May 23, 2012 at 11:45 AM, Josh Wills <jw...@cloudera.com> wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
>
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
>  * To stand up a sustaining Apache-based community around the Crunch
> codebase.
>  * Improved documentation of Java libraries and best practices.
>  * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
>  * Performance, usability, and robustness improvements.
>  * Improving diagnostic reporting and debugging for individual !MapReduce
> jobs.
>  * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user
> ]]
> and [[
> https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer
> ]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
>  * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
>  * Josh Wills: Josh wrote much of the original Crunch code.
>  * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
>  * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
>  * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
>  * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
>  * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
>  * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
>  * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
>  * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
>  * com.google.protobuf : New BSD
>  * org.hamcrest: New BSD
>  * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
>  * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required  Resources =
>
> == Mailing lists ==
>
>  * crunch-private (with moderated subscriptions)
>  * crunch-dev
>  * crunch-commits
>  * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
>  * Brock Noland (brock at cloudera dot com)
>  * Josh Wills (jwills at cloudera dot com)
>  * Gabriel Reid (gabriel dot reid at gmail dot com)
>  * Tom White (tom at cloudera dot com)
>  * Christian Tzolov (christian dot tzolov at gmail dot com)
>  * Robert Chu (robert at wibidata dot com)
>  * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
>  * Brock Noland, Cloudera
>  * Josh Wills, Cloudera
>  * Gabriel Reid, !TomTom
>  * Tom White, Cloudera
>  * Christian Tzolov, !TomTom
>  * Robert Chu, !WibiData
>  * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
>  * Patrick Hunt
>
> == Nominated Mentors ==
>
>  * Tom White
>  * Patrick Hunt
>  * Arun Murthy
>
> == Sponsoring Entity ==
>
>  * Apache Incubator PMC
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by "Edward J. Yoon" <ed...@apache.org>.

+1

On Thu, May 24, 2012 at 7:16 AM, Patrick Hunt <ph...@apache.org> wrote:
> +1
>
> Patrick
>
> On Wed, May 23, 2012 at 2:49 PM, Alan D. Cabrera <li...@toolazydogs.com> wrote:
>> +1
>>
>>
>> Regards,
>> Alan
>>
>>
>> On May 23, 2012, at 11:45 AM, Josh Wills wrote:
>>
>>> I would like to call a vote for accepting "Apache Crunch" for
>>> incubation in the Apache Incubator. The full proposal is available
>>> below.  We ask the Incubator PMC to sponsor it, with phunt as
>>> Champion, and phunt, tomwhite, and acmurthy volunteering to be
>>> Mentors.
>>>
>>> Please cast your vote:
>>>
>>> [ ] +1, bring Crunch into Incubator
>>> [ ] +0, I don't care either way,
>>> [ ] -1, do not bring Crunch into Incubator, because...
>>>
>>> This vote will be open for 72 hours and only votes from the Incubator
>>> PMC are binding.
>>>
>>> http://wiki.apache.org/incubator/CrunchProposal
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Patrick Hunt <ph...@apache.org>.

+1

Patrick

On Wed, May 23, 2012 at 2:49 PM, Alan D. Cabrera <li...@toolazydogs.com> wrote:
> +1
>
>
> Regards,
> Alan
>
>
> On May 23, 2012, at 11:45 AM, Josh Wills wrote:
>
>> I would like to call a vote for accepting "Apache Crunch" for
>> incubation in the Apache Incubator. The full proposal is available
>> below.  We ask the Incubator PMC to sponsor it, with phunt as
>> Champion, and phunt, tomwhite, and acmurthy volunteering to be
>> Mentors.
>>
>> Please cast your vote:
>>
>> [ ] +1, bring Crunch into Incubator
>> [ ] +0, I don't care either way,
>> [ ] -1, do not bring Crunch into Incubator, because...
>>
>> This vote will be open for 72 hours and only votes from the Incubator
>> PMC are binding.
>>
>> http://wiki.apache.org/incubator/CrunchProposal
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by "Alan D. Cabrera" <li...@toolazydogs.com>.

+1


Regards,
Alan

 
On May 23, 2012, at 11:45 AM, Josh Wills wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
> 
> Please cast your vote:
> 
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
> 
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
> 
> http://wiki.apache.org/incubator/CrunchProposal


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

+1 (non-binding)

+Vinod

On May 23, 2012, at 11:45 AM, Josh Wills wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
> 
> Please cast your vote:
> 
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
> 
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
> 
> http://wiki.apache.org/incubator/CrunchProposal
> 
> Proposal text from the wiki:
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
> 
> == Abstract ==
> 
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
> 
> == Proposal ==
> 
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
> 
> == Background ==
> 
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
> 
> == Rationale ==
> 
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
> 
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
> 
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
> 
> == Initial Goals ==
> 
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
> 
> Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch codebase.
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
> * Providing a centralized place for contributed extensions and
> domain-specific applications.
> 
> = Current Status =
> 
> == Meritocracy ==
> 
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
> 
> == Community ==
> 
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
> and [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
> 
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
> 
> == Core Developers ==
> 
> The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
> * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
> 
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
> 
> == Alignment ==
> 
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
> 
> = Known Risks =
> 
> == Orphaned Products ==
> 
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
> 
> == Inexperience with Open Source ==
> 
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
> 
> == Homogeneous Developers ==
> 
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
> 
> == Reliance on Salaried Developers ==
> 
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
> 
> == Relationships with Other Apache Products ==
> 
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
> 
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
> 
> == An Excessive Fascination with the Apache Brand ==
> 
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
> 
> = Documentation =
> 
> * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
> * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
> * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
> 
> = Initial Source =
> 
> * https://github.com/cloudera/crunch/tree/
> 
> == Source and Intellectual Property Submission Plan ==
> 
> * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
> 
> == External Dependencies ==
> 
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
> 
> * com.google.protobuf : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
> 
> Non-Apache build tools that are used by Crunch are as follows:
> 
> * Cobertura: GNU GPLv2
> 
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
> 
> == Cryptography ==
> 
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
> 
> = Required  Resources =
> 
> == Mailing lists ==
> 
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
> 
> == Github Repositories ==
> 
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
> 
> == Issue Tracking ==
> 
> JIRA Crunch (CRUNCH)
> 
> == Other Resources ==
> 
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
> 
> = Initial Committers =
> 
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
> 
> = Affiliations =
> 
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
> 
> = Sponsors =
> 
> == Champion ==
> 
> * Patrick Hunt
> 
> == Nominated Mentors ==
> 
> * Tom White
> * Patrick Hunt
> * Arun Murthy
> 
> == Sponsoring Entity ==
> 
> * Apache Incubator PMC
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Ahmed Radwan <ah...@cloudera.com>.

[ ] +1, bring Crunch into Incubator

On Wed, May 23, 2012 at 11:45 AM, Josh Wills <jw...@cloudera.com> wrote:
> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
>  * To stand up a sustaining Apache-based community around the Crunch codebase.
>  * Improved documentation of Java libraries and best practices.
>  * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
>  * Performance, usability, and robustness improvements.
>  * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
>  * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
> and [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
>  * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
>  * Josh Wills: Josh wrote much of the original Crunch code.
>  * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
>  * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
>  * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
>  * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
>  * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
>  * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
>  * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
>  * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
>  * com.google.protobuf : New BSD
>  * org.hamcrest: New BSD
>  * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
>  * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required  Resources =
>
> == Mailing lists ==
>
>  * crunch-private (with moderated subscriptions)
>  * crunch-dev
>  * crunch-commits
>  * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
>  * Brock Noland (brock at cloudera dot com)
>  * Josh Wills (jwills at cloudera dot com)
>  * Gabriel Reid (gabriel dot reid at gmail dot com)
>  * Tom White (tom at cloudera dot com)
>  * Christian Tzolov (christian dot tzolov at gmail dot com)
>  * Robert Chu (robert at wibidata dot com)
>  * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
>  * Brock Noland, Cloudera
>  * Josh Wills, Cloudera
>  * Gabriel Reid, !TomTom
>  * Tom White, Cloudera
>  * Christian Tzolov, !TomTom
>  * Robert Chu, !WibiData
>  * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
>  * Patrick Hunt
>
> == Nominated Mentors ==
>
>  * Tom White
>  * Patrick Hunt
>  * Arun Murthy
>
> == Sponsoring Entity ==
>
>  * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Arvind Prabhakar <ar...@apache.org>.

[X] +1, bring Crunch into Incubator (non-binding)

Thanks,
Arvind Prabhakar

On Wed, May 23, 2012 at 11:45 AM, Josh Wills <jw...@cloudera.com> wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
>
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
>  * To stand up a sustaining Apache-based community around the Crunch
> codebase.
>  * Improved documentation of Java libraries and best practices.
>  * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
>  * Performance, usability, and robustness improvements.
>  * Improving diagnostic reporting and debugging for individual !MapReduce
> jobs.
>  * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user
> ]]
> and [[
> https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer
> ]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
>  * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
>  * Josh Wills: Josh wrote much of the original Crunch code.
>  * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
>  * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
>  * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
>  * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
>  * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
>  * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
>  * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
>  * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
>  * com.google.protobuf : New BSD
>  * org.hamcrest: New BSD
>  * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
>  * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required  Resources =
>
> == Mailing lists ==
>
>  * crunch-private (with moderated subscriptions)
>  * crunch-dev
>  * crunch-commits
>  * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
>  * Brock Noland (brock at cloudera dot com)
>  * Josh Wills (jwills at cloudera dot com)
>  * Gabriel Reid (gabriel dot reid at gmail dot com)
>  * Tom White (tom at cloudera dot com)
>  * Christian Tzolov (christian dot tzolov at gmail dot com)
>  * Robert Chu (robert at wibidata dot com)
>  * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
>  * Brock Noland, Cloudera
>  * Josh Wills, Cloudera
>  * Gabriel Reid, !TomTom
>  * Tom White, Cloudera
>  * Christian Tzolov, !TomTom
>  * Robert Chu, !WibiData
>  * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
>  * Patrick Hunt
>
> == Nominated Mentors ==
>
>  * Tom White
>  * Patrick Hunt
>  * Arun Murthy
>
> == Sponsoring Entity ==
>
>  * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Tommaso Teofili <to...@gmail.com>.

+1

Tommaso

2012/5/23 Josh Wills <jw...@cloudera.com>

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
>
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
>  * To stand up a sustaining Apache-based community around the Crunch
> codebase.
>  * Improved documentation of Java libraries and best practices.
>  * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
>  * Performance, usability, and robustness improvements.
>  * Improving diagnostic reporting and debugging for individual !MapReduce
> jobs.
>  * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user
> ]]
> and [[
> https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer
> ]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
>  * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
>  * Josh Wills: Josh wrote much of the original Crunch code.
>  * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
>  * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
>  * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
>  * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
>  * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
>  * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
>  * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
>  * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
>  * com.google.protobuf : New BSD
>  * org.hamcrest: New BSD
>  * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
>  * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required  Resources =
>
> == Mailing lists ==
>
>  * crunch-private (with moderated subscriptions)
>  * crunch-dev
>  * crunch-commits
>  * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
>  * Brock Noland (brock at cloudera dot com)
>  * Josh Wills (jwills at cloudera dot com)
>  * Gabriel Reid (gabriel dot reid at gmail dot com)
>  * Tom White (tom at cloudera dot com)
>  * Christian Tzolov (christian dot tzolov at gmail dot com)
>  * Robert Chu (robert at wibidata dot com)
>  * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
>  * Brock Noland, Cloudera
>  * Josh Wills, Cloudera
>  * Gabriel Reid, !TomTom
>  * Tom White, Cloudera
>  * Christian Tzolov, !TomTom
>  * Robert Chu, !WibiData
>  * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
>  * Patrick Hunt
>
> == Nominated Mentors ==
>
>  * Tom White
>  * Patrick Hunt
>  * Arun Murthy
>
> == Sponsoring Entity ==
>
>  * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

[RESULTS] [VOTE] Accept Crunch into the Apache Incubator

Posted by Josh Wills <jw...@cloudera.com>.

The vote passes with twelve +1s from Incubator PMC members:

Arun Murthy
Alan Cabrera
Benson Margulies
Bertrand Delacretaz
Doug Cutting
Luciano Resende
Jukka Zitting
Matthew Franklin
Niall Pemberton
Patrick Hunt
Tom White
Tommaso Teofili

seven non-binding +1s, and one 0- from Incubator PMC member Steve
Loughran. Thank you everyone for voicing both your support and your
concerns. The discussion around the proposal was enlightening and
useful, and will hopefully benefit other projects that are considering
the Apache incubator.

Josh

On Sat, May 26, 2012 at 3:31 PM, Jakob Homan <jg...@gmail.com> wrote:
> +1.
>
> Josh and his cohorts have gotten a quick, unexpected introduction to
> the value of building a diverse community.  But the code's solid and
> the lessons learned.  Looking forward to this one.
>
> On Sat, May 26, 2012 at 7:16 AM, Steve Loughran
> <st...@gmail.com> wrote:
>> On 23 May 2012 19:45, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> I would like to call a vote for accepting "Apache Crunch" for
>>> incubation in the Apache Incubator. The full proposal is available
>>> below.  We ask the Incubator PMC to sponsor it, with phunt as
>>> Champion, and phunt, tomwhite, and acmurthy volunteering to be
>>> Mentors.
>>>
>>> Please cast your vote:
>>>
>>> [ ] +1, bring Crunch into Incubator
>>> [ ] +0, I don't care either way,
>>> [ ] -1, do not bring Crunch into Incubator, because...
>>>
>>>
>> -0, binding.
>>
>> I think it's good to broaden the Hadoop stack beyond Java, which is what
>> Crunch proposes. I know it has a limited organisational breadth -but that
>> is one thing incubation is designed to address. I just fear that the
>> exclusion of Jakob putting things off to a bad start.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Jakob Homan <jg...@gmail.com>.

+1.

Josh and his cohorts have gotten a quick, unexpected introduction to
the value of building a diverse community.  But the code's solid and
the lessons learned.  Looking forward to this one.

On Sat, May 26, 2012 at 7:16 AM, Steve Loughran
<st...@gmail.com> wrote:
> On 23 May 2012 19:45, Josh Wills <jw...@cloudera.com> wrote:
>
>> I would like to call a vote for accepting "Apache Crunch" for
>> incubation in the Apache Incubator. The full proposal is available
>> below.  We ask the Incubator PMC to sponsor it, with phunt as
>> Champion, and phunt, tomwhite, and acmurthy volunteering to be
>> Mentors.
>>
>> Please cast your vote:
>>
>> [ ] +1, bring Crunch into Incubator
>> [ ] +0, I don't care either way,
>> [ ] -1, do not bring Crunch into Incubator, because...
>>
>>
> -0, binding.
>
> I think it's good to broaden the Hadoop stack beyond Java, which is what
> Crunch proposes. I know it has a limited organisational breadth -but that
> is one thing incubation is designed to address. I just fear that the
> exclusion of Jakob putting things off to a bad start.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Steve Loughran <st...@gmail.com>.

On 23 May 2012 19:45, Josh Wills <jw...@cloudera.com> wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
>
-0, binding.

I think it's good to broaden the Hadoop stack beyond Java, which is what
Crunch proposes. I know it has a limited organisational breadth -but that
is one thing incubation is designed to address. I just fear that the
exclusion of Jakob putting things off to a bad start.

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Arun C Murthy <ac...@hortonworks.com>.

+1 (binding)

On May 23, 2012, at 11:45 AM, Josh Wills wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
> 
> Please cast your vote:
> 
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
> 
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
> 
> http://wiki.apache.org/incubator/CrunchProposal
> 
> Proposal text from the wiki:
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
> 
> == Abstract ==
> 
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
> 
> == Proposal ==
> 
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
> 
> == Background ==
> 
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
> 
> == Rationale ==
> 
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
> 
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
> 
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
> 
> == Initial Goals ==
> 
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
> 
> Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch codebase.
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
> * Providing a centralized place for contributed extensions and
> domain-specific applications.
> 
> = Current Status =
> 
> == Meritocracy ==
> 
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
> 
> == Community ==
> 
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
> and [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
> 
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
> 
> == Core Developers ==
> 
> The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
> * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
> 
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
> 
> == Alignment ==
> 
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
> 
> = Known Risks =
> 
> == Orphaned Products ==
> 
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
> 
> == Inexperience with Open Source ==
> 
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
> 
> == Homogeneous Developers ==
> 
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
> 
> == Reliance on Salaried Developers ==
> 
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
> 
> == Relationships with Other Apache Products ==
> 
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
> 
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
> 
> == An Excessive Fascination with the Apache Brand ==
> 
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
> 
> = Documentation =
> 
> * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
> * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
> * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
> 
> = Initial Source =
> 
> * https://github.com/cloudera/crunch/tree/
> 
> == Source and Intellectual Property Submission Plan ==
> 
> * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
> 
> == External Dependencies ==
> 
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
> 
> * com.google.protobuf : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
> 
> Non-Apache build tools that are used by Crunch are as follows:
> 
> * Cobertura: GNU GPLv2
> 
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
> 
> == Cryptography ==
> 
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
> 
> = Required  Resources =
> 
> == Mailing lists ==
> 
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
> 
> == Github Repositories ==
> 
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
> 
> == Issue Tracking ==
> 
> JIRA Crunch (CRUNCH)
> 
> == Other Resources ==
> 
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
> 
> = Initial Committers =
> 
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
> 
> = Affiliations =
> 
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
> 
> = Sponsors =
> 
> == Champion ==
> 
> * Patrick Hunt
> 
> == Nominated Mentors ==
> 
> * Tom White
> * Patrick Hunt
> * Arun Murthy
> 
> == Sponsoring Entity ==
> 
> * Apache Incubator PMC
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

RE: [VOTE] Accept Crunch into the Apache Incubator

Posted by "Franklin, Matthew B." <mf...@mitre.org>.

+1 (binding)

>-----Original Message-----
>From: Josh Wills [mailto:jwills@cloudera.com]
>Sent: Wednesday, May 23, 2012 2:46 PM
>To: general@incubator.apache.org
>Subject: [VOTE] Accept Crunch into the Apache Incubator
>
>I would like to call a vote for accepting "Apache Crunch" for
>incubation in the Apache Incubator. The full proposal is available
>below.  We ask the Incubator PMC to sponsor it, with phunt as
>Champion, and phunt, tomwhite, and acmurthy volunteering to be
>Mentors.
>
>Please cast your vote:
>
>[ ] +1, bring Crunch into Incubator
>[ ] +0, I don't care either way,
>[ ] -1, do not bring Crunch into Incubator, because...
>
>This vote will be open for 72 hours and only votes from the Incubator
>PMC are binding.
>
>http://wiki.apache.org/incubator/CrunchProposal
>
>Proposal text from the wiki:
>-----------------------------------------------------------------------------------------------
>-----------------------
>= Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
>== Abstract ==
>
>Crunch is a Java library for writing, testing, and running pipelines
>of !MapReduce jobs on Apache Hadoop.
>
>== Proposal ==
>
>Crunch is a Java library for writing, testing, and running pipelines
>of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
>high-level API for writing and testing complex !MapReduce jobs that
>require multiple processing stages.  It has a simple, flexible, and
>extensible data model that makes it ideal for processing data that
>does not naturally fit into a relational structure, such as time
>series and serialized object formats like JSON and Avro. It supports
>running pipelines either as a series of !MapReduce jobs on an Apache
>Hadoop cluster or in memory on a single machine for fast testing and
>debugging.
>
>== Background ==
>
>Crunch was initially developed by Cloudera to simplify the process of
>creating sequences of dependent !MapReduce jobs, especially jobs that
>processed non-relational data like time series. Its design was based
>on a paper Google published about a Java library they developed called
>!FlumeJava that was created in order to solve a similar class of
>problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
>2.0 licensed project in October 2011. During this time Crunch has been
>formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
>(February 2012), with an incremental update to version 0.2.1 (March
>2012) .  These releases are also distributed by Cloudera as source and
>binaries from Cloudera's Maven repository.
>
>== Rationale ==
>
>Most of the interesting analytical and data processing tasks that are
>run on an Apache Hadoop cluster require a series of !MapReduce jobs to
>be executed in sequence. Developers who are creating these pipelines
>today need to manually assign the sequence of tasks to perform in a
>dependent chain of !MapReduce jobs, even though there are a number of
>well-known patterns for fusing dependent computations together into a
>single !MapReduce stage and for performing common types of joins and
>aggregations. This results in !MapReduce pipelines that are more
>difficult to test, maintain, and extend to support new functionality.
>
>Furthermore, the type of data that is being stored and processed using
>Apache Hadoop is evolving. Although Hadoop was originally used for
>storing large volumes of structured text in the form of webpages and
>log files, it is now common for Hadoop to store complex, structured
>data formats such as JSON, Apache Avro, and Apache Thrift. These
>formats allow developers to work with serialized objects in
>programming languages like Java, C++, and Python, and allow for new
>types of analysis to be performed on complex data types. Hadoop has
>also been adopted by the scientific research community, who are using
>Hadoop to process time series data, structured binary files in the
>HDF5 format, and large medical and satellite images.
>
>Crunch addresses these challenges by providing a lightweight and
>extensible Java API for defining the stages of a data processing
>pipeline, which can then be run on an Apache Hadoop cluster as a
>sequence of dependent !MapReduce jobs, or in-memory on a single
>machine to facilitate fast testing and debugging. Crunch relies on a
>small set of primitive abstractions that represent immutable,
>distributed collections of objects. Developers define functions that
>are applied to those objects in order to generate new immutable,
>distributed collections of objects. Crunch also provides a library of
>common !MapReduce patterns for performing efficient joins and
>aggregation operations over these distributed collections that
>developers may integrate into their own pipelines. Crunch also
>provides native support for processing structured binary data formats
>like JSON, Apache Avro, and Apache Thrift, and is designed to be
>extensible to support working with any kind of data format that Java
>supports in its native form.
>
>== Initial Goals ==
>
>Crunch is currently in its first major release with a considerable
>number of enhancement requests, tasks, and issues recorded towards its
>future development. The initial goal of this project will be to
>continue to build community in the spirit of the "Apache Way", and to
>address the highly requested features and bug-fixes towards the next
>dot release.
>
>Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch
>codebase.
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
>that aggregate the same data in different ways into a single
>!MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual !MapReduce
>jobs.
> * Providing a centralized place for contributed extensions and
>domain-specific applications.
>
>= Current Status =
>
>== Meritocracy ==
>
>Crunch was initially developed by Josh Wills in September 2011 at
>Cloudera. Developers external to Cloudera provided feedback, suggested
>features and fixes and implemented extensions of Crunch. Cloudera's
>engineering team has since maintained the project with Josh Wills, Tom
>White, and Brock Noland dedicated towards its improvement.
>Contributors to Crunch include developers from multiple organizations,
>including businesses and universities.
>
>== Community ==
>
>Crunch is currently used by a number of organizations all over the
>world. Crunch has an active and growing user and developer community
>with active participation in
>[[https://groups.google.com/a/cloudera.org/group/crunch-
>users/topics|user]]
>and [[https://groups.google.com/a/cloudera.org/group/crunch-
>dev/topics|developer]]
>mailing lists.
>
>Since open sourcing the project, there have been eight individuals
>from five organizations who have contributed code.
>
>== Core Developers ==
>
>The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
>contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
>Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
>pipeline operations, including the sort library and a library of set
>operations.
> * Christian Tzolov: Christian has contributed several bug fixes for
>the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
>for Crunch and fixed several bugs in the runtime configuration logic.
>
>Several of the core developers of Crunch have contributed towards
>Hadoop or related Apache projects and are familiar with Apache
>principles and philosophy for community driven software development.
>
>== Alignment ==
>
>Crunch complements several current Apache projects. It complements
>Hadoop !MapReduce by providing a higher-level API for developing
>complex data processing pipelines that require a sequence of
>!MapReduce jobs to perform. Crunch also supports Apache HBase in order
>to simplify the process of writing !MapReduce jobs that execute over
>HBase tables. Crunch makes extensive use of the Apache Avro data
>format as an internal data representation process that makes
>!MapReduce jobs execute quickly and efficiently.
>
>= Known Risks =
>
>== Orphaned Products ==
>
>Crunch is already deployed in production at multiple companies and
>they are actively participating in creating new features. Crunch is
>getting traction with developers and thus the risks of it being
>orphaned are minimal.
>
>== Inexperience with Open Source ==
>
>All code developed for Crunch has been open sourced by Cloudera under
>Apache 2.0 license.  All committers to Crunch are intimately familiar
>with the Apache model for open-source development and are experienced
>with working with new contributors.
>
>== Homogeneous Developers ==
>
>The initial set of committers is from a reduced set of organizations.
>However, we expect that once approved for incubation, the project will
>attract new contributors from diverse organizations and will thus grow
>organically. The submission of patches from developers from several
>different organizations is a strong indication that Crunch will be
>widely adopted.
>
>== Reliance on Salaried Developers ==
>
>It is expected that Crunch will be developed on salaried and volunteer
>time, although all of the initial developers will work on it mainly on
>salaried time.
>
>== Relationships with Other Apache Products ==
>
>Crunch depends upon other Apache Projects: Apache Hadoop, Apache
>HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
>Commons components. Its build depends upon Apache Maven.
>
>Crunch's functionality has some indirect or direct overlap with the
>functionality of Apache Pig and Apache Hive but has several
>significant differences in terms of their user community and the types
>of data they are designed to work with.  Both Hive and Pig are
>high-level languages that are designed to allow non-programmers to
>quickly create and run !MapReduce jobs. Crunch is a Java library whose
>primary community is Java developers who are creating scalable data
>pipelines and !MapReduce-based applications. Additionally, Hive and
>Pig both employ a relational, tuple-oriented data model on top of
>HDFS, which introduces overhead and limits expressive power for
>developers who are working with serialized objects and non-relational
>data types. Crunch uses a lower-level data model that gives developers
>the freedom to work with data in a format that is optimized for the
>problem they are trying to solve.
>
>== An Excessive Fascination with the Apache Brand ==
>
>We would like Crunch to become an Apache project to further foster a
>healthy community of contributors and consumers around the project.
>Since Crunch directly interacts with many Apache Hadoop-related
>projects and solves an important problem of many Hadoop users,
>residing in the Apache Software Foundation will increase interaction
>with the larger community.
>
>= Documentation =
>
> * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
> * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
> * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
>= Initial Source =
>
> * https://github.com/cloudera/crunch/tree/
>
>== Source and Intellectual Property Submission Plan ==
>
> * The initial source is already licensed under the Apache License,
>Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
>== External Dependencies ==
>
>The required external dependencies are all Apache License or
>compatible licenses. Following components with non-Apache licenses are
>enumerated:
>
> * com.google.protobuf : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
>
>Non-Apache build tools that are used by Crunch are as follows:
>
> * Cobertura: GNU GPLv2
>
>Note that Cobertura is optional and is only used for calculating unit
>test coverage.
>
>== Cryptography ==
>
>Crunch uses standard APIs and tools for SSH and SSL communication
>where necessary.
>
>= Required  Resources =
>
>== Mailing lists ==
>
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
>
>== Github Repositories ==
>
>http://github.com/apache/crunch
>git://git.apache.org/crunch.git
>
>== Issue Tracking ==
>
>JIRA Crunch (CRUNCH)
>
>== Other Resources ==
>
>The existing code already has unit and integration tests so we would
>like a Jenkins instance to run them whenever a new patch is submitted.
>This can be added after project creation.
>
>= Initial Committers =
>
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
>= Affiliations =
>
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
>
>= Sponsors =
>
>== Champion ==
>
> * Patrick Hunt
>
>== Nominated Mentors ==
>
> * Tom White
> * Patrick Hunt
> * Arun Murthy
>
>== Sponsoring Entity ==
>
> * Apache Incubator PMC
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>For additional commands, e-mail: general-help@incubator.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Tom White <to...@apache.org>.

+1

Tom

On Wed, May 23, 2012 at 1:45 PM, Josh Wills <jw...@cloudera.com> wrote:
> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
>  * To stand up a sustaining Apache-based community around the Crunch codebase.
>  * Improved documentation of Java libraries and best practices.
>  * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
>  * Performance, usability, and robustness improvements.
>  * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
>  * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
> and [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
>  * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
>  * Josh Wills: Josh wrote much of the original Crunch code.
>  * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
>  * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
>  * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
>  * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
>  * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
>  * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
>  * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
>  * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
>  * com.google.protobuf : New BSD
>  * org.hamcrest: New BSD
>  * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
>  * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required  Resources =
>
> == Mailing lists ==
>
>  * crunch-private (with moderated subscriptions)
>  * crunch-dev
>  * crunch-commits
>  * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
>  * Brock Noland (brock at cloudera dot com)
>  * Josh Wills (jwills at cloudera dot com)
>  * Gabriel Reid (gabriel dot reid at gmail dot com)
>  * Tom White (tom at cloudera dot com)
>  * Christian Tzolov (christian dot tzolov at gmail dot com)
>  * Robert Chu (robert at wibidata dot com)
>  * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
>  * Brock Noland, Cloudera
>  * Josh Wills, Cloudera
>  * Gabriel Reid, !TomTom
>  * Tom White, Cloudera
>  * Christian Tzolov, !TomTom
>  * Robert Chu, !WibiData
>  * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
>  * Patrick Hunt
>
> == Nominated Mentors ==
>
>  * Tom White
>  * Patrick Hunt
>  * Arun Murthy
>
> == Sponsoring Entity ==
>
>  * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Mike Percy <mp...@apache.org>.

[x] +1, bring Crunch into Incubator (non-binding)

Regards,
Mike


On Wednesday, May 23, 2012 at 11:45 AM, Josh Wills wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below. We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
> ------------------------------
----------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages. It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) . These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch
codebase.
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual !MapReduce
jobs.
> * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user
]]
> and [[
https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
> * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license. All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with. Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
> * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
> * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
> * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
> * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
> * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
> * com.google.protobuf : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
> * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required Resources =
>
> == Mailing lists ==
>
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git (http://git.apache.org/crunch.git)
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
> * Patrick Hunt
>
> == Nominated Mentors ==
>
> * Tom White
> * Patrick Hunt
> * Arun Murthy
>
> == Sponsoring Entity ==
>
> * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org (mailto:
general-unsubscribe@incubator.apache.org)
> For additional commands, e-mail: general-help@incubator.apache.org(mailto:
general-help@incubator.apache.org)

Re: [VOTE] Accept Crunch into the Apache Incubator

Posted by Mike Percy <mp...@apache.org>.

[x] +1, bring Crunch into Incubator (non-binding)

Regards,
Mike


On Wednesday, May 23, 2012 at 11:45 AM, Josh Wills wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below. We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>
> Please cast your vote:
>
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>
> http://wiki.apache.org/incubator/CrunchProposal
>
> Proposal text from the wiki:
> ------------------------------
----------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>
> == Abstract ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>
> == Proposal ==
>
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages. It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>
> == Background ==
>
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) . These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>
> == Rationale ==
>
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>
> == Initial Goals ==
>
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>
> Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch
codebase.
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual !MapReduce
jobs.
> * Providing a centralized place for contributed extensions and
> domain-specific applications.
>
> = Current Status =
>
> == Meritocracy ==
>
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>
> == Community ==
>
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user
]]
> and [[
https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
>
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>
> == Core Developers ==
>
> The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
> * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>
> == Alignment ==
>
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>
> = Known Risks =
>
> == Orphaned Products ==
>
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>
> == Inexperience with Open Source ==
>
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license. All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>
> == Homogeneous Developers ==
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>
> == Reliance on Salaried Developers ==
>
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>
> == Relationships with Other Apache Products ==
>
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with. Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>
> == An Excessive Fascination with the Apache Brand ==
>
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>
> = Documentation =
>
> * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
> * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
> * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>
> = Initial Source =
>
> * https://github.com/cloudera/crunch/tree/
>
> == Source and Intellectual Property Submission Plan ==
>
> * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>
> == External Dependencies ==
>
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>
> * com.google.protobuf : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
>
> Non-Apache build tools that are used by Crunch are as follows:
>
> * Cobertura: GNU GPLv2
>
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>
> == Cryptography ==
>
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>
> = Required Resources =
>
> == Mailing lists ==
>
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
>
> == Github Repositories ==
>
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git (http://git.apache.org/crunch.git)
>
> == Issue Tracking ==
>
> JIRA Crunch (CRUNCH)
>
> == Other Resources ==
>
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>
> = Initial Committers =
>
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>
> = Affiliations =
>
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
> * Patrick Hunt
>
> == Nominated Mentors ==
>
> * Tom White
> * Patrick Hunt
> * Arun Murthy
>
> == Sponsoring Entity ==
>
> * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org (mailto:
general-unsubscribe@incubator.apache.org)
> For additional commands, e-mail: general-help@incubator.apache.org(mailto:
general-help@incubator.apache.org)