You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@incubator.apache.org by Chris Aniszczyk <ca...@gmail.com> on 2014/05/18 23:15:15 UTC

[VOTE] Accept Parquet into the incubator

Based on the results of the discussion thread:
http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E

I would like to call a vote on accepting Parquet into the incubator.
https://wiki.apache.org/incubator/ParquetProposal

[ ] +1 Accept Parquet into the Incubator
[ ] +0 Indifferent to the acceptance of Parquet
[ ] -1 Do not accept Parquet because ...

The vote will be open until Thursday May 22nd 18:00 UTC.

= Parquet Proposal =

== Abstract ==
Parquet is a columnar storage format for Hadoop.

== Proposal ==

We created Parquet to make the advantages of compressed, efficient columnar
data representation available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model, or
programming language.

== Background ==

Parquet is built from the ground up with complex nested data structures in
mind, and uses the repetition/definition level approach to encoding such
data structures, as popularized by Google Dremel (
https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding
schemes. Parquet allows compression schemes to be specified on a per-column
level, and is future-proofed to allow adding more encodings as they are
invented and implemented. We separate the concepts of encoding and
compression, allowing parquet consumers to implement operators that work
directly on encoded data without paying decompression and decoding penalty
when possible.

== Rationale ==

Parquet is built to be used by anyone. We believe that an efficient,
well-implemented columnar storage substrate should be useful to all
frameworks without the cost of extensive and difficult to set up
dependencies.

Furthermore, the rapid growth of Parquet community is empowered by open
source. We believe the Apache foundation is a great fit as the long-term
home for Parquet, as it provides an established process for
community-driven development and decision making by consensus. This is
exactly the model we want for future Parquet development.

== Initial Goals ==

 * Move the existing codebase to Apache
 * Integrate with the Apache development process
 * Ensure all dependencies are compliant with Apache License version 2.0
 * Incremental development and releases per Apache guidelines

== Current Status ==

Parquet has undergone 2 major releases:
https://github.com/Parquet/parquet-format/releases of the core format and
22 releases: https://github.com/Parquet/parquet-mr/releases of the
supporting set of Java libraries.

The Parquet source is currently hosted at GitHub, which will seed the
Apache git repository.

=== Meritocracy ===

We plan to invest in supporting a meritocracy. We will discuss the
requirements in an open forum. Several companies have already expressed
interest in this project, and we intend to invite additional developers to
participate. We will encourage and monitor community participation so that
privileges can be extended to those that contribute.

=== Community ===

There is a large need for an advanced columnar storage format for Hadoop.
Parquet is being used in production by many organizations (see
https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)

 * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
 * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
 * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
 * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
 * Twitter: https://twitter.com/J_/statuses/315844725611581441

By bringing Parquet into Apache, we believe that the community will grow
even bigger.

=== Core Developers ===

Parquet was initially developed as a collaboration between Twitter,
Cloudera and Criteo.

See
https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop

=== Alignment ===

We believe that having Parquet at Apache will help further the growth of
the big-data community, as it will encourage cooperation within the greater
ecosystem of projects spawned by Apache Hadoop. The alignment is also
beneficial to other Apache communities (such as Hadoop, Hive, Avro).

== Known Risks ==

=== Orphaned Products ===

The risk of the Parquet project being abandoned is minimal. There are many
organizations using Parquet in production, including Twitter, Cloudera,
Stripe, and Salesforce (
http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).

=== Inexperience with Open Source ===

Parquet has existed as a healthy open source for one year. During that
time, we have curated an open-source community successfully, attracting
over 40 contributors (see
https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
group of companies.
Several of the core contributors to the project are deeply familiar with
OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
are also Apache Pig committers with contributions to several other Apache
projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
multiple other related projects. Brock Noland is a Hive committer.

=== Homogenous Developers ===

The initial committers come from a number of companies and countries.
Parquet has an active community of developers, and we are committed to
recruiting additional committers based on their contributions to the
project. The java library component alone has contributions from 31
individual github accounts, 14 of which contributed over 1000 lines of code.

=== Reliance on Salaried Developers ===

It is expected that Parquet development will occur on both salaried time
and on volunteer time, after hours. The majority of initial committers are
paid by their employers to contribute to this project. However, they are
all passionate about the project, and we are confident that the project
will continue even if no salaried developers contribute to the project. As
evidence of this statement, we present the GitHub punchcard (see
https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot
of activity happens on weekends. We are committed to recruiting additional
committers including non-salaried developers.

=== Relationships with Other Apache Products ===

As mentioned in the Alignment section, Parquet is closely related to
Hadoop. It provides an API that allowed it to be easily integrated with
many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
Tajo. Some of the features it provides are similar to the ORC file format
which is part of the Hive project. However Parquet focused on being
framework agnostic and language independent and has been really successful
to that end. On top of the Apache projects mentioned above, Parquet is also
integrated with other open source projects, including Protocol Buffers,
Cloudera Impala or Scrooge. We look forward to continue collaborating with
those communities, as well as other Apache communities.

=== An Excessive Fascination with the Apache Brand ===

Parquet is an already healthy and well known open source project. This
proposal is not for the purpose of generating publicity. Rather, the
primary benefits to joining Apache are those outlined in the Rationale
section.

== Documentation ==

Documentation is currently located as README markdown files:

 * https://github.com/Parquet/parquet-format
 * https://github.com/Parquet/parquet-mr

== Source and Intellectual Property Submission Plan ==

The Parquet codebase is currently hosted on Github:
https://github.com/Parquet.

These are the codebases that we would migrate to the Apache foundation.

== External Dependencies ==


 * Junit: EPL
 * Apache Commons: ALv2
 * Apache Thrift: ALv2
 * Apache Maven: ALv2
 * Apache Avro: ALv2
 * Apache Hadoop: ALv2
 * Google Guava: ALv2
 * Google Protobuf: New BSD License

== Cryptography ==

We do not expect Parquet to be a controlled export item due to the use of
encryption.

== Required Resources ==

=== Mailing lists ===

 * private@parquet.incubator.apache.org
 * commits@parquet.incubator.apache.org
 * dev@parquet.incubator.apache.org

== Subversion Directory ==

Git is the preferred source control system:

 * git://git.apache.org/parquet-format
 * git://git.apache.org/parquet-mr

== Issue Tracking ==

We'd like to keep using the Git review and issue tracking tools.
Controlling Pull requests closing through git commit messages in
git.apache.org

== Initial Committers ==

 * Aniket Mokashi <an...@gmail.com>
 * Brock Noland <br...@apache.org>
 * Chris Aniszczyk <ca...@gmail.com>
 * Dmitriy Ryaboy <dv...@apache.org>
 * Jake Farrell <jf...@apache.org>
 * Jonathan Coveney <jc...@gmail.com>
 * Julien Le Dem <ju...@apache.org>
 * Lukas Nalezenec <lu...@gmail.com>
 * Marcel Kornacker <ma...@cloudera.com>
 * Mickael Lacour
 * Nong Li <no...@cloudera.com>
 * Remy Pecqueur
 * Ryan Blue <bl...@cloudera.com>
 * Tianshuo Deng <de...@gmail.com>
 * Tom White <to...@apache.org>
 * Wesley Peck

== Affiliations ==

 * Aniket Mokashi - Twitter
 * Brock Noland - Cloudera
 * Chris Aniszczyk - Twitter
 * Dmitriy Ryaboy - Twitter
 * Jake Farrell
 * Jonathan Coveney - Twitter
 * Julien Le Dem - Twitter
 * Lukas Nalezenec
 * Marcel Kornacker - Cloudera
 * Mickael Lacour - Criteo
 * Nong Li - Cloudera
 * Remy Pecqueur - Criteo
 * Ryan Blue - Cloudera
 * Tianshuo Deng - Twitter
 * Tom White - Cloudera
 * Wesley Peck - ARRIS, Inc.

== Sponsors ==

=== Champion ===

 * Todd Lipcon

=== Nominated Mentors ===

 * Tom White
 * Chris Mattmann
 * Jake Farrell
 * Roman Shaposhnik

=== Sponsoring Entity ===

The Apache Incubator

-- 
Cheers,

Chris Aniszczyk
http://aniszczyk.org
+1 512 961 6719

Re: [VOTE] Accept Parquet into the incubator

Posted by Andrei Savu <sa...@gmail.com>.

+1 (binding)

-- Andrei Savu (from mobile)
On May 18, 2014 3:15 PM, "Chris Aniszczyk" <ca...@gmail.com> wrote:

> Based on the results of the discussion thread:
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
>
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of
> code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719
>

Re: [VOTE] Accept Parquet into the incubator

Posted by Hyunsik Choi <hy...@apache.org>.

+1

(non binding)

On Tue, May 20, 2014 at 11:58 AM, Julien Le Dem <ju...@ledem.net> wrote:
> [X] +1 Accept Parquet into the Incubator
> (non binding)
> Julien
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Julien Le Dem <ju...@ledem.net>.

[X] +1 Accept Parquet into the Incubator
(non binding)
Julien


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Roman Shaposhnik <rv...@apache.org>.

On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <ca...@gmail.com> wrote:
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...

+1 (binding)

Thanks,
Roman.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

+1 from me (binding)!

Cheers,
Chris


-----Original Message-----
From: Chris Aniszczyk <ca...@gmail.com>
Reply-To: "general@incubator.apache.org" <ge...@incubator.apache.org>
Date: Sunday, May 18, 2014 2:15 PM
To: "general@incubator.apache.org" <ge...@incubator.apache.org>
Subject: [VOTE] Accept Parquet into the incubator

>Based on the results of the discussion thread:
>http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3C
>CAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3
>E
>
>I would like to call a vote on accepting Parquet into the incubator.
>https://wiki.apache.org/incubator/ParquetProposal
>
>[ ] +1 Accept Parquet into the Incubator
>[ ] +0 Indifferent to the acceptance of Parquet
>[ ] -1 Do not accept Parquet because ...
>
>The vote will be open until Thursday May 22nd 18:00 UTC.
>
>= Parquet Proposal =
>
>== Abstract ==
>Parquet is a columnar storage format for Hadoop.
>
>== Proposal ==
>
>We created Parquet to make the advantages of compressed, efficient
>columnar
>data representation available to any project in the Hadoop ecosystem,
>regardless of the choice of data processing framework, data model, or
>programming language.
>
>== Background ==
>
>Parquet is built from the ground up with complex nested data structures in
>mind, and uses the repetition/definition level approach to encoding such
>data structures, as popularized by Google Dremel (
>https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
>this approach is superior to simple flattening of nested name spaces.
>
>Parquet is built to support very efficient compression and encoding
>schemes. Parquet allows compression schemes to be specified on a
>per-column
>level, and is future-proofed to allow adding more encodings as they are
>invented and implemented. We separate the concepts of encoding and
>compression, allowing parquet consumers to implement operators that work
>directly on encoded data without paying decompression and decoding penalty
>when possible.
>
>== Rationale ==
>
>Parquet is built to be used by anyone. We believe that an efficient,
>well-implemented columnar storage substrate should be useful to all
>frameworks without the cost of extensive and difficult to set up
>dependencies.
>
>Furthermore, the rapid growth of Parquet community is empowered by open
>source. We believe the Apache foundation is a great fit as the long-term
>home for Parquet, as it provides an established process for
>community-driven development and decision making by consensus. This is
>exactly the model we want for future Parquet development.
>
>== Initial Goals ==
>
> * Move the existing codebase to Apache
> * Integrate with the Apache development process
> * Ensure all dependencies are compliant with Apache License version 2.0
> * Incremental development and releases per Apache guidelines
>
>== Current Status ==
>
>Parquet has undergone 2 major releases:
>https://github.com/Parquet/parquet-format/releases of the core format and
>22 releases: https://github.com/Parquet/parquet-mr/releases of the
>supporting set of Java libraries.
>
>The Parquet source is currently hosted at GitHub, which will seed the
>Apache git repository.
>
>=== Meritocracy ===
>
>We plan to invest in supporting a meritocracy. We will discuss the
>requirements in an open forum. Several companies have already expressed
>interest in this project, and we intend to invite additional developers to
>participate. We will encourage and monitor community participation so that
>privileges can be extended to those that contribute.
>
>=== Community ===
>
>There is a large need for an advanced columnar storage format for Hadoop.
>Parquet is being used in production by many organizations (see
>https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
> * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
> * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
>By bringing Parquet into Apache, we believe that the community will grow
>even bigger.
>
>=== Core Developers ===
>
>Parquet was initially developed as a collaboration between Twitter,
>Cloudera and Criteo.
>
>See
>https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-h
>adoop
>
>=== Alignment ===
>
>We believe that having Parquet at Apache will help further the growth of
>the big-data community, as it will encourage cooperation within the
>greater
>ecosystem of projects spawned by Apache Hadoop. The alignment is also
>beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
>== Known Risks ==
>
>=== Orphaned Products ===
>
>The risk of the Parquet project being abandoned is minimal. There are many
>organizations using Parquet in production, including Twitter, Cloudera,
>Stripe, and Salesforce (
>http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
>=== Inexperience with Open Source ===
>
>Parquet has existed as a healthy open source for one year. During that
>time, we have curated an open-source community successfully, attracting
>over 40 contributors (see
>https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
>group of companies.
>Several of the core contributors to the project are deeply familiar with
>OSS and Apache specifically: Julien Le Dem was until recently the PMC
>Chair
>for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
>are also Apache Pig committers with contributions to several other Apache
>projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>multiple other related projects. Brock Noland is a Hive committer.
>
>=== Homogenous Developers ===
>
>The initial committers come from a number of companies and countries.
>Parquet has an active community of developers, and we are committed to
>recruiting additional committers based on their contributions to the
>project. The java library component alone has contributions from 31
>individual github accounts, 14 of which contributed over 1000 lines of
>code.
>
>=== Reliance on Salaried Developers ===
>
>It is expected that Parquet development will occur on both salaried time
>and on volunteer time, after hours. The majority of initial committers are
>paid by their employers to contribute to this project. However, they are
>all passionate about the project, and we are confident that the project
>will continue even if no salaried developers contribute to the project. As
>evidence of this statement, we present the GitHub punchcard (see
>https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
>lot
>of activity happens on weekends. We are committed to recruiting additional
>committers including non-salaried developers.
>
>=== Relationships with Other Apache Products ===
>
>As mentioned in the Alignment section, Parquet is closely related to
>Hadoop. It provides an API that allowed it to be easily integrated with
>many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
>Tajo. Some of the features it provides are similar to the ORC file format
>which is part of the Hive project. However Parquet focused on being
>framework agnostic and language independent and has been really successful
>to that end. On top of the Apache projects mentioned above, Parquet is
>also
>integrated with other open source projects, including Protocol Buffers,
>Cloudera Impala or Scrooge. We look forward to continue collaborating with
>those communities, as well as other Apache communities.
>
>=== An Excessive Fascination with the Apache Brand ===
>
>Parquet is an already healthy and well known open source project. This
>proposal is not for the purpose of generating publicity. Rather, the
>primary benefits to joining Apache are those outlined in the Rationale
>section.
>
>== Documentation ==
>
>Documentation is currently located as README markdown files:
>
> * https://github.com/Parquet/parquet-format
> * https://github.com/Parquet/parquet-mr
>
>== Source and Intellectual Property Submission Plan ==
>
>The Parquet codebase is currently hosted on Github:
>https://github.com/Parquet.
>
>These are the codebases that we would migrate to the Apache foundation.
>
>== External Dependencies ==
>
>
> * Junit: EPL
> * Apache Commons: ALv2
> * Apache Thrift: ALv2
> * Apache Maven: ALv2
> * Apache Avro: ALv2
> * Apache Hadoop: ALv2
> * Google Guava: ALv2
> * Google Protobuf: New BSD License
>
>== Cryptography ==
>
>We do not expect Parquet to be a controlled export item due to the use of
>encryption.
>
>== Required Resources ==
>
>=== Mailing lists ===
>
> * private@parquet.incubator.apache.org
> * commits@parquet.incubator.apache.org
> * dev@parquet.incubator.apache.org
>
>== Subversion Directory ==
>
>Git is the preferred source control system:
>
> * git://git.apache.org/parquet-format
> * git://git.apache.org/parquet-mr
>
>== Issue Tracking ==
>
>We'd like to keep using the Git review and issue tracking tools.
>Controlling Pull requests closing through git commit messages in
>git.apache.org
>
>== Initial Committers ==
>
> * Aniket Mokashi <an...@gmail.com>
> * Brock Noland <br...@apache.org>
> * Chris Aniszczyk <ca...@gmail.com>
> * Dmitriy Ryaboy <dv...@apache.org>
> * Jake Farrell <jf...@apache.org>
> * Jonathan Coveney <jc...@gmail.com>
> * Julien Le Dem <ju...@apache.org>
> * Lukas Nalezenec <lu...@gmail.com>
> * Marcel Kornacker <ma...@cloudera.com>
> * Mickael Lacour
> * Nong Li <no...@cloudera.com>
> * Remy Pecqueur
> * Ryan Blue <bl...@cloudera.com>
> * Tianshuo Deng <de...@gmail.com>
> * Tom White <to...@apache.org>
> * Wesley Peck
>
>== Affiliations ==
>
> * Aniket Mokashi - Twitter
> * Brock Noland - Cloudera
> * Chris Aniszczyk - Twitter
> * Dmitriy Ryaboy - Twitter
> * Jake Farrell
> * Jonathan Coveney - Twitter
> * Julien Le Dem - Twitter
> * Lukas Nalezenec
> * Marcel Kornacker - Cloudera
> * Mickael Lacour - Criteo
> * Nong Li - Cloudera
> * Remy Pecqueur - Criteo
> * Ryan Blue - Cloudera
> * Tianshuo Deng - Twitter
> * Tom White - Cloudera
> * Wesley Peck - ARRIS, Inc.
>
>== Sponsors ==
>
>=== Champion ===
>
> * Todd Lipcon
>
>=== Nominated Mentors ===
>
> * Tom White
> * Chris Mattmann
> * Jake Farrell
> * Roman Shaposhnik
>
>=== Sponsoring Entity ===
>
>The Apache Incubator
>
>-- 
>Cheers,
>
>Chris Aniszczyk
>http://aniszczyk.org
>+1 512 961 6719


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Timothy Chen <tn...@gmail.com>.

+1 non-binding.

Tim


> On May 18, 2014, at 6:14 PM, Jake Farrell <jf...@apache.org> wrote:
> 
> +1 (binding)
> 
> -Jake
> 
> 
> 
> On Sun, May 18, 2014 at 5:15 PM, Chris Aniszczyk <ca...@gmail.com>wrote:
> 
>> Based on the results of the discussion thread:
>> 
>> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>> 
>> I would like to call a vote on accepting Parquet into the incubator.
>> https://wiki.apache.org/incubator/ParquetProposal
>> 
>> [ ] +1 Accept Parquet into the Incubator
>> [ ] +0 Indifferent to the acceptance of Parquet
>> [ ] -1 Do not accept Parquet because ...
>> 
>> The vote will be open until Thursday May 22nd 18:00 UTC.
>> 
>> = Parquet Proposal =
>> 
>> == Abstract ==
>> Parquet is a columnar storage format for Hadoop.
>> 
>> == Proposal ==
>> 
>> We created Parquet to make the advantages of compressed, efficient columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>> 
>> == Background ==
>> 
>> Parquet is built from the ground up with complex nested data structures in
>> mind, and uses the repetition/definition level approach to encoding such
>> data structures, as popularized by Google Dremel (
>> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
>> this approach is superior to simple flattening of nested name spaces.
>> 
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>> 
>> == Rationale ==
>> 
>> Parquet is built to be used by anyone. We believe that an efficient,
>> well-implemented columnar storage substrate should be useful to all
>> frameworks without the cost of extensive and difficult to set up
>> dependencies.
>> 
>> Furthermore, the rapid growth of Parquet community is empowered by open
>> source. We believe the Apache foundation is a great fit as the long-term
>> home for Parquet, as it provides an established process for
>> community-driven development and decision making by consensus. This is
>> exactly the model we want for future Parquet development.
>> 
>> == Initial Goals ==
>> 
>> * Move the existing codebase to Apache
>> * Integrate with the Apache development process
>> * Ensure all dependencies are compliant with Apache License version 2.0
>> * Incremental development and releases per Apache guidelines
>> 
>> == Current Status ==
>> 
>> Parquet has undergone 2 major releases:
>> https://github.com/Parquet/parquet-format/releases of the core format and
>> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
>> supporting set of Java libraries.
>> 
>> The Parquet source is currently hosted at GitHub, which will seed the
>> Apache git repository.
>> 
>> === Meritocracy ===
>> 
>> We plan to invest in supporting a meritocracy. We will discuss the
>> requirements in an open forum. Several companies have already expressed
>> interest in this project, and we intend to invite additional developers to
>> participate. We will encourage and monitor community participation so that
>> privileges can be extended to those that contribute.
>> 
>> === Community ===
>> 
>> There is a large need for an advanced columnar storage format for Hadoop.
>> Parquet is being used in production by many organizations (see
>> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>> 
>> * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>> * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>> * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>> * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>> * Twitter: https://twitter.com/J_/statuses/315844725611581441
>> 
>> By bringing Parquet into Apache, we believe that the community will grow
>> even bigger.
>> 
>> === Core Developers ===
>> 
>> Parquet was initially developed as a collaboration between Twitter,
>> Cloudera and Criteo.
>> 
>> See
>> 
>> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>> 
>> === Alignment ===
>> 
>> We believe that having Parquet at Apache will help further the growth of
>> the big-data community, as it will encourage cooperation within the greater
>> ecosystem of projects spawned by Apache Hadoop. The alignment is also
>> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>> 
>> == Known Risks ==
>> 
>> === Orphaned Products ===
>> 
>> The risk of the Parquet project being abandoned is minimal. There are many
>> organizations using Parquet in production, including Twitter, Cloudera,
>> Stripe, and Salesforce (
>> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>> 
>> === Inexperience with Open Source ===
>> 
>> Parquet has existed as a healthy open source for one year. During that
>> time, we have curated an open-source community successfully, attracting
>> over 40 contributors (see
>> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
>> group of companies.
>> Several of the core contributors to the project are deeply familiar with
>> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
>> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
>> are also Apache Pig committers with contributions to several other Apache
>> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>> multiple other related projects. Brock Noland is a Hive committer.
>> 
>> === Homogenous Developers ===
>> 
>> The initial committers come from a number of companies and countries.
>> Parquet has an active community of developers, and we are committed to
>> recruiting additional committers based on their contributions to the
>> project. The java library component alone has contributions from 31
>> individual github accounts, 14 of which contributed over 1000 lines of
>> code.
>> 
>> === Reliance on Salaried Developers ===
>> 
>> It is expected that Parquet development will occur on both salaried time
>> and on volunteer time, after hours. The majority of initial committers are
>> paid by their employers to contribute to this project. However, they are
>> all passionate about the project, and we are confident that the project
>> will continue even if no salaried developers contribute to the project. As
>> evidence of this statement, we present the GitHub punchcard (see
>> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
>> lot
>> of activity happens on weekends. We are committed to recruiting additional
>> committers including non-salaried developers.
>> 
>> === Relationships with Other Apache Products ===
>> 
>> As mentioned in the Alignment section, Parquet is closely related to
>> Hadoop. It provides an API that allowed it to be easily integrated with
>> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
>> Tajo. Some of the features it provides are similar to the ORC file format
>> which is part of the Hive project. However Parquet focused on being
>> framework agnostic and language independent and has been really successful
>> to that end. On top of the Apache projects mentioned above, Parquet is also
>> integrated with other open source projects, including Protocol Buffers,
>> Cloudera Impala or Scrooge. We look forward to continue collaborating with
>> those communities, as well as other Apache communities.
>> 
>> === An Excessive Fascination with the Apache Brand ===
>> 
>> Parquet is an already healthy and well known open source project. This
>> proposal is not for the purpose of generating publicity. Rather, the
>> primary benefits to joining Apache are those outlined in the Rationale
>> section.
>> 
>> == Documentation ==
>> 
>> Documentation is currently located as README markdown files:
>> 
>> * https://github.com/Parquet/parquet-format
>> * https://github.com/Parquet/parquet-mr
>> 
>> == Source and Intellectual Property Submission Plan ==
>> 
>> The Parquet codebase is currently hosted on Github:
>> https://github.com/Parquet.
>> 
>> These are the codebases that we would migrate to the Apache foundation.
>> 
>> == External Dependencies ==
>> 
>> 
>> * Junit: EPL
>> * Apache Commons: ALv2
>> * Apache Thrift: ALv2
>> * Apache Maven: ALv2
>> * Apache Avro: ALv2
>> * Apache Hadoop: ALv2
>> * Google Guava: ALv2
>> * Google Protobuf: New BSD License
>> 
>> == Cryptography ==
>> 
>> We do not expect Parquet to be a controlled export item due to the use of
>> encryption.
>> 
>> == Required Resources ==
>> 
>> === Mailing lists ===
>> 
>> * private@parquet.incubator.apache.org
>> * commits@parquet.incubator.apache.org
>> * dev@parquet.incubator.apache.org
>> 
>> == Subversion Directory ==
>> 
>> Git is the preferred source control system:
>> 
>> * git://git.apache.org/parquet-format
>> * git://git.apache.org/parquet-mr
>> 
>> == Issue Tracking ==
>> 
>> We'd like to keep using the Git review and issue tracking tools.
>> Controlling Pull requests closing through git commit messages in
>> git.apache.org
>> 
>> == Initial Committers ==
>> 
>> * Aniket Mokashi <an...@gmail.com>
>> * Brock Noland <br...@apache.org>
>> * Chris Aniszczyk <ca...@gmail.com>
>> * Dmitriy Ryaboy <dv...@apache.org>
>> * Jake Farrell <jf...@apache.org>
>> * Jonathan Coveney <jc...@gmail.com>
>> * Julien Le Dem <ju...@apache.org>
>> * Lukas Nalezenec <lu...@gmail.com>
>> * Marcel Kornacker <ma...@cloudera.com>
>> * Mickael Lacour
>> * Nong Li <no...@cloudera.com>
>> * Remy Pecqueur
>> * Ryan Blue <bl...@cloudera.com>
>> * Tianshuo Deng <de...@gmail.com>
>> * Tom White <to...@apache.org>
>> * Wesley Peck
>> 
>> == Affiliations ==
>> 
>> * Aniket Mokashi - Twitter
>> * Brock Noland - Cloudera
>> * Chris Aniszczyk - Twitter
>> * Dmitriy Ryaboy - Twitter
>> * Jake Farrell
>> * Jonathan Coveney - Twitter
>> * Julien Le Dem - Twitter
>> * Lukas Nalezenec
>> * Marcel Kornacker - Cloudera
>> * Mickael Lacour - Criteo
>> * Nong Li - Cloudera
>> * Remy Pecqueur - Criteo
>> * Ryan Blue - Cloudera
>> * Tianshuo Deng - Twitter
>> * Tom White - Cloudera
>> * Wesley Peck - ARRIS, Inc.
>> 
>> == Sponsors ==
>> 
>> === Champion ===
>> 
>> * Todd Lipcon
>> 
>> === Nominated Mentors ===
>> 
>> * Tom White
>> * Chris Mattmann
>> * Jake Farrell
>> * Roman Shaposhnik
>> 
>> === Sponsoring Entity ===
>> 
>> The Apache Incubator
>> 
>> --
>> Cheers,
>> 
>> Chris Aniszczyk
>> http://aniszczyk.org
>> +1 512 961 6719
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Jake Farrell <jf...@apache.org>.

+1 (binding)

-Jake



On Sun, May 18, 2014 at 5:15 PM, Chris Aniszczyk <ca...@gmail.com>wrote:

> Based on the results of the discussion thread:
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
>
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of
> code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719
>

Re: [VOTE] Accept Parquet into the incubator

Posted by Mark Struberg <st...@yahoo.de>.

+1 (binding)



LieGrue,
strub





> On Monday, 19 May 2014, 1:59, Chris Aniszczyk <ca...@gmail.com> wrote:
> > Based on the results of the discussion thread:
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
> 
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
> 
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
> 
> The vote will be open until Thursday May 22nd 18:00 UTC.
> 
> = Parquet Proposal =
> 
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
> 
> == Proposal ==
> 
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
> 
> == Background ==
> 
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
> 
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
> 
> == Rationale ==
> 
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
> 
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
> 
> == Initial Goals ==
> 
> * Move the existing codebase to Apache
> * Integrate with the Apache development process
> * Ensure all dependencies are compliant with Apache License version 2.0
> * Incremental development and releases per Apache guidelines
> 
> == Current Status ==
> 
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
> 
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
> 
> === Meritocracy ===
> 
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
> 
> === Community ===
> 
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
> 
> * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
> * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> * Twitter: https://twitter.com/J_/statuses/315844725611581441
> 
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
> 
> === Core Developers ===
> 
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
> 
> See
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
> 
> === Alignment ===
> 
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
> 
> == Known Risks ==
> 
> === Orphaned Products ===
> 
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
> 
> === Inexperience with Open Source ===
> 
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
> 
> === Homogenous Developers ===
> 
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of code.
> 
> === Reliance on Salaried Developers ===
> 
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
> 
> === Relationships with Other Apache Products ===
> 
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
> 
> === An Excessive Fascination with the Apache Brand ===
> 
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
> 
> == Documentation ==
> 
> Documentation is currently located as README markdown files:
> 
> * https://github.com/Parquet/parquet-format
> * https://github.com/Parquet/parquet-mr
> 
> == Source and Intellectual Property Submission Plan ==
> 
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
> 
> These are the codebases that we would migrate to the Apache foundation.
> 
> == External Dependencies ==
> 
> 
> * Junit: EPL
> * Apache Commons: ALv2
> * Apache Thrift: ALv2
> * Apache Maven: ALv2
> * Apache Avro: ALv2
> * Apache Hadoop: ALv2
> * Google Guava: ALv2
> * Google Protobuf: New BSD License
> 
> == Cryptography ==
> 
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * private@parquet.incubator.apache.org
> * commits@parquet.incubator.apache.org
> * dev@parquet.incubator.apache.org
> 
> == Subversion Directory ==
> 
> Git is the preferred source control system:
> 
> * git://git.apache.org/parquet-format
> * git://git.apache.org/parquet-mr
> 
> == Issue Tracking ==
> 
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
> 
> == Initial Committers ==
> 
> * Aniket Mokashi <an...@gmail.com>
> * Brock Noland <br...@apache.org>
> * Chris Aniszczyk <ca...@gmail.com>
> * Dmitriy Ryaboy <dv...@apache.org>
> * Jake Farrell <jf...@apache.org>
> * Jonathan Coveney <jc...@gmail.com>
> * Julien Le Dem <ju...@apache.org>
> * Lukas Nalezenec <lu...@gmail.com>
> * Marcel Kornacker <ma...@cloudera.com>
> * Mickael Lacour
> * Nong Li <no...@cloudera.com>
> * Remy Pecqueur
> * Ryan Blue <bl...@cloudera.com>
> * Tianshuo Deng <de...@gmail.com>
> * Tom White <to...@apache.org>
> * Wesley Peck
> 
> == Affiliations ==
> 
> * Aniket Mokashi - Twitter
> * Brock Noland - Cloudera
> * Chris Aniszczyk - Twitter
> * Dmitriy Ryaboy - Twitter
> * Jake Farrell
> * Jonathan Coveney - Twitter
> * Julien Le Dem - Twitter
> * Lukas Nalezenec
> * Marcel Kornacker - Cloudera
> * Mickael Lacour - Criteo
> * Nong Li - Cloudera
> * Remy Pecqueur - Criteo
> * Ryan Blue - Cloudera
> * Tianshuo Deng - Twitter
> * Tom White - Cloudera
> * Wesley Peck - ARRIS, Inc.
> 
> == Sponsors ==
> 
> === Champion ===
> 
> * Todd Lipcon
> 
> === Nominated Mentors ===
> 
> * Tom White
> * Chris Mattmann
> * Jake Farrell
> * Roman Shaposhnik
> 
> === Sponsoring Entity ===
> 
> The Apache Incubator
> 
> -- 
> Cheers,
> 
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Olivier Lamy <ol...@apache.org>.

+1

On 19 May 2014 07:15, Chris Aniszczyk <ca...@gmail.com> wrote:
> Based on the results of the discussion thread:
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719



-- 
Olivier Lamy
Ecetera: http://ecetera.com.au
http://twitter.com/olamy | http://linkedin.com/in/olamy

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Henry Saputra <he...@gmail.com>.

+1 binding

On Sunday, May 18, 2014, Chris Aniszczyk <ca...@gmail.com> wrote:

> Based on the results of the discussion thread:
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
>
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of
> code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org <javascript:;>
>  * commits@parquet.incubator.apache.org <javascript:;>
>  * dev@parquet.incubator.apache.org <javascript:;>
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <aniket486@gmail.com <javascript:;>>
>  * Brock Noland <brock@apache.org <javascript:;>>
>  * Chris Aniszczyk <caniszczyk@gmail.com <javascript:;>>
>  * Dmitriy Ryaboy <dvryaboy@apache.org <javascript:;>>
>  * Jake Farrell <jfarrell@apache.org <javascript:;>>
>  * Jonathan Coveney <jcoveney@gmail.com <javascript:;>>
>  * Julien Le Dem <julien@apache.org <javascript:;>>
>  * Lukas Nalezenec <lukas.nalezenec@gmail.com <javascript:;>>
>  * Marcel Kornacker <marcel@cloudera.com <javascript:;>>
>  * Mickael Lacour
>  * Nong Li <nong@cloudera.com <javascript:;>>
>  * Remy Pecqueur
>  * Ryan Blue <blue@cloudera.com <javascript:;>>
>  * Tianshuo Deng <dengtianshuo@gmail.com <javascript:;>>
>  * Tom White <tomwhite@apache.org <javascript:;>>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719
>

Re: [VOTE] Accept Parquet into the incubator

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

> [ X ] +1 Accept Parquet into the Incubator

(non-binding)

Jarcec

On Sun, May 18, 2014 at 02:15:15PM -0700, Chris Aniszczyk wrote:
> Based on the results of the discussion thread:
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
> 
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
> 
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
> 
> The vote will be open until Thursday May 22nd 18:00 UTC.
> 
> = Parquet Proposal =
> 
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
> 
> == Proposal ==
> 
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
> 
> == Background ==
> 
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
> 
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
> 
> == Rationale ==
> 
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
> 
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
> 
> == Initial Goals ==
> 
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
> 
> == Current Status ==
> 
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
> 
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
> 
> === Meritocracy ===
> 
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
> 
> === Community ===
> 
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
> 
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
> 
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
> 
> === Core Developers ===
> 
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
> 
> See
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
> 
> === Alignment ===
> 
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
> 
> == Known Risks ==
> 
> === Orphaned Products ===
> 
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
> 
> === Inexperience with Open Source ===
> 
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
> 
> === Homogenous Developers ===
> 
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of code.
> 
> === Reliance on Salaried Developers ===
> 
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
> 
> === Relationships with Other Apache Products ===
> 
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
> 
> === An Excessive Fascination with the Apache Brand ===
> 
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
> 
> == Documentation ==
> 
> Documentation is currently located as README markdown files:
> 
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
> 
> == Source and Intellectual Property Submission Plan ==
> 
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
> 
> These are the codebases that we would migrate to the Apache foundation.
> 
> == External Dependencies ==
> 
> 
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
> 
> == Cryptography ==
> 
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
> 
> == Subversion Directory ==
> 
> Git is the preferred source control system:
> 
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
> 
> == Issue Tracking ==
> 
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
> 
> == Initial Committers ==
> 
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
> 
> == Affiliations ==
> 
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
> 
> == Sponsors ==
> 
> === Champion ===
> 
>  * Todd Lipcon
> 
> === Nominated Mentors ===
> 
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
> 
> === Sponsoring Entity ===
> 
> The Apache Incubator
> 
> -- 
> Cheers,
> 
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719

Re: [VOTE] Accept Parquet into the incubator

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Sun, May 18, 2014 at 11:15 PM, Chris Aniszczyk <ca...@gmail.com> wrote:
> ...I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal..

+1

-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Tom White <to...@gmail.com>.

+1

Tom

On Mon, May 19, 2014 at 9:15 AM, Chris Aniszczyk <ca...@gmail.com> wrote:
> Based on the results of the discussion thread:
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Hitesh Shah <hi...@apache.org>.

+1 (non-binding)

— Hitesh

On May 18, 2014, at 2:15 PM, Chris Aniszczyk <ca...@gmail.com> wrote:

> Based on the results of the discussion thread:
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
> 
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
> 
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
> 
> The vote will be open until Thursday May 22nd 18:00 UTC.
> 
> = Parquet Proposal =
> 
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
> 
> == Proposal ==
> 
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
> 
> == Background ==
> 
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
> 
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
> 
> == Rationale ==
> 
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
> 
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
> 
> == Initial Goals ==
> 
> * Move the existing codebase to Apache
> * Integrate with the Apache development process
> * Ensure all dependencies are compliant with Apache License version 2.0
> * Incremental development and releases per Apache guidelines
> 
> == Current Status ==
> 
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
> 
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
> 
> === Meritocracy ===
> 
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
> 
> === Community ===
> 
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
> 
> * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
> * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> * Twitter: https://twitter.com/J_/statuses/315844725611581441
> 
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
> 
> === Core Developers ===
> 
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
> 
> See
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
> 
> === Alignment ===
> 
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
> 
> == Known Risks ==
> 
> === Orphaned Products ===
> 
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
> 
> === Inexperience with Open Source ===
> 
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
> 
> === Homogenous Developers ===
> 
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of code.
> 
> === Reliance on Salaried Developers ===
> 
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
> 
> === Relationships with Other Apache Products ===
> 
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
> 
> === An Excessive Fascination with the Apache Brand ===
> 
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
> 
> == Documentation ==
> 
> Documentation is currently located as README markdown files:
> 
> * https://github.com/Parquet/parquet-format
> * https://github.com/Parquet/parquet-mr
> 
> == Source and Intellectual Property Submission Plan ==
> 
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
> 
> These are the codebases that we would migrate to the Apache foundation.
> 
> == External Dependencies ==
> 
> 
> * Junit: EPL
> * Apache Commons: ALv2
> * Apache Thrift: ALv2
> * Apache Maven: ALv2
> * Apache Avro: ALv2
> * Apache Hadoop: ALv2
> * Google Guava: ALv2
> * Google Protobuf: New BSD License
> 
> == Cryptography ==
> 
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * private@parquet.incubator.apache.org
> * commits@parquet.incubator.apache.org
> * dev@parquet.incubator.apache.org
> 
> == Subversion Directory ==
> 
> Git is the preferred source control system:
> 
> * git://git.apache.org/parquet-format
> * git://git.apache.org/parquet-mr
> 
> == Issue Tracking ==
> 
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
> 
> == Initial Committers ==
> 
> * Aniket Mokashi <an...@gmail.com>
> * Brock Noland <br...@apache.org>
> * Chris Aniszczyk <ca...@gmail.com>
> * Dmitriy Ryaboy <dv...@apache.org>
> * Jake Farrell <jf...@apache.org>
> * Jonathan Coveney <jc...@gmail.com>
> * Julien Le Dem <ju...@apache.org>
> * Lukas Nalezenec <lu...@gmail.com>
> * Marcel Kornacker <ma...@cloudera.com>
> * Mickael Lacour
> * Nong Li <no...@cloudera.com>
> * Remy Pecqueur
> * Ryan Blue <bl...@cloudera.com>
> * Tianshuo Deng <de...@gmail.com>
> * Tom White <to...@apache.org>
> * Wesley Peck
> 
> == Affiliations ==
> 
> * Aniket Mokashi - Twitter
> * Brock Noland - Cloudera
> * Chris Aniszczyk - Twitter
> * Dmitriy Ryaboy - Twitter
> * Jake Farrell
> * Jonathan Coveney - Twitter
> * Julien Le Dem - Twitter
> * Lukas Nalezenec
> * Marcel Kornacker - Cloudera
> * Mickael Lacour - Criteo
> * Nong Li - Cloudera
> * Remy Pecqueur - Criteo
> * Ryan Blue - Cloudera
> * Tianshuo Deng - Twitter
> * Tom White - Cloudera
> * Wesley Peck - ARRIS, Inc.
> 
> == Sponsors ==
> 
> === Champion ===
> 
> * Todd Lipcon
> 
> === Nominated Mentors ===
> 
> * Tom White
> * Chris Mattmann
> * Jake Farrell
> * Roman Shaposhnik
> 
> === Sponsoring Entity ===
> 
> The Apache Incubator
> 
> -- 
> Cheers,
> 
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Henry Saputra <he...@gmail.com>.

Hi Chris, could you re-send the tally up VOTE result with subject
prefixed with [RESULT] ?


- Henry

On Wed, May 21, 2014 at 3:56 PM, Chris Aniszczyk <ca...@gmail.com> wrote:
> With 18 +1 votes (and 10+ as binding votes), I'll consider this vote a
> success.
>
> I'll proceed with the next steps.
>
> Thank you!
>
>
>
> On Sun, May 18, 2014 at 3:57 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> +1 from me (the proposed Champion)
>>
>> -Todd
>>
>>
>> On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <caniszczyk@gmail.com
>> >wrote:
>>
>> > Based on the results of the discussion thread:
>> >
>> >
>> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>> >
>> > I would like to call a vote on accepting Parquet into the incubator.
>> > https://wiki.apache.org/incubator/ParquetProposal
>> >
>> > [ ] +1 Accept Parquet into the Incubator
>> > [ ] +0 Indifferent to the acceptance of Parquet
>> > [ ] -1 Do not accept Parquet because ...
>> >
>> > The vote will be open until Thursday May 22nd 18:00 UTC.
>> >
>> > = Parquet Proposal =
>> >
>> > == Abstract ==
>> > Parquet is a columnar storage format for Hadoop.
>> >
>> > == Proposal ==
>> >
>> > We created Parquet to make the advantages of compressed, efficient
>> columnar
>> > data representation available to any project in the Hadoop ecosystem,
>> > regardless of the choice of data processing framework, data model, or
>> > programming language.
>> >
>> > == Background ==
>> >
>> > Parquet is built from the ground up with complex nested data structures
>> in
>> > mind, and uses the repetition/definition level approach to encoding such
>> > data structures, as popularized by Google Dremel (
>> > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We
>> believe
>> > this approach is superior to simple flattening of nested name spaces.
>> >
>> > Parquet is built to support very efficient compression and encoding
>> > schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> > level, and is future-proofed to allow adding more encodings as they are
>> > invented and implemented. We separate the concepts of encoding and
>> > compression, allowing parquet consumers to implement operators that work
>> > directly on encoded data without paying decompression and decoding
>> penalty
>> > when possible.
>> >
>> > == Rationale ==
>> >
>> > Parquet is built to be used by anyone. We believe that an efficient,
>> > well-implemented columnar storage substrate should be useful to all
>> > frameworks without the cost of extensive and difficult to set up
>> > dependencies.
>> >
>> > Furthermore, the rapid growth of Parquet community is empowered by open
>> > source. We believe the Apache foundation is a great fit as the long-term
>> > home for Parquet, as it provides an established process for
>> > community-driven development and decision making by consensus. This is
>> > exactly the model we want for future Parquet development.
>> >
>> > == Initial Goals ==
>> >
>> >  * Move the existing codebase to Apache
>> >  * Integrate with the Apache development process
>> >  * Ensure all dependencies are compliant with Apache License version 2.0
>> >  * Incremental development and releases per Apache guidelines
>> >
>> > == Current Status ==
>> >
>> > Parquet has undergone 2 major releases:
>> > https://github.com/Parquet/parquet-format/releases of the core format
>> and
>> > 22 releases: https://github.com/Parquet/parquet-mr/releases of the
>> > supporting set of Java libraries.
>> >
>> > The Parquet source is currently hosted at GitHub, which will seed the
>> > Apache git repository.
>> >
>> > === Meritocracy ===
>> >
>> > We plan to invest in supporting a meritocracy. We will discuss the
>> > requirements in an open forum. Several companies have already expressed
>> > interest in this project, and we intend to invite additional developers
>> to
>> > participate. We will encourage and monitor community participation so
>> that
>> > privileges can be extended to those that contribute.
>> >
>> > === Community ===
>> >
>> > There is a large need for an advanced columnar storage format for Hadoop.
>> > Parquet is being used in production by many organizations (see
>> > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>> >
>> >  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>> >  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>> >  * Salesforce:
>> https://twitter.com/TwitterOSS/statuses/392734610116726784
>> >  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>> >  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>> >
>> > By bringing Parquet into Apache, we believe that the community will grow
>> > even bigger.
>> >
>> > === Core Developers ===
>> >
>> > Parquet was initially developed as a collaboration between Twitter,
>> > Cloudera and Criteo.
>> >
>> > See
>> >
>> >
>> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>> >
>> > === Alignment ===
>> >
>> > We believe that having Parquet at Apache will help further the growth of
>> > the big-data community, as it will encourage cooperation within the
>> greater
>> > ecosystem of projects spawned by Apache Hadoop. The alignment is also
>> > beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>> >
>> > == Known Risks ==
>> >
>> > === Orphaned Products ===
>> >
>> > The risk of the Parquet project being abandoned is minimal. There are
>> many
>> > organizations using Parquet in production, including Twitter, Cloudera,
>> > Stripe, and Salesforce (
>> > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>> >
>> > === Inexperience with Open Source ===
>> >
>> > Parquet has existed as a healthy open source for one year. During that
>> > time, we have curated an open-source community successfully, attracting
>> > over 40 contributors (see
>> > https://github.com/Parquet/parquet-mr/graphs/contributors) from a
>> diverse
>> > group of companies.
>> > Several of the core contributors to the project are deeply familiar with
>> > OSS and Apache specifically: Julien Le Dem was until recently the PMC
>> Chair
>> > for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
>> > are also Apache Pig committers with contributions to several other Apache
>> > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>> > multiple other related projects. Brock Noland is a Hive committer.
>> >
>> > === Homogenous Developers ===
>> >
>> > The initial committers come from a number of companies and countries.
>> > Parquet has an active community of developers, and we are committed to
>> > recruiting additional committers based on their contributions to the
>> > project. The java library component alone has contributions from 31
>> > individual github accounts, 14 of which contributed over 1000 lines of
>> > code.
>> >
>> > === Reliance on Salaried Developers ===
>> >
>> > It is expected that Parquet development will occur on both salaried time
>> > and on volunteer time, after hours. The majority of initial committers
>> are
>> > paid by their employers to contribute to this project. However, they are
>> > all passionate about the project, and we are confident that the project
>> > will continue even if no salaried developers contribute to the project.
>> As
>> > evidence of this statement, we present the GitHub punchcard (see
>> > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
>> > lot
>> > of activity happens on weekends. We are committed to recruiting
>> additional
>> > committers including non-salaried developers.
>> >
>> > === Relationships with Other Apache Products ===
>> >
>> > As mentioned in the Alignment section, Parquet is closely related to
>> > Hadoop. It provides an API that allowed it to be easily integrated with
>> > many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill,
>> Crunch,
>> > Tajo. Some of the features it provides are similar to the ORC file format
>> > which is part of the Hive project. However Parquet focused on being
>> > framework agnostic and language independent and has been really
>> successful
>> > to that end. On top of the Apache projects mentioned above, Parquet is
>> also
>> > integrated with other open source projects, including Protocol Buffers,
>> > Cloudera Impala or Scrooge. We look forward to continue collaborating
>> with
>> > those communities, as well as other Apache communities.
>> >
>> > === An Excessive Fascination with the Apache Brand ===
>> >
>> > Parquet is an already healthy and well known open source project. This
>> > proposal is not for the purpose of generating publicity. Rather, the
>> > primary benefits to joining Apache are those outlined in the Rationale
>> > section.
>> >
>> > == Documentation ==
>> >
>> > Documentation is currently located as README markdown files:
>> >
>> >  * https://github.com/Parquet/parquet-format
>> >  * https://github.com/Parquet/parquet-mr
>> >
>> > == Source and Intellectual Property Submission Plan ==
>> >
>> > The Parquet codebase is currently hosted on Github:
>> > https://github.com/Parquet.
>> >
>> > These are the codebases that we would migrate to the Apache foundation.
>> >
>> > == External Dependencies ==
>> >
>> >
>> >  * Junit: EPL
>> >  * Apache Commons: ALv2
>> >  * Apache Thrift: ALv2
>> >  * Apache Maven: ALv2
>> >  * Apache Avro: ALv2
>> >  * Apache Hadoop: ALv2
>> >  * Google Guava: ALv2
>> >  * Google Protobuf: New BSD License
>> >
>> > == Cryptography ==
>> >
>> > We do not expect Parquet to be a controlled export item due to the use of
>> > encryption.
>> >
>> > == Required Resources ==
>> >
>> > === Mailing lists ===
>> >
>> >  * private@parquet.incubator.apache.org
>> >  * commits@parquet.incubator.apache.org
>> >  * dev@parquet.incubator.apache.org
>> >
>> > == Subversion Directory ==
>> >
>> > Git is the preferred source control system:
>> >
>> >  * git://git.apache.org/parquet-format
>> >  * git://git.apache.org/parquet-mr
>> >
>> > == Issue Tracking ==
>> >
>> > We'd like to keep using the Git review and issue tracking tools.
>> > Controlling Pull requests closing through git commit messages in
>> > git.apache.org
>> >
>> > == Initial Committers ==
>> >
>> >  * Aniket Mokashi <an...@gmail.com>
>> >  * Brock Noland <br...@apache.org>
>> >  * Chris Aniszczyk <ca...@gmail.com>
>> >  * Dmitriy Ryaboy <dv...@apache.org>
>> >  * Jake Farrell <jf...@apache.org>
>> >  * Jonathan Coveney <jc...@gmail.com>
>> >  * Julien Le Dem <ju...@apache.org>
>> >  * Lukas Nalezenec <lu...@gmail.com>
>> >  * Marcel Kornacker <ma...@cloudera.com>
>> >  * Mickael Lacour
>> >  * Nong Li <no...@cloudera.com>
>> >  * Remy Pecqueur
>> >  * Ryan Blue <bl...@cloudera.com>
>> >  * Tianshuo Deng <de...@gmail.com>
>> >  * Tom White <to...@apache.org>
>> >  * Wesley Peck
>> >
>> > == Affiliations ==
>> >
>> >  * Aniket Mokashi - Twitter
>> >  * Brock Noland - Cloudera
>> >  * Chris Aniszczyk - Twitter
>> >  * Dmitriy Ryaboy - Twitter
>> >  * Jake Farrell
>> >  * Jonathan Coveney - Twitter
>> >  * Julien Le Dem - Twitter
>> >  * Lukas Nalezenec
>> >  * Marcel Kornacker - Cloudera
>> >  * Mickael Lacour - Criteo
>> >  * Nong Li - Cloudera
>> >  * Remy Pecqueur - Criteo
>> >  * Ryan Blue - Cloudera
>> >  * Tianshuo Deng - Twitter
>> >  * Tom White - Cloudera
>> >  * Wesley Peck - ARRIS, Inc.
>> >
>> > == Sponsors ==
>> >
>> > === Champion ===
>> >
>> >  * Todd Lipcon
>> >
>> > === Nominated Mentors ===
>> >
>> >  * Tom White
>> >  * Chris Mattmann
>> >  * Jake Farrell
>> >  * Roman Shaposhnik
>> >
>> > === Sponsoring Entity ===
>> >
>> > The Apache Incubator
>> >
>> > --
>> > Cheers,
>> >
>> > Chris Aniszczyk
>> > http://aniszczyk.org
>> > +1 512 961 6719
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Parquet into the incubator

Posted by Chris Aniszczyk <ca...@gmail.com>.

With 18 +1 votes (and 10+ as binding votes), I'll consider this vote a
success.

I'll proceed with the next steps.

Thank you!



On Sun, May 18, 2014 at 3:57 PM, Todd Lipcon <to...@cloudera.com> wrote:

> +1 from me (the proposed Champion)
>
> -Todd
>
>
> On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <caniszczyk@gmail.com
> >wrote:
>
> > Based on the results of the discussion thread:
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
> >
> > I would like to call a vote on accepting Parquet into the incubator.
> > https://wiki.apache.org/incubator/ParquetProposal
> >
> > [ ] +1 Accept Parquet into the Incubator
> > [ ] +0 Indifferent to the acceptance of Parquet
> > [ ] -1 Do not accept Parquet because ...
> >
> > The vote will be open until Thursday May 22nd 18:00 UTC.
> >
> > = Parquet Proposal =
> >
> > == Abstract ==
> > Parquet is a columnar storage format for Hadoop.
> >
> > == Proposal ==
> >
> > We created Parquet to make the advantages of compressed, efficient
> columnar
> > data representation available to any project in the Hadoop ecosystem,
> > regardless of the choice of data processing framework, data model, or
> > programming language.
> >
> > == Background ==
> >
> > Parquet is built from the ground up with complex nested data structures
> in
> > mind, and uses the repetition/definition level approach to encoding such
> > data structures, as popularized by Google Dremel (
> > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We
> believe
> > this approach is superior to simple flattening of nested name spaces.
> >
> > Parquet is built to support very efficient compression and encoding
> > schemes. Parquet allows compression schemes to be specified on a
> per-column
> > level, and is future-proofed to allow adding more encodings as they are
> > invented and implemented. We separate the concepts of encoding and
> > compression, allowing parquet consumers to implement operators that work
> > directly on encoded data without paying decompression and decoding
> penalty
> > when possible.
> >
> > == Rationale ==
> >
> > Parquet is built to be used by anyone. We believe that an efficient,
> > well-implemented columnar storage substrate should be useful to all
> > frameworks without the cost of extensive and difficult to set up
> > dependencies.
> >
> > Furthermore, the rapid growth of Parquet community is empowered by open
> > source. We believe the Apache foundation is a great fit as the long-term
> > home for Parquet, as it provides an established process for
> > community-driven development and decision making by consensus. This is
> > exactly the model we want for future Parquet development.
> >
> > == Initial Goals ==
> >
> >  * Move the existing codebase to Apache
> >  * Integrate with the Apache development process
> >  * Ensure all dependencies are compliant with Apache License version 2.0
> >  * Incremental development and releases per Apache guidelines
> >
> > == Current Status ==
> >
> > Parquet has undergone 2 major releases:
> > https://github.com/Parquet/parquet-format/releases of the core format
> and
> > 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> > supporting set of Java libraries.
> >
> > The Parquet source is currently hosted at GitHub, which will seed the
> > Apache git repository.
> >
> > === Meritocracy ===
> >
> > We plan to invest in supporting a meritocracy. We will discuss the
> > requirements in an open forum. Several companies have already expressed
> > interest in this project, and we intend to invite additional developers
> to
> > participate. We will encourage and monitor community participation so
> that
> > privileges can be extended to those that contribute.
> >
> > === Community ===
> >
> > There is a large need for an advanced columnar storage format for Hadoop.
> > Parquet is being used in production by many organizations (see
> > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
> >
> >  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> >  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> >  * Salesforce:
> https://twitter.com/TwitterOSS/statuses/392734610116726784
> >  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> >  * Twitter: https://twitter.com/J_/statuses/315844725611581441
> >
> > By bringing Parquet into Apache, we believe that the community will grow
> > even bigger.
> >
> > === Core Developers ===
> >
> > Parquet was initially developed as a collaboration between Twitter,
> > Cloudera and Criteo.
> >
> > See
> >
> >
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
> >
> > === Alignment ===
> >
> > We believe that having Parquet at Apache will help further the growth of
> > the big-data community, as it will encourage cooperation within the
> greater
> > ecosystem of projects spawned by Apache Hadoop. The alignment is also
> > beneficial to other Apache communities (such as Hadoop, Hive, Avro).
> >
> > == Known Risks ==
> >
> > === Orphaned Products ===
> >
> > The risk of the Parquet project being abandoned is minimal. There are
> many
> > organizations using Parquet in production, including Twitter, Cloudera,
> > Stripe, and Salesforce (
> > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
> >
> > === Inexperience with Open Source ===
> >
> > Parquet has existed as a healthy open source for one year. During that
> > time, we have curated an open-source community successfully, attracting
> > over 40 contributors (see
> > https://github.com/Parquet/parquet-mr/graphs/contributors) from a
> diverse
> > group of companies.
> > Several of the core contributors to the project are deeply familiar with
> > OSS and Apache specifically: Julien Le Dem was until recently the PMC
> Chair
> > for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> > are also Apache Pig committers with contributions to several other Apache
> > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> > multiple other related projects. Brock Noland is a Hive committer.
> >
> > === Homogenous Developers ===
> >
> > The initial committers come from a number of companies and countries.
> > Parquet has an active community of developers, and we are committed to
> > recruiting additional committers based on their contributions to the
> > project. The java library component alone has contributions from 31
> > individual github accounts, 14 of which contributed over 1000 lines of
> > code.
> >
> > === Reliance on Salaried Developers ===
> >
> > It is expected that Parquet development will occur on both salaried time
> > and on volunteer time, after hours. The majority of initial committers
> are
> > paid by their employers to contribute to this project. However, they are
> > all passionate about the project, and we are confident that the project
> > will continue even if no salaried developers contribute to the project.
> As
> > evidence of this statement, we present the GitHub punchcard (see
> > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> > lot
> > of activity happens on weekends. We are committed to recruiting
> additional
> > committers including non-salaried developers.
> >
> > === Relationships with Other Apache Products ===
> >
> > As mentioned in the Alignment section, Parquet is closely related to
> > Hadoop. It provides an API that allowed it to be easily integrated with
> > many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill,
> Crunch,
> > Tajo. Some of the features it provides are similar to the ORC file format
> > which is part of the Hive project. However Parquet focused on being
> > framework agnostic and language independent and has been really
> successful
> > to that end. On top of the Apache projects mentioned above, Parquet is
> also
> > integrated with other open source projects, including Protocol Buffers,
> > Cloudera Impala or Scrooge. We look forward to continue collaborating
> with
> > those communities, as well as other Apache communities.
> >
> > === An Excessive Fascination with the Apache Brand ===
> >
> > Parquet is an already healthy and well known open source project. This
> > proposal is not for the purpose of generating publicity. Rather, the
> > primary benefits to joining Apache are those outlined in the Rationale
> > section.
> >
> > == Documentation ==
> >
> > Documentation is currently located as README markdown files:
> >
> >  * https://github.com/Parquet/parquet-format
> >  * https://github.com/Parquet/parquet-mr
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > The Parquet codebase is currently hosted on Github:
> > https://github.com/Parquet.
> >
> > These are the codebases that we would migrate to the Apache foundation.
> >
> > == External Dependencies ==
> >
> >
> >  * Junit: EPL
> >  * Apache Commons: ALv2
> >  * Apache Thrift: ALv2
> >  * Apache Maven: ALv2
> >  * Apache Avro: ALv2
> >  * Apache Hadoop: ALv2
> >  * Google Guava: ALv2
> >  * Google Protobuf: New BSD License
> >
> > == Cryptography ==
> >
> > We do not expect Parquet to be a controlled export item due to the use of
> > encryption.
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >  * private@parquet.incubator.apache.org
> >  * commits@parquet.incubator.apache.org
> >  * dev@parquet.incubator.apache.org
> >
> > == Subversion Directory ==
> >
> > Git is the preferred source control system:
> >
> >  * git://git.apache.org/parquet-format
> >  * git://git.apache.org/parquet-mr
> >
> > == Issue Tracking ==
> >
> > We'd like to keep using the Git review and issue tracking tools.
> > Controlling Pull requests closing through git commit messages in
> > git.apache.org
> >
> > == Initial Committers ==
> >
> >  * Aniket Mokashi <an...@gmail.com>
> >  * Brock Noland <br...@apache.org>
> >  * Chris Aniszczyk <ca...@gmail.com>
> >  * Dmitriy Ryaboy <dv...@apache.org>
> >  * Jake Farrell <jf...@apache.org>
> >  * Jonathan Coveney <jc...@gmail.com>
> >  * Julien Le Dem <ju...@apache.org>
> >  * Lukas Nalezenec <lu...@gmail.com>
> >  * Marcel Kornacker <ma...@cloudera.com>
> >  * Mickael Lacour
> >  * Nong Li <no...@cloudera.com>
> >  * Remy Pecqueur
> >  * Ryan Blue <bl...@cloudera.com>
> >  * Tianshuo Deng <de...@gmail.com>
> >  * Tom White <to...@apache.org>
> >  * Wesley Peck
> >
> > == Affiliations ==
> >
> >  * Aniket Mokashi - Twitter
> >  * Brock Noland - Cloudera
> >  * Chris Aniszczyk - Twitter
> >  * Dmitriy Ryaboy - Twitter
> >  * Jake Farrell
> >  * Jonathan Coveney - Twitter
> >  * Julien Le Dem - Twitter
> >  * Lukas Nalezenec
> >  * Marcel Kornacker - Cloudera
> >  * Mickael Lacour - Criteo
> >  * Nong Li - Cloudera
> >  * Remy Pecqueur - Criteo
> >  * Ryan Blue - Cloudera
> >  * Tianshuo Deng - Twitter
> >  * Tom White - Cloudera
> >  * Wesley Peck - ARRIS, Inc.
> >
> > == Sponsors ==
> >
> > === Champion ===
> >
> >  * Todd Lipcon
> >
> > === Nominated Mentors ===
> >
> >  * Tom White
> >  * Chris Mattmann
> >  * Jake Farrell
> >  * Roman Shaposhnik
> >
> > === Sponsoring Entity ===
> >
> > The Apache Incubator
> >
> > --
> > Cheers,
> >
> > Chris Aniszczyk
> > http://aniszczyk.org
> > +1 512 961 6719
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Cheers,

Chris Aniszczyk
http://aniszczyk.org
+1 512 961 6719

Re: [VOTE] Accept Parquet into the incubator

Posted by Todd Lipcon <to...@cloudera.com>.

+1 from me (the proposed Champion)

-Todd


On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <ca...@gmail.com>wrote:

> Based on the results of the discussion thread:
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
>
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of
> code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [VOTE] Accept Parquet into the incubator

Posted by Brock Noland <br...@cloudera.com>.

[X ] +1 Accept Parquet into the Incubator

non-binding


On Mon, May 19, 2014 at 11:24 AM, Andrew Purtell <ap...@apache.org>wrote:

> +1 (binding)
>
>
> On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <caniszczyk@gmail.com
> >wrote:
>
> > Based on the results of the discussion thread:
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
> >
> > I would like to call a vote on accepting Parquet into the incubator.
> > https://wiki.apache.org/incubator/ParquetProposal
> >
> > [ ] +1 Accept Parquet into the Incubator
> > [ ] +0 Indifferent to the acceptance of Parquet
> > [ ] -1 Do not accept Parquet because ...
> >
> > The vote will be open until Thursday May 22nd 18:00 UTC.
> >
> > = Parquet Proposal =
> >
> > == Abstract ==
> > Parquet is a columnar storage format for Hadoop.
> >
> > == Proposal ==
> >
> > We created Parquet to make the advantages of compressed, efficient
> columnar
> > data representation available to any project in the Hadoop ecosystem,
> > regardless of the choice of data processing framework, data model, or
> > programming language.
> >
> > == Background ==
> >
> > Parquet is built from the ground up with complex nested data structures
> in
> > mind, and uses the repetition/definition level approach to encoding such
> > data structures, as popularized by Google Dremel (
> > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We
> believe
> > this approach is superior to simple flattening of nested name spaces.
> >
> > Parquet is built to support very efficient compression and encoding
> > schemes. Parquet allows compression schemes to be specified on a
> per-column
> > level, and is future-proofed to allow adding more encodings as they are
> > invented and implemented. We separate the concepts of encoding and
> > compression, allowing parquet consumers to implement operators that work
> > directly on encoded data without paying decompression and decoding
> penalty
> > when possible.
> >
> > == Rationale ==
> >
> > Parquet is built to be used by anyone. We believe that an efficient,
> > well-implemented columnar storage substrate should be useful to all
> > frameworks without the cost of extensive and difficult to set up
> > dependencies.
> >
> > Furthermore, the rapid growth of Parquet community is empowered by open
> > source. We believe the Apache foundation is a great fit as the long-term
> > home for Parquet, as it provides an established process for
> > community-driven development and decision making by consensus. This is
> > exactly the model we want for future Parquet development.
> >
> > == Initial Goals ==
> >
> >  * Move the existing codebase to Apache
> >  * Integrate with the Apache development process
> >  * Ensure all dependencies are compliant with Apache License version 2.0
> >  * Incremental development and releases per Apache guidelines
> >
> > == Current Status ==
> >
> > Parquet has undergone 2 major releases:
> > https://github.com/Parquet/parquet-format/releases of the core format
> and
> > 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> > supporting set of Java libraries.
> >
> > The Parquet source is currently hosted at GitHub, which will seed the
> > Apache git repository.
> >
> > === Meritocracy ===
> >
> > We plan to invest in supporting a meritocracy. We will discuss the
> > requirements in an open forum. Several companies have already expressed
> > interest in this project, and we intend to invite additional developers
> to
> > participate. We will encourage and monitor community participation so
> that
> > privileges can be extended to those that contribute.
> >
> > === Community ===
> >
> > There is a large need for an advanced columnar storage format for Hadoop.
> > Parquet is being used in production by many organizations (see
> > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
> >
> >  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> >  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> >  * Salesforce:
> https://twitter.com/TwitterOSS/statuses/392734610116726784
> >  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> >  * Twitter: https://twitter.com/J_/statuses/315844725611581441
> >
> > By bringing Parquet into Apache, we believe that the community will grow
> > even bigger.
> >
> > === Core Developers ===
> >
> > Parquet was initially developed as a collaboration between Twitter,
> > Cloudera and Criteo.
> >
> > See
> >
> >
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
> >
> > === Alignment ===
> >
> > We believe that having Parquet at Apache will help further the growth of
> > the big-data community, as it will encourage cooperation within the
> greater
> > ecosystem of projects spawned by Apache Hadoop. The alignment is also
> > beneficial to other Apache communities (such as Hadoop, Hive, Avro).
> >
> > == Known Risks ==
> >
> > === Orphaned Products ===
> >
> > The risk of the Parquet project being abandoned is minimal. There are
> many
> > organizations using Parquet in production, including Twitter, Cloudera,
> > Stripe, and Salesforce (
> > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
> >
> > === Inexperience with Open Source ===
> >
> > Parquet has existed as a healthy open source for one year. During that
> > time, we have curated an open-source community successfully, attracting
> > over 40 contributors (see
> > https://github.com/Parquet/parquet-mr/graphs/contributors) from a
> diverse
> > group of companies.
> > Several of the core contributors to the project are deeply familiar with
> > OSS and Apache specifically: Julien Le Dem was until recently the PMC
> Chair
> > for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> > are also Apache Pig committers with contributions to several other Apache
> > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> > multiple other related projects. Brock Noland is a Hive committer.
> >
> > === Homogenous Developers ===
> >
> > The initial committers come from a number of companies and countries.
> > Parquet has an active community of developers, and we are committed to
> > recruiting additional committers based on their contributions to the
> > project. The java library component alone has contributions from 31
> > individual github accounts, 14 of which contributed over 1000 lines of
> > code.
> >
> > === Reliance on Salaried Developers ===
> >
> > It is expected that Parquet development will occur on both salaried time
> > and on volunteer time, after hours. The majority of initial committers
> are
> > paid by their employers to contribute to this project. However, they are
> > all passionate about the project, and we are confident that the project
> > will continue even if no salaried developers contribute to the project.
> As
> > evidence of this statement, we present the GitHub punchcard (see
> > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> > lot
> > of activity happens on weekends. We are committed to recruiting
> additional
> > committers including non-salaried developers.
> >
> > === Relationships with Other Apache Products ===
> >
> > As mentioned in the Alignment section, Parquet is closely related to
> > Hadoop. It provides an API that allowed it to be easily integrated with
> > many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill,
> Crunch,
> > Tajo. Some of the features it provides are similar to the ORC file format
> > which is part of the Hive project. However Parquet focused on being
> > framework agnostic and language independent and has been really
> successful
> > to that end. On top of the Apache projects mentioned above, Parquet is
> also
> > integrated with other open source projects, including Protocol Buffers,
> > Cloudera Impala or Scrooge. We look forward to continue collaborating
> with
> > those communities, as well as other Apache communities.
> >
> > === An Excessive Fascination with the Apache Brand ===
> >
> > Parquet is an already healthy and well known open source project. This
> > proposal is not for the purpose of generating publicity. Rather, the
> > primary benefits to joining Apache are those outlined in the Rationale
> > section.
> >
> > == Documentation ==
> >
> > Documentation is currently located as README markdown files:
> >
> >  * https://github.com/Parquet/parquet-format
> >  * https://github.com/Parquet/parquet-mr
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > The Parquet codebase is currently hosted on Github:
> > https://github.com/Parquet.
> >
> > These are the codebases that we would migrate to the Apache foundation.
> >
> > == External Dependencies ==
> >
> >
> >  * Junit: EPL
> >  * Apache Commons: ALv2
> >  * Apache Thrift: ALv2
> >  * Apache Maven: ALv2
> >  * Apache Avro: ALv2
> >  * Apache Hadoop: ALv2
> >  * Google Guava: ALv2
> >  * Google Protobuf: New BSD License
> >
> > == Cryptography ==
> >
> > We do not expect Parquet to be a controlled export item due to the use of
> > encryption.
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >  * private@parquet.incubator.apache.org
> >  * commits@parquet.incubator.apache.org
> >  * dev@parquet.incubator.apache.org
> >
> > == Subversion Directory ==
> >
> > Git is the preferred source control system:
> >
> >  * git://git.apache.org/parquet-format
> >  * git://git.apache.org/parquet-mr
> >
> > == Issue Tracking ==
> >
> > We'd like to keep using the Git review and issue tracking tools.
> > Controlling Pull requests closing through git commit messages in
> > git.apache.org
> >
> > == Initial Committers ==
> >
> >  * Aniket Mokashi <an...@gmail.com>
> >  * Brock Noland <br...@apache.org>
> >  * Chris Aniszczyk <ca...@gmail.com>
> >  * Dmitriy Ryaboy <dv...@apache.org>
> >  * Jake Farrell <jf...@apache.org>
> >  * Jonathan Coveney <jc...@gmail.com>
> >  * Julien Le Dem <ju...@apache.org>
> >  * Lukas Nalezenec <lu...@gmail.com>
> >  * Marcel Kornacker <ma...@cloudera.com>
> >  * Mickael Lacour
> >  * Nong Li <no...@cloudera.com>
> >  * Remy Pecqueur
> >  * Ryan Blue <bl...@cloudera.com>
> >  * Tianshuo Deng <de...@gmail.com>
> >  * Tom White <to...@apache.org>
> >  * Wesley Peck
> >
> > == Affiliations ==
> >
> >  * Aniket Mokashi - Twitter
> >  * Brock Noland - Cloudera
> >  * Chris Aniszczyk - Twitter
> >  * Dmitriy Ryaboy - Twitter
> >  * Jake Farrell
> >  * Jonathan Coveney - Twitter
> >  * Julien Le Dem - Twitter
> >  * Lukas Nalezenec
> >  * Marcel Kornacker - Cloudera
> >  * Mickael Lacour - Criteo
> >  * Nong Li - Cloudera
> >  * Remy Pecqueur - Criteo
> >  * Ryan Blue - Cloudera
> >  * Tianshuo Deng - Twitter
> >  * Tom White - Cloudera
> >  * Wesley Peck - ARRIS, Inc.
> >
> > == Sponsors ==
> >
> > === Champion ===
> >
> >  * Todd Lipcon
> >
> > === Nominated Mentors ===
> >
> >  * Tom White
> >  * Chris Mattmann
> >  * Jake Farrell
> >  * Roman Shaposhnik
> >
> > === Sponsoring Entity ===
> >
> > The Apache Incubator
> >
> > --
> > Cheers,
> >
> > Chris Aniszczyk
> > http://aniszczyk.org
> > +1 512 961 6719
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: [VOTE] Accept Parquet into the incubator

Posted by Andrew Purtell <ap...@apache.org>.

+1 (binding)


On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <ca...@gmail.com>wrote:

> Based on the results of the discussion thread:
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
>
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of
> code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [VOTE] Accept Parquet into the incubator

Posted by Arvind Prabhakar <ar...@apache.org>.

+1 (binding)

Regards,
Arvind Prabhakar


On Sun, May 18, 2014 at 2:15 PM, Chris Aniszczyk <ca...@gmail.com>wrote:

> Based on the results of the discussion thread:
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>
> I would like to call a vote on accepting Parquet into the incubator.
> https://wiki.apache.org/incubator/ParquetProposal
>
> [ ] +1 Accept Parquet into the Incubator
> [ ] +0 Indifferent to the acceptance of Parquet
> [ ] -1 Do not accept Parquet because ...
>
> The vote will be open until Thursday May 22nd 18:00 UTC.
>
> = Parquet Proposal =
>
> == Abstract ==
> Parquet is a columnar storage format for Hadoop.
>
> == Proposal ==
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> == Background ==
>
> Parquet is built from the ground up with complex nested data structures in
> mind, and uses the repetition/definition level approach to encoding such
> data structures, as popularized by Google Dremel (
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
> this approach is superior to simple flattening of nested name spaces.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> == Rationale ==
>
> Parquet is built to be used by anyone. We believe that an efficient,
> well-implemented columnar storage substrate should be useful to all
> frameworks without the cost of extensive and difficult to set up
> dependencies.
>
> Furthermore, the rapid growth of Parquet community is empowered by open
> source. We believe the Apache foundation is a great fit as the long-term
> home for Parquet, as it provides an established process for
> community-driven development and decision making by consensus. This is
> exactly the model we want for future Parquet development.
>
> == Initial Goals ==
>
>  * Move the existing codebase to Apache
>  * Integrate with the Apache development process
>  * Ensure all dependencies are compliant with Apache License version 2.0
>  * Incremental development and releases per Apache guidelines
>
> == Current Status ==
>
> Parquet has undergone 2 major releases:
> https://github.com/Parquet/parquet-format/releases of the core format and
> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> supporting set of Java libraries.
>
> The Parquet source is currently hosted at GitHub, which will seed the
> Apache git repository.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute.
>
> === Community ===
>
> There is a large need for an advanced columnar storage format for Hadoop.
> Parquet is being used in production by many organizations (see
> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
> By bringing Parquet into Apache, we believe that the community will grow
> even bigger.
>
> === Core Developers ===
>
> Parquet was initially developed as a collaboration between Twitter,
> Cloudera and Criteo.
>
> See
>
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
> === Alignment ===
>
> We believe that having Parquet at Apache will help further the growth of
> the big-data community, as it will encourage cooperation within the greater
> ecosystem of projects spawned by Apache Hadoop. The alignment is also
> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Parquet project being abandoned is minimal. There are many
> organizations using Parquet in production, including Twitter, Cloudera,
> Stripe, and Salesforce (
> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
> === Inexperience with Open Source ===
>
> Parquet has existed as a healthy open source for one year. During that
> time, we have curated an open-source community successfully, attracting
> over 40 contributors (see
> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
> group of companies.
> Several of the core contributors to the project are deeply familiar with
> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are also Apache Pig committers with contributions to several other Apache
> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> multiple other related projects. Brock Noland is a Hive committer.
>
> === Homogenous Developers ===
>
> The initial committers come from a number of companies and countries.
> Parquet has an active community of developers, and we are committed to
> recruiting additional committers based on their contributions to the
> project. The java library component alone has contributions from 31
> individual github accounts, 14 of which contributed over 1000 lines of
> code.
>
> === Reliance on Salaried Developers ===
>
> It is expected that Parquet development will occur on both salaried time
> and on volunteer time, after hours. The majority of initial committers are
> paid by their employers to contribute to this project. However, they are
> all passionate about the project, and we are confident that the project
> will continue even if no salaried developers contribute to the project. As
> evidence of this statement, we present the GitHub punchcard (see
> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
> lot
> of activity happens on weekends. We are committed to recruiting additional
> committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
>
> As mentioned in the Alignment section, Parquet is closely related to
> Hadoop. It provides an API that allowed it to be easily integrated with
> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
> Tajo. Some of the features it provides are similar to the ORC file format
> which is part of the Hive project. However Parquet focused on being
> framework agnostic and language independent and has been really successful
> to that end. On top of the Apache projects mentioned above, Parquet is also
> integrated with other open source projects, including Protocol Buffers,
> Cloudera Impala or Scrooge. We look forward to continue collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Parquet is an already healthy and well known open source project. This
> proposal is not for the purpose of generating publicity. Rather, the
> primary benefits to joining Apache are those outlined in the Rationale
> section.
>
> == Documentation ==
>
> Documentation is currently located as README markdown files:
>
>  * https://github.com/Parquet/parquet-format
>  * https://github.com/Parquet/parquet-mr
>
> == Source and Intellectual Property Submission Plan ==
>
> The Parquet codebase is currently hosted on Github:
> https://github.com/Parquet.
>
> These are the codebases that we would migrate to the Apache foundation.
>
> == External Dependencies ==
>
>
>  * Junit: EPL
>  * Apache Commons: ALv2
>  * Apache Thrift: ALv2
>  * Apache Maven: ALv2
>  * Apache Avro: ALv2
>  * Apache Hadoop: ALv2
>  * Google Guava: ALv2
>  * Google Protobuf: New BSD License
>
> == Cryptography ==
>
> We do not expect Parquet to be a controlled export item due to the use of
> encryption.
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@parquet.incubator.apache.org
>  * commits@parquet.incubator.apache.org
>  * dev@parquet.incubator.apache.org
>
> == Subversion Directory ==
>
> Git is the preferred source control system:
>
>  * git://git.apache.org/parquet-format
>  * git://git.apache.org/parquet-mr
>
> == Issue Tracking ==
>
> We'd like to keep using the Git review and issue tracking tools.
> Controlling Pull requests closing through git commit messages in
> git.apache.org
>
> == Initial Committers ==
>
>  * Aniket Mokashi <an...@gmail.com>
>  * Brock Noland <br...@apache.org>
>  * Chris Aniszczyk <ca...@gmail.com>
>  * Dmitriy Ryaboy <dv...@apache.org>
>  * Jake Farrell <jf...@apache.org>
>  * Jonathan Coveney <jc...@gmail.com>
>  * Julien Le Dem <ju...@apache.org>
>  * Lukas Nalezenec <lu...@gmail.com>
>  * Marcel Kornacker <ma...@cloudera.com>
>  * Mickael Lacour
>  * Nong Li <no...@cloudera.com>
>  * Remy Pecqueur
>  * Ryan Blue <bl...@cloudera.com>
>  * Tianshuo Deng <de...@gmail.com>
>  * Tom White <to...@apache.org>
>  * Wesley Peck
>
> == Affiliations ==
>
>  * Aniket Mokashi - Twitter
>  * Brock Noland - Cloudera
>  * Chris Aniszczyk - Twitter
>  * Dmitriy Ryaboy - Twitter
>  * Jake Farrell
>  * Jonathan Coveney - Twitter
>  * Julien Le Dem - Twitter
>  * Lukas Nalezenec
>  * Marcel Kornacker - Cloudera
>  * Mickael Lacour - Criteo
>  * Nong Li - Cloudera
>  * Remy Pecqueur - Criteo
>  * Ryan Blue - Cloudera
>  * Tianshuo Deng - Twitter
>  * Tom White - Cloudera
>  * Wesley Peck - ARRIS, Inc.
>
> == Sponsors ==
>
> === Champion ===
>
>  * Todd Lipcon
>
> === Nominated Mentors ===
>
>  * Tom White
>  * Chris Mattmann
>  * Jake Farrell
>  * Roman Shaposhnik
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719
>