You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Matthew Hayes <mh...@linkedin.com> on 2013/12/18 23:49:24 UTC

[PROPOSAL] DataFu for Incubation

Hi all,

I would like to share our draft ASF incubation proposal for DataFu, a library that makes it easier to solve data problems in Hadoop and high level languages based on it.

The proposal can be found here:

https://wiki.apache.org/incubator/DataFuProposal

The source code is available on GitHub:

https://github.com/linkedin/datafu.

The text of the proposal is copied below.  Feedback is appreciated!

Thanks,
Matt

== Abstract ==

Data``Fu makes it easier to solve data problems using Hadoop and higher level languages based on it.

== Proposal ==

Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions in higher level languages based on it to perform data analysis.  It provides functions for common statistics tasks (e.g. quantiles, sampling), Page``Rank, stream sessionization, and set and bag operations.  Data``Fu also provides Hadoop jobs for incremental data processing in Map``Reduce.

== Background ==

Data``Fu began two years ago as set of UDFs developed internally at Linked``In, coming from our desire to solve common problems with reusable components.  Recognizing that the community could benefit from such a library, we added documentation, an extensive suite of unit tests, and open sourced the code.  Since then there have been steady contributions to Data``Fu as we encountered common problems not yet solved by it.  Others outside Linked``In have contributed as well.  More recently we recognized the challenges with efficient incremental processing of data in Hadoop and have contributed a set of Hadoop Map``Reduce jobs as a solution.

Data``Fu began as a project at Linked``In, but it has shown itself to be useful to other organizations and developers as well as they have faced similar problems.  We would like to share Data``Fu with the ASF and begin developing a community of developers and users within Apache.

== Rationale ==

There is a strong need for well tested libraries that help developers solve common data problems in Hadoop and higher level languages such as Pig, Hive, Crunch, Scalding, etc.

== Current Status ==

=== Meritocracy ===

Our intent with this incubator proposal is to start building a diverse developer community around Data``Fu following the Apache meritocracy model.  Since Data``Fu was initially open sourced in 2011, it has received contributions from both within and outside Linked``In.  We plan to continue support for new contributors and work with those who contribute significantly to the project to make them committers.

=== Community ===

Data``Fu has been building a community of developers for two years.  It began with contributors from Linked``In and has received contributions from developers at Cloudera since very early on.  It has been included included in Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our contributor base significantly and invite all those who are interested in solving large-scale data processing problems to participate.

=== Core Developers ===

Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes initiated the project in 2011, and aside from continued contributions to Data``Fu has also contributed the sub-project Hourglass for incremental Map``Reduce processing.  Separate from Data``Fu he has also open sourced the White Elephant project.  Sam Shah contributed a significant portion of the original code and continues to contribute to the project.  William Vaughan has been contributing regularly to Data``Fu for the past two years.  Evion Kim has been contributing to Data``Fu for the past year.  Xiangrui Meng recently contributed implementations of scalable sampling algorithms based on research from a paper he published.  Chris Lloyd has provided some important bug fixes and unit tests.  Mitul Tiwari has also contributed to Data``Fu.  Mathieu Bastian has been developing Map``Reduce jobs that we hope to include in Data``Fu.  In addition he also leads the open source Gephi project.

=== Alignment ===

The ASF is the natural choice to host the Data``Fu project as its goal of encouraging community-driven open-source projects fits with our vision for Data``Fu.  Additionally, other projects Data``Fu integrates with, such as Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are hosted by the ASF and we will benefit and provide benefit by close proximity to them.

== Known Risks ==

=== Orphaned Products ===

The core developers have been contributing to Data``Fu for the past two years.  There is very little risk of Data``Fu being abandoned given its widespread use within Linked``In.

=== Inexperience with Open Source ===

Data``Fu was started as an open source project in 2011 and has remained so for two years.  Matt initiated the project, and additionally is the creator of the open source White Elephant project.  He has also contributed patches to Apache Pig.  Most recently he has released Hourglass as a sub-project of Data``Fu.  Sam contributed much of the original code and continues to contribute to the project.  Will has been contributing to Data``Fu since it was first open sourced.  Evion has been contributing for the past year.  Mathieu leads the open source Gephi project.  Jakob has been actively involved with the ASF as a full-time Hadoop committer and PMC member.

=== Homogeneous Developers ===

The current core developers are all from Linked``In.  Data``Fu has also received contributions from other corporations such as Cloudera.  Two of these developers are among the Initial Committers listed below.  We hope to establish a developer community that includes contributors from several other corporations and we are actively encouraging new contributors via presentations and blog posts.

=== Reliance on Salaried Developers ===

The current core developers are salaried employees of Linked``In, however they are not paid specifically to work on Data``Fu.  Contributions to Data``Fu arise from the developers solving problems they encounter in their various projects.  The purpose of Data``Fu is to share these solutions so that others may benefit and build a community of developers striving to solve common problems together.  Furthermore, once the project has a community built around it, we expect to get committers, developers and contributions from outside the current core developers.

=== Relationships with Other Apache Products ===

Data``Fu is deeply integrated with Apache products.  It began as a library of user-defined functions for Apache Pig.  It has grown to also include Hadoop jobs for incremental data processing and in the future will include code for other higher level languages built on top of Apache Hadoop.

=== An Excessive Obsession with the Apache Brand ===

While we respect the reputation of the Apache brand and have no doubts that it will attract contributors and users, our interest is primarily to give Data``Fu a solid home as an open source project following an established development model.

== Documentation ==

Information on Data``Fu can be found at:

https://github.com/LinkedIn/DataFu/blob/master/README.md

== Initial Source ==

The initial source is available at:

https://github.com/LinkedIn/DataFu

== Source and Intellectual Property Submission Plan ==

 * The Data``Fu library source code, available on Git``Hub.

== External Dependencies ==

The initial source has the following external dependencies that are either included in the final Data``Fu library or required in order to use it:

 * fastutil (Apache 2.0)
 * joda-time (Apache 2.0)
 * commons-math (Apache 2.0)
 * guava (Apache 2.0)
 * stream (Apache 2.0)
 * jsr-305 (BSD)
 * log4j (Apache 2.0)
 * json (The JSON License)
 * avro (Apache 2.0)

In addition, the following external libraries are used either in building, developing, or testing the project:

 * pig (Apache 2.0)
 * hadoop (Apache 2.0)
 * jline (BSD)
 * antlr (BSD)
 * commons-io (Apache 2.0)
 * testng (Apache 2.0)
 * maven (Apache 2.0)
 * jsr-311 (CDDL-1.0)
 * slf4j (MIT)
 * eclipse (Eclipse Public License 1.0)
 * autojar (GPLv2)
 * jarjar (Apache 2.0)

== Cryptography ==

Data``Fu has user-defined functions that use MD5 and SHA provided by Java’s java.security.Message``Digest.

== Required Resources ==

=== Mailing Lists ===

Data``Fu-private for private PMC discussions (with moderated subscriptions) Data``Fu-dev Data``Fu-commits

=== Subversion Directory ===

Git is the preferred source control system: git://git.apache.org/DataFu

=== Issue Tracking ===

JIRA Data``Fu (Data``Fu)

=== Other Resources ===

The existing code already has unit tests, so we would like a Hudson instance to run them whenever a new patch is submitted. This can be added after project creation.

== Initial Committers ==

 * Matthew Hayes
 * William Vaughan
 * Evion Kim
 * Sam Shah
 * Xiangrui Meng
 * Christopher Lloyd
 * Mathieu Bastian
 * Mitul Tiwari
 * Josh Wills
 * Jarek Jarcec Cecho

== Affiliations ==

 * Matthew Hayes (Linked``In)
 * William Vaughan (Linked``In)
 * Evion Kim (Linked``In)
 * Sam Shah (Linked``In)
 * Xiangrui Meng (Linked``In)
 * Christopher Lloyd (Linked``In)
 * Mathieu Bastian (Linked``In)
 * Mitul Tiwari (Linked``In)
 * Josh Wills (Cloudera)
 * Jarek Jarcec Cecho (Cloudera)

== Sponsors ==

=== Champion ===

Jakob Homan (Apache Member)

=== Nominated Mentors ===

 * Ashutosh Chauhan <hashutosh at apache dot org>
 * Roman Shaposhnik <rvs at apache dot org>
 * Ted Dunning <tdunning at apache dot org>

=== Sponsoring Entity ===

We are requesting the Incubator to sponsor this project.


Re: [PROPOSAL] DataFu for Incubation

Posted by Roman Shaposhnik <rv...@apache.org>.
On Wed, Dec 18, 2013 at 3:57 PM, sebb <se...@gmail.com> wrote:
> On 18 December 2013 22:49, Matthew Hayes <mh...@linkedin.com> wrote:
>> Hi all,
>>
>> I would like to share our draft ASF incubation proposal for DataFu, a library that
> makes it easier to solve data problems in Hadoop and high level languages based on it.
>
> I am the only person to think that the last part of the name has
> unfortunate connotations?
> c.f. SNAFU which has the same last two characters.

Personally, I see it as playful and not crossing any lines.

Thanks,
Roman.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [PROPOSAL] DataFu for Incubation

Posted by sebb <se...@gmail.com>.
OK, but the first association that came to my mind was SNAFU - perhaps
because it shares the last 3 letters with DataFu.

I just thought you ought to be aware that the name could have negative
connotations for some people.

On 19 December 2013 00:16, Matthew Hayes <mh...@linkedin.com> wrote:
> When we came up with the name a couple years ago, it was inspired by "kung fu", in a playful way as Roman mentioned.  Sort of like saying your Java Fu or Python Fu is excellent.
>
> -Matt
> ________________________________________
> From: sebb [sebbaz@gmail.com]
> Sent: Wednesday, December 18, 2013 3:57 PM
> To: general@incubator.apache.org
> Subject: Re: [PROPOSAL] DataFu for Incubation
>
> On 18 December 2013 22:49, Matthew Hayes <mh...@linkedin.com> wrote:
>> Hi all,
>>
>> I would like to share our draft ASF incubation proposal for DataFu, a library that makes it easier to solve data problems in Hadoop and high level languages based on it.
>
> I am the only person to think that the last part of the name has
> unfortunate connotations?
> c.f. SNAFU which has the same last two characters.
>
>> The proposal can be found here:
>>
>> https://wiki.apache.org/incubator/DataFuProposal
>>
>> The source code is available on GitHub:
>>
>> https://github.com/linkedin/datafu.
>>
>> The text of the proposal is copied below.  Feedback is appreciated!
>>
>> Thanks,
>> Matt
>>
>> == Abstract ==
>>
>> Data``Fu makes it easier to solve data problems using Hadoop and higher level languages based on it.
>>
>> == Proposal ==
>>
>> Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions in higher level languages based on it to perform data analysis.  It provides functions for common statistics tasks (e.g. quantiles, sampling), Page``Rank, stream sessionization, and set and bag operations.  Data``Fu also provides Hadoop jobs for incremental data processing in Map``Reduce.
>>
>> == Background ==
>>
>> Data``Fu began two years ago as set of UDFs developed internally at Linked``In, coming from our desire to solve common problems with reusable components.  Recognizing that the community could benefit from such a library, we added documentation, an extensive suite of unit tests, and open sourced the code.  Since then there have been steady contributions to Data``Fu as we encountered common problems not yet solved by it.  Others outside Linked``In have contributed as well.  More recently we recognized the challenges with efficient incremental processing of data in Hadoop and have contributed a set of Hadoop Map``Reduce jobs as a solution.
>>
>> Data``Fu began as a project at Linked``In, but it has shown itself to be useful to other organizations and developers as well as they have faced similar problems.  We would like to share Data``Fu with the ASF and begin developing a community of developers and users within Apache.
>>
>> == Rationale ==
>>
>> There is a strong need for well tested libraries that help developers solve common data problems in Hadoop and higher level languages such as Pig, Hive, Crunch, Scalding, etc.
>>
>> == Current Status ==
>>
>> === Meritocracy ===
>>
>> Our intent with this incubator proposal is to start building a diverse developer community around Data``Fu following the Apache meritocracy model.  Since Data``Fu was initially open sourced in 2011, it has received contributions from both within and outside Linked``In.  We plan to continue support for new contributors and work with those who contribute significantly to the project to make them committers.
>>
>> === Community ===
>>
>> Data``Fu has been building a community of developers for two years.  It began with contributors from Linked``In and has received contributions from developers at Cloudera since very early on.  It has been included included in Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our contributor base significantly and invite all those who are interested in solving large-scale data processing problems to participate.
>>
>> === Core Developers ===
>>
>> Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes initiated the project in 2011, and aside from continued contributions to Data``Fu has also contributed the sub-project Hourglass for incremental Map``Reduce processing.  Separate from Data``Fu he has also open sourced the White Elephant project.  Sam Shah contributed a significant portion of the original code and continues to contribute to the project.  William Vaughan has been contributing regularly to Data``Fu for the past two years.  Evion Kim has been contributing to Data``Fu for the past year.  Xiangrui Meng recently contributed implementations of scalable sampling algorithms based on research from a paper he published.  Chris Lloyd has provided some important bug fixes and unit tests.  Mitul Tiwari has also contributed to Data``Fu.  Mathieu Bastian has been developing Map``Reduce jobs that we hope to include in Data``Fu.  In addition he also leads the open source Gephi project.
>>
>> === Alignment ===
>>
>> The ASF is the natural choice to host the Data``Fu project as its goal of encouraging community-driven open-source projects fits with our vision for Data``Fu.  Additionally, other projects Data``Fu integrates with, such as Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are hosted by the ASF and we will benefit and provide benefit by close proximity to them.
>>
>> == Known Risks ==
>>
>> === Orphaned Products ===
>>
>> The core developers have been contributing to Data``Fu for the past two years.  There is very little risk of Data``Fu being abandoned given its widespread use within Linked``In.
>>
>> === Inexperience with Open Source ===
>>
>> Data``Fu was started as an open source project in 2011 and has remained so for two years.  Matt initiated the project, and additionally is the creator of the open source White Elephant project.  He has also contributed patches to Apache Pig.  Most recently he has released Hourglass as a sub-project of Data``Fu.  Sam contributed much of the original code and continues to contribute to the project.  Will has been contributing to Data``Fu since it was first open sourced.  Evion has been contributing for the past year.  Mathieu leads the open source Gephi project.  Jakob has been actively involved with the ASF as a full-time Hadoop committer and PMC member.
>>
>> === Homogeneous Developers ===
>>
>> The current core developers are all from Linked``In.  Data``Fu has also received contributions from other corporations such as Cloudera.  Two of these developers are among the Initial Committers listed below.  We hope to establish a developer community that includes contributors from several other corporations and we are actively encouraging new contributors via presentations and blog posts.
>>
>> === Reliance on Salaried Developers ===
>>
>> The current core developers are salaried employees of Linked``In, however they are not paid specifically to work on Data``Fu.  Contributions to Data``Fu arise from the developers solving problems they encounter in their various projects.  The purpose of Data``Fu is to share these solutions so that others may benefit and build a community of developers striving to solve common problems together.  Furthermore, once the project has a community built around it, we expect to get committers, developers and contributions from outside the current core developers.
>>
>> === Relationships with Other Apache Products ===
>>
>> Data``Fu is deeply integrated with Apache products.  It began as a library of user-defined functions for Apache Pig.  It has grown to also include Hadoop jobs for incremental data processing and in the future will include code for other higher level languages built on top of Apache Hadoop.
>>
>> === An Excessive Obsession with the Apache Brand ===
>>
>> While we respect the reputation of the Apache brand and have no doubts that it will attract contributors and users, our interest is primarily to give Data``Fu a solid home as an open source project following an established development model.
>>
>> == Documentation ==
>>
>> Information on Data``Fu can be found at:
>>
>> https://github.com/LinkedIn/DataFu/blob/master/README.md
>>
>> == Initial Source ==
>>
>> The initial source is available at:
>>
>> https://github.com/LinkedIn/DataFu
>>
>> == Source and Intellectual Property Submission Plan ==
>>
>>  * The Data``Fu library source code, available on Git``Hub.
>>
>> == External Dependencies ==
>>
>> The initial source has the following external dependencies that are either included in the final Data``Fu library or required in order to use it:
>>
>>  * fastutil (Apache 2.0)
>>  * joda-time (Apache 2.0)
>>  * commons-math (Apache 2.0)
>>  * guava (Apache 2.0)
>>  * stream (Apache 2.0)
>>  * jsr-305 (BSD)
>>  * log4j (Apache 2.0)
>>  * json (The JSON License)
>>  * avro (Apache 2.0)
>>
>> In addition, the following external libraries are used either in building, developing, or testing the project:
>>
>>  * pig (Apache 2.0)
>>  * hadoop (Apache 2.0)
>>  * jline (BSD)
>>  * antlr (BSD)
>>  * commons-io (Apache 2.0)
>>  * testng (Apache 2.0)
>>  * maven (Apache 2.0)
>>  * jsr-311 (CDDL-1.0)
>>  * slf4j (MIT)
>>  * eclipse (Eclipse Public License 1.0)
>>  * autojar (GPLv2)
>>  * jarjar (Apache 2.0)
>>
>> == Cryptography ==
>>
>> Data``Fu has user-defined functions that use MD5 and SHA provided by Java’s java.security.Message``Digest.
>>
>> == Required Resources ==
>>
>> === Mailing Lists ===
>>
>> Data``Fu-private for private PMC discussions (with moderated subscriptions) Data``Fu-dev Data``Fu-commits
>>
>> === Subversion Directory ===
>>
>> Git is the preferred source control system: git://git.apache.org/DataFu
>>
>> === Issue Tracking ===
>>
>> JIRA Data``Fu (Data``Fu)
>>
>> === Other Resources ===
>>
>> The existing code already has unit tests, so we would like a Hudson instance to run them whenever a new patch is submitted. This can be added after project creation.
>>
>> == Initial Committers ==
>>
>>  * Matthew Hayes
>>  * William Vaughan
>>  * Evion Kim
>>  * Sam Shah
>>  * Xiangrui Meng
>>  * Christopher Lloyd
>>  * Mathieu Bastian
>>  * Mitul Tiwari
>>  * Josh Wills
>>  * Jarek Jarcec Cecho
>>
>> == Affiliations ==
>>
>>  * Matthew Hayes (Linked``In)
>>  * William Vaughan (Linked``In)
>>  * Evion Kim (Linked``In)
>>  * Sam Shah (Linked``In)
>>  * Xiangrui Meng (Linked``In)
>>  * Christopher Lloyd (Linked``In)
>>  * Mathieu Bastian (Linked``In)
>>  * Mitul Tiwari (Linked``In)
>>  * Josh Wills (Cloudera)
>>  * Jarek Jarcec Cecho (Cloudera)
>>
>> == Sponsors ==
>>
>> === Champion ===
>>
>> Jakob Homan (Apache Member)
>>
>> === Nominated Mentors ===
>>
>>  * Ashutosh Chauhan <hashutosh at apache dot org>
>>  * Roman Shaposhnik <rvs at apache dot org>
>>  * Ted Dunning <tdunning at apache dot org>
>>
>> === Sponsoring Entity ===
>>
>> We are requesting the Incubator to sponsor this project.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [PROPOSAL] DataFu for Incubation

Posted by Jakob Homan <jg...@gmail.com>.
This proposal has been up for nearly two weeks, long enough to accommodate
the holiday lull.  We can be extra careful during the podling name search
to address Sebb's concerns.  I'll call a vote for the incubation.
-Jakob


On Wed, Dec 18, 2013 at 6:01 PM, Craig L Russell
<cr...@oracle.com>wrote:

> Just a bystander...
>
> I don't associate any negatives with "fu" or "foo". Now, "F... yoU" does,
> but that's not not an issue here, is it???
>
> I've seen lots of e.g.  "I don't have enough xxx-fu to comment".
>
> Your fu rocks.
>
> Craig
>
> On Dec 18, 2013, at 4:16 PM, Matthew Hayes wrote:
>
> > When we came up with the name a couple years ago, it was inspired by
> "kung fu", in a playful way as Roman mentioned.  Sort of like saying your
> Java Fu or Python Fu is excellent.
> >
> > -Matt
> > ________________________________________
> > From: sebb [sebbaz@gmail.com]
> > Sent: Wednesday, December 18, 2013 3:57 PM
> > To: general@incubator.apache.org
> > Subject: Re: [PROPOSAL] DataFu for Incubation
> >
> > On 18 December 2013 22:49, Matthew Hayes <mh...@linkedin.com> wrote:
> >> Hi all,
> >>
> >> I would like to share our draft ASF incubation proposal for DataFu, a
> library that makes it easier to solve data problems in Hadoop and high
> level languages based on it.
> >
> > I am the only person to think that the last part of the name has
> > unfortunate connotations?
> > c.f. SNAFU which has the same last two characters.
> >
> >> The proposal can be found here:
> >>
> >> https://wiki.apache.org/incubator/DataFuProposal
> >>
> >> The source code is available on GitHub:
> >>
> >> https://github.com/linkedin/datafu.
> >>
> >> The text of the proposal is copied below.  Feedback is appreciated!
> >>
> >> Thanks,
> >> Matt
> >>
> >> == Abstract ==
> >>
> >> Data``Fu makes it easier to solve data problems using Hadoop and higher
> level languages based on it.
> >>
> >> == Proposal ==
> >>
> >> Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions
> in higher level languages based on it to perform data analysis.  It
> provides functions for common statistics tasks (e.g. quantiles, sampling),
> Page``Rank, stream sessionization, and set and bag operations.  Data``Fu
> also provides Hadoop jobs for incremental data processing in Map``Reduce.
> >>
> >> == Background ==
> >>
> >> Data``Fu began two years ago as set of UDFs developed internally at
> Linked``In, coming from our desire to solve common problems with reusable
> components.  Recognizing that the community could benefit from such a
> library, we added documentation, an extensive suite of unit tests, and open
> sourced the code.  Since then there have been steady contributions to
> Data``Fu as we encountered common problems not yet solved by it.  Others
> outside Linked``In have contributed as well.  More recently we recognized
> the challenges with efficient incremental processing of data in Hadoop and
> have contributed a set of Hadoop Map``Reduce jobs as a solution.
> >>
> >> Data``Fu began as a project at Linked``In, but it has shown itself to
> be useful to other organizations and developers as well as they have faced
> similar problems.  We would like to share Data``Fu with the ASF and begin
> developing a community of developers and users within Apache.
> >>
> >> == Rationale ==
> >>
> >> There is a strong need for well tested libraries that help developers
> solve common data problems in Hadoop and higher level languages such as
> Pig, Hive, Crunch, Scalding, etc.
> >>
> >> == Current Status ==
> >>
> >> === Meritocracy ===
> >>
> >> Our intent with this incubator proposal is to start building a diverse
> developer community around Data``Fu following the Apache meritocracy model.
>  Since Data``Fu was initially open sourced in 2011, it has received
> contributions from both within and outside Linked``In.  We plan to continue
> support for new contributors and work with those who contribute
> significantly to the project to make them committers.
> >>
> >> === Community ===
> >>
> >> Data``Fu has been building a community of developers for two years.  It
> began with contributors from Linked``In and has received contributions from
> developers at Cloudera since very early on.  It has been included included
> in Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our
> contributor base significantly and invite all those who are interested in
> solving large-scale data processing problems to participate.
> >>
> >> === Core Developers ===
> >>
> >> Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes
> initiated the project in 2011, and aside from continued contributions to
> Data``Fu has also contributed the sub-project Hourglass for incremental
> Map``Reduce processing.  Separate from Data``Fu he has also open sourced
> the White Elephant project.  Sam Shah contributed a significant portion of
> the original code and continues to contribute to the project.  William
> Vaughan has been contributing regularly to Data``Fu for the past two years.
>  Evion Kim has been contributing to Data``Fu for the past year.  Xiangrui
> Meng recently contributed implementations of scalable sampling algorithms
> based on research from a paper he published.  Chris Lloyd has provided some
> important bug fixes and unit tests.  Mitul Tiwari has also contributed to
> Data``Fu.  Mathieu Bastian has been developing Map``Reduce jobs that we
> hope to include in Data``Fu.  In addition he also leads the open source
> Gephi project.
> >>
> >> === Alignment ===
> >>
> >> The ASF is the natural choice to host the Data``Fu project as its goal
> of encouraging community-driven open-source projects fits with our vision
> for Data``Fu.  Additionally, other projects Data``Fu integrates with, such
> as Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache
> Crunch, are hosted by the ASF and we will benefit and provide benefit by
> close proximity to them.
> >>
> >> == Known Risks ==
> >>
> >> === Orphaned Products ===
> >>
> >> The core developers have been contributing to Data``Fu for the past two
> years.  There is very little risk of Data``Fu being abandoned given its
> widespread use within Linked``In.
> >>
> >> === Inexperience with Open Source ===
> >>
> >> Data``Fu was started as an open source project in 2011 and has remained
> so for two years.  Matt initiated the project, and additionally is the
> creator of the open source White Elephant project.  He has also contributed
> patches to Apache Pig.  Most recently he has released Hourglass as a
> sub-project of Data``Fu.  Sam contributed much of the original code and
> continues to contribute to the project.  Will has been contributing to
> Data``Fu since it was first open sourced.  Evion has been contributing for
> the past year.  Mathieu leads the open source Gephi project.  Jakob has
> been actively involved with the ASF as a full-time Hadoop committer and PMC
> member.
> >>
> >> === Homogeneous Developers ===
> >>
> >> The current core developers are all from Linked``In.  Data``Fu has also
> received contributions from other corporations such as Cloudera.  Two of
> these developers are among the Initial Committers listed below.  We hope to
> establish a developer community that includes contributors from several
> other corporations and we are actively encouraging new contributors via
> presentations and blog posts.
> >>
> >> === Reliance on Salaried Developers ===
> >>
> >> The current core developers are salaried employees of Linked``In,
> however they are not paid specifically to work on Data``Fu.  Contributions
> to Data``Fu arise from the developers solving problems they encounter in
> their various projects.  The purpose of Data``Fu is to share these
> solutions so that others may benefit and build a community of developers
> striving to solve common problems together.  Furthermore, once the project
> has a community built around it, we expect to get committers, developers
> and contributions from outside the current core developers.
> >>
> >> === Relationships with Other Apache Products ===
> >>
> >> Data``Fu is deeply integrated with Apache products.  It began as a
> library of user-defined functions for Apache Pig.  It has grown to also
> include Hadoop jobs for incremental data processing and in the future will
> include code for other higher level languages built on top of Apache Hadoop.
> >>
> >> === An Excessive Obsession with the Apache Brand ===
> >>
> >> While we respect the reputation of the Apache brand and have no doubts
> that it will attract contributors and users, our interest is primarily to
> give Data``Fu a solid home as an open source project following an
> established development model.
> >>
> >> == Documentation ==
> >>
> >> Information on Data``Fu can be found at:
> >>
> >> https://github.com/LinkedIn/DataFu/blob/master/README.md
> >>
> >> == Initial Source ==
> >>
> >> The initial source is available at:
> >>
> >> https://github.com/LinkedIn/DataFu
> >>
> >> == Source and Intellectual Property Submission Plan ==
> >>
> >> * The Data``Fu library source code, available on Git``Hub.
> >>
> >> == External Dependencies ==
> >>
> >> The initial source has the following external dependencies that are
> either included in the final Data``Fu library or required in order to use
> it:
> >>
> >> * fastutil (Apache 2.0)
> >> * joda-time (Apache 2.0)
> >> * commons-math (Apache 2.0)
> >> * guava (Apache 2.0)
> >> * stream (Apache 2.0)
> >> * jsr-305 (BSD)
> >> * log4j (Apache 2.0)
> >> * json (The JSON License)
> >> * avro (Apache 2.0)
> >>
> >> In addition, the following external libraries are used either in
> building, developing, or testing the project:
> >>
> >> * pig (Apache 2.0)
> >> * hadoop (Apache 2.0)
> >> * jline (BSD)
> >> * antlr (BSD)
> >> * commons-io (Apache 2.0)
> >> * testng (Apache 2.0)
> >> * maven (Apache 2.0)
> >> * jsr-311 (CDDL-1.0)
> >> * slf4j (MIT)
> >> * eclipse (Eclipse Public License 1.0)
> >> * autojar (GPLv2)
> >> * jarjar (Apache 2.0)
> >>
> >> == Cryptography ==
> >>
> >> Data``Fu has user-defined functions that use MD5 and SHA provided by
> Java’s java.security.Message``Digest.
> >>
> >> == Required Resources ==
> >>
> >> === Mailing Lists ===
> >>
> >> Data``Fu-private for private PMC discussions (with moderated
> subscriptions) Data``Fu-dev Data``Fu-commits
> >>
> >> === Subversion Directory ===
> >>
> >> Git is the preferred source control system: git://git.apache.org/DataFu
> >>
> >> === Issue Tracking ===
> >>
> >> JIRA Data``Fu (Data``Fu)
> >>
> >> === Other Resources ===
> >>
> >> The existing code already has unit tests, so we would like a Hudson
> instance to run them whenever a new patch is submitted. This can be added
> after project creation.
> >>
> >> == Initial Committers ==
> >>
> >> * Matthew Hayes
> >> * William Vaughan
> >> * Evion Kim
> >> * Sam Shah
> >> * Xiangrui Meng
> >> * Christopher Lloyd
> >> * Mathieu Bastian
> >> * Mitul Tiwari
> >> * Josh Wills
> >> * Jarek Jarcec Cecho
> >>
> >> == Affiliations ==
> >>
> >> * Matthew Hayes (Linked``In)
> >> * William Vaughan (Linked``In)
> >> * Evion Kim (Linked``In)
> >> * Sam Shah (Linked``In)
> >> * Xiangrui Meng (Linked``In)
> >> * Christopher Lloyd (Linked``In)
> >> * Mathieu Bastian (Linked``In)
> >> * Mitul Tiwari (Linked``In)
> >> * Josh Wills (Cloudera)
> >> * Jarek Jarcec Cecho (Cloudera)
> >>
> >> == Sponsors ==
> >>
> >> === Champion ===
> >>
> >> Jakob Homan (Apache Member)
> >>
> >> === Nominated Mentors ===
> >>
> >> * Ashutosh Chauhan <hashutosh at apache dot org>
> >> * Roman Shaposhnik <rvs at apache dot org>
> >> * Ted Dunning <tdunning at apache dot org>
> >>
> >> === Sponsoring Entity ===
> >>
> >> We are requesting the Incubator to sponsor this project.
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
>
> Craig L Russell
> Architect, Oracle
> http://db.apache.org/jdo
> 408 276-5638 mailto:Craig.Russell@oracle.com
> P.S. A good JDO? O, Gasp!
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [PROPOSAL] DataFu for Incubation

Posted by Craig L Russell <cr...@oracle.com>.
Just a bystander...

I don't associate any negatives with "fu" or "foo". Now, "F... yoU" does, but that's not not an issue here, is it???

I've seen lots of e.g.  "I don't have enough xxx-fu to comment". 

Your fu rocks.

Craig

On Dec 18, 2013, at 4:16 PM, Matthew Hayes wrote:

> When we came up with the name a couple years ago, it was inspired by "kung fu", in a playful way as Roman mentioned.  Sort of like saying your Java Fu or Python Fu is excellent.
> 
> -Matt
> ________________________________________
> From: sebb [sebbaz@gmail.com]
> Sent: Wednesday, December 18, 2013 3:57 PM
> To: general@incubator.apache.org
> Subject: Re: [PROPOSAL] DataFu for Incubation
> 
> On 18 December 2013 22:49, Matthew Hayes <mh...@linkedin.com> wrote:
>> Hi all,
>> 
>> I would like to share our draft ASF incubation proposal for DataFu, a library that makes it easier to solve data problems in Hadoop and high level languages based on it.
> 
> I am the only person to think that the last part of the name has
> unfortunate connotations?
> c.f. SNAFU which has the same last two characters.
> 
>> The proposal can be found here:
>> 
>> https://wiki.apache.org/incubator/DataFuProposal
>> 
>> The source code is available on GitHub:
>> 
>> https://github.com/linkedin/datafu.
>> 
>> The text of the proposal is copied below.  Feedback is appreciated!
>> 
>> Thanks,
>> Matt
>> 
>> == Abstract ==
>> 
>> Data``Fu makes it easier to solve data problems using Hadoop and higher level languages based on it.
>> 
>> == Proposal ==
>> 
>> Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions in higher level languages based on it to perform data analysis.  It provides functions for common statistics tasks (e.g. quantiles, sampling), Page``Rank, stream sessionization, and set and bag operations.  Data``Fu also provides Hadoop jobs for incremental data processing in Map``Reduce.
>> 
>> == Background ==
>> 
>> Data``Fu began two years ago as set of UDFs developed internally at Linked``In, coming from our desire to solve common problems with reusable components.  Recognizing that the community could benefit from such a library, we added documentation, an extensive suite of unit tests, and open sourced the code.  Since then there have been steady contributions to Data``Fu as we encountered common problems not yet solved by it.  Others outside Linked``In have contributed as well.  More recently we recognized the challenges with efficient incremental processing of data in Hadoop and have contributed a set of Hadoop Map``Reduce jobs as a solution.
>> 
>> Data``Fu began as a project at Linked``In, but it has shown itself to be useful to other organizations and developers as well as they have faced similar problems.  We would like to share Data``Fu with the ASF and begin developing a community of developers and users within Apache.
>> 
>> == Rationale ==
>> 
>> There is a strong need for well tested libraries that help developers solve common data problems in Hadoop and higher level languages such as Pig, Hive, Crunch, Scalding, etc.
>> 
>> == Current Status ==
>> 
>> === Meritocracy ===
>> 
>> Our intent with this incubator proposal is to start building a diverse developer community around Data``Fu following the Apache meritocracy model.  Since Data``Fu was initially open sourced in 2011, it has received contributions from both within and outside Linked``In.  We plan to continue support for new contributors and work with those who contribute significantly to the project to make them committers.
>> 
>> === Community ===
>> 
>> Data``Fu has been building a community of developers for two years.  It began with contributors from Linked``In and has received contributions from developers at Cloudera since very early on.  It has been included included in Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our contributor base significantly and invite all those who are interested in solving large-scale data processing problems to participate.
>> 
>> === Core Developers ===
>> 
>> Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes initiated the project in 2011, and aside from continued contributions to Data``Fu has also contributed the sub-project Hourglass for incremental Map``Reduce processing.  Separate from Data``Fu he has also open sourced the White Elephant project.  Sam Shah contributed a significant portion of the original code and continues to contribute to the project.  William Vaughan has been contributing regularly to Data``Fu for the past two years.  Evion Kim has been contributing to Data``Fu for the past year.  Xiangrui Meng recently contributed implementations of scalable sampling algorithms based on research from a paper he published.  Chris Lloyd has provided some important bug fixes and unit tests.  Mitul Tiwari has also contributed to Data``Fu.  Mathieu Bastian has been developing Map``Reduce jobs that we hope to include in Data``Fu.  In addition he also leads the open source Gephi project.
>> 
>> === Alignment ===
>> 
>> The ASF is the natural choice to host the Data``Fu project as its goal of encouraging community-driven open-source projects fits with our vision for Data``Fu.  Additionally, other projects Data``Fu integrates with, such as Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are hosted by the ASF and we will benefit and provide benefit by close proximity to them.
>> 
>> == Known Risks ==
>> 
>> === Orphaned Products ===
>> 
>> The core developers have been contributing to Data``Fu for the past two years.  There is very little risk of Data``Fu being abandoned given its widespread use within Linked``In.
>> 
>> === Inexperience with Open Source ===
>> 
>> Data``Fu was started as an open source project in 2011 and has remained so for two years.  Matt initiated the project, and additionally is the creator of the open source White Elephant project.  He has also contributed patches to Apache Pig.  Most recently he has released Hourglass as a sub-project of Data``Fu.  Sam contributed much of the original code and continues to contribute to the project.  Will has been contributing to Data``Fu since it was first open sourced.  Evion has been contributing for the past year.  Mathieu leads the open source Gephi project.  Jakob has been actively involved with the ASF as a full-time Hadoop committer and PMC member.
>> 
>> === Homogeneous Developers ===
>> 
>> The current core developers are all from Linked``In.  Data``Fu has also received contributions from other corporations such as Cloudera.  Two of these developers are among the Initial Committers listed below.  We hope to establish a developer community that includes contributors from several other corporations and we are actively encouraging new contributors via presentations and blog posts.
>> 
>> === Reliance on Salaried Developers ===
>> 
>> The current core developers are salaried employees of Linked``In, however they are not paid specifically to work on Data``Fu.  Contributions to Data``Fu arise from the developers solving problems they encounter in their various projects.  The purpose of Data``Fu is to share these solutions so that others may benefit and build a community of developers striving to solve common problems together.  Furthermore, once the project has a community built around it, we expect to get committers, developers and contributions from outside the current core developers.
>> 
>> === Relationships with Other Apache Products ===
>> 
>> Data``Fu is deeply integrated with Apache products.  It began as a library of user-defined functions for Apache Pig.  It has grown to also include Hadoop jobs for incremental data processing and in the future will include code for other higher level languages built on top of Apache Hadoop.
>> 
>> === An Excessive Obsession with the Apache Brand ===
>> 
>> While we respect the reputation of the Apache brand and have no doubts that it will attract contributors and users, our interest is primarily to give Data``Fu a solid home as an open source project following an established development model.
>> 
>> == Documentation ==
>> 
>> Information on Data``Fu can be found at:
>> 
>> https://github.com/LinkedIn/DataFu/blob/master/README.md
>> 
>> == Initial Source ==
>> 
>> The initial source is available at:
>> 
>> https://github.com/LinkedIn/DataFu
>> 
>> == Source and Intellectual Property Submission Plan ==
>> 
>> * The Data``Fu library source code, available on Git``Hub.
>> 
>> == External Dependencies ==
>> 
>> The initial source has the following external dependencies that are either included in the final Data``Fu library or required in order to use it:
>> 
>> * fastutil (Apache 2.0)
>> * joda-time (Apache 2.0)
>> * commons-math (Apache 2.0)
>> * guava (Apache 2.0)
>> * stream (Apache 2.0)
>> * jsr-305 (BSD)
>> * log4j (Apache 2.0)
>> * json (The JSON License)
>> * avro (Apache 2.0)
>> 
>> In addition, the following external libraries are used either in building, developing, or testing the project:
>> 
>> * pig (Apache 2.0)
>> * hadoop (Apache 2.0)
>> * jline (BSD)
>> * antlr (BSD)
>> * commons-io (Apache 2.0)
>> * testng (Apache 2.0)
>> * maven (Apache 2.0)
>> * jsr-311 (CDDL-1.0)
>> * slf4j (MIT)
>> * eclipse (Eclipse Public License 1.0)
>> * autojar (GPLv2)
>> * jarjar (Apache 2.0)
>> 
>> == Cryptography ==
>> 
>> Data``Fu has user-defined functions that use MD5 and SHA provided by Java’s java.security.Message``Digest.
>> 
>> == Required Resources ==
>> 
>> === Mailing Lists ===
>> 
>> Data``Fu-private for private PMC discussions (with moderated subscriptions) Data``Fu-dev Data``Fu-commits
>> 
>> === Subversion Directory ===
>> 
>> Git is the preferred source control system: git://git.apache.org/DataFu
>> 
>> === Issue Tracking ===
>> 
>> JIRA Data``Fu (Data``Fu)
>> 
>> === Other Resources ===
>> 
>> The existing code already has unit tests, so we would like a Hudson instance to run them whenever a new patch is submitted. This can be added after project creation.
>> 
>> == Initial Committers ==
>> 
>> * Matthew Hayes
>> * William Vaughan
>> * Evion Kim
>> * Sam Shah
>> * Xiangrui Meng
>> * Christopher Lloyd
>> * Mathieu Bastian
>> * Mitul Tiwari
>> * Josh Wills
>> * Jarek Jarcec Cecho
>> 
>> == Affiliations ==
>> 
>> * Matthew Hayes (Linked``In)
>> * William Vaughan (Linked``In)
>> * Evion Kim (Linked``In)
>> * Sam Shah (Linked``In)
>> * Xiangrui Meng (Linked``In)
>> * Christopher Lloyd (Linked``In)
>> * Mathieu Bastian (Linked``In)
>> * Mitul Tiwari (Linked``In)
>> * Josh Wills (Cloudera)
>> * Jarek Jarcec Cecho (Cloudera)
>> 
>> == Sponsors ==
>> 
>> === Champion ===
>> 
>> Jakob Homan (Apache Member)
>> 
>> === Nominated Mentors ===
>> 
>> * Ashutosh Chauhan <hashutosh at apache dot org>
>> * Roman Shaposhnik <rvs at apache dot org>
>> * Ted Dunning <tdunning at apache dot org>
>> 
>> === Sponsoring Entity ===
>> 
>> We are requesting the Incubator to sponsor this project.
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 

Craig L Russell
Architect, Oracle
http://db.apache.org/jdo
408 276-5638 mailto:Craig.Russell@oracle.com
P.S. A good JDO? O, Gasp!


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


RE: [PROPOSAL] DataFu for Incubation

Posted by Matthew Hayes <mh...@linkedin.com>.
When we came up with the name a couple years ago, it was inspired by "kung fu", in a playful way as Roman mentioned.  Sort of like saying your Java Fu or Python Fu is excellent.

-Matt
________________________________________
From: sebb [sebbaz@gmail.com]
Sent: Wednesday, December 18, 2013 3:57 PM
To: general@incubator.apache.org
Subject: Re: [PROPOSAL] DataFu for Incubation

On 18 December 2013 22:49, Matthew Hayes <mh...@linkedin.com> wrote:
> Hi all,
>
> I would like to share our draft ASF incubation proposal for DataFu, a library that makes it easier to solve data problems in Hadoop and high level languages based on it.

I am the only person to think that the last part of the name has
unfortunate connotations?
c.f. SNAFU which has the same last two characters.

> The proposal can be found here:
>
> https://wiki.apache.org/incubator/DataFuProposal
>
> The source code is available on GitHub:
>
> https://github.com/linkedin/datafu.
>
> The text of the proposal is copied below.  Feedback is appreciated!
>
> Thanks,
> Matt
>
> == Abstract ==
>
> Data``Fu makes it easier to solve data problems using Hadoop and higher level languages based on it.
>
> == Proposal ==
>
> Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions in higher level languages based on it to perform data analysis.  It provides functions for common statistics tasks (e.g. quantiles, sampling), Page``Rank, stream sessionization, and set and bag operations.  Data``Fu also provides Hadoop jobs for incremental data processing in Map``Reduce.
>
> == Background ==
>
> Data``Fu began two years ago as set of UDFs developed internally at Linked``In, coming from our desire to solve common problems with reusable components.  Recognizing that the community could benefit from such a library, we added documentation, an extensive suite of unit tests, and open sourced the code.  Since then there have been steady contributions to Data``Fu as we encountered common problems not yet solved by it.  Others outside Linked``In have contributed as well.  More recently we recognized the challenges with efficient incremental processing of data in Hadoop and have contributed a set of Hadoop Map``Reduce jobs as a solution.
>
> Data``Fu began as a project at Linked``In, but it has shown itself to be useful to other organizations and developers as well as they have faced similar problems.  We would like to share Data``Fu with the ASF and begin developing a community of developers and users within Apache.
>
> == Rationale ==
>
> There is a strong need for well tested libraries that help developers solve common data problems in Hadoop and higher level languages such as Pig, Hive, Crunch, Scalding, etc.
>
> == Current Status ==
>
> === Meritocracy ===
>
> Our intent with this incubator proposal is to start building a diverse developer community around Data``Fu following the Apache meritocracy model.  Since Data``Fu was initially open sourced in 2011, it has received contributions from both within and outside Linked``In.  We plan to continue support for new contributors and work with those who contribute significantly to the project to make them committers.
>
> === Community ===
>
> Data``Fu has been building a community of developers for two years.  It began with contributors from Linked``In and has received contributions from developers at Cloudera since very early on.  It has been included included in Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our contributor base significantly and invite all those who are interested in solving large-scale data processing problems to participate.
>
> === Core Developers ===
>
> Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes initiated the project in 2011, and aside from continued contributions to Data``Fu has also contributed the sub-project Hourglass for incremental Map``Reduce processing.  Separate from Data``Fu he has also open sourced the White Elephant project.  Sam Shah contributed a significant portion of the original code and continues to contribute to the project.  William Vaughan has been contributing regularly to Data``Fu for the past two years.  Evion Kim has been contributing to Data``Fu for the past year.  Xiangrui Meng recently contributed implementations of scalable sampling algorithms based on research from a paper he published.  Chris Lloyd has provided some important bug fixes and unit tests.  Mitul Tiwari has also contributed to Data``Fu.  Mathieu Bastian has been developing Map``Reduce jobs that we hope to include in Data``Fu.  In addition he also leads the open source Gephi project.
>
> === Alignment ===
>
> The ASF is the natural choice to host the Data``Fu project as its goal of encouraging community-driven open-source projects fits with our vision for Data``Fu.  Additionally, other projects Data``Fu integrates with, such as Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are hosted by the ASF and we will benefit and provide benefit by close proximity to them.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The core developers have been contributing to Data``Fu for the past two years.  There is very little risk of Data``Fu being abandoned given its widespread use within Linked``In.
>
> === Inexperience with Open Source ===
>
> Data``Fu was started as an open source project in 2011 and has remained so for two years.  Matt initiated the project, and additionally is the creator of the open source White Elephant project.  He has also contributed patches to Apache Pig.  Most recently he has released Hourglass as a sub-project of Data``Fu.  Sam contributed much of the original code and continues to contribute to the project.  Will has been contributing to Data``Fu since it was first open sourced.  Evion has been contributing for the past year.  Mathieu leads the open source Gephi project.  Jakob has been actively involved with the ASF as a full-time Hadoop committer and PMC member.
>
> === Homogeneous Developers ===
>
> The current core developers are all from Linked``In.  Data``Fu has also received contributions from other corporations such as Cloudera.  Two of these developers are among the Initial Committers listed below.  We hope to establish a developer community that includes contributors from several other corporations and we are actively encouraging new contributors via presentations and blog posts.
>
> === Reliance on Salaried Developers ===
>
> The current core developers are salaried employees of Linked``In, however they are not paid specifically to work on Data``Fu.  Contributions to Data``Fu arise from the developers solving problems they encounter in their various projects.  The purpose of Data``Fu is to share these solutions so that others may benefit and build a community of developers striving to solve common problems together.  Furthermore, once the project has a community built around it, we expect to get committers, developers and contributions from outside the current core developers.
>
> === Relationships with Other Apache Products ===
>
> Data``Fu is deeply integrated with Apache products.  It began as a library of user-defined functions for Apache Pig.  It has grown to also include Hadoop jobs for incremental data processing and in the future will include code for other higher level languages built on top of Apache Hadoop.
>
> === An Excessive Obsession with the Apache Brand ===
>
> While we respect the reputation of the Apache brand and have no doubts that it will attract contributors and users, our interest is primarily to give Data``Fu a solid home as an open source project following an established development model.
>
> == Documentation ==
>
> Information on Data``Fu can be found at:
>
> https://github.com/LinkedIn/DataFu/blob/master/README.md
>
> == Initial Source ==
>
> The initial source is available at:
>
> https://github.com/LinkedIn/DataFu
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The Data``Fu library source code, available on Git``Hub.
>
> == External Dependencies ==
>
> The initial source has the following external dependencies that are either included in the final Data``Fu library or required in order to use it:
>
>  * fastutil (Apache 2.0)
>  * joda-time (Apache 2.0)
>  * commons-math (Apache 2.0)
>  * guava (Apache 2.0)
>  * stream (Apache 2.0)
>  * jsr-305 (BSD)
>  * log4j (Apache 2.0)
>  * json (The JSON License)
>  * avro (Apache 2.0)
>
> In addition, the following external libraries are used either in building, developing, or testing the project:
>
>  * pig (Apache 2.0)
>  * hadoop (Apache 2.0)
>  * jline (BSD)
>  * antlr (BSD)
>  * commons-io (Apache 2.0)
>  * testng (Apache 2.0)
>  * maven (Apache 2.0)
>  * jsr-311 (CDDL-1.0)
>  * slf4j (MIT)
>  * eclipse (Eclipse Public License 1.0)
>  * autojar (GPLv2)
>  * jarjar (Apache 2.0)
>
> == Cryptography ==
>
> Data``Fu has user-defined functions that use MD5 and SHA provided by Java’s java.security.Message``Digest.
>
> == Required Resources ==
>
> === Mailing Lists ===
>
> Data``Fu-private for private PMC discussions (with moderated subscriptions) Data``Fu-dev Data``Fu-commits
>
> === Subversion Directory ===
>
> Git is the preferred source control system: git://git.apache.org/DataFu
>
> === Issue Tracking ===
>
> JIRA Data``Fu (Data``Fu)
>
> === Other Resources ===
>
> The existing code already has unit tests, so we would like a Hudson instance to run them whenever a new patch is submitted. This can be added after project creation.
>
> == Initial Committers ==
>
>  * Matthew Hayes
>  * William Vaughan
>  * Evion Kim
>  * Sam Shah
>  * Xiangrui Meng
>  * Christopher Lloyd
>  * Mathieu Bastian
>  * Mitul Tiwari
>  * Josh Wills
>  * Jarek Jarcec Cecho
>
> == Affiliations ==
>
>  * Matthew Hayes (Linked``In)
>  * William Vaughan (Linked``In)
>  * Evion Kim (Linked``In)
>  * Sam Shah (Linked``In)
>  * Xiangrui Meng (Linked``In)
>  * Christopher Lloyd (Linked``In)
>  * Mathieu Bastian (Linked``In)
>  * Mitul Tiwari (Linked``In)
>  * Josh Wills (Cloudera)
>  * Jarek Jarcec Cecho (Cloudera)
>
> == Sponsors ==
>
> === Champion ===
>
> Jakob Homan (Apache Member)
>
> === Nominated Mentors ===
>
>  * Ashutosh Chauhan <hashutosh at apache dot org>
>  * Roman Shaposhnik <rvs at apache dot org>
>  * Ted Dunning <tdunning at apache dot org>
>
> === Sponsoring Entity ===
>
> We are requesting the Incubator to sponsor this project.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [PROPOSAL] DataFu for Incubation

Posted by sebb <se...@gmail.com>.
On 18 December 2013 22:49, Matthew Hayes <mh...@linkedin.com> wrote:
> Hi all,
>
> I would like to share our draft ASF incubation proposal for DataFu, a library that makes it easier to solve data problems in Hadoop and high level languages based on it.

I am the only person to think that the last part of the name has
unfortunate connotations?
c.f. SNAFU which has the same last two characters.

> The proposal can be found here:
>
> https://wiki.apache.org/incubator/DataFuProposal
>
> The source code is available on GitHub:
>
> https://github.com/linkedin/datafu.
>
> The text of the proposal is copied below.  Feedback is appreciated!
>
> Thanks,
> Matt
>
> == Abstract ==
>
> Data``Fu makes it easier to solve data problems using Hadoop and higher level languages based on it.
>
> == Proposal ==
>
> Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions in higher level languages based on it to perform data analysis.  It provides functions for common statistics tasks (e.g. quantiles, sampling), Page``Rank, stream sessionization, and set and bag operations.  Data``Fu also provides Hadoop jobs for incremental data processing in Map``Reduce.
>
> == Background ==
>
> Data``Fu began two years ago as set of UDFs developed internally at Linked``In, coming from our desire to solve common problems with reusable components.  Recognizing that the community could benefit from such a library, we added documentation, an extensive suite of unit tests, and open sourced the code.  Since then there have been steady contributions to Data``Fu as we encountered common problems not yet solved by it.  Others outside Linked``In have contributed as well.  More recently we recognized the challenges with efficient incremental processing of data in Hadoop and have contributed a set of Hadoop Map``Reduce jobs as a solution.
>
> Data``Fu began as a project at Linked``In, but it has shown itself to be useful to other organizations and developers as well as they have faced similar problems.  We would like to share Data``Fu with the ASF and begin developing a community of developers and users within Apache.
>
> == Rationale ==
>
> There is a strong need for well tested libraries that help developers solve common data problems in Hadoop and higher level languages such as Pig, Hive, Crunch, Scalding, etc.
>
> == Current Status ==
>
> === Meritocracy ===
>
> Our intent with this incubator proposal is to start building a diverse developer community around Data``Fu following the Apache meritocracy model.  Since Data``Fu was initially open sourced in 2011, it has received contributions from both within and outside Linked``In.  We plan to continue support for new contributors and work with those who contribute significantly to the project to make them committers.
>
> === Community ===
>
> Data``Fu has been building a community of developers for two years.  It began with contributors from Linked``In and has received contributions from developers at Cloudera since very early on.  It has been included included in Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our contributor base significantly and invite all those who are interested in solving large-scale data processing problems to participate.
>
> === Core Developers ===
>
> Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes initiated the project in 2011, and aside from continued contributions to Data``Fu has also contributed the sub-project Hourglass for incremental Map``Reduce processing.  Separate from Data``Fu he has also open sourced the White Elephant project.  Sam Shah contributed a significant portion of the original code and continues to contribute to the project.  William Vaughan has been contributing regularly to Data``Fu for the past two years.  Evion Kim has been contributing to Data``Fu for the past year.  Xiangrui Meng recently contributed implementations of scalable sampling algorithms based on research from a paper he published.  Chris Lloyd has provided some important bug fixes and unit tests.  Mitul Tiwari has also contributed to Data``Fu.  Mathieu Bastian has been developing Map``Reduce jobs that we hope to include in Data``Fu.  In addition he also leads the open source Gephi project.
>
> === Alignment ===
>
> The ASF is the natural choice to host the Data``Fu project as its goal of encouraging community-driven open-source projects fits with our vision for Data``Fu.  Additionally, other projects Data``Fu integrates with, such as Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are hosted by the ASF and we will benefit and provide benefit by close proximity to them.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The core developers have been contributing to Data``Fu for the past two years.  There is very little risk of Data``Fu being abandoned given its widespread use within Linked``In.
>
> === Inexperience with Open Source ===
>
> Data``Fu was started as an open source project in 2011 and has remained so for two years.  Matt initiated the project, and additionally is the creator of the open source White Elephant project.  He has also contributed patches to Apache Pig.  Most recently he has released Hourglass as a sub-project of Data``Fu.  Sam contributed much of the original code and continues to contribute to the project.  Will has been contributing to Data``Fu since it was first open sourced.  Evion has been contributing for the past year.  Mathieu leads the open source Gephi project.  Jakob has been actively involved with the ASF as a full-time Hadoop committer and PMC member.
>
> === Homogeneous Developers ===
>
> The current core developers are all from Linked``In.  Data``Fu has also received contributions from other corporations such as Cloudera.  Two of these developers are among the Initial Committers listed below.  We hope to establish a developer community that includes contributors from several other corporations and we are actively encouraging new contributors via presentations and blog posts.
>
> === Reliance on Salaried Developers ===
>
> The current core developers are salaried employees of Linked``In, however they are not paid specifically to work on Data``Fu.  Contributions to Data``Fu arise from the developers solving problems they encounter in their various projects.  The purpose of Data``Fu is to share these solutions so that others may benefit and build a community of developers striving to solve common problems together.  Furthermore, once the project has a community built around it, we expect to get committers, developers and contributions from outside the current core developers.
>
> === Relationships with Other Apache Products ===
>
> Data``Fu is deeply integrated with Apache products.  It began as a library of user-defined functions for Apache Pig.  It has grown to also include Hadoop jobs for incremental data processing and in the future will include code for other higher level languages built on top of Apache Hadoop.
>
> === An Excessive Obsession with the Apache Brand ===
>
> While we respect the reputation of the Apache brand and have no doubts that it will attract contributors and users, our interest is primarily to give Data``Fu a solid home as an open source project following an established development model.
>
> == Documentation ==
>
> Information on Data``Fu can be found at:
>
> https://github.com/LinkedIn/DataFu/blob/master/README.md
>
> == Initial Source ==
>
> The initial source is available at:
>
> https://github.com/LinkedIn/DataFu
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The Data``Fu library source code, available on Git``Hub.
>
> == External Dependencies ==
>
> The initial source has the following external dependencies that are either included in the final Data``Fu library or required in order to use it:
>
>  * fastutil (Apache 2.0)
>  * joda-time (Apache 2.0)
>  * commons-math (Apache 2.0)
>  * guava (Apache 2.0)
>  * stream (Apache 2.0)
>  * jsr-305 (BSD)
>  * log4j (Apache 2.0)
>  * json (The JSON License)
>  * avro (Apache 2.0)
>
> In addition, the following external libraries are used either in building, developing, or testing the project:
>
>  * pig (Apache 2.0)
>  * hadoop (Apache 2.0)
>  * jline (BSD)
>  * antlr (BSD)
>  * commons-io (Apache 2.0)
>  * testng (Apache 2.0)
>  * maven (Apache 2.0)
>  * jsr-311 (CDDL-1.0)
>  * slf4j (MIT)
>  * eclipse (Eclipse Public License 1.0)
>  * autojar (GPLv2)
>  * jarjar (Apache 2.0)
>
> == Cryptography ==
>
> Data``Fu has user-defined functions that use MD5 and SHA provided by Java’s java.security.Message``Digest.
>
> == Required Resources ==
>
> === Mailing Lists ===
>
> Data``Fu-private for private PMC discussions (with moderated subscriptions) Data``Fu-dev Data``Fu-commits
>
> === Subversion Directory ===
>
> Git is the preferred source control system: git://git.apache.org/DataFu
>
> === Issue Tracking ===
>
> JIRA Data``Fu (Data``Fu)
>
> === Other Resources ===
>
> The existing code already has unit tests, so we would like a Hudson instance to run them whenever a new patch is submitted. This can be added after project creation.
>
> == Initial Committers ==
>
>  * Matthew Hayes
>  * William Vaughan
>  * Evion Kim
>  * Sam Shah
>  * Xiangrui Meng
>  * Christopher Lloyd
>  * Mathieu Bastian
>  * Mitul Tiwari
>  * Josh Wills
>  * Jarek Jarcec Cecho
>
> == Affiliations ==
>
>  * Matthew Hayes (Linked``In)
>  * William Vaughan (Linked``In)
>  * Evion Kim (Linked``In)
>  * Sam Shah (Linked``In)
>  * Xiangrui Meng (Linked``In)
>  * Christopher Lloyd (Linked``In)
>  * Mathieu Bastian (Linked``In)
>  * Mitul Tiwari (Linked``In)
>  * Josh Wills (Cloudera)
>  * Jarek Jarcec Cecho (Cloudera)
>
> == Sponsors ==
>
> === Champion ===
>
> Jakob Homan (Apache Member)
>
> === Nominated Mentors ===
>
>  * Ashutosh Chauhan <hashutosh at apache dot org>
>  * Roman Shaposhnik <rvs at apache dot org>
>  * Ted Dunning <tdunning at apache dot org>
>
> === Sponsoring Entity ===
>
> We are requesting the Incubator to sponsor this project.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org