You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Mohammad Islam <mi...@yahoo.com> on 2011/06/29 21:10:26 UTC

[VOTE] Oozie to join the Incubator

Hi All,

The discussion about Oozie proposal is settling down. Therefore I would like to 
initiate a vote to accept Oozie as an Apache Incubator project.

The latest proposal is pasted at the end and it could be found in the wiki as 
well:
 
http://wiki.apache.org/incubator/OozieProposal


The related discussion thread is at:
http://www.mail-archive.com/general@incubator.apache.org/msg29633.html


Please cast your votes:

[  ] +1 Accept Oozie for incubation
[  ] +0 Indifferent to Oozie incubation
[  ] -1 Reject Oozie for incubation

This vote will close 72 hours  from now.

Regards,
Mohammad


Abstract
Oozie is a server-based workflow scheduling and coordination system to manage 
data processing jobs for Apache HadoopTM. 

Proposal
Oozie is an  extensible, scalable and reliable system to define, manage, 
schedule,  and execute complex Hadoop workloads via web services. More  
specifically, this includes: 

	* XML-based declarative framework to specify a job or a complex workflow of 
dependent jobs. 

	* Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, 
Pig, Hive and custom java applications. 

	* Workflow scheduling based on frequency and/or data availability. 
	* Monitoring capability, automatic retry and failure handing of jobs. 
	* Extensible and pluggable architecture to allow arbitrary grid programming 
paradigms. 

	* Authentication, authorization, and capacity-aware load throttling to allow 
multi-tenant software as a service. 

Background
Most data  processing applications require multiple jobs to achieve their goals,  
with inherent dependencies among the jobs. A dependency could be  sequential, 
where one job can only start after another job has finished.  Or it could be 
conditional, where the execution of a job depends on the  return value or status 
of another job. In other cases, parallel  execution of multiple jobs may be 
permitted – or desired – to exploit  the massive pool of compute nodes provided 
by Hadoop. 

These  job dependencies are often expressed as a Directed Acyclic Graph, also  
called a workflow. A node in the workflow is typically a job (a  computation on 
the grid) or another type of action such as an eMail  notification. Computations 
can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
available on the grid. Edges of the graph  represent transitions from one node 
to the next, as the execution of a  workflow proceeds. 

Describing  a workflow in a declarative way has the advantage of decoupling job  
dependencies and execution control from application logic. Furthermore,  the 
workflow is modularized into jobs that can be reused within the same  workflow 
or across different workflows. Execution of the workflow is  then driven by a 
runtime system without understanding the application  logic of the jobs. This 
runtime system specializes in reliable and  predictable execution: It can retry 
actions that have failed or invoke a  cleanup action after termination of the 
workflow; it can monitor  progress, success, or failure of a workflow, and send 
appropriate alerts  to an administrator. The application developer is relieved 
from  implementing these generic procedures. 

Furthermore,  some applications or workflows need to run in periodic intervals 
or  when dependent data is available. For example, a workflow could be  executed 
every day as soon as output data from the previous 24 instances  of another, 
hourly workflow is available. The workflow coordinator  provides such scheduling 
features, along with prioritization, load  balancing and throttling to optimize 
utilization of resources in the  cluster. This makes it easier to maintain, 
control, and coordinate  complex data applications. 

Nearly  three years ago, a team of Yahoo! developers addressed these critical  
requirements for Hadoop-based data processing systems by developing a  new 
workflow management and scheduling system called Oozie. While it was  initially 
developed as a Yahoo!-internal project, it was designed and  implemented with 
the intention of open-sourcing. Oozie was released as a GitHub project in early 
2010. Oozie is used in production within Yahoo and  since it has been 
open-sourced it has been gaining adoption with  external developers 

Rationale
Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order 
to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a 
combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell 
scripts. 

Because  of this, developers find themselves writing ad-hoc glue programs to  
combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule, 
manage, monitor and recover. 

Workflow  management and scheduling is an essential feature for large-scale data  
processing applications. Such applications could write the customized  solution 
that would require separate development, operational, and  maintenance overhead. 
Since it is a prevalent use-case for data  processing, the application developer 
would surely prefer a generalized  solution with little or no such overhead. 
Oozie addresses the challenge  by providing an execution framework to flexibly 
specify the job  dependency, data dependency, and time dependency. In addition, 
Oozie  provides a multi-tenant-based centralized service and the opportunity to  
optimize load and utilization while respecting SLAs. 

Oozie is built on Apache HadoopTM to schedule jobs related to various Apache 
projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie 
is expected to  attract the larger and more diversified community that currently 
uses  such Apache sponsored projects. Additionally, users of the Hadoop  
ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie, 
as part of the Apache Hadoop TMecosystem, will be a great benefit to the current 
Hadoop/Pig/Hive/HBase/HCatalog community. 

Current Status
Meritocracy
Oozie  currently is a github-based open sourced project where developers from  
multiple companies are contributing to the project. Our intent with this  
incubator proposal is to further extend this diverse developer  community around 
Oozie following the Apache meritocracy model. We plan  to continue to provide 
adequate support to new developers and to quickly  recruit those who make solid 
contributions to committer status. In  addition, Oozie will expect, accept, and 
work to attract contributions  from amateurs as well. 

Community
While an  efficient workflow management and scheduling system is critical for  
large companies with huge data processing in multi-tenant clusters, it  is 
equally necessary for any non-trivial deployment. Different companies  are 
currently using Oozie as a workflow scheduler for Hadoop-based data  processing. 
At Yahoo! it is being used extensively in production  clusters to process 
thousand of jobs. Like the Oozie user community, the  Oozie developer community 
is also very strong. Developers from Yahoo!  provided the initial code base, and 
they are still the most active  contributors. In late 2010, developers from 
Cloudera also started  contributing, and currently other companies (e.g., IBM) 
are beginning to  participate. 

We currently use JIRA for issue tracking, github for code hosting and Yahoo! 
Groups for developer and user communications. 

Core Developers
Oozie is  currently being designed and developed by four engineers from Yahoo! –  
Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition, 
many outside contributors are actively contributing in design  and development. 
Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very 
important contributors. All of these core  developers have deep expertise in 
Hadoop and the Hadoop Ecosystem in  general. 

Alignment
The ASF is a  natural host for Oozie given that it is already the home of 
Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was  
designed to support Hadoop from the beginning in order to solve data  processing 
challenges in Hadoop clusters. Oozie complements the existing  Apache cloud 
computing projects by providing a flexible framework for  managing complex data 
processing tasks. 

Known Risks
Orphaned Products
The core  developers plan to work full time on the project. There is very little  
risk of Oozie getting orphaned since large companies like Yahoo! are  
extensively using it on their production Hadoop clusters. For example,  there 
are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed 
hourly through Oozie in production. In addition, there are  nearly 400 active 
users (including Yahoo! internal and external) in the  email community where 
nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500 
downloads of the Oozie binary in  the last eight months from the github site and 
a large number of  downloads were conducted by other companies such as Cloudera. 
Oozie has  three major releases and more than 15 patch releases in the last 
couple  of years which further demonstrates Oozie as a very active project. We  
plan to extend and diversify this community further through Apache. 

Inexperience with Open Source
The core  developers are all active users and followers of open source. They are  
already committers and contributors to the Oozie Github project. In  addition, 
they are very familiar with Apache principals and philosophy  for community 
driven software development. 

Homogeneous Developers
The core developers are from Yahoo! as well as from several other corporations, 
including Cloudera and IBM. 

Reliance on Salaried Developers
Currently,  the developers are paid to do work on Oozie. Companies like Yahoo! 
and  Cloudera are invested in Oozie as the solution to the workflow  management 
and scheduling problem in Hadoop clusters, and that is not  likely to change. In 
addition, since workflow management is very  important for most hadoop based 
data processing, non-salaried developers  and researchers from various 
institutes are expected to contribute to  the project. 

Relationships with Other Apache Products
Oozie is  based on Apache Hadoop to manage jobs created by different Apache  
projects such as Hadoop, Pig, and Hive. Users of these products are  extensively 
using Oozie as their workflow scheduler. 

An Excessive Fascination with the Apache Brand
We deeply  respect the reputation of Apache and have had great success with 
other  Apache projects such as Pig and HCatalog. We are motivated to expand and  
increase the adoption and development of Oozie following Apache’s  established 
open source model. We have also given reasons in the  Rationale and Alignment 
sections. 

Documentation
Information about Oozie can be found at http://yahoo.github.com/oozie/. The 
following links provide more information about Oozie in open source: 

	* Codebase at GitHub: https://github.com/yahoo/oozie. 
	* JIRA : http://oozie-jira.hadoop.developer.yahoo.net 
	* Continuous Integration (CI)  build: 
http://oozie-ci.hadoop.developer.yahoo.net/ 

	* Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/ 
Initial Source
Oozie has been under development since 2009 by a team of engineers at Yahoo!. It 
is currently hosted on GitHub under an Apache license at 
https://github.com/yahoo/oozie. 

External Dependencies
The required  external dependencies are all Apache License or compatible 
licenses.  Following the components with non-Apache licenses are enumerated: 

	* HSQLDB License: HSQLDB 
	* JDOM license: JDOM 
	* BSD: Serp 
	* CCDL v1: jaxb-api, ejb, JAF 
NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,  
the other listed components are transitive dependencies of other Apache  
components used by Oozie. 

Cryptography
Oozie supports the Kerberos authentication mechanism to access secured Hadoop 
services. 

Required Resources
Mailing Lists
	* oozie-private for private PMC discussions (with moderated subscriptions) 
	* oozie-dev 
	* oozie-commits 
	* oozie-user 
Subversion Directory
https://svn.apache.org/repos/asf/incubator/oozie 
Issue Tracking
JIRA Oozie (OOZIE) 
Other Resources
The  existing code already has unit tests, so we would like a Hudson instance  
to run them whenever a new patch is submitted. This can be added after  project 
creation. 

Initial Committers
	* Mohammad K Islam (mislam77 at yahoo  dot com) 
	* Angelo K Huang (angelohuang at gmail dot com) 
	* Mayank Bansal (mabansal at gmail dot com) 
	* Andreas Neumann (neunand at gmail dot com) 
	* Alejandro Abdelnur (tucu00 at gmail dot com) 
	* Chao Wang (brookwc at gmail dot com) 
Affiliations
	* Mohammad K Islam (Yahoo!) 
	* Angelo Huang (Yahoo!) 
	* Mayank Bansal (Yahoo!) 
	* Andreas Neumann (Yahoo!) 
	* Alejandro Abdelnur (Cloudera) 
	* Chao Wang (IBM) 
Sponsors
Champion
Alan Gates 
Nominated Mentors
	* Owen O'Malley (Incubator PMC member) 
	* Alan Gates (Incubator PMC member) 
	* Christopher Douglas(Incubator PMC member) 
	* Devaraj Das (Hadoop PMC member) 
Sponsoring EntityWe are requesting the Incubator to sponsor this project. 

Re: [VOTE] Oozie to join the Incubator

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Wed, Jun 29, 2011 at 8:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:

<snip>

> [ X] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation

Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Wed, Jun 29, 2011 at 9:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:
>...
> [X  ] +1 Accept Oozie for incubation
>...

-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Mayank Bansal <ma...@gmail.com>.
+1 (non-binding)

On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:

> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would
> like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki
> as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to
> manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex
> workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe,
> Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of
> jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid
> programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling
> to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their
> goals,
> with inherent dependencies among the jobs. A dependency could be
>  sequential,
> where one job can only start after another job has finished.  Or it could
> be
> conditional, where the execution of a job depends on the  return value or
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph,
> also
> called a workflow. A node in the workflow is typically a job (a
>  computation on
> the grid) or another type of action such as an eMail  notification.
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming
> paradigm
> available on the grid. Edges of the graph  represent transitions from one
> node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling
> job
> dependencies and execution control from application logic. Furthermore,
>  the
> workflow is modularized into jobs that can be reused within the same
>  workflow
> or across different workflows. Execution of the workflow is  then driven by
> a
> runtime system without understanding the application  logic of the jobs.
> This
> runtime system specializes in reliable and  predictable execution: It can
> retry
> actions that have failed or invoke a  cleanup action after termination of
> the
> workflow; it can monitor  progress, success, or failure of a workflow, and
> send
> appropriate alerts  to an administrator. The application developer is
> relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic
> intervals
> or  when dependent data is available. For example, a workflow could be
>  executed
> every day as soon as output data from the previous 24 instances  of
> another,
> hourly workflow is available. The workflow coordinator  provides such
> scheduling
> features, along with prioritization, load  balancing and throttling to
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these
> critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was
>  initially
> developed as a Yahoo!-internal project, it was designed and  implemented
> with
> the intention of open-sourcing. Oozie was released as a GitHub project in
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly
>  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs
> to
> combine these Hadoop jobs. These ad-hoc programs are difficult to
>  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale
> data
> processing applications. Such applications could write the customized
>  solution
> that would require separate development, operational, and  maintenance
> overhead.
> Since it is a prevalent use-case for data  processing, the application
> developer
> would surely prefer a generalized  solution with little or no such
> overhead.
> Oozie addresses the challenge  by providing an execution framework to
> flexibly
> specify the job  dependency, data dependency, and time dependency. In
> addition,
> Oozie  provides a multi-tenant-based centralized service and the
> opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various
> Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project,
> Oozie
> is expected to  attract the larger and more diversified community that
> currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,
>  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the
> current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers
> from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community
> around
> Oozie following the Apache meritocracy model. We plan  to continue to
> provide
> adequate support to new developers and to quickly  recruit those who make
> solid
> contributions to committer status. In  addition, Oozie will expect, accept,
> and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical
> for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data
>  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer
> community
> is also very strong. Developers from Yahoo!  provided the initial code
> base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g.,
> IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and
> Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from
> Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In
>  addition,
> many outside contributors are actively contributing in design  and
> development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are
> very
> important contributors. All of these core  developers have deep expertise
> in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data
>  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex
> data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very
> little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,
>  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are
> processed
> hourly through Oozie in production. In addition, there are  nearly 400
> active
> users (including Yahoo! internal and external) in the  email community
> where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than
> 1500
> downloads of the Oozie binary in  the last eight months from the github
> site and
> a large number of  downloads were conducted by other companies such as
> Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project.
> We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source.
> They are
> already committers and contributors to the Oozie Github project. In
>  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other
> corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like
> Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow
>  management
> and scheduling problem in Hadoop clusters, and that is not  likely to
> change. In
> addition, since workflow management is very  important for most hadoop
> based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are
>  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand
> and
> increase the adoption and development of Oozie following Apache’s
>  established
> open source model. We have also given reasons in the  Rationale and
> Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/.
> The
> following links provide more information about Oozie in open source:
>
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>        * Yahoo user community:
> http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at
> Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are
> enumerated:
>
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by
> Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured
> Hadoop
> services.
>
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated
> subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson
> instance
> to run them whenever a new patch is submitted. This can be added after
>  project
> creation.
>
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>



-- 
Thanks and Regards,
Mayank
Cell: 408-718-9370

Re: [VOTE] Oozie to join the Incubator

Posted by Nigel Daley <nd...@mac.com>.
+1 (binding)

Sent from my iPad

On Jun 29, 2011, at 12:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:

> Hi All,
> 
> The discussion about Oozie proposal is settling down. Therefore I would like to 
> initiate a vote to accept Oozie as an Apache Incubator project.
> 
> The latest proposal is pasted at the end and it could be found in the wiki as 
> well:
> 
> http://wiki.apache.org/incubator/OozieProposal
> 
> 
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
> 
> 
> Please cast your votes:
> 
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
> 
> This vote will close 72 hours  from now.
> 
> Regards,
> Mohammad
> 
> 
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage 
> data processing jobs for Apache HadoopTM. 
> 
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage, 
> schedule,  and execute complex Hadoop workloads via web services. More  
> specifically, this includes: 
> 
>    * XML-based declarative framework to specify a job or a complex workflow of 
> dependent jobs. 
> 
>    * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, 
> Pig, Hive and custom java applications. 
> 
>    * Workflow scheduling based on frequency and/or data availability. 
>    * Monitoring capability, automatic retry and failure handing of jobs. 
>    * Extensible and pluggable architecture to allow arbitrary grid programming 
> paradigms. 
> 
>    * Authentication, authorization, and capacity-aware load throttling to allow 
> multi-tenant software as a service. 
> 
> Background
> Most data  processing applications require multiple jobs to achieve their goals,  
> with inherent dependencies among the jobs. A dependency could be  sequential, 
> where one job can only start after another job has finished.  Or it could be 
> conditional, where the execution of a job depends on the  return value or status 
> of another job. In other cases, parallel  execution of multiple jobs may be 
> permitted – or desired – to exploit  the massive pool of compute nodes provided 
> by Hadoop. 
> 
> These  job dependencies are often expressed as a Directed Acyclic Graph, also  
> called a workflow. A node in the workflow is typically a job (a  computation on 
> the grid) or another type of action such as an eMail  notification. Computations 
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
> available on the grid. Edges of the graph  represent transitions from one node 
> to the next, as the execution of a  workflow proceeds. 
> 
> Describing  a workflow in a declarative way has the advantage of decoupling job  
> dependencies and execution control from application logic. Furthermore,  the 
> workflow is modularized into jobs that can be reused within the same  workflow 
> or across different workflows. Execution of the workflow is  then driven by a 
> runtime system without understanding the application  logic of the jobs. This 
> runtime system specializes in reliable and  predictable execution: It can retry 
> actions that have failed or invoke a  cleanup action after termination of the 
> workflow; it can monitor  progress, success, or failure of a workflow, and send 
> appropriate alerts  to an administrator. The application developer is relieved 
> from  implementing these generic procedures. 
> 
> Furthermore,  some applications or workflows need to run in periodic intervals 
> or  when dependent data is available. For example, a workflow could be  executed 
> every day as soon as output data from the previous 24 instances  of another, 
> hourly workflow is available. The workflow coordinator  provides such scheduling 
> features, along with prioritization, load  balancing and throttling to optimize 
> utilization of resources in the  cluster. This makes it easier to maintain, 
> control, and coordinate  complex data applications. 
> 
> Nearly  three years ago, a team of Yahoo! developers addressed these critical  
> requirements for Hadoop-based data processing systems by developing a  new 
> workflow management and scheduling system called Oozie. While it was  initially 
> developed as a Yahoo!-internal project, it was designed and  implemented with 
> the intention of open-sourcing. Oozie was released as a GitHub project in early 
> 2010. Oozie is used in production within Yahoo and  since it has been 
> open-sourced it has been gaining adoption with  external developers 
> 
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order 
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a 
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell 
> scripts. 
> 
> Because  of this, developers find themselves writing ad-hoc glue programs to  
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule, 
> manage, monitor and recover. 
> 
> Workflow  management and scheduling is an essential feature for large-scale data  
> processing applications. Such applications could write the customized  solution 
> that would require separate development, operational, and  maintenance overhead. 
> Since it is a prevalent use-case for data  processing, the application developer 
> would surely prefer a generalized  solution with little or no such overhead. 
> Oozie addresses the challenge  by providing an execution framework to flexibly 
> specify the job  dependency, data dependency, and time dependency. In addition, 
> Oozie  provides a multi-tenant-based centralized service and the opportunity to  
> optimize load and utilization while respecting SLAs. 
> 
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache 
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie 
> is expected to  attract the larger and more diversified community that currently 
> uses  such Apache sponsored projects. Additionally, users of the Hadoop  
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie, 
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current 
> Hadoop/Pig/Hive/HBase/HCatalog community. 
> 
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from  
> multiple companies are contributing to the project. Our intent with this  
> incubator proposal is to further extend this diverse developer  community around 
> Oozie following the Apache meritocracy model. We plan  to continue to provide 
> adequate support to new developers and to quickly  recruit those who make solid 
> contributions to committer status. In  addition, Oozie will expect, accept, and 
> work to attract contributions  from amateurs as well. 
> 
> Community
> While an  efficient workflow management and scheduling system is critical for  
> large companies with huge data processing in multi-tenant clusters, it  is 
> equally necessary for any non-trivial deployment. Different companies  are 
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing. 
> At Yahoo! it is being used extensively in production  clusters to process 
> thousand of jobs. Like the Oozie user community, the  Oozie developer community 
> is also very strong. Developers from Yahoo!  provided the initial code base, and 
> they are still the most active  contributors. In late 2010, developers from 
> Cloudera also started  contributing, and currently other companies (e.g., IBM) 
> are beginning to  participate. 
> 
> We currently use JIRA for issue tracking, github for code hosting and Yahoo! 
> Groups for developer and user communications. 
> 
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –  
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition, 
> many outside contributors are actively contributing in design  and development. 
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very 
> important contributors. All of these core  developers have deep expertise in 
> Hadoop and the Hadoop Ecosystem in  general. 
> 
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of 
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was  
> designed to support Hadoop from the beginning in order to solve data  processing 
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud 
> computing projects by providing a flexible framework for  managing complex data 
> processing tasks. 
> 
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little  
> risk of Oozie getting orphaned since large companies like Yahoo! are  
> extensively using it on their production Hadoop clusters. For example,  there 
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed 
> hourly through Oozie in production. In addition, there are  nearly 400 active 
> users (including Yahoo! internal and external) in the  email community where 
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500 
> downloads of the Oozie binary in  the last eight months from the github site and 
> a large number of  downloads were conducted by other companies such as Cloudera. 
> Oozie has  three major releases and more than 15 patch releases in the last 
> couple  of years which further demonstrates Oozie as a very active project. We  
> plan to extend and diversify this community further through Apache. 
> 
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are  
> already committers and contributors to the Oozie Github project. In  addition, 
> they are very familiar with Apache principals and philosophy  for community 
> driven software development. 
> 
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations, 
> including Cloudera and IBM. 
> 
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo! 
> and  Cloudera are invested in Oozie as the solution to the workflow  management 
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In 
> addition, since workflow management is very  important for most hadoop based 
> data processing, non-salaried developers  and researchers from various 
> institutes are expected to contribute to  the project. 
> 
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache  
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively 
> using Oozie as their workflow scheduler. 
> 
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with 
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and  
> increase the adoption and development of Oozie following Apache’s  established 
> open source model. We have also given reasons in the  Rationale and Alignment 
> sections. 
> 
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The 
> following links provide more information about Oozie in open source: 
> 
>    * Codebase at GitHub: https://github.com/yahoo/oozie. 
>    * JIRA : http://oozie-jira.hadoop.developer.yahoo.net 
>    * Continuous Integration (CI)  build: 
> http://oozie-ci.hadoop.developer.yahoo.net/ 
> 
>    * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/ 
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It 
> is currently hosted on GitHub under an Apache license at 
> https://github.com/yahoo/oozie. 
> 
> External Dependencies
> The required  external dependencies are all Apache License or compatible 
> licenses.  Following the components with non-Apache licenses are enumerated: 
> 
>    * HSQLDB License: HSQLDB 
>    * JDOM license: JDOM 
>    * BSD: Serp 
>    * CCDL v1: jaxb-api, ejb, JAF 
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,  
> the other listed components are transitive dependencies of other Apache  
> components used by Oozie. 
> 
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop 
> services. 
> 
> Required Resources
> Mailing Lists
>    * oozie-private for private PMC discussions (with moderated subscriptions) 
>    * oozie-dev 
>    * oozie-commits 
>    * oozie-user 
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie 
> Issue Tracking
> JIRA Oozie (OOZIE) 
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance  
> to run them whenever a new patch is submitted. This can be added after  project 
> creation. 
> 
> Initial Committers
>    * Mohammad K Islam (mislam77 at yahoo  dot com) 
>    * Angelo K Huang (angelohuang at gmail dot com) 
>    * Mayank Bansal (mabansal at gmail dot com) 
>    * Andreas Neumann (neunand at gmail dot com) 
>    * Alejandro Abdelnur (tucu00 at gmail dot com) 
>    * Chao Wang (brookwc at gmail dot com) 
> Affiliations
>    * Mohammad K Islam (Yahoo!) 
>    * Angelo Huang (Yahoo!) 
>    * Mayank Bansal (Yahoo!) 
>    * Andreas Neumann (Yahoo!) 
>    * Alejandro Abdelnur (Cloudera) 
>    * Chao Wang (IBM) 
> Sponsors
> Champion
> Alan Gates 
> Nominated Mentors
>    * Owen O'Malley (Incubator PMC member) 
>    * Alan Gates (Incubator PMC member) 
>    * Christopher Douglas(Incubator PMC member) 
>    * Devaraj Das (Hadoop PMC member) 
> Sponsoring EntityWe are requesting the Incubator to sponsor this project. 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by "Alan D. Cabrera" <li...@toolazydogs.com>.
+1 binding


Regards,
Alan

On Jun 29, 2011, at 12:10 PM, Mohammad Islam wrote:

> Hi All,
> 
> The discussion about Oozie proposal is settling down. Therefore I would like to 
> initiate a vote to accept Oozie as an Apache Incubator project.
> 
> The latest proposal is pasted at the end and it could be found in the wiki as 
> well:
> 
> http://wiki.apache.org/incubator/OozieProposal
> 
> 
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
> 
> 
> Please cast your votes:
> 
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
> 
> This vote will close 72 hours  from now.
> 
> Regards,
> Mohammad
> 
> 
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage 
> data processing jobs for Apache HadoopTM. 
> 
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage, 
> schedule,  and execute complex Hadoop workloads via web services. More  
> specifically, this includes: 
> 
> 	* XML-based declarative framework to specify a job or a complex workflow of 
> dependent jobs. 
> 
> 	* Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, 
> Pig, Hive and custom java applications. 
> 
> 	* Workflow scheduling based on frequency and/or data availability. 
> 	* Monitoring capability, automatic retry and failure handing of jobs. 
> 	* Extensible and pluggable architecture to allow arbitrary grid programming 
> paradigms. 
> 
> 	* Authentication, authorization, and capacity-aware load throttling to allow 
> multi-tenant software as a service. 
> 
> Background
> Most data  processing applications require multiple jobs to achieve their goals,  
> with inherent dependencies among the jobs. A dependency could be  sequential, 
> where one job can only start after another job has finished.  Or it could be 
> conditional, where the execution of a job depends on the  return value or status 
> of another job. In other cases, parallel  execution of multiple jobs may be 
> permitted – or desired – to exploit  the massive pool of compute nodes provided 
> by Hadoop. 
> 
> These  job dependencies are often expressed as a Directed Acyclic Graph, also  
> called a workflow. A node in the workflow is typically a job (a  computation on 
> the grid) or another type of action such as an eMail  notification. Computations 
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
> available on the grid. Edges of the graph  represent transitions from one node 
> to the next, as the execution of a  workflow proceeds. 
> 
> Describing  a workflow in a declarative way has the advantage of decoupling job  
> dependencies and execution control from application logic. Furthermore,  the 
> workflow is modularized into jobs that can be reused within the same  workflow 
> or across different workflows. Execution of the workflow is  then driven by a 
> runtime system without understanding the application  logic of the jobs. This 
> runtime system specializes in reliable and  predictable execution: It can retry 
> actions that have failed or invoke a  cleanup action after termination of the 
> workflow; it can monitor  progress, success, or failure of a workflow, and send 
> appropriate alerts  to an administrator. The application developer is relieved 
> from  implementing these generic procedures. 
> 
> Furthermore,  some applications or workflows need to run in periodic intervals 
> or  when dependent data is available. For example, a workflow could be  executed 
> every day as soon as output data from the previous 24 instances  of another, 
> hourly workflow is available. The workflow coordinator  provides such scheduling 
> features, along with prioritization, load  balancing and throttling to optimize 
> utilization of resources in the  cluster. This makes it easier to maintain, 
> control, and coordinate  complex data applications. 
> 
> Nearly  three years ago, a team of Yahoo! developers addressed these critical  
> requirements for Hadoop-based data processing systems by developing a  new 
> workflow management and scheduling system called Oozie. While it was  initially 
> developed as a Yahoo!-internal project, it was designed and  implemented with 
> the intention of open-sourcing. Oozie was released as a GitHub project in early 
> 2010. Oozie is used in production within Yahoo and  since it has been 
> open-sourced it has been gaining adoption with  external developers 
> 
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order 
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a 
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell 
> scripts. 
> 
> Because  of this, developers find themselves writing ad-hoc glue programs to  
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule, 
> manage, monitor and recover. 
> 
> Workflow  management and scheduling is an essential feature for large-scale data  
> processing applications. Such applications could write the customized  solution 
> that would require separate development, operational, and  maintenance overhead. 
> Since it is a prevalent use-case for data  processing, the application developer 
> would surely prefer a generalized  solution with little or no such overhead. 
> Oozie addresses the challenge  by providing an execution framework to flexibly 
> specify the job  dependency, data dependency, and time dependency. In addition, 
> Oozie  provides a multi-tenant-based centralized service and the opportunity to  
> optimize load and utilization while respecting SLAs. 
> 
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache 
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie 
> is expected to  attract the larger and more diversified community that currently 
> uses  such Apache sponsored projects. Additionally, users of the Hadoop  
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie, 
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current 
> Hadoop/Pig/Hive/HBase/HCatalog community. 
> 
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from  
> multiple companies are contributing to the project. Our intent with this  
> incubator proposal is to further extend this diverse developer  community around 
> Oozie following the Apache meritocracy model. We plan  to continue to provide 
> adequate support to new developers and to quickly  recruit those who make solid 
> contributions to committer status. In  addition, Oozie will expect, accept, and 
> work to attract contributions  from amateurs as well. 
> 
> Community
> While an  efficient workflow management and scheduling system is critical for  
> large companies with huge data processing in multi-tenant clusters, it  is 
> equally necessary for any non-trivial deployment. Different companies  are 
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing. 
> At Yahoo! it is being used extensively in production  clusters to process 
> thousand of jobs. Like the Oozie user community, the  Oozie developer community 
> is also very strong. Developers from Yahoo!  provided the initial code base, and 
> they are still the most active  contributors. In late 2010, developers from 
> Cloudera also started  contributing, and currently other companies (e.g., IBM) 
> are beginning to  participate. 
> 
> We currently use JIRA for issue tracking, github for code hosting and Yahoo! 
> Groups for developer and user communications. 
> 
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –  
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition, 
> many outside contributors are actively contributing in design  and development. 
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very 
> important contributors. All of these core  developers have deep expertise in 
> Hadoop and the Hadoop Ecosystem in  general. 
> 
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of 
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was  
> designed to support Hadoop from the beginning in order to solve data  processing 
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud 
> computing projects by providing a flexible framework for  managing complex data 
> processing tasks. 
> 
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little  
> risk of Oozie getting orphaned since large companies like Yahoo! are  
> extensively using it on their production Hadoop clusters. For example,  there 
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed 
> hourly through Oozie in production. In addition, there are  nearly 400 active 
> users (including Yahoo! internal and external) in the  email community where 
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500 
> downloads of the Oozie binary in  the last eight months from the github site and 
> a large number of  downloads were conducted by other companies such as Cloudera. 
> Oozie has  three major releases and more than 15 patch releases in the last 
> couple  of years which further demonstrates Oozie as a very active project. We  
> plan to extend and diversify this community further through Apache. 
> 
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are  
> already committers and contributors to the Oozie Github project. In  addition, 
> they are very familiar with Apache principals and philosophy  for community 
> driven software development. 
> 
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations, 
> including Cloudera and IBM. 
> 
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo! 
> and  Cloudera are invested in Oozie as the solution to the workflow  management 
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In 
> addition, since workflow management is very  important for most hadoop based 
> data processing, non-salaried developers  and researchers from various 
> institutes are expected to contribute to  the project. 
> 
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache  
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively 
> using Oozie as their workflow scheduler. 
> 
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with 
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and  
> increase the adoption and development of Oozie following Apache’s  established 
> open source model. We have also given reasons in the  Rationale and Alignment 
> sections. 
> 
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The 
> following links provide more information about Oozie in open source: 
> 
> 	* Codebase at GitHub: https://github.com/yahoo/oozie. 
> 	* JIRA : http://oozie-jira.hadoop.developer.yahoo.net 
> 	* Continuous Integration (CI)  build: 
> http://oozie-ci.hadoop.developer.yahoo.net/ 
> 
> 	* Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/ 
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It 
> is currently hosted on GitHub under an Apache license at 
> https://github.com/yahoo/oozie. 
> 
> External Dependencies
> The required  external dependencies are all Apache License or compatible 
> licenses.  Following the components with non-Apache licenses are enumerated: 
> 
> 	* HSQLDB License: HSQLDB 
> 	* JDOM license: JDOM 
> 	* BSD: Serp 
> 	* CCDL v1: jaxb-api, ejb, JAF 
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,  
> the other listed components are transitive dependencies of other Apache  
> components used by Oozie. 
> 
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop 
> services. 
> 
> Required Resources
> Mailing Lists
> 	* oozie-private for private PMC discussions (with moderated subscriptions) 
> 	* oozie-dev 
> 	* oozie-commits 
> 	* oozie-user 
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie 
> Issue Tracking
> JIRA Oozie (OOZIE) 
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance  
> to run them whenever a new patch is submitted. This can be added after  project 
> creation. 
> 
> Initial Committers
> 	* Mohammad K Islam (mislam77 at yahoo  dot com) 
> 	* Angelo K Huang (angelohuang at gmail dot com) 
> 	* Mayank Bansal (mabansal at gmail dot com) 
> 	* Andreas Neumann (neunand at gmail dot com) 
> 	* Alejandro Abdelnur (tucu00 at gmail dot com) 
> 	* Chao Wang (brookwc at gmail dot com) 
> Affiliations
> 	* Mohammad K Islam (Yahoo!) 
> 	* Angelo Huang (Yahoo!) 
> 	* Mayank Bansal (Yahoo!) 
> 	* Andreas Neumann (Yahoo!) 
> 	* Alejandro Abdelnur (Cloudera) 
> 	* Chao Wang (IBM) 
> Sponsors
> Champion
> Alan Gates 
> Nominated Mentors
> 	* Owen O'Malley (Incubator PMC member) 
> 	* Alan Gates (Incubator PMC member) 
> 	* Christopher Douglas(Incubator PMC member) 
> 	* Devaraj Das (Hadoop PMC member) 
> Sponsoring EntityWe are requesting the Incubator to sponsor this project. 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
+1 (binding). Good luck guys!

Cheers,
Chris 

Sent from my iPad

On Jun 29, 2011, at 12:10 PM, "Mohammad Islam" <mi...@yahoo.com> wrote:

> Hi All,
> 
> The discussion about Oozie proposal is settling down. Therefore I would like to
> initiate a vote to accept Oozie as an Apache Incubator project.
> 
> The latest proposal is pasted at the end and it could be found in the wiki as
> well:
> 
> http://wiki.apache.org/incubator/OozieProposal
> 
> 
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
> 
> 
> Please cast your votes:
> 
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
> 
> This vote will close 72 hours  from now.
> 
> Regards,
> Mohammad
> 
> 
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
> 
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
> 
>        * XML-based declarative framework to specify a job or a complex workflow of
> dependent jobs.
> 
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
> Pig, Hive and custom java applications.
> 
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid programming
> paradigms.
> 
>        * Authentication, authorization, and capacity-aware load throttling to allow
> multi-tenant software as a service.
> 
> Background
> Most data  processing applications require multiple jobs to achieve their goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes provided
> by Hadoop.
> 
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation on
> the grid) or another type of action such as an eMail  notification. Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
> 
> Describing  a workflow in a declarative way has the advantage of decoupling job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
> 
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such scheduling
> features, along with prioritization, load  balancing and throttling to optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
> 
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
> 
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
> scripts.
> 
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
> 
> Workflow  management and scheduling is an essential feature for large-scale data
> processing applications. Such applications could write the customized  solution
> that would require separate development, operational, and  maintenance overhead.
> Since it is a prevalent use-case for data  processing, the application developer
> would surely prefer a generalized  solution with little or no such overhead.
> Oozie addresses the challenge  by providing an execution framework to flexibly
> specify the job  dependency, data dependency, and time dependency. In addition,
> Oozie  provides a multi-tenant-based centralized service and the opportunity to
> optimize load and utilization while respecting SLAs.
> 
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
> is expected to  attract the larger and more diversified community that currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
> Hadoop/Pig/Hive/HBase/HCatalog community.
> 
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community around
> Oozie following the Apache meritocracy model. We plan  to continue to provide
> adequate support to new developers and to quickly  recruit those who make solid
> contributions to committer status. In  addition, Oozie will expect, accept, and
> work to attract contributions  from amateurs as well.
> 
> Community
> While an  efficient workflow management and scheduling system is critical for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer community
> is also very strong. Developers from Yahoo!  provided the initial code base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g., IBM)
> are beginning to  participate.
> 
> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
> Groups for developer and user communications.
> 
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
> many outside contributors are actively contributing in design  and development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
> important contributors. All of these core  developers have deep expertise in
> Hadoop and the Hadoop Ecosystem in  general.
> 
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex data
> processing tasks.
> 
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
> hourly through Oozie in production. In addition, there are  nearly 400 active
> users (including Yahoo! internal and external) in the  email community where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
> downloads of the Oozie binary in  the last eight months from the github site and
> a large number of  downloads were conducted by other companies such as Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project. We
> plan to extend and diversify this community further through Apache.
> 
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are
> already committers and contributors to the Oozie Github project. In  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
> 
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations,
> including Cloudera and IBM.
> 
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow  management
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
> addition, since workflow management is very  important for most hadoop based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
> 
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
> using Oozie as their workflow scheduler.
> 
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
> increase the adoption and development of Oozie following Apache’s  established
> open source model. We have also given reasons in the  Rationale and Alignment
> sections.
> 
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
> following links provide more information about Oozie in open source:
> 
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
> 
>        * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
> 
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are enumerated:
> 
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
> 
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
> services.
> 
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance
> to run them whenever a new patch is submitted. This can be added after  project
> creation.
> 
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.

Re: [VOTE] Oozie to join the Incubator

Posted by Doug Cutting <cu...@apache.org>.
+1

Doug

On 06/29/2011 12:10 PM, Mohammad Islam wrote:
> Hi All,
> 
> The discussion about Oozie proposal is settling down. Therefore I would like to 
> initiate a vote to accept Oozie as an Apache Incubator project.
> 
> The latest proposal is pasted at the end and it could be found in the wiki as 
> well:
>  
> http://wiki.apache.org/incubator/OozieProposal
> 
> 
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
> 
> 
> Please cast your votes:
> 
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
> 
> This vote will close 72 hours  from now.
> 
> Regards,
> Mohammad
> 
> 
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage 
> data processing jobs for Apache HadoopTM. 
> 
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage, 
> schedule,  and execute complex Hadoop workloads via web services. More  
> specifically, this includes: 
> 
> 	* XML-based declarative framework to specify a job or a complex workflow of 
> dependent jobs. 
> 
> 	* Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, 
> Pig, Hive and custom java applications. 
> 
> 	* Workflow scheduling based on frequency and/or data availability. 
> 	* Monitoring capability, automatic retry and failure handing of jobs. 
> 	* Extensible and pluggable architecture to allow arbitrary grid programming 
> paradigms. 
> 
> 	* Authentication, authorization, and capacity-aware load throttling to allow 
> multi-tenant software as a service. 
> 
> Background
> Most data  processing applications require multiple jobs to achieve their goals,  
> with inherent dependencies among the jobs. A dependency could be  sequential, 
> where one job can only start after another job has finished.  Or it could be 
> conditional, where the execution of a job depends on the  return value or status 
> of another job. In other cases, parallel  execution of multiple jobs may be 
> permitted – or desired – to exploit  the massive pool of compute nodes provided 
> by Hadoop. 
> 
> These  job dependencies are often expressed as a Directed Acyclic Graph, also  
> called a workflow. A node in the workflow is typically a job (a  computation on 
> the grid) or another type of action such as an eMail  notification. Computations 
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
> available on the grid. Edges of the graph  represent transitions from one node 
> to the next, as the execution of a  workflow proceeds. 
> 
> Describing  a workflow in a declarative way has the advantage of decoupling job  
> dependencies and execution control from application logic. Furthermore,  the 
> workflow is modularized into jobs that can be reused within the same  workflow 
> or across different workflows. Execution of the workflow is  then driven by a 
> runtime system without understanding the application  logic of the jobs. This 
> runtime system specializes in reliable and  predictable execution: It can retry 
> actions that have failed or invoke a  cleanup action after termination of the 
> workflow; it can monitor  progress, success, or failure of a workflow, and send 
> appropriate alerts  to an administrator. The application developer is relieved 
> from  implementing these generic procedures. 
> 
> Furthermore,  some applications or workflows need to run in periodic intervals 
> or  when dependent data is available. For example, a workflow could be  executed 
> every day as soon as output data from the previous 24 instances  of another, 
> hourly workflow is available. The workflow coordinator  provides such scheduling 
> features, along with prioritization, load  balancing and throttling to optimize 
> utilization of resources in the  cluster. This makes it easier to maintain, 
> control, and coordinate  complex data applications. 
> 
> Nearly  three years ago, a team of Yahoo! developers addressed these critical  
> requirements for Hadoop-based data processing systems by developing a  new 
> workflow management and scheduling system called Oozie. While it was  initially 
> developed as a Yahoo!-internal project, it was designed and  implemented with 
> the intention of open-sourcing. Oozie was released as a GitHub project in early 
> 2010. Oozie is used in production within Yahoo and  since it has been 
> open-sourced it has been gaining adoption with  external developers 
> 
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order 
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a 
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell 
> scripts. 
> 
> Because  of this, developers find themselves writing ad-hoc glue programs to  
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule, 
> manage, monitor and recover. 
> 
> Workflow  management and scheduling is an essential feature for large-scale data  
> processing applications. Such applications could write the customized  solution 
> that would require separate development, operational, and  maintenance overhead. 
> Since it is a prevalent use-case for data  processing, the application developer 
> would surely prefer a generalized  solution with little or no such overhead. 
> Oozie addresses the challenge  by providing an execution framework to flexibly 
> specify the job  dependency, data dependency, and time dependency. In addition, 
> Oozie  provides a multi-tenant-based centralized service and the opportunity to  
> optimize load and utilization while respecting SLAs. 
> 
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache 
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie 
> is expected to  attract the larger and more diversified community that currently 
> uses  such Apache sponsored projects. Additionally, users of the Hadoop  
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie, 
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current 
> Hadoop/Pig/Hive/HBase/HCatalog community. 
> 
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from  
> multiple companies are contributing to the project. Our intent with this  
> incubator proposal is to further extend this diverse developer  community around 
> Oozie following the Apache meritocracy model. We plan  to continue to provide 
> adequate support to new developers and to quickly  recruit those who make solid 
> contributions to committer status. In  addition, Oozie will expect, accept, and 
> work to attract contributions  from amateurs as well. 
> 
> Community
> While an  efficient workflow management and scheduling system is critical for  
> large companies with huge data processing in multi-tenant clusters, it  is 
> equally necessary for any non-trivial deployment. Different companies  are 
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing. 
> At Yahoo! it is being used extensively in production  clusters to process 
> thousand of jobs. Like the Oozie user community, the  Oozie developer community 
> is also very strong. Developers from Yahoo!  provided the initial code base, and 
> they are still the most active  contributors. In late 2010, developers from 
> Cloudera also started  contributing, and currently other companies (e.g., IBM) 
> are beginning to  participate. 
> 
> We currently use JIRA for issue tracking, github for code hosting and Yahoo! 
> Groups for developer and user communications. 
> 
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –  
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition, 
> many outside contributors are actively contributing in design  and development. 
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very 
> important contributors. All of these core  developers have deep expertise in 
> Hadoop and the Hadoop Ecosystem in  general. 
> 
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of 
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was  
> designed to support Hadoop from the beginning in order to solve data  processing 
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud 
> computing projects by providing a flexible framework for  managing complex data 
> processing tasks. 
> 
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little  
> risk of Oozie getting orphaned since large companies like Yahoo! are  
> extensively using it on their production Hadoop clusters. For example,  there 
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed 
> hourly through Oozie in production. In addition, there are  nearly 400 active 
> users (including Yahoo! internal and external) in the  email community where 
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500 
> downloads of the Oozie binary in  the last eight months from the github site and 
> a large number of  downloads were conducted by other companies such as Cloudera. 
> Oozie has  three major releases and more than 15 patch releases in the last 
> couple  of years which further demonstrates Oozie as a very active project. We  
> plan to extend and diversify this community further through Apache. 
> 
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are  
> already committers and contributors to the Oozie Github project. In  addition, 
> they are very familiar with Apache principals and philosophy  for community 
> driven software development. 
> 
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations, 
> including Cloudera and IBM. 
> 
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo! 
> and  Cloudera are invested in Oozie as the solution to the workflow  management 
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In 
> addition, since workflow management is very  important for most hadoop based 
> data processing, non-salaried developers  and researchers from various 
> institutes are expected to contribute to  the project. 
> 
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache  
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively 
> using Oozie as their workflow scheduler. 
> 
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with 
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and  
> increase the adoption and development of Oozie following Apache’s  established 
> open source model. We have also given reasons in the  Rationale and Alignment 
> sections. 
> 
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The 
> following links provide more information about Oozie in open source: 
> 
> 	* Codebase at GitHub: https://github.com/yahoo/oozie. 
> 	* JIRA : http://oozie-jira.hadoop.developer.yahoo.net 
> 	* Continuous Integration (CI)  build: 
> http://oozie-ci.hadoop.developer.yahoo.net/ 
> 
> 	* Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/ 
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It 
> is currently hosted on GitHub under an Apache license at 
> https://github.com/yahoo/oozie. 
> 
> External Dependencies
> The required  external dependencies are all Apache License or compatible 
> licenses.  Following the components with non-Apache licenses are enumerated: 
> 
> 	* HSQLDB License: HSQLDB 
> 	* JDOM license: JDOM 
> 	* BSD: Serp 
> 	* CCDL v1: jaxb-api, ejb, JAF 
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,  
> the other listed components are transitive dependencies of other Apache  
> components used by Oozie. 
> 
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop 
> services. 
> 
> Required Resources
> Mailing Lists
> 	* oozie-private for private PMC discussions (with moderated subscriptions) 
> 	* oozie-dev 
> 	* oozie-commits 
> 	* oozie-user 
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie 
> Issue Tracking
> JIRA Oozie (OOZIE) 
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance  
> to run them whenever a new patch is submitted. This can be added after  project 
> creation. 
> 
> Initial Committers
> 	* Mohammad K Islam (mislam77 at yahoo  dot com) 
> 	* Angelo K Huang (angelohuang at gmail dot com) 
> 	* Mayank Bansal (mabansal at gmail dot com) 
> 	* Andreas Neumann (neunand at gmail dot com) 
> 	* Alejandro Abdelnur (tucu00 at gmail dot com) 
> 	* Chao Wang (brookwc at gmail dot com) 
> Affiliations
> 	* Mohammad K Islam (Yahoo!) 
> 	* Angelo Huang (Yahoo!) 
> 	* Mayank Bansal (Yahoo!) 
> 	* Andreas Neumann (Yahoo!) 
> 	* Alejandro Abdelnur (Cloudera) 
> 	* Chao Wang (IBM) 
> Sponsors
> Champion
> Alan Gates 
> Nominated Mentors
> 	* Owen O'Malley (Incubator PMC member) 
> 	* Alan Gates (Incubator PMC member) 
> 	* Christopher Douglas(Incubator PMC member) 
> 	* Devaraj Das (Hadoop PMC member) 
> Sponsoring EntityWe are requesting the Incubator to sponsor this project. 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
+1 (non-binding)

On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:
> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation on
> the grid) or another type of action such as an eMail  notification. Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such scheduling
> features, along with prioritization, load  balancing and throttling to optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale data
> processing applications. Such applications could write the customized  solution
> that would require separate development, operational, and  maintenance overhead.
> Since it is a prevalent use-case for data  processing, the application developer
> would surely prefer a generalized  solution with little or no such overhead.
> Oozie addresses the challenge  by providing an execution framework to flexibly
> specify the job  dependency, data dependency, and time dependency. In addition,
> Oozie  provides a multi-tenant-based centralized service and the opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
> is expected to  attract the larger and more diversified community that currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community around
> Oozie following the Apache meritocracy model. We plan  to continue to provide
> adequate support to new developers and to quickly  recruit those who make solid
> contributions to committer status. In  addition, Oozie will expect, accept, and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer community
> is also very strong. Developers from Yahoo!  provided the initial code base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g., IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
> many outside contributors are actively contributing in design  and development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
> important contributors. All of these core  developers have deep expertise in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
> hourly through Oozie in production. In addition, there are  nearly 400 active
> users (including Yahoo! internal and external) in the  email community where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
> downloads of the Oozie binary in  the last eight months from the github site and
> a large number of  downloads were conducted by other companies such as Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project. We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are
> already committers and contributors to the Oozie Github project. In  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow  management
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
> addition, since workflow management is very  important for most hadoop based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
> increase the adoption and development of Oozie following Apache’s  established
> open source model. We have also given reasons in the  Rationale and Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
> following links provide more information about Oozie in open source:
>
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>        * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are enumerated:
>
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
> services.
>
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance
> to run them whenever a new patch is submitted. This can be added after  project
> creation.
>
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Eric Sammer <es...@cloudera.com>.
+1 (non-binding)

On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:
> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation on
> the grid) or another type of action such as an eMail  notification. Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such scheduling
> features, along with prioritization, load  balancing and throttling to optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale data
> processing applications. Such applications could write the customized  solution
> that would require separate development, operational, and  maintenance overhead.
> Since it is a prevalent use-case for data  processing, the application developer
> would surely prefer a generalized  solution with little or no such overhead.
> Oozie addresses the challenge  by providing an execution framework to flexibly
> specify the job  dependency, data dependency, and time dependency. In addition,
> Oozie  provides a multi-tenant-based centralized service and the opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
> is expected to  attract the larger and more diversified community that currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community around
> Oozie following the Apache meritocracy model. We plan  to continue to provide
> adequate support to new developers and to quickly  recruit those who make solid
> contributions to committer status. In  addition, Oozie will expect, accept, and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer community
> is also very strong. Developers from Yahoo!  provided the initial code base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g., IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
> many outside contributors are actively contributing in design  and development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
> important contributors. All of these core  developers have deep expertise in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
> hourly through Oozie in production. In addition, there are  nearly 400 active
> users (including Yahoo! internal and external) in the  email community where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
> downloads of the Oozie binary in  the last eight months from the github site and
> a large number of  downloads were conducted by other companies such as Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project. We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are
> already committers and contributors to the Oozie Github project. In  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow  management
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
> addition, since workflow management is very  important for most hadoop based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
> increase the adoption and development of Oozie following Apache’s  established
> open source model. We have also given reasons in the  Rationale and Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
> following links provide more information about Oozie in open source:
>
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>        * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are enumerated:
>
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
> services.
>
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance
> to run them whenever a new patch is submitted. This can be added after  project
> creation.
>
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Mohammad Nour El-Din <no...@gmail.com>.
+1 (Binding)

On Thu, Jun 30, 2011 at 10:04 AM, Chris Douglas <cd...@apache.org> wrote:
> +1 (binding) -C
>
> On Wednesday, June 29, 2011, Mohammad Islam <mi...@yahoo.com> wrote:
>> Hi All,
>>
>> The discussion about Oozie proposal is settling down. Therefore I would like to
>> initiate a vote to accept Oozie as an Apache Incubator project.
>>
>> The latest proposal is pasted at the end and it could be found in the wiki as
>> well:
>>
>> http://wiki.apache.org/incubator/OozieProposal
>>
>>
>> The related discussion thread is at:
>> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>>
>>
>> Please cast your votes:
>>
>> [  ] +1 Accept Oozie for incubation
>> [  ] +0 Indifferent to Oozie incubation
>> [  ] -1 Reject Oozie for incubation
>>
>> This vote will close 72 hours  from now.
>>
>> Regards,
>> Mohammad
>>
>>
>> Abstract
>> Oozie is a server-based workflow scheduling and coordination system to manage
>> data processing jobs for Apache HadoopTM.
>>
>> Proposal
>> Oozie is an  extensible, scalable and reliable system to define, manage,
>> schedule,  and execute complex Hadoop workloads via web services. More
>> specifically, this includes:
>>
>>         * XML-based declarative framework to specify a job or a complex workflow of
>> dependent jobs.
>>
>>         * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
>> Pig, Hive and custom java applications.
>>
>>         * Workflow scheduling based on frequency and/or data availability.
>>         * Monitoring capability, automatic retry and failure handing of jobs.
>>         * Extensible and pluggable architecture to allow arbitrary grid programming
>> paradigms.
>>
>>         * Authentication, authorization, and capacity-aware load throttling to allow
>> multi-tenant software as a service.
>>
>> Background
>> Most data  processing applications require multiple jobs to achieve their goals,
>> with inherent dependencies among the jobs. A dependency could be  sequential,
>> where one job can only start after another job has finished.  Or it could be
>> conditional, where the execution of a job depends on the  return value or status
>> of another job. In other cases, parallel  execution of multiple jobs may be
>> permitted – or desired – to exploit  the massive pool of compute nodes provided
>> by Hadoop.
>>
>> These  job dependencies are often expressed as a Directed Acyclic Graph, also
>> called a workflow. A node in the workflow is typically a job (a  computation on
>> the grid) or another type of action such as an eMail  notification. Computations
>> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
>> available on the grid. Edges of the graph  represent transitions from one node
>> to the next, as the execution of a  workflow proceeds.
>>
>> Describing  a workflow in a declarative way has the advantage of decoupling job
>> dependencies and execution control from application logic. Furthermore,  the
>> workflow is modularized into jobs that can be reused within the same  workflow
>> or across different workflows. Execution of the workflow is  then driven by a
>> runtime system without understanding the application  logic of the jobs. This
>> runtime system specializes in reliable and  predictable execution: It can retry
>> actions that have failed or invoke a  cleanup action after termination of the
>> workflow; it can monitor  progress, success, or failure of a workflow, and send
>> appropriate alerts  to an administrator. The application developer is relieved
>> from  implementing these generic procedures.
>>
>> Furthermore,  some applications or workflows need to run in periodic intervals
>> or  when dependent data is available. For example, a workflow could be  executed
>> every day as soon as output data from the previous 24 instances  of another,
>> hourly workflow is available. The workflow coordinator  provides such scheduling
>> features, along with prioritization, load  balancing and throttling to optimize
>> utilization of resources in the  cluster. This makes it easier to maintain,
>> control, and coordinate  complex data applications.
>>
>> Nearly  three years ago, a team of Yahoo! developers addressed these critical
>> requirements for Hadoop-based data processing systems by developing a  new
>> workflow management and scheduling system called Oozie. While it was  initially
>> developed as a Yahoo!-internal project, it was designed and  implemented with
>> the intention of open-sourcing. Oozie was released as a GitHub project in early
>> 2010. Oozie is used in production within Yahoo and  since it has been
>> open-sourced it has been gaining adoption with  external developers
>>
>> Rationale
>> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
>> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
>> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
>> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
>> scripts.
>>
>> Because  of this, developers find themselves writing ad-hoc glue programs to
>> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
>> manage, monitor and recover.
>>
>> Workflow  management and scheduling is an essential feature for large-scale data
>> processing applications. Such applications could write the customized  solution
>> that would require separate development, operational, and  maintenance overhead.
>> Since it is a prevalent use-case for data  processing, the application developer
>> would surely prefer a generalized  solution with little or no such overhead.
>> Oozie addresses the challenge  by providing an execution framework to flexibly
>> specify the job  dependency, data dependency, and time dependency. In addition,
>> Oozie  provides a multi-tenant-based centralized service and the opportunity to
>> optimize load and utilization while respecting SLAs.
>>
>> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
>> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
>> is expected to  attract the larger and more diversified community that currently
>> uses  such Apache sponsored projects. Additionally, users of the Hadoop
>> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
>> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
>> Hadoop/Pig/Hive/HBase/HCatalog community.
>>
>> Current Status
>> Meritocracy
>> Oozie  currently is a github-based open sourced project where developers from
>> multiple companies are contributing to the project. Our intent with this
>> incubator proposal is to further extend this diverse developer  community around
>> Oozie following the Apache meritocracy model. We plan  to continue to provide
>> adequate support to new developers and to quickly  recruit those who make solid
>> contributions to committer status. In  addition, Oozie will expect, accept, and
>> work to attract contributions  from amateurs as well.
>>
>> Community
>> While an  efficient workflow management and scheduling system is critical for
>> large companies with huge data processing in multi-tenant clusters, it  is
>> equally necessary for any non-trivial deployment. Different companies  are
>> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
>> At Yahoo! it is being used extensively in production  clusters to process
>> thousand of jobs. Like the Oozie user community, the  Oozie developer community
>> is also very strong. Developers from Yahoo!  provided the initial code base, and
>> they are still the most active  contributors. In late 2010, developers from
>> Cloudera also started  contributing, and currently other companies (e.g., IBM)
>> are beginning to  participate.
>>
>> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
>> Groups for developer and user communications.
>>
>> Core Developers
>> Oozie is  currently being designed and developed by four engineers from Yahoo! –
>> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
>> many outside contributors are actively contributing in design  and development.
>> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
>> important contributors. All of these core  developers have deep expertise in
>> Hadoop and the Hadoop Ecosystem in  general.
>>
>> Alignment
>> The ASF is a  natural host for Oozie given that it is already the home of
>> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
>> designed to support Hadoop from the beginning in order to solve data  processing
>> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
>> computing projects by providing a flexible framework for  managing complex data
>> processing tasks.
>>
>> Known Risks
>> Orphaned Products
>> The core  developers plan to work full time on the project. There is very little
>> risk of Oozie getting orphaned since large companies like Yahoo! are
>> extensively using it on their production Hadoop clusters. For example,  there
>> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
>> hourly through Oozie in production. In addition, there are  nearly 400 active
>> users (including Yahoo! internal and external) in the  email community where
>> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
>> downloads of the Oozie binary in  the last eight months from the github site and
>> a large number of  downloads were conducted by other companies such as Cloudera.
>> Oozie has  three major releases and more than 15 patch releases in the last
>> couple  of years which further demonstrates Oozie as a very active project. We
>> plan to extend and diversify this community further through Apache.
>>
>> Inexperience with Open Source
>> The core  developers are all active users and followers of open source. They are
>> already committers and contributors to the Oozie Github project. In  addition,
>> they are very familiar with Apache principals and philosophy  for community
>> driven software development.
>>
>> Homogeneous Developers
>> The core developers are from Yahoo! as well as from several other corporations,
>> including Cloudera and IBM.
>>
>> Reliance on Salaried Developers
>> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
>> and  Cloudera are invested in Oozie as the solution to the workflow  management
>> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
>> addition, since workflow management is very  important for most hadoop based
>> data processing, non-salaried developers  and researchers from various
>> institutes are expected to contribute to  the project.
>>
>> Relationships with Other Apache Products
>> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
>> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
>> using Oozie as their workflow scheduler.
>>
>> An Excessive Fascination with the Apache Brand
>> We deeply  respect the reputation of Apache and have had great success with
>> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
>> increase the adoption and development of Oozie following Apache’s  established
>> open source model. We have also given reasons in the  Rationale and Alignment
>> sections.
>>
>> Documentation
>> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
>> following links provide more information about Oozie in open source:
>>
>>         * Codebase at GitHub: https://github.com/yahoo/oozie.
>>         * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>>         * Continuous Integration (CI)  build:
>> http://oozie-ci.hadoop.developer.yahoo.net/
>>
>>         * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
>> Initial Source
>> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
>> is currently hosted on GitHub under an Apache license at
>> https://github.com/yahoo/oozie.
>>
>> External Dependencies
>> The required  external dependencies are all Apache License or compatible
>> licenses.  Following the components with non-Apache licenses are enumerated:
>>
>>         * HSQLDB License: HSQLDB
>>         * JDOM license: JDOM
>>         * BSD: Serp
>>         * CCDL v1: jaxb-api, ejb, JAF
>> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
>> the other listed components are transitive dependencies of other Apache
>> components used by Oozie.
>>
>> Cryptography
>> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
>> services.
>>
>> Required Resources
>> Mailing Lists
>>         * oozie-private for private PMC discussions (with moderated subscriptions)
>>         * oozie-dev
>>         * oozie-commits
>>         * oozie-user
>> Subversion Directory
>> https://svn.apache.org/repos/asf/incubator/oozie
>> Issue Tracking
>> JIRA Oozie (OOZIE)
>> Other Resources
>> The  existing code already has unit tests, so we would like a Hudson instance
>> to run them whenever a new patch is submitted. This can be added after  project
>> creation.
>>
>> Initial Committers
>>         * Mohammad K Islam (mislam77 at yahoo  dot com)
>>         * Angelo K Huang (angelohuang at gmail dot com)
>>         * Mayank Bansal (mabansal at gmail dot com)
>>         * Andreas Neumann (neunand at gmail dot com)
>>         * Alejandro Abdelnur (tucu00 at gmail dot com)
>>         * Chao Wang (brookwc at gmail dot com)
>> Affiliations
>>         * Mohammad K Islam (Yahoo!)
>>         * Angelo Huang (Yahoo!)
>>         * Mayank Bansal (Yahoo!)
>>         * Andreas Neumann (Yahoo!)
>>         * Alejandro Abdelnur (Cloudera)
>>         * Chao Wang (IBM)
>> Sponsors
>> Champion
>> Alan Gates
>> Nominated Mentors
>>         * Owen O'Malley (Incubator PMC member)
>>         * Alan Gates (Incubator PMC member)
>>         * Christopher Douglas(Incubator PMC member)
>>         * Devaraj Das (Hadoop PMC member)
>> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>



-- 
Thanks
- Mohammad Nour
  Author of (WebSphere Application Server Community Edition 2.0 User Guide)
  http://www.redbooks.ibm.com/abstracts/sg247585.html
- LinkedIn: http://www.linkedin.com/in/mnour
- Blog: http://tadabborat.blogspot.com
----
"Life is like riding a bicycle. To keep your balance you must keep moving"
- Albert Einstein

"Writing clean code is what you must do in order to call yourself a
professional. There is no reasonable excuse for doing anything less
than your best."
- Clean Code: A Handbook of Agile Software Craftsmanship

"Stay hungry, stay foolish."
- Steve Jobs

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Chris Douglas <cd...@apache.org>.
+1 (binding) -C

On Wednesday, June 29, 2011, Mohammad Islam <mi...@yahoo.com> wrote:
> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>         * XML-based declarative framework to specify a job or a complex workflow of
> dependent jobs.
>
>         * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
> Pig, Hive and custom java applications.
>
>         * Workflow scheduling based on frequency and/or data availability.
>         * Monitoring capability, automatic retry and failure handing of jobs.
>         * Extensible and pluggable architecture to allow arbitrary grid programming
> paradigms.
>
>         * Authentication, authorization, and capacity-aware load throttling to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation on
> the grid) or another type of action such as an eMail  notification. Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such scheduling
> features, along with prioritization, load  balancing and throttling to optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale data
> processing applications. Such applications could write the customized  solution
> that would require separate development, operational, and  maintenance overhead.
> Since it is a prevalent use-case for data  processing, the application developer
> would surely prefer a generalized  solution with little or no such overhead.
> Oozie addresses the challenge  by providing an execution framework to flexibly
> specify the job  dependency, data dependency, and time dependency. In addition,
> Oozie  provides a multi-tenant-based centralized service and the opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
> is expected to  attract the larger and more diversified community that currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community around
> Oozie following the Apache meritocracy model. We plan  to continue to provide
> adequate support to new developers and to quickly  recruit those who make solid
> contributions to committer status. In  addition, Oozie will expect, accept, and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer community
> is also very strong. Developers from Yahoo!  provided the initial code base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g., IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
> many outside contributors are actively contributing in design  and development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
> important contributors. All of these core  developers have deep expertise in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
> hourly through Oozie in production. In addition, there are  nearly 400 active
> users (including Yahoo! internal and external) in the  email community where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
> downloads of the Oozie binary in  the last eight months from the github site and
> a large number of  downloads were conducted by other companies such as Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project. We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are
> already committers and contributors to the Oozie Github project. In  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow  management
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
> addition, since workflow management is very  important for most hadoop based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
> increase the adoption and development of Oozie following Apache’s  established
> open source model. We have also given reasons in the  Rationale and Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
> following links provide more information about Oozie in open source:
>
>         * Codebase at GitHub: https://github.com/yahoo/oozie.
>         * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>         * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>         * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are enumerated:
>
>         * HSQLDB License: HSQLDB
>         * JDOM license: JDOM
>         * BSD: Serp
>         * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
> services.
>
> Required Resources
> Mailing Lists
>         * oozie-private for private PMC discussions (with moderated subscriptions)
>         * oozie-dev
>         * oozie-commits
>         * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance
> to run them whenever a new patch is submitted. This can be added after  project
> creation.
>
> Initial Committers
>         * Mohammad K Islam (mislam77 at yahoo  dot com)
>         * Angelo K Huang (angelohuang at gmail dot com)
>         * Mayank Bansal (mabansal at gmail dot com)
>         * Andreas Neumann (neunand at gmail dot com)
>         * Alejandro Abdelnur (tucu00 at gmail dot com)
>         * Chao Wang (brookwc at gmail dot com)
> Affiliations
>         * Mohammad K Islam (Yahoo!)
>         * Angelo Huang (Yahoo!)
>         * Mayank Bansal (Yahoo!)
>         * Andreas Neumann (Yahoo!)
>         * Alejandro Abdelnur (Cloudera)
>         * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>         * Owen O'Malley (Incubator PMC member)
>         * Alan Gates (Incubator PMC member)
>         * Christopher Douglas(Incubator PMC member)
>         * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Ahmed Radwan <ah...@cloudera.com>.
+1 (non-binding)
Good luck

On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:

> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would
> like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki
> as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to
> manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex
> workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe,
> Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of
> jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid
> programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling
> to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their
> goals,
> with inherent dependencies among the jobs. A dependency could be
>  sequential,
> where one job can only start after another job has finished.  Or it could
> be
> conditional, where the execution of a job depends on the  return value or
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph,
> also
> called a workflow. A node in the workflow is typically a job (a
>  computation on
> the grid) or another type of action such as an eMail  notification.
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming
> paradigm
> available on the grid. Edges of the graph  represent transitions from one
> node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling
> job
> dependencies and execution control from application logic. Furthermore,
>  the
> workflow is modularized into jobs that can be reused within the same
>  workflow
> or across different workflows. Execution of the workflow is  then driven by
> a
> runtime system without understanding the application  logic of the jobs.
> This
> runtime system specializes in reliable and  predictable execution: It can
> retry
> actions that have failed or invoke a  cleanup action after termination of
> the
> workflow; it can monitor  progress, success, or failure of a workflow, and
> send
> appropriate alerts  to an administrator. The application developer is
> relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic
> intervals
> or  when dependent data is available. For example, a workflow could be
>  executed
> every day as soon as output data from the previous 24 instances  of
> another,
> hourly workflow is available. The workflow coordinator  provides such
> scheduling
> features, along with prioritization, load  balancing and throttling to
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these
> critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was
>  initially
> developed as a Yahoo!-internal project, it was designed and  implemented
> with
> the intention of open-sourcing. Oozie was released as a GitHub project in
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly
>  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs
> to
> combine these Hadoop jobs. These ad-hoc programs are difficult to
>  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale
> data
> processing applications. Such applications could write the customized
>  solution
> that would require separate development, operational, and  maintenance
> overhead.
> Since it is a prevalent use-case for data  processing, the application
> developer
> would surely prefer a generalized  solution with little or no such
> overhead.
> Oozie addresses the challenge  by providing an execution framework to
> flexibly
> specify the job  dependency, data dependency, and time dependency. In
> addition,
> Oozie  provides a multi-tenant-based centralized service and the
> opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various
> Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project,
> Oozie
> is expected to  attract the larger and more diversified community that
> currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,
>  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the
> current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers
> from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community
> around
> Oozie following the Apache meritocracy model. We plan  to continue to
> provide
> adequate support to new developers and to quickly  recruit those who make
> solid
> contributions to committer status. In  addition, Oozie will expect, accept,
> and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical
> for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data
>  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer
> community
> is also very strong. Developers from Yahoo!  provided the initial code
> base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g.,
> IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and
> Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from
> Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In
>  addition,
> many outside contributors are actively contributing in design  and
> development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are
> very
> important contributors. All of these core  developers have deep expertise
> in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data
>  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex
> data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very
> little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,
>  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are
> processed
> hourly through Oozie in production. In addition, there are  nearly 400
> active
> users (including Yahoo! internal and external) in the  email community
> where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than
> 1500
> downloads of the Oozie binary in  the last eight months from the github
> site and
> a large number of  downloads were conducted by other companies such as
> Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project.
> We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source.
> They are
> already committers and contributors to the Oozie Github project. In
>  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other
> corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like
> Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow
>  management
> and scheduling problem in Hadoop clusters, and that is not  likely to
> change. In
> addition, since workflow management is very  important for most hadoop
> based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are
>  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand
> and
> increase the adoption and development of Oozie following Apache’s
>  established
> open source model. We have also given reasons in the  Rationale and
> Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/.
> The
> following links provide more information about Oozie in open source:
>
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>        * Yahoo user community:
> http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at
> Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are
> enumerated:
>
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by
> Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured
> Hadoop
> services.
>
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated
> subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson
> instance
> to run them whenever a new patch is submitted. This can be added after
>  project
> creation.
>
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>

Re: [VOTE] Oozie to join the Incubator

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
+1 (non-binding)

On Wed, Jun 29, 2011 at 10:18 PM, Ashish <pa...@gmail.com> wrote:

> +1 (non-binding)
>
> On Thu, Jun 30, 2011 at 12:40 AM, Mohammad Islam <mi...@yahoo.com>
> wrote:
>
> > Hi All,
> >
> > The discussion about Oozie proposal is settling down. Therefore I would
> > like to
> > initiate a vote to accept Oozie as an Apache Incubator project.
> >
> > The latest proposal is pasted at the end and it could be found in the
> wiki
> > as
> > well:
> >
> > http://wiki.apache.org/incubator/OozieProposal
> >
> >
> > The related discussion thread is at:
> > http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
> >
> >
> > Please cast your votes:
> >
> > [  ] +1 Accept Oozie for incubation
> > [  ] +0 Indifferent to Oozie incubation
> > [  ] -1 Reject Oozie for incubation
> >
> > This vote will close 72 hours  from now.
> >
> > Regards,
> > Mohammad
> >
> >
> > Abstract
> > Oozie is a server-based workflow scheduling and coordination system to
> > manage
> > data processing jobs for Apache HadoopTM.
> >
> > Proposal
> > Oozie is an  extensible, scalable and reliable system to define, manage,
> > schedule,  and execute complex Hadoop workloads via web services. More
> > specifically, this includes:
> >
> >        * XML-based declarative framework to specify a job or a complex
> > workflow of
> > dependent jobs.
> >
> >        * Support different types of job such as Hadoop Map-Reduce, Pipe,
> > Streaming,
> > Pig, Hive and custom java applications.
> >
> >        * Workflow scheduling based on frequency and/or data availability.
> >        * Monitoring capability, automatic retry and failure handing of
> > jobs.
> >        * Extensible and pluggable architecture to allow arbitrary grid
> > programming
> > paradigms.
> >
> >        * Authentication, authorization, and capacity-aware load
> throttling
> > to allow
> > multi-tenant software as a service.
> >
> > Background
> > Most data  processing applications require multiple jobs to achieve their
> > goals,
> > with inherent dependencies among the jobs. A dependency could be
> >  sequential,
> > where one job can only start after another job has finished.  Or it could
> > be
> > conditional, where the execution of a job depends on the  return value or
> > status
> > of another job. In other cases, parallel  execution of multiple jobs may
> be
> > permitted – or desired – to exploit  the massive pool of compute nodes
> > provided
> > by Hadoop.
> >
> > These  job dependencies are often expressed as a Directed Acyclic Graph,
> > also
> > called a workflow. A node in the workflow is typically a job (a
> >  computation on
> > the grid) or another type of action such as an eMail  notification.
> > Computations
> > can be expressed in map/reduce, Pig, Hive or  any other programming
> > paradigm
> > available on the grid. Edges of the graph  represent transitions from one
> > node
> > to the next, as the execution of a  workflow proceeds.
> >
> > Describing  a workflow in a declarative way has the advantage of
> decoupling
> > job
> > dependencies and execution control from application logic. Furthermore,
> >  the
> > workflow is modularized into jobs that can be reused within the same
> >  workflow
> > or across different workflows. Execution of the workflow is  then driven
> by
> > a
> > runtime system without understanding the application  logic of the jobs.
> > This
> > runtime system specializes in reliable and  predictable execution: It can
> > retry
> > actions that have failed or invoke a  cleanup action after termination of
> > the
> > workflow; it can monitor  progress, success, or failure of a workflow,
> and
> > send
> > appropriate alerts  to an administrator. The application developer is
> > relieved
> > from  implementing these generic procedures.
> >
> > Furthermore,  some applications or workflows need to run in periodic
> > intervals
> > or  when dependent data is available. For example, a workflow could be
> >  executed
> > every day as soon as output data from the previous 24 instances  of
> > another,
> > hourly workflow is available. The workflow coordinator  provides such
> > scheduling
> > features, along with prioritization, load  balancing and throttling to
> > optimize
> > utilization of resources in the  cluster. This makes it easier to
> maintain,
> > control, and coordinate  complex data applications.
> >
> > Nearly  three years ago, a team of Yahoo! developers addressed these
> > critical
> > requirements for Hadoop-based data processing systems by developing a
>  new
> > workflow management and scheduling system called Oozie. While it was
> >  initially
> > developed as a Yahoo!-internal project, it was designed and  implemented
> > with
> > the intention of open-sourcing. Oozie was released as a GitHub project in
> > early
> > 2010. Oozie is used in production within Yahoo and  since it has been
> > open-sourced it has been gaining adoption with  external developers
> >
> > Rationale
> > Commonly,  applications that run on Hadoop require multiple Hadoop jobs
> in
> > order
> > to  obtain the desired results. Furthermore, these Hadoop jobs are
> commonly
> >  a
> > combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> > map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and
> > shell
> > scripts.
> >
> > Because  of this, developers find themselves writing ad-hoc glue programs
> > to
> > combine these Hadoop jobs. These ad-hoc programs are difficult to
> >  schedule,
> > manage, monitor and recover.
> >
> > Workflow  management and scheduling is an essential feature for
> large-scale
> > data
> > processing applications. Such applications could write the customized
> >  solution
> > that would require separate development, operational, and  maintenance
> > overhead.
> > Since it is a prevalent use-case for data  processing, the application
> > developer
> > would surely prefer a generalized  solution with little or no such
> > overhead.
> > Oozie addresses the challenge  by providing an execution framework to
> > flexibly
> > specify the job  dependency, data dependency, and time dependency. In
> > addition,
> > Oozie  provides a multi-tenant-based centralized service and the
> > opportunity to
> > optimize load and utilization while respecting SLAs.
> >
> > Oozie is built on Apache HadoopTM to schedule jobs related to various
> > Apache
> > projects such as Hadoop,  Pig, and Hive. As an Apache Open source
> project,
> > Oozie
> > is expected to  attract the larger and more diversified community that
> > currently
> > uses  such Apache sponsored projects. Additionally, users of the Hadoop
> > ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,
> >  Oozie,
> > as part of the Apache Hadoop TMecosystem, will be a great benefit to the
> > current
> > Hadoop/Pig/Hive/HBase/HCatalog community.
> >
> > Current Status
> > Meritocracy
> > Oozie  currently is a github-based open sourced project where developers
> > from
> > multiple companies are contributing to the project. Our intent with this
> > incubator proposal is to further extend this diverse developer  community
> > around
> > Oozie following the Apache meritocracy model. We plan  to continue to
> > provide
> > adequate support to new developers and to quickly  recruit those who make
> > solid
> > contributions to committer status. In  addition, Oozie will expect,
> accept,
> > and
> > work to attract contributions  from amateurs as well.
> >
> > Community
> > While an  efficient workflow management and scheduling system is critical
> > for
> > large companies with huge data processing in multi-tenant clusters, it
>  is
> > equally necessary for any non-trivial deployment. Different companies
>  are
> > currently using Oozie as a workflow scheduler for Hadoop-based data
> >  processing.
> > At Yahoo! it is being used extensively in production  clusters to process
> > thousand of jobs. Like the Oozie user community, the  Oozie developer
> > community
> > is also very strong. Developers from Yahoo!  provided the initial code
> > base, and
> > they are still the most active  contributors. In late 2010, developers
> from
> > Cloudera also started  contributing, and currently other companies (e.g.,
> > IBM)
> > are beginning to  participate.
> >
> > We currently use JIRA for issue tracking, github for code hosting and
> > Yahoo!
> > Groups for developer and user communications.
> >
> > Core Developers
> > Oozie is  currently being designed and developed by four engineers from
> > Yahoo! –
> > Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In
> >  addition,
> > many outside contributors are actively contributing in design  and
> > development.
> > Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are
> > very
> > important contributors. All of these core  developers have deep expertise
> > in
> > Hadoop and the Hadoop Ecosystem in  general.
> >
> > Alignment
> > The ASF is a  natural host for Oozie given that it is already the home of
> > Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> > designed to support Hadoop from the beginning in order to solve data
> >  processing
> > challenges in Hadoop clusters. Oozie complements the existing  Apache
> cloud
> > computing projects by providing a flexible framework for  managing
> complex
> > data
> > processing tasks.
> >
> > Known Risks
> > Orphaned Products
> > The core  developers plan to work full time on the project. There is very
> > little
> > risk of Oozie getting orphaned since large companies like Yahoo! are
> > extensively using it on their production Hadoop clusters. For example,
> >  there
> > are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are
> > processed
> > hourly through Oozie in production. In addition, there are  nearly 400
> > active
> > users (including Yahoo! internal and external) in the  email community
> > where
> > nearly 15 emails are exchanged per day.  Furthermore, there were more
> than
> > 1500
> > downloads of the Oozie binary in  the last eight months from the github
> > site and
> > a large number of  downloads were conducted by other companies such as
> > Cloudera.
> > Oozie has  three major releases and more than 15 patch releases in the
> last
> > couple  of years which further demonstrates Oozie as a very active
> project.
> > We
> > plan to extend and diversify this community further through Apache.
> >
> > Inexperience with Open Source
> > The core  developers are all active users and followers of open source.
> > They are
> > already committers and contributors to the Oozie Github project. In
> >  addition,
> > they are very familiar with Apache principals and philosophy  for
> community
> > driven software development.
> >
> > Homogeneous Developers
> > The core developers are from Yahoo! as well as from several other
> > corporations,
> > including Cloudera and IBM.
> >
> > Reliance on Salaried Developers
> > Currently,  the developers are paid to do work on Oozie. Companies like
> > Yahoo!
> > and  Cloudera are invested in Oozie as the solution to the workflow
> >  management
> > and scheduling problem in Hadoop clusters, and that is not  likely to
> > change. In
> > addition, since workflow management is very  important for most hadoop
> > based
> > data processing, non-salaried developers  and researchers from various
> > institutes are expected to contribute to  the project.
> >
> > Relationships with Other Apache Products
> > Oozie is  based on Apache Hadoop to manage jobs created by different
> Apache
> > projects such as Hadoop, Pig, and Hive. Users of these products are
> >  extensively
> > using Oozie as their workflow scheduler.
> >
> > An Excessive Fascination with the Apache Brand
> > We deeply  respect the reputation of Apache and have had great success
> with
> > other  Apache projects such as Pig and HCatalog. We are motivated to
> expand
> > and
> > increase the adoption and development of Oozie following Apache’s
> >  established
> > open source model. We have also given reasons in the  Rationale and
> > Alignment
> > sections.
> >
> > Documentation
> > Information about Oozie can be found at http://yahoo.github.com/oozie/.
> > The
> > following links provide more information about Oozie in open source:
> >
> >        * Codebase at GitHub: https://github.com/yahoo/oozie.
> >        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
> >        * Continuous Integration (CI)  build:
> > http://oozie-ci.hadoop.developer.yahoo.net/
> >
> >        * Yahoo user community:
> > http://tech.groups.yahoo.com/group/Oozie-users/
> > Initial Source
> > Oozie has been under development since 2009 by a team of engineers at
> > Yahoo!. It
> > is currently hosted on GitHub under an Apache license at
> > https://github.com/yahoo/oozie.
> >
> > External Dependencies
> > The required  external dependencies are all Apache License or compatible
> > licenses.  Following the components with non-Apache licenses are
> > enumerated:
> >
> >        * HSQLDB License: HSQLDB
> >        * JDOM license: JDOM
> >        * BSD: Serp
> >        * CCDL v1: jaxb-api, ejb, JAF
> > NOTE:  With the exception of HSQLDB and JDOM that are directly used by
> > Oozie,
> > the other listed components are transitive dependencies of other Apache
> > components used by Oozie.
> >
> > Cryptography
> > Oozie supports the Kerberos authentication mechanism to access secured
> > Hadoop
> > services.
> >
> > Required Resources
> > Mailing Lists
> >        * oozie-private for private PMC discussions (with moderated
> > subscriptions)
> >        * oozie-dev
> >        * oozie-commits
> >        * oozie-user
> > Subversion Directory
> > https://svn.apache.org/repos/asf/incubator/oozie
> > Issue Tracking
> > JIRA Oozie (OOZIE)
> > Other Resources
> > The  existing code already has unit tests, so we would like a Hudson
> > instance
> > to run them whenever a new patch is submitted. This can be added after
> >  project
> > creation.
> >
> > Initial Committers
> >        * Mohammad K Islam (mislam77 at yahoo  dot com)
> >        * Angelo K Huang (angelohuang at gmail dot com)
> >        * Mayank Bansal (mabansal at gmail dot com)
> >        * Andreas Neumann (neunand at gmail dot com)
> >        * Alejandro Abdelnur (tucu00 at gmail dot com)
> >        * Chao Wang (brookwc at gmail dot com)
> > Affiliations
> >        * Mohammad K Islam (Yahoo!)
> >        * Angelo Huang (Yahoo!)
> >        * Mayank Bansal (Yahoo!)
> >        * Andreas Neumann (Yahoo!)
> >        * Alejandro Abdelnur (Cloudera)
> >        * Chao Wang (IBM)
> > Sponsors
> > Champion
> > Alan Gates
> > Nominated Mentors
> >        * Owen O'Malley (Incubator PMC member)
> >        * Alan Gates (Incubator PMC member)
> >        * Christopher Douglas(Incubator PMC member)
> >        * Devaraj Das (Hadoop PMC member)
> > Sponsoring EntityWe are requesting the Incubator to sponsor this project.
> >
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: [VOTE] Oozie to join the Incubator

Posted by Ashish <pa...@gmail.com>.
+1 (non-binding)

On Thu, Jun 30, 2011 at 12:40 AM, Mohammad Islam <mi...@yahoo.com> wrote:

> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would
> like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki
> as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to
> manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex
> workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe,
> Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of
> jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid
> programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling
> to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their
> goals,
> with inherent dependencies among the jobs. A dependency could be
>  sequential,
> where one job can only start after another job has finished.  Or it could
> be
> conditional, where the execution of a job depends on the  return value or
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph,
> also
> called a workflow. A node in the workflow is typically a job (a
>  computation on
> the grid) or another type of action such as an eMail  notification.
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming
> paradigm
> available on the grid. Edges of the graph  represent transitions from one
> node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling
> job
> dependencies and execution control from application logic. Furthermore,
>  the
> workflow is modularized into jobs that can be reused within the same
>  workflow
> or across different workflows. Execution of the workflow is  then driven by
> a
> runtime system without understanding the application  logic of the jobs.
> This
> runtime system specializes in reliable and  predictable execution: It can
> retry
> actions that have failed or invoke a  cleanup action after termination of
> the
> workflow; it can monitor  progress, success, or failure of a workflow, and
> send
> appropriate alerts  to an administrator. The application developer is
> relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic
> intervals
> or  when dependent data is available. For example, a workflow could be
>  executed
> every day as soon as output data from the previous 24 instances  of
> another,
> hourly workflow is available. The workflow coordinator  provides such
> scheduling
> features, along with prioritization, load  balancing and throttling to
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these
> critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was
>  initially
> developed as a Yahoo!-internal project, it was designed and  implemented
> with
> the intention of open-sourcing. Oozie was released as a GitHub project in
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly
>  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs
> to
> combine these Hadoop jobs. These ad-hoc programs are difficult to
>  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale
> data
> processing applications. Such applications could write the customized
>  solution
> that would require separate development, operational, and  maintenance
> overhead.
> Since it is a prevalent use-case for data  processing, the application
> developer
> would surely prefer a generalized  solution with little or no such
> overhead.
> Oozie addresses the challenge  by providing an execution framework to
> flexibly
> specify the job  dependency, data dependency, and time dependency. In
> addition,
> Oozie  provides a multi-tenant-based centralized service and the
> opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various
> Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project,
> Oozie
> is expected to  attract the larger and more diversified community that
> currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,
>  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the
> current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers
> from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community
> around
> Oozie following the Apache meritocracy model. We plan  to continue to
> provide
> adequate support to new developers and to quickly  recruit those who make
> solid
> contributions to committer status. In  addition, Oozie will expect, accept,
> and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical
> for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data
>  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer
> community
> is also very strong. Developers from Yahoo!  provided the initial code
> base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g.,
> IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and
> Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from
> Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In
>  addition,
> many outside contributors are actively contributing in design  and
> development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are
> very
> important contributors. All of these core  developers have deep expertise
> in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data
>  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex
> data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very
> little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,
>  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are
> processed
> hourly through Oozie in production. In addition, there are  nearly 400
> active
> users (including Yahoo! internal and external) in the  email community
> where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than
> 1500
> downloads of the Oozie binary in  the last eight months from the github
> site and
> a large number of  downloads were conducted by other companies such as
> Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project.
> We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source.
> They are
> already committers and contributors to the Oozie Github project. In
>  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other
> corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like
> Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow
>  management
> and scheduling problem in Hadoop clusters, and that is not  likely to
> change. In
> addition, since workflow management is very  important for most hadoop
> based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are
>  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand
> and
> increase the adoption and development of Oozie following Apache’s
>  established
> open source model. We have also given reasons in the  Rationale and
> Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/.
> The
> following links provide more information about Oozie in open source:
>
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>        * Yahoo user community:
> http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at
> Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are
> enumerated:
>
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by
> Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured
> Hadoop
> services.
>
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated
> subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson
> instance
> to run them whenever a new patch is submitted. This can be added after
>  project
> creation.
>
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: [VOTE] Oozie to join the Incubator

Posted by Suresh Marru <sm...@apache.org>.
Hi Mohammad,

I am interested to contribute to this project, since any one did not vote yet, can I add my name to the Initial Committers? 

Thanks,
Suresh

On Jun 29, 2011, at 3:10 PM, Mohammad Islam wrote:

> Hi All,
> 
> The discussion about Oozie proposal is settling down. Therefore I would like to 
> initiate a vote to accept Oozie as an Apache Incubator project.
> 
> The latest proposal is pasted at the end and it could be found in the wiki as 
> well:
> 
> http://wiki.apache.org/incubator/OozieProposal
> 
> 
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
> 
> 
> Please cast your votes:
> 
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
> 
> This vote will close 72 hours  from now.
> 
> Regards,
> Mohammad
> 
> 
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage 
> data processing jobs for Apache HadoopTM. 
> 
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage, 
> schedule,  and execute complex Hadoop workloads via web services. More  
> specifically, this includes: 
> 
> 	* XML-based declarative framework to specify a job or a complex workflow of 
> dependent jobs. 
> 
> 	* Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, 
> Pig, Hive and custom java applications. 
> 
> 	* Workflow scheduling based on frequency and/or data availability. 
> 	* Monitoring capability, automatic retry and failure handing of jobs. 
> 	* Extensible and pluggable architecture to allow arbitrary grid programming 
> paradigms. 
> 
> 	* Authentication, authorization, and capacity-aware load throttling to allow 
> multi-tenant software as a service. 
> 
> Background
> Most data  processing applications require multiple jobs to achieve their goals,  
> with inherent dependencies among the jobs. A dependency could be  sequential, 
> where one job can only start after another job has finished.  Or it could be 
> conditional, where the execution of a job depends on the  return value or status 
> of another job. In other cases, parallel  execution of multiple jobs may be 
> permitted – or desired – to exploit  the massive pool of compute nodes provided 
> by Hadoop. 
> 
> These  job dependencies are often expressed as a Directed Acyclic Graph, also  
> called a workflow. A node in the workflow is typically a job (a  computation on 
> the grid) or another type of action such as an eMail  notification. Computations 
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
> available on the grid. Edges of the graph  represent transitions from one node 
> to the next, as the execution of a  workflow proceeds. 
> 
> Describing  a workflow in a declarative way has the advantage of decoupling job  
> dependencies and execution control from application logic. Furthermore,  the 
> workflow is modularized into jobs that can be reused within the same  workflow 
> or across different workflows. Execution of the workflow is  then driven by a 
> runtime system without understanding the application  logic of the jobs. This 
> runtime system specializes in reliable and  predictable execution: It can retry 
> actions that have failed or invoke a  cleanup action after termination of the 
> workflow; it can monitor  progress, success, or failure of a workflow, and send 
> appropriate alerts  to an administrator. The application developer is relieved 
> from  implementing these generic procedures. 
> 
> Furthermore,  some applications or workflows need to run in periodic intervals 
> or  when dependent data is available. For example, a workflow could be  executed 
> every day as soon as output data from the previous 24 instances  of another, 
> hourly workflow is available. The workflow coordinator  provides such scheduling 
> features, along with prioritization, load  balancing and throttling to optimize 
> utilization of resources in the  cluster. This makes it easier to maintain, 
> control, and coordinate  complex data applications. 
> 
> Nearly  three years ago, a team of Yahoo! developers addressed these critical  
> requirements for Hadoop-based data processing systems by developing a  new 
> workflow management and scheduling system called Oozie. While it was  initially 
> developed as a Yahoo!-internal project, it was designed and  implemented with 
> the intention of open-sourcing. Oozie was released as a GitHub project in early 
> 2010. Oozie is used in production within Yahoo and  since it has been 
> open-sourced it has been gaining adoption with  external developers 
> 
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order 
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a 
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell 
> scripts. 
> 
> Because  of this, developers find themselves writing ad-hoc glue programs to  
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule, 
> manage, monitor and recover. 
> 
> Workflow  management and scheduling is an essential feature for large-scale data  
> processing applications. Such applications could write the customized  solution 
> that would require separate development, operational, and  maintenance overhead. 
> Since it is a prevalent use-case for data  processing, the application developer 
> would surely prefer a generalized  solution with little or no such overhead. 
> Oozie addresses the challenge  by providing an execution framework to flexibly 
> specify the job  dependency, data dependency, and time dependency. In addition, 
> Oozie  provides a multi-tenant-based centralized service and the opportunity to  
> optimize load and utilization while respecting SLAs. 
> 
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache 
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie 
> is expected to  attract the larger and more diversified community that currently 
> uses  such Apache sponsored projects. Additionally, users of the Hadoop  
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie, 
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current 
> Hadoop/Pig/Hive/HBase/HCatalog community. 
> 
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from  
> multiple companies are contributing to the project. Our intent with this  
> incubator proposal is to further extend this diverse developer  community around 
> Oozie following the Apache meritocracy model. We plan  to continue to provide 
> adequate support to new developers and to quickly  recruit those who make solid 
> contributions to committer status. In  addition, Oozie will expect, accept, and 
> work to attract contributions  from amateurs as well. 
> 
> Community
> While an  efficient workflow management and scheduling system is critical for  
> large companies with huge data processing in multi-tenant clusters, it  is 
> equally necessary for any non-trivial deployment. Different companies  are 
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing. 
> At Yahoo! it is being used extensively in production  clusters to process 
> thousand of jobs. Like the Oozie user community, the  Oozie developer community 
> is also very strong. Developers from Yahoo!  provided the initial code base, and 
> they are still the most active  contributors. In late 2010, developers from 
> Cloudera also started  contributing, and currently other companies (e.g., IBM) 
> are beginning to  participate. 
> 
> We currently use JIRA for issue tracking, github for code hosting and Yahoo! 
> Groups for developer and user communications. 
> 
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –  
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition, 
> many outside contributors are actively contributing in design  and development. 
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very 
> important contributors. All of these core  developers have deep expertise in 
> Hadoop and the Hadoop Ecosystem in  general. 
> 
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of 
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was  
> designed to support Hadoop from the beginning in order to solve data  processing 
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud 
> computing projects by providing a flexible framework for  managing complex data 
> processing tasks. 
> 
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little  
> risk of Oozie getting orphaned since large companies like Yahoo! are  
> extensively using it on their production Hadoop clusters. For example,  there 
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed 
> hourly through Oozie in production. In addition, there are  nearly 400 active 
> users (including Yahoo! internal and external) in the  email community where 
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500 
> downloads of the Oozie binary in  the last eight months from the github site and 
> a large number of  downloads were conducted by other companies such as Cloudera. 
> Oozie has  three major releases and more than 15 patch releases in the last 
> couple  of years which further demonstrates Oozie as a very active project. We  
> plan to extend and diversify this community further through Apache. 
> 
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are  
> already committers and contributors to the Oozie Github project. In  addition, 
> they are very familiar with Apache principals and philosophy  for community 
> driven software development. 
> 
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations, 
> including Cloudera and IBM. 
> 
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo! 
> and  Cloudera are invested in Oozie as the solution to the workflow  management 
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In 
> addition, since workflow management is very  important for most hadoop based 
> data processing, non-salaried developers  and researchers from various 
> institutes are expected to contribute to  the project. 
> 
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache  
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively 
> using Oozie as their workflow scheduler. 
> 
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with 
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and  
> increase the adoption and development of Oozie following Apache’s  established 
> open source model. We have also given reasons in the  Rationale and Alignment 
> sections. 
> 
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The 
> following links provide more information about Oozie in open source: 
> 
> 	* Codebase at GitHub: https://github.com/yahoo/oozie. 
> 	* JIRA : http://oozie-jira.hadoop.developer.yahoo.net 
> 	* Continuous Integration (CI)  build: 
> http://oozie-ci.hadoop.developer.yahoo.net/ 
> 
> 	* Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/ 
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It 
> is currently hosted on GitHub under an Apache license at 
> https://github.com/yahoo/oozie. 
> 
> External Dependencies
> The required  external dependencies are all Apache License or compatible 
> licenses.  Following the components with non-Apache licenses are enumerated: 
> 
> 	* HSQLDB License: HSQLDB 
> 	* JDOM license: JDOM 
> 	* BSD: Serp 
> 	* CCDL v1: jaxb-api, ejb, JAF 
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,  
> the other listed components are transitive dependencies of other Apache  
> components used by Oozie. 
> 
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop 
> services. 
> 
> Required Resources
> Mailing Lists
> 	* oozie-private for private PMC discussions (with moderated subscriptions) 
> 	* oozie-dev 
> 	* oozie-commits 
> 	* oozie-user 
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie 
> Issue Tracking
> JIRA Oozie (OOZIE) 
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance  
> to run them whenever a new patch is submitted. This can be added after  project 
> creation. 
> 
> Initial Committers
> 	* Mohammad K Islam (mislam77 at yahoo  dot com) 
> 	* Angelo K Huang (angelohuang at gmail dot com) 
> 	* Mayank Bansal (mabansal at gmail dot com) 
> 	* Andreas Neumann (neunand at gmail dot com) 
> 	* Alejandro Abdelnur (tucu00 at gmail dot com) 
> 	* Chao Wang (brookwc at gmail dot com) 
> Affiliations
> 	* Mohammad K Islam (Yahoo!) 
> 	* Angelo Huang (Yahoo!) 
> 	* Mayank Bansal (Yahoo!) 
> 	* Andreas Neumann (Yahoo!) 
> 	* Alejandro Abdelnur (Cloudera) 
> 	* Chao Wang (IBM) 
> Sponsors
> Champion
> Alan Gates 
> Nominated Mentors
> 	* Owen O'Malley (Incubator PMC member) 
> 	* Alan Gates (Incubator PMC member) 
> 	* Christopher Douglas(Incubator PMC member) 
> 	* Devaraj Das (Hadoop PMC member) 
> Sponsoring EntityWe are requesting the Incubator to sponsor this project. 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by "Edward J. Yoon" <ed...@apache.org>.
Cool project, +1

On Thu, Jun 30, 2011 at 2:23 PM, Arvind Prabhakar <ar...@apache.org> wrote:
> +1 (non-binding)
>
> Thanks,
> Arvind
>
> On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:
>> Hi All,
>>
>> The discussion about Oozie proposal is settling down. Therefore I would like to
>> initiate a vote to accept Oozie as an Apache Incubator project.
>>
>> The latest proposal is pasted at the end and it could be found in the wiki as
>> well:
>>
>> http://wiki.apache.org/incubator/OozieProposal
>>
>>
>> The related discussion thread is at:
>> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>>
>>
>> Please cast your votes:
>>
>> [  ] +1 Accept Oozie for incubation
>> [  ] +0 Indifferent to Oozie incubation
>> [  ] -1 Reject Oozie for incubation
>>
>> This vote will close 72 hours  from now.
>>
>> Regards,
>> Mohammad
>>
>>
>> Abstract
>> Oozie is a server-based workflow scheduling and coordination system to manage
>> data processing jobs for Apache HadoopTM.
>>
>> Proposal
>> Oozie is an  extensible, scalable and reliable system to define, manage,
>> schedule,  and execute complex Hadoop workloads via web services. More
>> specifically, this includes:
>>
>>        * XML-based declarative framework to specify a job or a complex workflow of
>> dependent jobs.
>>
>>        * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
>> Pig, Hive and custom java applications.
>>
>>        * Workflow scheduling based on frequency and/or data availability.
>>        * Monitoring capability, automatic retry and failure handing of jobs.
>>        * Extensible and pluggable architecture to allow arbitrary grid programming
>> paradigms.
>>
>>        * Authentication, authorization, and capacity-aware load throttling to allow
>> multi-tenant software as a service.
>>
>> Background
>> Most data  processing applications require multiple jobs to achieve their goals,
>> with inherent dependencies among the jobs. A dependency could be  sequential,
>> where one job can only start after another job has finished.  Or it could be
>> conditional, where the execution of a job depends on the  return value or status
>> of another job. In other cases, parallel  execution of multiple jobs may be
>> permitted – or desired – to exploit  the massive pool of compute nodes provided
>> by Hadoop.
>>
>> These  job dependencies are often expressed as a Directed Acyclic Graph, also
>> called a workflow. A node in the workflow is typically a job (a  computation on
>> the grid) or another type of action such as an eMail  notification. Computations
>> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
>> available on the grid. Edges of the graph  represent transitions from one node
>> to the next, as the execution of a  workflow proceeds.
>>
>> Describing  a workflow in a declarative way has the advantage of decoupling job
>> dependencies and execution control from application logic. Furthermore,  the
>> workflow is modularized into jobs that can be reused within the same  workflow
>> or across different workflows. Execution of the workflow is  then driven by a
>> runtime system without understanding the application  logic of the jobs. This
>> runtime system specializes in reliable and  predictable execution: It can retry
>> actions that have failed or invoke a  cleanup action after termination of the
>> workflow; it can monitor  progress, success, or failure of a workflow, and send
>> appropriate alerts  to an administrator. The application developer is relieved
>> from  implementing these generic procedures.
>>
>> Furthermore,  some applications or workflows need to run in periodic intervals
>> or  when dependent data is available. For example, a workflow could be  executed
>> every day as soon as output data from the previous 24 instances  of another,
>> hourly workflow is available. The workflow coordinator  provides such scheduling
>> features, along with prioritization, load  balancing and throttling to optimize
>> utilization of resources in the  cluster. This makes it easier to maintain,
>> control, and coordinate  complex data applications.
>>
>> Nearly  three years ago, a team of Yahoo! developers addressed these critical
>> requirements for Hadoop-based data processing systems by developing a  new
>> workflow management and scheduling system called Oozie. While it was  initially
>> developed as a Yahoo!-internal project, it was designed and  implemented with
>> the intention of open-sourcing. Oozie was released as a GitHub project in early
>> 2010. Oozie is used in production within Yahoo and  since it has been
>> open-sourced it has been gaining adoption with  external developers
>>
>> Rationale
>> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
>> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
>> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
>> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
>> scripts.
>>
>> Because  of this, developers find themselves writing ad-hoc glue programs to
>> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
>> manage, monitor and recover.
>>
>> Workflow  management and scheduling is an essential feature for large-scale data
>> processing applications. Such applications could write the customized  solution
>> that would require separate development, operational, and  maintenance overhead.
>> Since it is a prevalent use-case for data  processing, the application developer
>> would surely prefer a generalized  solution with little or no such overhead.
>> Oozie addresses the challenge  by providing an execution framework to flexibly
>> specify the job  dependency, data dependency, and time dependency. In addition,
>> Oozie  provides a multi-tenant-based centralized service and the opportunity to
>> optimize load and utilization while respecting SLAs.
>>
>> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
>> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
>> is expected to  attract the larger and more diversified community that currently
>> uses  such Apache sponsored projects. Additionally, users of the Hadoop
>> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
>> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
>> Hadoop/Pig/Hive/HBase/HCatalog community.
>>
>> Current Status
>> Meritocracy
>> Oozie  currently is a github-based open sourced project where developers from
>> multiple companies are contributing to the project. Our intent with this
>> incubator proposal is to further extend this diverse developer  community around
>> Oozie following the Apache meritocracy model. We plan  to continue to provide
>> adequate support to new developers and to quickly  recruit those who make solid
>> contributions to committer status. In  addition, Oozie will expect, accept, and
>> work to attract contributions  from amateurs as well.
>>
>> Community
>> While an  efficient workflow management and scheduling system is critical for
>> large companies with huge data processing in multi-tenant clusters, it  is
>> equally necessary for any non-trivial deployment. Different companies  are
>> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
>> At Yahoo! it is being used extensively in production  clusters to process
>> thousand of jobs. Like the Oozie user community, the  Oozie developer community
>> is also very strong. Developers from Yahoo!  provided the initial code base, and
>> they are still the most active  contributors. In late 2010, developers from
>> Cloudera also started  contributing, and currently other companies (e.g., IBM)
>> are beginning to  participate.
>>
>> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
>> Groups for developer and user communications.
>>
>> Core Developers
>> Oozie is  currently being designed and developed by four engineers from Yahoo! –
>> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
>> many outside contributors are actively contributing in design  and development.
>> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
>> important contributors. All of these core  developers have deep expertise in
>> Hadoop and the Hadoop Ecosystem in  general.
>>
>> Alignment
>> The ASF is a  natural host for Oozie given that it is already the home of
>> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
>> designed to support Hadoop from the beginning in order to solve data  processing
>> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
>> computing projects by providing a flexible framework for  managing complex data
>> processing tasks.
>>
>> Known Risks
>> Orphaned Products
>> The core  developers plan to work full time on the project. There is very little
>> risk of Oozie getting orphaned since large companies like Yahoo! are
>> extensively using it on their production Hadoop clusters. For example,  there
>> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
>> hourly through Oozie in production. In addition, there are  nearly 400 active
>> users (including Yahoo! internal and external) in the  email community where
>> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
>> downloads of the Oozie binary in  the last eight months from the github site and
>> a large number of  downloads were conducted by other companies such as Cloudera.
>> Oozie has  three major releases and more than 15 patch releases in the last
>> couple  of years which further demonstrates Oozie as a very active project. We
>> plan to extend and diversify this community further through Apache.
>>
>> Inexperience with Open Source
>> The core  developers are all active users and followers of open source. They are
>> already committers and contributors to the Oozie Github project. In  addition,
>> they are very familiar with Apache principals and philosophy  for community
>> driven software development.
>>
>> Homogeneous Developers
>> The core developers are from Yahoo! as well as from several other corporations,
>> including Cloudera and IBM.
>>
>> Reliance on Salaried Developers
>> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
>> and  Cloudera are invested in Oozie as the solution to the workflow  management
>> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
>> addition, since workflow management is very  important for most hadoop based
>> data processing, non-salaried developers  and researchers from various
>> institutes are expected to contribute to  the project.
>>
>> Relationships with Other Apache Products
>> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
>> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
>> using Oozie as their workflow scheduler.
>>
>> An Excessive Fascination with the Apache Brand
>> We deeply  respect the reputation of Apache and have had great success with
>> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
>> increase the adoption and development of Oozie following Apache’s  established
>> open source model. We have also given reasons in the  Rationale and Alignment
>> sections.
>>
>> Documentation
>> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
>> following links provide more information about Oozie in open source:
>>
>>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>>        * Continuous Integration (CI)  build:
>> http://oozie-ci.hadoop.developer.yahoo.net/
>>
>>        * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
>> Initial Source
>> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
>> is currently hosted on GitHub under an Apache license at
>> https://github.com/yahoo/oozie.
>>
>> External Dependencies
>> The required  external dependencies are all Apache License or compatible
>> licenses.  Following the components with non-Apache licenses are enumerated:
>>
>>        * HSQLDB License: HSQLDB
>>        * JDOM license: JDOM
>>        * BSD: Serp
>>        * CCDL v1: jaxb-api, ejb, JAF
>> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
>> the other listed components are transitive dependencies of other Apache
>> components used by Oozie.
>>
>> Cryptography
>> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
>> services.
>>
>> Required Resources
>> Mailing Lists
>>        * oozie-private for private PMC discussions (with moderated subscriptions)
>>        * oozie-dev
>>        * oozie-commits
>>        * oozie-user
>> Subversion Directory
>> https://svn.apache.org/repos/asf/incubator/oozie
>> Issue Tracking
>> JIRA Oozie (OOZIE)
>> Other Resources
>> The  existing code already has unit tests, so we would like a Hudson instance
>> to run them whenever a new patch is submitted. This can be added after  project
>> creation.
>>
>> Initial Committers
>>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>>        * Angelo K Huang (angelohuang at gmail dot com)
>>        * Mayank Bansal (mabansal at gmail dot com)
>>        * Andreas Neumann (neunand at gmail dot com)
>>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>>        * Chao Wang (brookwc at gmail dot com)
>> Affiliations
>>        * Mohammad K Islam (Yahoo!)
>>        * Angelo Huang (Yahoo!)
>>        * Mayank Bansal (Yahoo!)
>>        * Andreas Neumann (Yahoo!)
>>        * Alejandro Abdelnur (Cloudera)
>>        * Chao Wang (IBM)
>> Sponsors
>> Champion
>> Alan Gates
>> Nominated Mentors
>>        * Owen O'Malley (Incubator PMC member)
>>        * Alan Gates (Incubator PMC member)
>>        * Christopher Douglas(Incubator PMC member)
>>        * Devaraj Das (Hadoop PMC member)
>> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Arvind Prabhakar <ar...@apache.org>.
+1 (non-binding)

Thanks,
Arvind

On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:
> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation on
> the grid) or another type of action such as an eMail  notification. Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such scheduling
> features, along with prioritization, load  balancing and throttling to optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale data
> processing applications. Such applications could write the customized  solution
> that would require separate development, operational, and  maintenance overhead.
> Since it is a prevalent use-case for data  processing, the application developer
> would surely prefer a generalized  solution with little or no such overhead.
> Oozie addresses the challenge  by providing an execution framework to flexibly
> specify the job  dependency, data dependency, and time dependency. In addition,
> Oozie  provides a multi-tenant-based centralized service and the opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
> is expected to  attract the larger and more diversified community that currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community around
> Oozie following the Apache meritocracy model. We plan  to continue to provide
> adequate support to new developers and to quickly  recruit those who make solid
> contributions to committer status. In  addition, Oozie will expect, accept, and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer community
> is also very strong. Developers from Yahoo!  provided the initial code base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g., IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
> many outside contributors are actively contributing in design  and development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
> important contributors. All of these core  developers have deep expertise in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
> hourly through Oozie in production. In addition, there are  nearly 400 active
> users (including Yahoo! internal and external) in the  email community where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
> downloads of the Oozie binary in  the last eight months from the github site and
> a large number of  downloads were conducted by other companies such as Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project. We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are
> already committers and contributors to the Oozie Github project. In  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow  management
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
> addition, since workflow management is very  important for most hadoop based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
> increase the adoption and development of Oozie following Apache’s  established
> open source model. We have also given reasons in the  Rationale and Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
> following links provide more information about Oozie in open source:
>
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>        * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are enumerated:
>
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
> services.
>
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance
> to run them whenever a new patch is submitted. This can be added after  project
> creation.
>
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Oozie to join the Incubator

Posted by Tom White <to...@apache.org>.
+1

Tom

On Wed, Jun 29, 2011 at 8:10 PM, Mohammad Islam <mi...@yahoo.com> wrote:
> Hi All,
>
> The discussion about Oozie proposal is settling down. Therefore I would like to
> initiate a vote to accept Oozie as an Apache Incubator project.
>
> The latest proposal is pasted at the end and it could be found in the wiki as
> well:
>
> http://wiki.apache.org/incubator/OozieProposal
>
>
> The related discussion thread is at:
> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>
>
> Please cast your votes:
>
> [  ] +1 Accept Oozie for incubation
> [  ] +0 Indifferent to Oozie incubation
> [  ] -1 Reject Oozie for incubation
>
> This vote will close 72 hours  from now.
>
> Regards,
> Mohammad
>
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation on
> the grid) or another type of action such as an eMail  notification. Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such scheduling
> features, along with prioritization, load  balancing and throttling to optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale data
> processing applications. Such applications could write the customized  solution
> that would require separate development, operational, and  maintenance overhead.
> Since it is a prevalent use-case for data  processing, the application developer
> would surely prefer a generalized  solution with little or no such overhead.
> Oozie addresses the challenge  by providing an execution framework to flexibly
> specify the job  dependency, data dependency, and time dependency. In addition,
> Oozie  provides a multi-tenant-based centralized service and the opportunity to
> optimize load and utilization while respecting SLAs.
>
> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, Oozie
> is expected to  attract the larger and more diversified community that currently
> uses  such Apache sponsored projects. Additionally, users of the Hadoop
> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  Oozie,
> as part of the Apache Hadoop TMecosystem, will be a great benefit to the current
> Hadoop/Pig/Hive/HBase/HCatalog community.
>
> Current Status
> Meritocracy
> Oozie  currently is a github-based open sourced project where developers from
> multiple companies are contributing to the project. Our intent with this
> incubator proposal is to further extend this diverse developer  community around
> Oozie following the Apache meritocracy model. We plan  to continue to provide
> adequate support to new developers and to quickly  recruit those who make solid
> contributions to committer status. In  addition, Oozie will expect, accept, and
> work to attract contributions  from amateurs as well.
>
> Community
> While an  efficient workflow management and scheduling system is critical for
> large companies with huge data processing in multi-tenant clusters, it  is
> equally necessary for any non-trivial deployment. Different companies  are
> currently using Oozie as a workflow scheduler for Hadoop-based data  processing.
> At Yahoo! it is being used extensively in production  clusters to process
> thousand of jobs. Like the Oozie user community, the  Oozie developer community
> is also very strong. Developers from Yahoo!  provided the initial code base, and
> they are still the most active  contributors. In late 2010, developers from
> Cloudera also started  contributing, and currently other companies (e.g., IBM)
> are beginning to  participate.
>
> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
> Groups for developer and user communications.
>
> Core Developers
> Oozie is  currently being designed and developed by four engineers from Yahoo! –
> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  addition,
> many outside contributors are actively contributing in design  and development.
> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
> important contributors. All of these core  developers have deep expertise in
> Hadoop and the Hadoop Ecosystem in  general.
>
> Alignment
> The ASF is a  natural host for Oozie given that it is already the home of
> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
> designed to support Hadoop from the beginning in order to solve data  processing
> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
> computing projects by providing a flexible framework for  managing complex data
> processing tasks.
>
> Known Risks
> Orphaned Products
> The core  developers plan to work full time on the project. There is very little
> risk of Oozie getting orphaned since large companies like Yahoo! are
> extensively using it on their production Hadoop clusters. For example,  there
> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are processed
> hourly through Oozie in production. In addition, there are  nearly 400 active
> users (including Yahoo! internal and external) in the  email community where
> nearly 15 emails are exchanged per day.  Furthermore, there were more than 1500
> downloads of the Oozie binary in  the last eight months from the github site and
> a large number of  downloads were conducted by other companies such as Cloudera.
> Oozie has  three major releases and more than 15 patch releases in the last
> couple  of years which further demonstrates Oozie as a very active project. We
> plan to extend and diversify this community further through Apache.
>
> Inexperience with Open Source
> The core  developers are all active users and followers of open source. They are
> already committers and contributors to the Oozie Github project. In  addition,
> they are very familiar with Apache principals and philosophy  for community
> driven software development.
>
> Homogeneous Developers
> The core developers are from Yahoo! as well as from several other corporations,
> including Cloudera and IBM.
>
> Reliance on Salaried Developers
> Currently,  the developers are paid to do work on Oozie. Companies like Yahoo!
> and  Cloudera are invested in Oozie as the solution to the workflow  management
> and scheduling problem in Hadoop clusters, and that is not  likely to change. In
> addition, since workflow management is very  important for most hadoop based
> data processing, non-salaried developers  and researchers from various
> institutes are expected to contribute to  the project.
>
> Relationships with Other Apache Products
> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
> projects such as Hadoop, Pig, and Hive. Users of these products are  extensively
> using Oozie as their workflow scheduler.
>
> An Excessive Fascination with the Apache Brand
> We deeply  respect the reputation of Apache and have had great success with
> other  Apache projects such as Pig and HCatalog. We are motivated to expand and
> increase the adoption and development of Oozie following Apache’s  established
> open source model. We have also given reasons in the  Rationale and Alignment
> sections.
>
> Documentation
> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
> following links provide more information about Oozie in open source:
>
>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>        * Continuous Integration (CI)  build:
> http://oozie-ci.hadoop.developer.yahoo.net/
>
>        * Yahoo user community: http://tech.groups.yahoo.com/group/Oozie-users/
> Initial Source
> Oozie has been under development since 2009 by a team of engineers at Yahoo!. It
> is currently hosted on GitHub under an Apache license at
> https://github.com/yahoo/oozie.
>
> External Dependencies
> The required  external dependencies are all Apache License or compatible
> licenses.  Following the components with non-Apache licenses are enumerated:
>
>        * HSQLDB License: HSQLDB
>        * JDOM license: JDOM
>        * BSD: Serp
>        * CCDL v1: jaxb-api, ejb, JAF
> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
> the other listed components are transitive dependencies of other Apache
> components used by Oozie.
>
> Cryptography
> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
> services.
>
> Required Resources
> Mailing Lists
>        * oozie-private for private PMC discussions (with moderated subscriptions)
>        * oozie-dev
>        * oozie-commits
>        * oozie-user
> Subversion Directory
> https://svn.apache.org/repos/asf/incubator/oozie
> Issue Tracking
> JIRA Oozie (OOZIE)
> Other Resources
> The  existing code already has unit tests, so we would like a Hudson instance
> to run them whenever a new patch is submitted. This can be added after  project
> creation.
>
> Initial Committers
>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>        * Angelo K Huang (angelohuang at gmail dot com)
>        * Mayank Bansal (mabansal at gmail dot com)
>        * Andreas Neumann (neunand at gmail dot com)
>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>        * Chao Wang (brookwc at gmail dot com)
> Affiliations
>        * Mohammad K Islam (Yahoo!)
>        * Angelo Huang (Yahoo!)
>        * Mayank Bansal (Yahoo!)
>        * Andreas Neumann (Yahoo!)
>        * Alejandro Abdelnur (Cloudera)
>        * Chao Wang (IBM)
> Sponsors
> Champion
> Alan Gates
> Nominated Mentors
>        * Owen O'Malley (Incubator PMC member)
>        * Alan Gates (Incubator PMC member)
>        * Christopher Douglas(Incubator PMC member)
>        * Devaraj Das (Hadoop PMC member)
> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org