You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Eli Collins <el...@cloudera.com> on 2010/05/21 22:42:28 UTC

[DISCUSSION] Proposal for making core Hadoop changes

As HDFS and MapReduce have matured the cost and complexity of
introducing features has grown. Each new feature has to consider
interactions with a growing set of existing features, a growing user
base (upgrades, backwards compatibility) and additional use cases
(more and more projects now build on them). At the same time we don't
want the high bar for contribution to unnecessarily hinder new
development and releases.

Many projects at a similar stage address this by adopting a more
formal way to describe, socialize and shepherd enhancements to their
platforms. Today, new features are often discussed via an umbrella
jira, which may have an attached design document. There are a number
of issues with this approach. The design documents vary in format and
quality, and are often reviewed by a limited audience. They aren't
version controlled. Sometimes the proposal is only partially
specified. Jiras are often ignored. Understanding a proposal and it's
implications through a series of threads in the jira comments is
difficult. It's hard for contributors and users to find these
top-level jiras and follow their status.

I'd like to propose that core Hadoop adopts something similar to
Python's PEP (Python Enhancement Proposal) [1]. A "HEP" would be a
single primary mechanism for proposing new features, incorporating
community feedback, and recording decisions. The author of the HEP
would be responsible for building consensus and moving the feature
forward. Similarly, some subset of the community would be responsible
for reviewing HEPs in a timely manner and identifying missing pieces
in the proposal. Discussion would occur before patches showed up on
jira. People interested in the core Hadoop roadmap could keep an eye
on the HEPs without the overhead of following jira traffic.

Why base this on the PEP? The format has proven useful to a
substantial existing project, and I think the workflow is not too
heavy-weight, and well-suited to a community such as ours. That being
said, we could discuss other models (eg Java's JSR).

Before we get into specifics, is this something the community would
like to adopt in some form? Does adapting the PEP and its workflow to
our projects, community and bylaws seem reasonable?

Thanks,
Eli

1. http://www.python.org/dev/peps/pep-0001

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Eli Collins <el...@cloudera.com>.

Hey Konstantin,

Apologies for the delay, busy with stuff for the summit.  I'll get a
concrete proposal to general based on our discussion at the
contributor's meeting out this week.

Thanks,
Eli

On Mon, Jun 28, 2010 at 5:50 PM, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
> Eli,
>
> Just checking on the status of this proposal.
>
> In the past I was hesitant about introducing more formalities.
> I now think we really need some mechanism for
> new feature and project proposals, also tracking decisions.
> For the reasons exactly as you describe in your email.
> Whether it is going to be HEP or something else, it is best
> if we adopt it soon.
>
> Thanks,
> --Konstantin
>
>
> On 5/21/2010 1:42 PM, Eli Collins wrote:
>>
>> As HDFS and MapReduce have matured the cost and complexity of
>> introducing features has grown. Each new feature has to consider
>> interactions with a growing set of existing features, a growing user
>> base (upgrades, backwards compatibility) and additional use cases
>> (more and more projects now build on them). At the same time we don't
>> want the high bar for contribution to unnecessarily hinder new
>> development and releases.
>>
>> Many projects at a similar stage address this by adopting a more
>> formal way to describe, socialize and shepherd enhancements to their
>> platforms. Today, new features are often discussed via an umbrella
>> jira, which may have an attached design document. There are a number
>> of issues with this approach. The design documents vary in format and
>> quality, and are often reviewed by a limited audience. They aren't
>> version controlled. Sometimes the proposal is only partially
>> specified. Jiras are often ignored. Understanding a proposal and it's
>> implications through a series of threads in the jira comments is
>> difficult. It's hard for contributors and users to find these
>> top-level jiras and follow their status.
>>
>> I'd like to propose that core Hadoop adopts something similar to
>> Python's PEP (Python Enhancement Proposal) [1]. A "HEP" would be a
>> single primary mechanism for proposing new features, incorporating
>> community feedback, and recording decisions. The author of the HEP
>> would be responsible for building consensus and moving the feature
>> forward. Similarly, some subset of the community would be responsible
>> for reviewing HEPs in a timely manner and identifying missing pieces
>> in the proposal. Discussion would occur before patches showed up on
>> jira. People interested in the core Hadoop roadmap could keep an eye
>> on the HEPs without the overhead of following jira traffic.
>>
>> Why base this on the PEP? The format has proven useful to a
>> substantial existing project, and I think the workflow is not too
>> heavy-weight, and well-suited to a community such as ours. That being
>> said, we could discuss other models (eg Java's JSR).
>>
>> Before we get into specifics, is this something the community would
>> like to adopt in some form? Does adapting the PEP and its workflow to
>> our projects, community and bylaws seem reasonable?
>>
>> Thanks,
>> Eli
>>
>> 1. http://www.python.org/dev/peps/pep-0001
>>
>
>

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Bernd Fondermann <be...@googlemail.com>.

On Tue, Jun 29, 2010 at 19:11, Jay Booth <ja...@gmail.com> wrote:
> Well, if people decide that some system more organized than email
> threads is better to keep track of major project proposals, it may
> help with some aspects of the project.

As long as this system is under the control of our infra team, this is ok.

> Or it may not.  The fact that
> a pro forma vote by email may also be required at some points to make
> something "official" shouldn't be a major reason against such a
> system, if it's otherwise a good idea.

Discussions must also take place on-list.
There is no such thing as "pro forma" on-list activity.

  Bernd

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Jay Booth <ja...@gmail.com>.

Well, if people decide that some system more organized than email
threads is better to keep track of major project proposals, it may
help with some aspects of the project.  Or it may not.  The fact that
a pro forma vote by email may also be required at some points to make
something "official" shouldn't be a major reason against such a
system, if it's otherwise a good idea.

On Tue, Jun 29, 2010 at 11:29 AM, Bernd Fondermann
<be...@googlemail.com> wrote:
> On Tue, Jun 29, 2010 at 02:50, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
>> Eli,
>>
>> Just checking on the status of this proposal.
>>
>> In the past I was hesitant about introducing more formalities.
>> I now think we really need some mechanism for
>> new feature and project proposals, also tracking decisions.
>
> Making and tracking decisions at Apache is done via public ASF mailing
> lists, exclusively.
> Any other means of communication, including face-to-face, JIRA, IRC
> etc, is *not binding*.
> Every community member has equal say (only PMC members votes are
> binding though).
> Committers can veto commits and commit to svn. PMC members have
> special rights and duties, too, as described in our Bylaws.
>
> That's about it.
>
> If Hadoop has issues tracking and making decisions, you won't fix that
> by introducing any formalities.
>
>  Bernd
>

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Bernd Fondermann <be...@googlemail.com>.

On Tue, Jun 29, 2010 at 20:02, Eli Collins <el...@cloudera.com> wrote:
> On Tuesday, June 29, 2010, Bernd Fondermann
> <be...@googlemail.com> wrote:
>> On Tue, Jun 29, 2010 at 02:50, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
>>> Eli,
>>>
>>> Just checking on the status of this proposal.
>>>
>>> In the past I was hesitant about introducing more formalities.
>>> I now think we really need some mechanism for
>>> new feature and project proposals, also tracking decisions.
>>
>> Making and tracking decisions at Apache is done via public ASF mailing
>> lists, exclusively.
>
> All proposals will be discussed and voted on the public lists.  Per
> the original mail the proposal must be compatible with current bylaws.

Thanks for the clarification.

  Bernd

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Eli Collins <el...@cloudera.com>.

On Tuesday, June 29, 2010, Bernd Fondermann
<be...@googlemail.com> wrote:
> On Tue, Jun 29, 2010 at 02:50, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
>> Eli,
>>
>> Just checking on the status of this proposal.
>>
>> In the past I was hesitant about introducing more formalities.
>> I now think we really need some mechanism for
>> new feature and project proposals, also tracking decisions.
>
> Making and tracking decisions at Apache is done via public ASF mailing
> lists, exclusively.

All proposals will be discussed and voted on the public lists.  Per
the original mail the proposal must be compatible with current bylaws.

Thanks,
Eli

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Bernd Fondermann <be...@googlemail.com>.

On Tue, Jun 29, 2010 at 02:50, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
> Eli,
>
> Just checking on the status of this proposal.
>
> In the past I was hesitant about introducing more formalities.
> I now think we really need some mechanism for
> new feature and project proposals, also tracking decisions.

Making and tracking decisions at Apache is done via public ASF mailing
lists, exclusively.
Any other means of communication, including face-to-face, JIRA, IRC
etc, is *not binding*.
Every community member has equal say (only PMC members votes are
binding though).
Committers can veto commits and commit to svn. PMC members have
special rights and duties, too, as described in our Bylaws.

That's about it.

If Hadoop has issues tracking and making decisions, you won't fix that
by introducing any formalities.

  Bernd

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Eli,

Just checking on the status of this proposal.

In the past I was hesitant about introducing more formalities.
I now think we really need some mechanism for
new feature and project proposals, also tracking decisions.
For the reasons exactly as you describe in your email.
Whether it is going to be HEP or something else, it is best
if we adopt it soon.

Thanks,
--Konstantin


On 5/21/2010 1:42 PM, Eli Collins wrote:
> As HDFS and MapReduce have matured the cost and complexity of
> introducing features has grown. Each new feature has to consider
> interactions with a growing set of existing features, a growing user
> base (upgrades, backwards compatibility) and additional use cases
> (more and more projects now build on them). At the same time we don't
> want the high bar for contribution to unnecessarily hinder new
> development and releases.
>
> Many projects at a similar stage address this by adopting a more
> formal way to describe, socialize and shepherd enhancements to their
> platforms. Today, new features are often discussed via an umbrella
> jira, which may have an attached design document. There are a number
> of issues with this approach. The design documents vary in format and
> quality, and are often reviewed by a limited audience. They aren't
> version controlled. Sometimes the proposal is only partially
> specified. Jiras are often ignored. Understanding a proposal and it's
> implications through a series of threads in the jira comments is
> difficult. It's hard for contributors and users to find these
> top-level jiras and follow their status.
>
> I'd like to propose that core Hadoop adopts something similar to
> Python's PEP (Python Enhancement Proposal) [1]. A "HEP" would be a
> single primary mechanism for proposing new features, incorporating
> community feedback, and recording decisions. The author of the HEP
> would be responsible for building consensus and moving the feature
> forward. Similarly, some subset of the community would be responsible
> for reviewing HEPs in a timely manner and identifying missing pieces
> in the proposal. Discussion would occur before patches showed up on
> jira. People interested in the core Hadoop roadmap could keep an eye
> on the HEPs without the overhead of following jira traffic.
>
> Why base this on the PEP? The format has proven useful to a
> substantial existing project, and I think the workflow is not too
> heavy-weight, and well-suited to a community such as ours. That being
> said, we could discuss other models (eg Java's JSR).
>
> Before we get into specifics, is this something the community would
> like to adopt in some form? Does adapting the PEP and its workflow to
> our projects, community and bylaws seem reasonable?
>
> Thanks,
> Eli
>
> 1. http://www.python.org/dev/peps/pep-0001
>

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Sure, each project can choose to use the framework in the way they see fit
on Launchpad. I wanted to call out their use of metadata as being
particularly nice. We may want to consider similar fields and applications
of those fields for HEPs.

On Tue, Jun 1, 2010 at 2:28 PM, Eli Collins <el...@cloudera.com> wrote:

> Hey Jeff,
>
> Blueprints (it's a launchpad thing) is more of an issue tracking
> system (launchpad doesn't put features/enhancements in their bug
> database), eg drizzle has lots of blueprints, and blueprints for
> cleaning up code, adding config flags, etc. We'll use jira for that
> kind of stuff, the HEP is for larger stuff that needs more upfront
> discussion.
>
> Thanks,
> Eli
>
> On Mon, May 31, 2010 at 10:16 AM, Jeff Hammerbacher <ha...@cloudera.com>
> wrote:
> > A far more lightweight example of multi-issue feature planning in an open
> > source project comes from Drizzle and their "blueprints":
> > https://blueprints.launchpad.net/drizzle.
> >
> > Each "spec" has a drafter, an approver, and an assignee; declares the
> other
> > specs on which it depends; points to the relevant branches in the source
> > tree and issues in the issue tracker; and has a priority, definition
> state,
> > and implementation state.
> >
> > I don't know how it's working out for them in practice, but on paper it
> > looks quite nice.
> >
> > On Wed, May 26, 2010 at 9:13 AM, Eli Collins <el...@cloudera.com> wrote:
> >
> >> > No, but I'd estimate the cost of merging at 1-2 days work a week just
> to
> >> > pull in the code *and identify why the tests are failing*. Git may be
> >> better
> >> > at merging in changes, but if Hadoop doesn't work on my machine after
> the
> >> > merge, I need to identify whether its my code, the merged code, some
> >> machine
> >> > quirk, etc. It's the testing that is the problem for me, not the
> >> > merge effort. That's the Hadoop own tests any my own functional test
> >> suites,
> >> > the ones that bring up clusters and push work through. Those are the
> >> > troublespots, as they do things that hadoop's own tests don't do, like
> as
> >> > for all the JSP pages.
> >>
> >> I've lived off a git branch of common/hdfs for half a year with a big
> >> uncommitted patch, it's no where near 1-2 days of effort per week to
> >> merge in changes from trunk. If the tests are passing on trunk, and
> >> they fail after your merge then those are real test failures due to
> >> your change (and therefore should require effort). The issues with
> >> your internal tests failing due to changes on trunk is the same
> >> whether you merge or you just do an update - you have to update before
> >> checking in the patch anyway - so that issue is about the state of
> >> trunk when you merge or update, rather than about being on a branch.
> >>
> >> >
> >> >> Might find the
> >> >> following interesting:
> >> >> http://incubator.apache.org/learn/rules-for-revolutionaries.html
> >> >
> >> > There's a long story behind JDD's paper, I'm glad you have read it, it
> >> does
> >> > lay out what is effectively the ASF process for effecting significant
> >> change
> >> > -but it doesn't imply that's the only process for having changes.
> >> >
> >>
> >> Just to be clear I don't mean imply that branches are the only process
> >> for making changes. Interesting that this is considered the effective
> >> ASF process, it hasn't seemed to me that recent big features on hadoop
> >> have used it, only one I'm aware of that was done on a branch was
> >> append.
> >>
> >> > I think gradual evolution in trunk is good, it lets people play with
> >> what's
> >> > coming in. Having lots of separate branches and everyone's private
> >> release
> >> > being a merge of many patches that you choose is bad.
> >>
> >> Agreed.  Personally I don't think people should release from branches.
> >> And in practice I don't think you'll see lots of branches, people can
> >> and would still develop on trunk. Getting changes merged from a branch
> >> back to trunk before the whole branch is merged is a good thing, the
> >> whole branch may never be merged and that's OK too. Branches are a
> >> mechanism, releases are policy.
> >>
> >> Thanks,
> >> Eli
> >>
> >
>

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Eli Collins <el...@cloudera.com>.

Hey Jeff,

Blueprints (it's a launchpad thing) is more of an issue tracking
system (launchpad doesn't put features/enhancements in their bug
database), eg drizzle has lots of blueprints, and blueprints for
cleaning up code, adding config flags, etc. We'll use jira for that
kind of stuff, the HEP is for larger stuff that needs more upfront
discussion.

Thanks,
Eli

On Mon, May 31, 2010 at 10:16 AM, Jeff Hammerbacher <ha...@cloudera.com> wrote:
> A far more lightweight example of multi-issue feature planning in an open
> source project comes from Drizzle and their "blueprints":
> https://blueprints.launchpad.net/drizzle.
>
> Each "spec" has a drafter, an approver, and an assignee; declares the other
> specs on which it depends; points to the relevant branches in the source
> tree and issues in the issue tracker; and has a priority, definition state,
> and implementation state.
>
> I don't know how it's working out for them in practice, but on paper it
> looks quite nice.
>
> On Wed, May 26, 2010 at 9:13 AM, Eli Collins <el...@cloudera.com> wrote:
>
>> > No, but I'd estimate the cost of merging at 1-2 days work a week just to
>> > pull in the code *and identify why the tests are failing*. Git may be
>> better
>> > at merging in changes, but if Hadoop doesn't work on my machine after the
>> > merge, I need to identify whether its my code, the merged code, some
>> machine
>> > quirk, etc. It's the testing that is the problem for me, not the
>> > merge effort. That's the Hadoop own tests any my own functional test
>> suites,
>> > the ones that bring up clusters and push work through. Those are the
>> > troublespots, as they do things that hadoop's own tests don't do, like as
>> > for all the JSP pages.
>>
>> I've lived off a git branch of common/hdfs for half a year with a big
>> uncommitted patch, it's no where near 1-2 days of effort per week to
>> merge in changes from trunk. If the tests are passing on trunk, and
>> they fail after your merge then those are real test failures due to
>> your change (and therefore should require effort). The issues with
>> your internal tests failing due to changes on trunk is the same
>> whether you merge or you just do an update - you have to update before
>> checking in the patch anyway - so that issue is about the state of
>> trunk when you merge or update, rather than about being on a branch.
>>
>> >
>> >> Might find the
>> >> following interesting:
>> >> http://incubator.apache.org/learn/rules-for-revolutionaries.html
>> >
>> > There's a long story behind JDD's paper, I'm glad you have read it, it
>> does
>> > lay out what is effectively the ASF process for effecting significant
>> change
>> > -but it doesn't imply that's the only process for having changes.
>> >
>>
>> Just to be clear I don't mean imply that branches are the only process
>> for making changes. Interesting that this is considered the effective
>> ASF process, it hasn't seemed to me that recent big features on hadoop
>> have used it, only one I'm aware of that was done on a branch was
>> append.
>>
>> > I think gradual evolution in trunk is good, it lets people play with
>> what's
>> > coming in. Having lots of separate branches and everyone's private
>> release
>> > being a merge of many patches that you choose is bad.
>>
>> Agreed.  Personally I don't think people should release from branches.
>> And in practice I don't think you'll see lots of branches, people can
>> and would still develop on trunk. Getting changes merged from a branch
>> back to trunk before the whole branch is merged is a good thing, the
>> whole branch may never be merged and that's OK too. Branches are a
>> mechanism, releases are policy.
>>
>> Thanks,
>> Eli
>>
>

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

A far more lightweight example of multi-issue feature planning in an open
source project comes from Drizzle and their "blueprints":
https://blueprints.launchpad.net/drizzle.

Each "spec" has a drafter, an approver, and an assignee; declares the other
specs on which it depends; points to the relevant branches in the source
tree and issues in the issue tracker; and has a priority, definition state,
and implementation state.

I don't know how it's working out for them in practice, but on paper it
looks quite nice.

On Wed, May 26, 2010 at 9:13 AM, Eli Collins <el...@cloudera.com> wrote:

> > No, but I'd estimate the cost of merging at 1-2 days work a week just to
> > pull in the code *and identify why the tests are failing*. Git may be
> better
> > at merging in changes, but if Hadoop doesn't work on my machine after the
> > merge, I need to identify whether its my code, the merged code, some
> machine
> > quirk, etc. It's the testing that is the problem for me, not the
> > merge effort. That's the Hadoop own tests any my own functional test
> suites,
> > the ones that bring up clusters and push work through. Those are the
> > troublespots, as they do things that hadoop's own tests don't do, like as
> > for all the JSP pages.
>
> I've lived off a git branch of common/hdfs for half a year with a big
> uncommitted patch, it's no where near 1-2 days of effort per week to
> merge in changes from trunk. If the tests are passing on trunk, and
> they fail after your merge then those are real test failures due to
> your change (and therefore should require effort). The issues with
> your internal tests failing due to changes on trunk is the same
> whether you merge or you just do an update - you have to update before
> checking in the patch anyway - so that issue is about the state of
> trunk when you merge or update, rather than about being on a branch.
>
> >
> >> Might find the
> >> following interesting:
> >> http://incubator.apache.org/learn/rules-for-revolutionaries.html
> >
> > There's a long story behind JDD's paper, I'm glad you have read it, it
> does
> > lay out what is effectively the ASF process for effecting significant
> change
> > -but it doesn't imply that's the only process for having changes.
> >
>
> Just to be clear I don't mean imply that branches are the only process
> for making changes. Interesting that this is considered the effective
> ASF process, it hasn't seemed to me that recent big features on hadoop
> have used it, only one I'm aware of that was done on a branch was
> append.
>
> > I think gradual evolution in trunk is good, it lets people play with
> what's
> > coming in. Having lots of separate branches and everyone's private
> release
> > being a merge of many patches that you choose is bad.
>
> Agreed.  Personally I don't think people should release from branches.
> And in practice I don't think you'll see lots of branches, people can
> and would still develop on trunk. Getting changes merged from a branch
> back to trunk before the whole branch is merged is a good thing, the
> whole branch may never be merged and that's OK too. Branches are a
> mechanism, releases are policy.
>
> Thanks,
> Eli
>

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Eli Collins <el...@cloudera.com>.

> No, but I'd estimate the cost of merging at 1-2 days work a week just to
> pull in the code *and identify why the tests are failing*. Git may be better
> at merging in changes, but if Hadoop doesn't work on my machine after the
> merge, I need to identify whether its my code, the merged code, some machine
> quirk, etc. It's the testing that is the problem for me, not the
> merge effort. That's the Hadoop own tests any my own functional test suites,
> the ones that bring up clusters and push work through. Those are the
> troublespots, as they do things that hadoop's own tests don't do, like as
> for all the JSP pages.

I've lived off a git branch of common/hdfs for half a year with a big
uncommitted patch, it's no where near 1-2 days of effort per week to
merge in changes from trunk. If the tests are passing on trunk, and
they fail after your merge then those are real test failures due to
your change (and therefore should require effort). The issues with
your internal tests failing due to changes on trunk is the same
whether you merge or you just do an update - you have to update before
checking in the patch anyway - so that issue is about the state of
trunk when you merge or update, rather than about being on a branch.

>
>> Might find the
>> following interesting:
>> http://incubator.apache.org/learn/rules-for-revolutionaries.html
>
> There's a long story behind JDD's paper, I'm glad you have read it, it does
> lay out what is effectively the ASF process for effecting significant change
> -but it doesn't imply that's the only process for having changes.
>

Just to be clear I don't mean imply that branches are the only process
for making changes. Interesting that this is considered the effective
ASF process, it hasn't seemed to me that recent big features on hadoop
have used it, only one I'm aware of that was done on a branch was
append.

> I think gradual evolution in trunk is good, it lets people play with what's
> coming in. Having lots of separate branches and everyone's private release
> being a merge of many patches that you choose is bad.

Agreed.  Personally I don't think people should release from branches.
And in practice I don't think you'll see lots of branches, people can
and would still develop on trunk. Getting changes merged from a branch
back to trunk before the whole branch is merged is a good thing, the
whole branch may never be merged and that's OK too. Branches are a
mechanism, releases are policy.

Thanks,
Eli

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Steve Loughran <st...@apache.org>.

Eli Collins wrote:

> The cost of adding features has gotten high anyway (even without
> branching). It's a classic trade-off -- merge overhead vs moving
> faster without burdening others -- as the overhead imposed on others
> increases, and tools (git) make it easier to live and collaborate on
> branches it makes more sense 

maybe, but if you are trying to keep >1 branch in sync, all the low cost 
refactorings become expensive to perform
  -renaming variables
  -hitting the reformat-code button to align the code with the project 
layout rules
  -moving methods around
Life is simplest if you own the entire codebase and can move stuff 
around without any discussion. Closed source projects can do that, but 
even then it annoys other team members. In any OSS project, keeping 
stuff more stable makes it easier to take in third party patches, and 
ensures that stack traces from various versions all point to roughly the 
same code, always handy. Once you try to keep multiple branches alive, 
it becomes very hard to do big changes in trunk.

>(you don't need a team of engineers or
> dedicated merge engineer to maintain the branch).

No, but I'd estimate the cost of merging at 1-2 days work a week just to 
pull in the code *and identify why the tests are failing*. Git may be 
better at merging in changes, but if Hadoop doesn't work on my machine 
after the merge, I need to identify whether its my code, the merged 
code, some machine quirk, etc. It's the testing that is the problem for 
me, not the
merge effort. That's the Hadoop own tests any my own functional test 
suites, the ones that bring up clusters and push work through. Those are 
the troublespots, as they do things that hadoop's own tests don't do, 
like as for all the JSP pages.

> Might find the
> following interesting:
> http://incubator.apache.org/learn/rules-for-revolutionaries.html

There's a long story behind JDD's paper, I'm glad you have read it, it 
does lay out what is effectively the ASF process for effecting 
significant change -but it doesn't imply that's the only process for 
having changes.

One of the big issues that in any successful project it becomes hard to 
do a big rewrite, and you end up with what was done early on, despite 
known issues. The "Some Thoughts on Ant 1.3 and 2.0" discussion is 
related to this  we -and I wasn't a committer at this time, just a user- 
weren't able to do the big rework so we are left today with the design 
errors of the past (like the way undefined properties just get retained 
as ${undefined.property} instead of some kind of error appearing):
http://www.mail-archive.com/ant-dev@jakarta.apache.org/msg05984.html

I think gradual evolution in trunk is good, it lets people play with 
what's coming in. Having lots of separate branches and everyone's 
private release being a merge of many patches that you choose is bad. 
Because it means my version != your version != anyone else's, which 
implies that your tests mean nothing to me unless I also test at scale. 
Which I can do, but with different hardware and network configs from 
other people, it's still tricky to assign blame. Is it my merge that 
isn't working, is it some quirk of virtualisation underneath, or is it 
just this week's trunk playing up?

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Eli Collins <el...@cloudera.com>.

On Tue, May 25, 2010 at 9:28 AM, Steve Loughran <st...@apache.org> wrote:
> Jeff Hammerbacher wrote:
>>>
>>> For comparison, anyone have a references to similar processes?
>>>
>>
>> Java has the Java Community Process: http://jcp.org/en/home/index
>>
>
> a process that nobody liked, such as this comment by GregW of the jetty team
> on JSP 3
> http://blogs.webtide.com/gregw/entry/servlet_3_0_public_review
>
> JCP has some advantage over standards bodies I've been in
>  * they recognise the value of tests.
>  * better remote collaboration
>  * more open to interested third parties
> But that's it. Very vendor-managed, Sun was usually in charge, you'd be hard
> pressed to find anyone on the Apache jcp-open list (yes, we have one!) who
> is happy.

The JCP seems heavy weight, we'll want to make sure the pendulum
doesn't swing too far in the opposite direction.
Would be interesting if there are other good light weight alternatives
to the PEP, I looked and didn't turn up many.

> * evolution in the codebase is a good way of getting stuff to meet people's
> needs. If you have to have big branches until things are perfect you have
> the cost of maintaining branches, its harder for people to experiment with
> your stuff.
> * If the cost of adding features is high -and maintaining branches, merging,
> identifying test failures is high- the barrier to participation is pretty
> steep. you need a team of engineers to work on every feature

The cost of adding features has gotten high anyway (even without
branching). It's a classic trade-off -- merge overhead vs moving
faster without burdening others -- as the overhead imposed on others
increases, and tools (git) make it easier to live and collaborate on
branches it makes more sense (you don't need a team of engineers or
dedicated merge engineer to maintain the branch). Might find the
following interesting:
http://incubator.apache.org/learn/rules-for-revolutionaries.html

Thanks,
Eli

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Steve Loughran <st...@apache.org>.

Jeff Hammerbacher wrote:
>> For comparison, anyone have a references to similar processes?
>>
> 
> Java has the Java Community Process: http://jcp.org/en/home/index
> 

a process that nobody liked, such as this comment by GregW of the jetty 
team on JSP 3
http://blogs.webtide.com/gregw/entry/servlet_3_0_public_review

JCP has some advantage over standards bodies I've been in
  * they recognise the value of tests.
  * better remote collaboration
  * more open to interested third parties
But that's it. Very vendor-managed, Sun was usually in charge, you'd be 
hard pressed to find anyone on the Apache jcp-open list (yes, we have 
one!) who is happy.

I haven't looked at Elliot's proposal in enough detail to comment, here 
are my thoughts from working on the lifecycle stuff, and on other ASF 
projects

* evolution in the codebase is a good way of getting stuff to meet 
people's needs. If you have to have big branches until things are 
perfect you have the cost of maintaining branches, its harder for people 
to experiment with your stuff.
* If the cost of adding features is high -and maintaining branches, 
merging, identifying test failures is high- the barrier to participation 
is pretty steep. you need a team of engineers to work on every feature

-steve

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

> For comparison, anyone have a references to similar processes?
>

Java has the Java Community Process: http://jcp.org/en/home/index

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Nigel Daley <nd...@mac.com>.

+1 to better process around feature enhancements. I like that PEP also  
includes process enhancements too.

For comparison, anyone have a references to similar processes?

Cheers,
Nige

On May 21, 2010, at 1:42 PM, Eli Collins <el...@cloudera.com> wrote:

> As HDFS and MapReduce have matured the cost and complexity of
> introducing features has grown. Each new feature has to consider
> interactions with a growing set of existing features, a growing user
> base (upgrades, backwards compatibility) and additional use cases
> (more and more projects now build on them). At the same time we don't
> want the high bar for contribution to unnecessarily hinder new
> development and releases.
>
> Many projects at a similar stage address this by adopting a more
> formal way to describe, socialize and shepherd enhancements to their
> platforms. Today, new features are often discussed via an umbrella
> jira, which may have an attached design document. There are a number
> of issues with this approach. The design documents vary in format and
> quality, and are often reviewed by a limited audience. They aren't
> version controlled. Sometimes the proposal is only partially
> specified. Jiras are often ignored. Understanding a proposal and it's
> implications through a series of threads in the jira comments is
> difficult. It's hard for contributors and users to find these
> top-level jiras and follow their status.
>
> I'd like to propose that core Hadoop adopts something similar to
> Python's PEP (Python Enhancement Proposal) [1]. A "HEP" would be a
> single primary mechanism for proposing new features, incorporating
> community feedback, and recording decisions. The author of the HEP
> would be responsible for building consensus and moving the feature
> forward. Similarly, some subset of the community would be responsible
> for reviewing HEPs in a timely manner and identifying missing pieces
> in the proposal. Discussion would occur before patches showed up on
> jira. People interested in the core Hadoop roadmap could keep an eye
> on the HEPs without the overhead of following jira traffic.
>
> Why base this on the PEP? The format has proven useful to a
> substantial existing project, and I think the workflow is not too
> heavy-weight, and well-suited to a community such as ours. That being
> said, we could discuss other models (eg Java's JSR).
>
> Before we get into specifics, is this something the community would
> like to adopt in some form? Does adapting the PEP and its workflow to
> our projects, community and bylaws seem reasonable?
>
> Thanks,
> Eli
>
> 1. http://www.python.org/dev/peps/pep-0001

Re: [DISCUSSION] Proposal for making core Hadoop changes

Posted by Amr Awadallah <aa...@cloudera.com>.

 > Does adapting the PEP and its workflow to our projects, community and 
bylaws seem reasonable?

+1

On 5/21/2010 1:42 PM, Eli Collins wrote:
> As HDFS and MapReduce have matured the cost and complexity of
> introducing features has grown. Each new feature has to consider
> interactions with a growing set of existing features, a growing user
> base (upgrades, backwards compatibility) and additional use cases
> (more and more projects now build on them). At the same time we don't
> want the high bar for contribution to unnecessarily hinder new
> development and releases.
>
> Many projects at a similar stage address this by adopting a more
> formal way to describe, socialize and shepherd enhancements to their
> platforms. Today, new features are often discussed via an umbrella
> jira, which may have an attached design document. There are a number
> of issues with this approach. The design documents vary in format and
> quality, and are often reviewed by a limited audience. They aren't
> version controlled. Sometimes the proposal is only partially
> specified. Jiras are often ignored. Understanding a proposal and it's
> implications through a series of threads in the jira comments is
> difficult. It's hard for contributors and users to find these
> top-level jiras and follow their status.
>
> I'd like to propose that core Hadoop adopts something similar to
> Python's PEP (Python Enhancement Proposal) [1]. A "HEP" would be a
> single primary mechanism for proposing new features, incorporating
> community feedback, and recording decisions. The author of the HEP
> would be responsible for building consensus and moving the feature
> forward. Similarly, some subset of the community would be responsible
> for reviewing HEPs in a timely manner and identifying missing pieces
> in the proposal. Discussion would occur before patches showed up on
> jira. People interested in the core Hadoop roadmap could keep an eye
> on the HEPs without the overhead of following jira traffic.
>
> Why base this on the PEP? The format has proven useful to a
> substantial existing project, and I think the workflow is not too
> heavy-weight, and well-suited to a community such as ours. That being
> said, we could discuss other models (eg Java's JSR).
>
> Before we get into specifics, is this something the community would
> like to adopt in some form? Does adapting the PEP and its workflow to
> our projects, community and bylaws seem reasonable?
>
> Thanks,
> Eli
>
> 1. http://www.python.org/dev/peps/pep-0001
>