You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Eric Yang <ey...@yahoo-inc.com> on 2011/05/05 04:39:51 UTC

[DISCUSSION] development process of Hadoop

If we reflect back and see how the development community end up in its current state for Hadoop.  There are development rapidly happening and tested in all kind of organizations.  However, Hadoop committers are only committing code that are interested by the sponsored companies.  People are coding defensively to ensuring only self serving patches would be committed, and helping others and merging problem are always prioritized secondary.  While the world demand agility, the "review then commit" process is preventing progress from happening.  Committers are afraid to commit patches because review hasn't took place.  By the time patch is reviewed, it does not apply properly.  People end up having to generate multiple version of patches to ensure the code can be applied.  The large lag time between patch generation and reviewed is taking significant toll on the community and progress.

Yahoo have a great team of developers who improves Hadoop at faster pace with its own fork of the source code.  The reason that Yahoo was able to achieve faster improvement with features was due to the ability to use source code repository tools properly.  Unfortunate for Yahoo, their source code repository was not Apache svn trunk.  I applause Owen and Arun's effort for men powering and backward/forward porting the changes between yahoo github and Apache svn.  There might be some jiras that needs to be merged into Hadoop 0.20.203 branch to ensure the linage is correct.  The community should offer to help with detail listing of what is missing rather than vote -1 without concise reasoning of what is missing.

JIRA is meant as a discussion and collaboration tool, but hadoop community intends to use it as the source code version control system with men powered diff maker.  While spending time in the incubator with other project, the mentors have explained that it is not ASF's philosophy to use "review then commit".  Hadoop community should rethink if the community is using the right tools for the right task.

Use JIRA, if there is large feature set that requires brain storming, and developers should have the ability to make small incremental changes without RTC.  This will ensure developers help each other rather than policing each other.

Any thoughts?

Regards,
Eric

Re: [DISCUSSION] development process of Hadoop

Posted by Eric Yang <ey...@yahoo-inc.com>.
Instead of depending on review then commit practice being the norm,  Hadoop committers can probably take advantage of the svn jira plugin.  People can actively commit to svn as long as a jira number is reference in the commit.  The commit message will show up in JIRA and leave a trail of activities for reference.  Future committers can refer back to the code history to see why the code is written the way it did.  It is less error prone to maintain patch increments.  This seems like a solvable problem by tweaking the behaviors of the hadoop committers.

Regards,
Eric

On 5/4/11 11:31 PM, "Eli Collins" <el...@cloudera.com> wrote:

On Wed, May 4, 2011 at 7:39 PM, Eric Yang <ey...@yahoo-inc.com> wrote:
> If we reflect back and see how the development community end up in its current state for Hadoop.  There are development rapidly happening and tested in all kind of organizations.  However, Hadoop committers are only committing code that are interested by the sponsored companies.  People are coding defensively to ensuring only self serving patches would be committed, and helping others and merging problem are always prioritized secondary.  While the world demand agility, the "review then commit" process is preventing progress from happening.  Committers are afraid to commit patches because review hasn't took place.  By the time patch is reviewed, it does not apply properly.  People end up having to generate multiple version of patches to ensure the code can be applied.  The large lag time between patch generation and reviewed is taking significant toll on the community and progress.
>
> Yahoo have a great team of developers who improves Hadoop at faster pace with its own fork of the source code.  The reason that Yahoo was able to achieve faster improvement with features was due to the ability to use source code repository tools properly.  Unfortunate for Yahoo, their source code repository was not Apache svn trunk.  I applause Owen and Arun's effort for men powering and backward/forward porting the changes between yahoo github and Apache svn.  There might be some jiras that needs to be merged into Hadoop 0.20.203 branch to ensure the linage is correct.  The community should offer to help with detail listing of what is missing rather than vote -1 without concise reasoning of what is missing.
>
> JIRA is meant as a discussion and collaboration tool, but hadoop community intends to use it as the source code version control system with men powered diff maker.  While spending time in the incubator with other project, the mentors have explained that it is not ASF's philosophy to use "review then commit".

ASF's policy is that projects make this decision for themselves:
http://www.apache.org/dev/project-creation.html

The Hadoop bylaws specify that code changes are lazy consensus, ie you
need a +1 from a committer. Technically the code doesn't have to be
reviewed before committing it, that's just been the norm.

I don't think jira is technically required either, it's just been the
norm. The vote for the patch has to happen on the lists, that happens
as a side effect of jira traffic going to the dev lists.

> Hadoop community should rethink if the community is using the right tools for the right task.
>
> Use JIRA, if there is large feature set that requires brain storming, and developers should have the ability to make small incremental changes without RTC.  This will ensure developers help each other rather than policing each other.
>
> Any thoughts?
>

I think you can move quickly with RTC or CTR, I've worked on RTC
projects that have moved quickly. It requires people dedicate
bandwidth to reviewing changes. If you do want all your code reviewed
(at some point) then you're ultimately limited by review bandwidth,
with either RTC or CTR.

The time it takes to file a jira is normally insignificant compared to
the time to create and test a change. The idea with using jira is that
you propose/discuss a change before creating code. You could do that
on the lists too. I agree using just a code review tool for small
stuff would be faster, eg things that don't require a bug #, release
note, etc.

Thanks,
Eli


Re: [DISCUSSION] development process of Hadoop

Posted by Eli Collins <el...@cloudera.com>.
On Wed, May 4, 2011 at 7:39 PM, Eric Yang <ey...@yahoo-inc.com> wrote:
> If we reflect back and see how the development community end up in its current state for Hadoop.  There are development rapidly happening and tested in all kind of organizations.  However, Hadoop committers are only committing code that are interested by the sponsored companies.  People are coding defensively to ensuring only self serving patches would be committed, and helping others and merging problem are always prioritized secondary.  While the world demand agility, the "review then commit" process is preventing progress from happening.  Committers are afraid to commit patches because review hasn't took place.  By the time patch is reviewed, it does not apply properly.  People end up having to generate multiple version of patches to ensure the code can be applied.  The large lag time between patch generation and reviewed is taking significant toll on the community and progress.
>
> Yahoo have a great team of developers who improves Hadoop at faster pace with its own fork of the source code.  The reason that Yahoo was able to achieve faster improvement with features was due to the ability to use source code repository tools properly.  Unfortunate for Yahoo, their source code repository was not Apache svn trunk.  I applause Owen and Arun's effort for men powering and backward/forward porting the changes between yahoo github and Apache svn.  There might be some jiras that needs to be merged into Hadoop 0.20.203 branch to ensure the linage is correct.  The community should offer to help with detail listing of what is missing rather than vote -1 without concise reasoning of what is missing.
>
> JIRA is meant as a discussion and collaboration tool, but hadoop community intends to use it as the source code version control system with men powered diff maker.  While spending time in the incubator with other project, the mentors have explained that it is not ASF's philosophy to use "review then commit".

ASF's policy is that projects make this decision for themselves:
http://www.apache.org/dev/project-creation.html

The Hadoop bylaws specify that code changes are lazy consensus, ie you
need a +1 from a committer. Technically the code doesn't have to be
reviewed before committing it, that's just been the norm.

I don't think jira is technically required either, it's just been the
norm. The vote for the patch has to happen on the lists, that happens
as a side effect of jira traffic going to the dev lists.

> Hadoop community should rethink if the community is using the right tools for the right task.
>
> Use JIRA, if there is large feature set that requires brain storming, and developers should have the ability to make small incremental changes without RTC.  This will ensure developers help each other rather than policing each other.
>
> Any thoughts?
>

I think you can move quickly with RTC or CTR, I've worked on RTC
projects that have moved quickly. It requires people dedicate
bandwidth to reviewing changes. If you do want all your code reviewed
(at some point) then you're ultimately limited by review bandwidth,
with either RTC or CTR.

The time it takes to file a jira is normally insignificant compared to
the time to create and test a change. The idea with using jira is that
you propose/discuss a change before creating code. You could do that
on the lists too. I agree using just a code review tool for small
stuff would be faster, eg things that don't require a bug #, release
note, etc.

Thanks,
Eli