You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Arun C Murthy <ac...@yahoo-inc.com> on 2010/08/24 02:27:14 UTC

[DISCUSS] Hadoop Security Release off Yahoo! patchset

Even with the work on hadoop-0.22 (trunk) starting in earnest it is  
fairly obvious, given our past history, that it will take a while for  
us to get it stable and deployable - for e.g. it took us nearly 6  
months to deploy hadoop-0.20.

In the interim I'd like to propose we push a hadoop-0.20-security  
release off the Yahoo! patchset (http://github.com/yahoo/hadoop- 
common). This will ensure the community benefits from all the work  
done at Yahoo! for over 12 months *now*, and ensures that we do not  
have to wait until hadoop-0.22 which has all of these patches.

Some salient aspects:
a) Full-fledged security implementation deployed at scale (4000 nodes)  
in production.
b) Lots of work on the stabilizing and optimizing the NameNode and  
JobTracker for over 12 months. This has been critical in deploying  
Hadoop at scale i.e. clusters of 4000 nodes. For e.g. we have a 50%  
improvement in CPU utilization on the JobTracker vis-a-vis the  
hadoop-0.20.2 release.
c) Several new features in the scheduler (CapacityScheduler), Map- 
Reduce framework, better support for multi-tenancy etc.
d) Several performance and stability improvements to the system e.g.  
iterative ls, robustness against rogue clients/jobs/users etc.

Also, given the huge number of features and enhancements I'd like to  
propose we create a new 0.20-security branch and commit the Yahoo  
patchset there for the release.

This has been proposed earlier by Doug and did not get far due to  
concerns about the effect this would have on development on trunk.  
However, I believe, we have a case for demonstrable progress on trunk  
now, and it would be useful to have an interim, fully-tested Apache  
Hadoop release available to the community.

  Conceivably, one could imagine a Hadoop Security + Append release  
soon after. At this point a Hadoop Security release alone would add  
tremendous value for the reasons above. Presently we would like to get  
this release out quickly to focus the majority of our efforts on trunk.

Thoughts?

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Let me second arun here.

This is incremental work on 0.20.  We're happy to support any branch naming strategy the community likes, but sticking with 20.<minor> seems like the right default approach.  

Let's discuss 1.0 issues on another thread.  Our priority is to get our work into other folks hands.

Thanks!
E14

On Jan 12, 2011, at 1:34 PM, Arun C Murthy wrote:

> I'm willing to discuss any and all options, for a very short period.
> 
> Technically you have a reasonable point, Doug has suggested this in  
> the past too. If everyone agrees, fine; if not, I'm do not want hung  
> up on a release number. I just *do not* want a controversy.
> 
> As I mentioned, I'm looking to finish this up in a couple of weeks;  
> so, I could do without a long discussion on the on the critical path.
> 
> I'm happy to go with a reasonable compromise, if not, hadoop-0.20.100  
> is what I'm priming for.
> 
> Heck, if Stack wants to call the append release (not sure how far  
> ahead he is) as hadoop-0.20.100, I'm willing to call this  
> hadoop-0.20.200.
> 
> All I care about is having a distinct release number from 0.20.2 (our  
> last stable release). Again, I just want to get a release into the  
> hands of our users. Please, let's resolve this quickly. Please.
> 
> Arun
> 
> On Jan 12, 2011, at 1:10 PM, Owen O'Malley wrote:
> 
>> 
>> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
>> 
>>> I'm open to suggestions - how about something like 20.100 to show
>>> that it's a big jump? Anything else?
>> 
>> 
>> Although I'm not wild about any of the potential release names, this
>> patch set is neither a subset or superset of the 0.21 or 0.22
>> branches. Given that, I think that a new major release number makes
>> the most sense. It is also relatively likely that additional minor
>> releases will be made off of this branch while 0.22 is stabilizing.
>> We've talked about declaring 0.20 a 1.0 for a long time and this feels
>> like backing into the decision, but technically, I believe it to be
>> the right name for such a release.
>> 
>> Thoughts?
>> 
>> -- Owen
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Nigel Daley <nd...@mac.com>.

On Jan 12, 2011, at 11:07 PM, Arun C Murthy wrote:

> 
> On Jan 12, 2011, at 2:56 PM, Nigel Daley wrote:
> 
>> +1 for 0.20.x, where x >= 100.  I agree that the 1.0 moniker would involve more discussion.
> 
> Ok, seems like we are converging; we can continue talking. I've created the branch to get the ball rolling.
> 
>> Will this be a jumbo patch attached to a Jira and then committed to the branch?  Just curious.
> 
> I'm afraid that the svn log of the branch from github Y! branch is fairly useless since a single JIRA might have multiple commits in the Y! branch (bugfix on top of a bugfix). We have done that in several cases (but the patches committed to trunk have a single patch which is the result of forward porting a complete feature/bugfix). IAC the this branch and 0.22 have diverged so much that almost no non-trivial patch would apply without a significant amount of work.
> 
> Thus, I think a jumbo patch should suffice. It will also ensure this can done quickly so that the community can then concentrate on 0.22 and beyond.
> 
> However, I will (manually) ensure all relevant jiras are referenced in the CHANGES.txt and Release Notes for folks to see the contents of the release. This is the hardest part of the exercise. Also, this ensures that we can track these jiras for 0.22 as Eli suggested.
> 
> Does that seem like a reasonable way forward? I'm happy to brainstorm.

+1.  If it turns out to be insufficient to figure out how to apply similar changes to trunk/0.22 then we can address that as needed.

Thanks Arun!

Nige

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 17, 2011, at 8:40 PM, Jeff Hammerbacher wrote:

>>
>> Apache Hadoop hasn't had a stable, updated release in a while.
>>
>
> That's what 0.22 is for?

Every single Hadoop release in the recent past, and I have worked on  
pretty much every single Hadoop release since forever, has taken at  
least 3-4 months to stabilize.

So, we are, at a minimum looking at June, 2011, for 0.22.

This could be a good intermediate release, no? This need not be the  
only one either, please work on 20.append or 20.append+security and  
release it, I fully support it.

IAC, would calling this release something other than 0.20.* be ok?

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

>
> Apache Hadoop hasn't had a stable, updated release in a while.
>

That's what 0.22 is for?

However, it does remedy the critical problem - a stable, updated Apache
> Hadoop release.
>

Again, isn't that what 0.22 is for?


> An appeal: Let's use a bit of common sense and get the project moving
> forward with a release. Folks are welcome to put forward a append release
> and an append+security release and so forth (I've strongly supported that),
> not to mention 0.22 and beyond. IMHO, more than one release is definitely
> better than none.
>

Yes, 0.22 has both append and security. 0.22 also has the nice feature of
following the Apache release guidelines rather than relying on the patch set
of an independent entity, whether it's Cloudera, Facebook, or Yahoo.


> Let's get the ball rolling, please!
>

Agreed! Nigel has done a great job getting the ball rolling on the 0.22
release. I'm looking forward to seeing everyone burn down the blockers that
have been identified.

Regards,
Jeff

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On 1/17/11 12:11 PM, "Doug Cutting" <cu...@apache.org> wrote:
>
> We would not release this until each change in it has been reviewed by
> the community, right?  Otherwise we may end up with changes in a 0.20
> release that don't get approved when they're contributed to trunk and
> cause trunk to regress.  So I don't yet see the point of committing  
> the
> mega patch since the community needs to review each individual change
> anyway, so we might wait until each is reviewed to commit it.

My take is straight-forward:

Apache Hadoop hasn't had a stable, updated release in a while.

As a result there is too much confusion for the user-community. There  
are too many releases done by too many entities and nothing is  
available from Apache, for a long while now. This is a situation we  
need to rectify, urgently!

Engaging in community review of these patches will distract the  
developer community's attention from 0.22 and the future. Not to  
mention, it will take forever and keep users hanging. Yes, the  
mechanics are important - but not more important than the end result.

IAC:
a) The vast majority of these patches are already on jira, and have  
been for several, several months now.
b) The vast majority of these patches have already been committed to  
trunk i.e. 0.22.

Sure, some patches maybe missing from 0.22 or jira; my proposal is not  
ideal and - I don't think anyone is pretending it is.

However, it does remedy the critical problem - a stable, updated  
Apache Hadoop release.

We can remedy backward or forward compatibility by being clever with  
our release versions or names.

An appeal: Let's use a bit of common sense and get the project moving  
forward with a release. Folks are welcome to put forward a append  
release and an append+security release and so forth (I've strongly  
supported that), not to mention 0.22 and beyond. IMHO, more than one  
release is definitely better than none.

Let's get the ball rolling, please!

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ian Holsman <ha...@holsman.net>.

Arun also mentioned that he has created a separate branch ( http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-security-patches/ ) for the individual patches.. so he is doing both.. so everyone should be happy.

--I
On Jan 25, 2011, at 2:05 AM, Eric Baldeschwieler wrote:

> Hi Ian,
> 
> No votes have been called for at the moment.  Right now all arun's done is create a branch and ask for feedback from folks who want to try it.  Most of the forward ports are already either committed or in a patch available state, as arun mentioned.  We'll work through the others as individual JIRAs to allow everyone to kick the tires.  That should avoid issues with 0.22.  
> 
> I don't anticipate votes etc, unless folks do want to try it and do provide feedback.  This is what runs at yahoo at the moment.  I hope people think it is worth trying.
> 
> Thanks,
> 
> E14
> 
> On Jan 22, 2011, at 6:22 AM, Ian Holsman wrote:
> 
>> 
>> On Jan 19, 2011, at 1:12 PM, Konstantin Shvachko wrote:
>> 
>>> On Tue, Jan 18, 2011 at 11:49 PM, Ian Holsman <ha...@holsman.net> wrote:
>>> 
>>>> I think Roy's suggestion of applying the commits individually to the branch
>>>> from your current working branch would help with this.
>>>> 
>>> 
>>> I am sure this is not what Roy suggested. Ian. I think the idea is simple.
>> 
>> to Quote Roy:
>> b) create a branch off of some prior Apache release point in svn
>>  and replay the internal Y! commits on that branch until the branch
>>  source code is identical to what you have tested locally.  Then
>>  RM a tarball based on that branch and start a release vote.
>>  Since the history is now in svn, others could do the RM bit if
>>  you don't have time.
>> 
>> 
>> Arun has chosen option (c), that Roy also mentioned as a valid way of doing it.
>> 
>>> If you decide to donate to a non-profit organization you are free to choose
>>> the form of your donation.
>> 
>> 
>> I think you are confusing a non-profit for a dumping ground. 
>> Any organization (non-profit or for-profit) has responsibilities, and their is always a tradeoff between features and risk. Any organization can choose to not accept a donation. It comes down to give-and-take
>> 
>> 
>> As Roy also mentioned, option (c) will be harder for others to test, and get consensus about weather it is release worthy, let alone merge into 0.22.
>> 
>> 
>>> 
>>> Thanks,
>>> --Konstantin
>> 
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 25, 2011, at 2:05 AM, Eric Baldeschwieler wrote:
>
> I don't anticipate votes etc, unless folks do want to try it and do  
> provide feedback.  This is what runs at yahoo at the moment.  I hope  
> people think it is worth trying.
>

I've put up the bits at http://people.apache.org/~acmurthy/hadoop-0.20.100-rc0/ 
  for interested folks. Please do provide feedback if you try it.

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Hi Ian,

No votes have been called for at the moment.  Right now all arun's done is create a branch and ask for feedback from folks who want to try it.  Most of the forward ports are already either committed or in a patch available state, as arun mentioned.  We'll work through the others as individual JIRAs to allow everyone to kick the tires.  That should avoid issues with 0.22.  

I don't anticipate votes etc, unless folks do want to try it and do provide feedback.  This is what runs at yahoo at the moment.  I hope people think it is worth trying.

Thanks,

E14

On Jan 22, 2011, at 6:22 AM, Ian Holsman wrote:

> 
> On Jan 19, 2011, at 1:12 PM, Konstantin Shvachko wrote:
> 
>> On Tue, Jan 18, 2011 at 11:49 PM, Ian Holsman <ha...@holsman.net> wrote:
>> 
>>> I think Roy's suggestion of applying the commits individually to the branch
>>> from your current working branch would help with this.
>>> 
>> 
>> I am sure this is not what Roy suggested. Ian. I think the idea is simple.
> 
> to Quote Roy:
> b) create a branch off of some prior Apache release point in svn
>   and replay the internal Y! commits on that branch until the branch
>   source code is identical to what you have tested locally.  Then
>   RM a tarball based on that branch and start a release vote.
>   Since the history is now in svn, others could do the RM bit if
>   you don't have time.
> 
> 
> Arun has chosen option (c), that Roy also mentioned as a valid way of doing it.
> 
>> If you decide to donate to a non-profit organization you are free to choose
>> the form of your donation.
> 
> 
> I think you are confusing a non-profit for a dumping ground. 
> Any organization (non-profit or for-profit) has responsibilities, and their is always a tradeoff between features and risk. Any organization can choose to not accept a donation. It comes down to give-and-take
> 
> 
> As Roy also mentioned, option (c) will be harder for others to test, and get consensus about weather it is release worthy, let alone merge into 0.22.
> 
> 
>> 
>> Thanks,
>> --Konstantin
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ian Holsman <ha...@holsman.net>.

On Jan 19, 2011, at 1:12 PM, Konstantin Shvachko wrote:

> On Tue, Jan 18, 2011 at 11:49 PM, Ian Holsman <ha...@holsman.net> wrote:
> 
>> I think Roy's suggestion of applying the commits individually to the branch
>> from your current working branch would help with this.
>> 
> 
> I am sure this is not what Roy suggested. Ian. I think the idea is simple.

to Quote Roy:
b) create a branch off of some prior Apache release point in svn
   and replay the internal Y! commits on that branch until the branch
   source code is identical to what you have tested locally.  Then
   RM a tarball based on that branch and start a release vote.
   Since the history is now in svn, others could do the RM bit if
   you don't have time.

Arun has chosen option (c), that Roy also mentioned as a valid way of doing it.

> If you decide to donate to a non-profit organization you are free to choose
> the form of your donation.

I think you are confusing a non-profit for a dumping ground. 
Any organization (non-profit or for-profit) has responsibilities, and their is always a tradeoff between features and risk. Any organization can choose to not accept a donation. It comes down to give-and-take

As Roy also mentioned, option (c) will be harder for others to test, and get consensus about weather it is release worthy, let alone merge into 0.22.

> 
> Thanks,
> --Konstantin

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Konstantin Shvachko <sh...@gmail.com>.

On Tue, Jan 18, 2011 at 11:49 PM, Ian Holsman <ha...@holsman.net> wrote:

> I think Roy's suggestion of applying the commits individually to the branch
> from your current working branch would help with this.
>

I am sure this is not what Roy suggested. Ian. I think the idea is simple.
If you decide to donate to a non-profit organization you are free to choose
the form of your donation.
There is a tradeoff between its usability to the community and the effort
Softwareput into making it such.
The community can evaluate the quality of the donation and decide on how to
consume it.

Also, this is different from the 0.20-append discussion, imo, as it doesn't
require additional
community resources. I can see many of them are dedicated to building 0.22
as we speak.

Thanks,
--Konstantin

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ian Holsman <ha...@holsman.net>.

That's what scares me, and highlights one of the points I made before.

If someone wants to just use the capacity scheduling improvements, and not other parts, they will find it hard. I think Roy's suggestion of applying the commits individually to the branch from your current working branch would help with this.

regards
Ian

On Jan 19, 2011, at 1:22 AM, Eric Baldeschwieler wrote:

> Sounds like you are agreeing with me in your own way Allen ;-)
> 
> We're getting > 2x better throughput and stability from the capacity scheduler in this branch.  I'd love to get you feedback on that.
> 
> The more nodes and users in your deployment, the more improvements you will see. 
> 
> ---
> E14 - via iPhone
> 
> On Jan 18, 2011, at 11:10 AM, "Allen Wittenauer" <aw...@linkedin.com> wrote:
> 
>> On Jan 17, 2011, at 2:56 PM, Eric Baldeschwieler wrote:
>>> 
>>> Right now you don't have the choice of an Apache release if you are looking for a stabilized modern version of Hadoop.  
>> 
>> 
>>   Can we ratchet down the hyperbole to at least a point where I don't want to vomit? Thanks.
>> 
>>   [For the record, some of us quite like our stable, almost-a-year Apache Hadoop 0.20.2 w/3 patches installations, thankyouverymuch.  (Those patches are for portability and capacity scheduler fixes, since the one in 0.20.2 is completely useless.)]

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Sounds like you are agreeing with me in your own way Allen ;-)

We're getting > 2x better throughput and stability from the capacity scheduler in this branch.  I'd love to get you feedback on that.

The more nodes and users in your deployment, the more improvements you will see. 

---
E14 - via iPhone

On Jan 18, 2011, at 11:10 AM, "Allen Wittenauer" <aw...@linkedin.com> wrote:

> On Jan 17, 2011, at 2:56 PM, Eric Baldeschwieler wrote:
>> 
>> Right now you don't have the choice of an Apache release if you are looking for a stabilized modern version of Hadoop.  
> 
> 
>    Can we ratchet down the hyperbole to at least a point where I don't want to vomit? Thanks.
> 
>    [For the record, some of us quite like our stable, almost-a-year Apache Hadoop 0.20.2 w/3 patches installations, thankyouverymuch.  (Those patches are for portability and capacity scheduler fixes, since the one in 0.20.2 is completely useless.)]

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Jan 17, 2011, at 2:56 PM, Eric Baldeschwieler wrote:
> 
>  Right now you don't have the choice of an Apache release if you are looking for a stabilized modern version of Hadoop.  


	Can we ratchet down the hyperbole to at least a point where I don't want to vomit? Thanks.

	[For the record, some of us quite like our stable, almost-a-year Apache Hadoop 0.20.2 w/3 patches installations, thankyouverymuch.  (Those patches are for portability and capacity scheduler fixes, since the one in 0.20.2 is completely useless.)]

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Chris Douglas <cd...@apache.org>.

On Mon, Jan 17, 2011 at 8:30 PM, Jeff Hammerbacher <ha...@cloudera.com> wrote:
> We had this exact same discussion about the 0.20-append branch a few weeks
> ago. A few organizations have tested that code at scale and feel strongly
> that it's stable. We decided not to release it because it does not meet the
> Apache guidelines for a release.

It does meet the guidelines. The last validation and development of
the append work wasn't done openly, but that doesn't prevent it from
being contributed and released. That discussion died out because no
committer stepped up, rolled a release of that branch, and called a
vote. If it gets a majority of votes on the PMC, it could even be
released off the 0.20 branch. Individuals may have their reservations
and vote accordingly, but the PMC has the authority to release
0.20-append and 0.20.100 (or whatever).

> A few weeks later, we now have another organization claiming that their
> 0.20-based branch is tested at scale and should be released. It's claimed
> that 0.20.100 will be "more stable, performant and more useful to our
> users"; the same can be said of the 0.20-append branch. Neither branch,
> however, is a bugfix release and thus does not meet the Apache guidelines
> for a release. That's too bad; we should work to avoid this situation again
> in the future, but let's not try to change the rules because we did a poor
> job in the past of getting our work released via Apache.

The rules from Apache are *far* more flexible than what we've
practiced. Our rigidity has contributed to the current state of the
project by pushing active development behind the walls of contributing
organizations. We aren't obligated to live with a fractured community,
nor do some nebulous Apache guidelines prohibit us from finding a way
forward.

> As Nigel mentions, and as was done with 0.20-append, I would fully support a
> "a code-only drop into a branch w/ no formal Apache release". That's fully
> compliant with the Apache process.

This helps nobody except those who would cut releases outside of
Apache, which is precisely what we're trying to curtail.

> All of these discussions will be moot once we get 0.22 out the door and stop
> arguing over which organization has the most magical 0.20-based bits. I'm
> looking forward to seeing all of the Apache Hadoop contributors working full
> time on that release process once these bits are committed to the 0.20.100
> branch.

+1 I'm sure I'm not alone in turning ill at the thought of working on
a 0.20 branch again. But others may have different priorities,
different interests, and may actually prefer the architectural
decisions in 0.20 to those made since then. There's obviously enough
interest in a stable branch for all the major contributors to solve
those same problems independently. Opening up space in Apache for that
work is what we should have done a year ago. Since we don't yet have a
credible release, it still makes sense today. -C

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Bringing 'organizations' into this discussion is very disingenuous.

Doug, credit to him, was the first person to propose this release:
http://www.mail-archive.com/general@hadoop.apache.org/msg01427.html

I have supported the append-release:
http://www.mail-archive.com/general@hadoop.apache.org/msg02584.html

So, stop coloring arguments in this manner.

Arun


On Jan 17, 2011, at 8:30 PM, Jeff Hammerbacher wrote:

> Hey,
>
> We had this exact same discussion about the 0.20-append branch a few  
> weeks
> ago. A few organizations have tested that code at scale and feel  
> strongly
> that it's stable. We decided not to release it because it does not  
> meet the
> Apache guidelines for a release. The Apache process has its pros and  
> cons;
> we've all accepted them, so the community moved on and focused its  
> energy on
> the 0.22 release.
>
> A few weeks later, we now have another organization claiming that  
> their
> 0.20-based branch is tested at scale and should be released. It's  
> claimed
> that 0.20.100 will be "more stable, performant and more useful to our
> users"; the same can be said of the 0.20-append branch. Neither  
> branch,
> however, is a bugfix release and thus does not meet the Apache  
> guidelines
> for a release. That's too bad; we should work to avoid this  
> situation again
> in the future, but let's not try to change the rules because we did  
> a poor
> job in the past of getting our work released via Apache.
>
> As Nigel mentions, and as was done with 0.20-append, I would fully  
> support a
> "a code-only drop into a branch w/ no formal Apache release". That's  
> fully
> compliant with the Apache process.
>
> All of these discussions will be moot once we get 0.22 out the door  
> and stop
> arguing over which organization has the most magical 0.20-based  
> bits. I'm
> looking forward to seeing all of the Apache Hadoop contributors  
> working full
> time on that release process once these bits are committed to the  
> 0.20.100
> branch.
>
> Thanks,
> Jeff
>
> On Mon, Jan 17, 2011 at 6:55 PM, Todd Papaioannou <toddp@yahoo- 
> inc.com>wrote:
>
>> That's only true if you plan to pull forward the changes wholesale  
>> into
>> .21, .22 and beyond. And that is not what is being proposed.
>>
>> If the plan is to just land an updated and more stable version of . 
>> 20 that
>> is completely backwards compatible, then this can be done within  
>> that code
>> line without any impact to the end users. Any changes that the  
>> community
>> wish to pull forward can be identified, isolated and reviewed per  
>> the normal
>> process. Or they can remain in the .20.100 release for eternity,  
>> without any
>> impact on the future.
>>
>> Either way, the .20 release will be more stable, performant and  
>> more useful
>> to our users, and the community at large can focus on releasing . 
>> 22, which
>> we all believe is the right goal.
>>
>> ToddP
>>
>> From: Doug Cutting <cu...@apache.org>>
>> Reply-To:  
>> "general@hadoop.apache.org<ma...@hadoop.apache.org>" <
>> general@hadoop.apache.org<ma...@hadoop.apache.org>>
>> Date: Mon, 17 Jan 2011 15:49:51 -0800
>> To: "general@hadoop.apache.org<ma...@hadoop.apache.org>" <
>> general@hadoop.apache.org<ma...@hadoop.apache.org>>
>> Subject: Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset
>>
>>
>> Backwards compatibility has been a goal, so
>> with luck we will not ID regressions.
>>
>> My point was that, in addition to back-compatibility with prior 0.20
>> releases, we must also consider the forward-compatibility of each  
>> change
>> with 0.21, 0.22 and trunk.
>>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey,

We had this exact same discussion about the 0.20-append branch a few weeks
ago. A few organizations have tested that code at scale and feel strongly
that it's stable. We decided not to release it because it does not meet the
Apache guidelines for a release. The Apache process has its pros and cons;
we've all accepted them, so the community moved on and focused its energy on
the 0.22 release.

A few weeks later, we now have another organization claiming that their
0.20-based branch is tested at scale and should be released. It's claimed
that 0.20.100 will be "more stable, performant and more useful to our
users"; the same can be said of the 0.20-append branch. Neither branch,
however, is a bugfix release and thus does not meet the Apache guidelines
for a release. That's too bad; we should work to avoid this situation again
in the future, but let's not try to change the rules because we did a poor
job in the past of getting our work released via Apache.

As Nigel mentions, and as was done with 0.20-append, I would fully support a
"a code-only drop into a branch w/ no formal Apache release". That's fully
compliant with the Apache process.

All of these discussions will be moot once we get 0.22 out the door and stop
arguing over which organization has the most magical 0.20-based bits. I'm
looking forward to seeing all of the Apache Hadoop contributors working full
time on that release process once these bits are committed to the 0.20.100
branch.

Thanks,
Jeff

On Mon, Jan 17, 2011 at 6:55 PM, Todd Papaioannou <to...@yahoo-inc.com>wrote:

> That's only true if you plan to pull forward the changes wholesale into
> .21, .22 and beyond. And that is not what is being proposed.
>
> If the plan is to just land an updated and more stable version of .20 that
> is completely backwards compatible, then this can be done within that code
> line without any impact to the end users. Any changes that the community
> wish to pull forward can be identified, isolated and reviewed per the normal
> process. Or they can remain in the .20.100 release for eternity, without any
> impact on the future.
>
> Either way, the .20 release will be more stable, performant and more useful
> to our users, and the community at large can focus on releasing .22, which
> we all believe is the right goal.
>
> ToddP
>
> From: Doug Cutting <cu...@apache.org>>
> Reply-To: "general@hadoop.apache.org<ma...@hadoop.apache.org>" <
> general@hadoop.apache.org<ma...@hadoop.apache.org>>
> Date: Mon, 17 Jan 2011 15:49:51 -0800
> To: "general@hadoop.apache.org<ma...@hadoop.apache.org>" <
> general@hadoop.apache.org<ma...@hadoop.apache.org>>
> Subject: Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset
>
>
> Backwards compatibility has been a goal, so
> with luck we will not ID regressions.
>
> My point was that, in addition to back-compatibility with prior 0.20
> releases, we must also consider the forward-compatibility of each change
> with 0.21, 0.22 and trunk.
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Todd Papaioannou <to...@yahoo-inc.com>.

That's only true if you plan to pull forward the changes wholesale into .21, .22 and beyond. And that is not what is being proposed.

If the plan is to just land an updated and more stable version of .20 that is completely backwards compatible, then this can be done within that code line without any impact to the end users. Any changes that the community wish to pull forward can be identified, isolated and reviewed per the normal process. Or they can remain in the .20.100 release for eternity, without any impact on the future.

Either way, the .20 release will be more stable, performant and more useful to our users, and the community at large can focus on releasing .22, which we all believe is the right goal.

ToddP

From: Doug Cutting <cu...@apache.org>>
Reply-To: "general@hadoop.apache.org<ma...@hadoop.apache.org>" <ge...@hadoop.apache.org>>
Date: Mon, 17 Jan 2011 15:49:51 -0800
To: "general@hadoop.apache.org<ma...@hadoop.apache.org>" <ge...@hadoop.apache.org>>
Subject: Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset


Backwards compatibility has been a goal, so
with luck we will not ID regressions.

My point was that, in addition to back-compatibility with prior 0.20
releases, we must also consider the forward-compatibility of each change
with 0.21, 0.22 and trunk.

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Doug Cutting <cu...@apache.org>.

On 01/17/2011 02:56 PM, Eric Baldeschwieler wrote:
> 1) To doug's point - Yes, absolutely, we want folks to review this.
> The patch is now available.  Lets work together to get it formatted
> as folks like in subversion and reviewed.  Where there are issues,
> let's work to resolve them.  With luck folks will find this work
> consistent and useful.

The question I was addressing was whether to commit the mega-patch 
as-is, or attempt to linearize it into a sequence of patches, one per 
issue addressed by the mega patch.  I believe that, as-is, it is 
probably too big to review as a unit.  I don't see that merely naming 
the changes in it makes it substantially easier to review.  Rather, each 
issue probably needs to be associated with a distinct patch in order to 
permit independent review.

> Backwards compatibility has been a goal, so
> with luck we will not ID regressions.

My point was that, in addition to back-compatibility with prior 0.20 
releases, we must also consider the forward-compatibility of each change 
with 0.21, 0.22 and trunk.

> As todd mentioned earlier
> point, a lot of this work has already been merged into CDH and all of
> it has been reviewed by several apache committers already.

Right, but review must be in public, so that everyone in the community 
has an equal chance to be involved in the development of each change.

Doug

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Hi Folks,

We are very interested in sharing what we are doing with the community.  I think we can separate this into multiple stages.

1) To doug's point - Yes, absolutely, we want folks to review this.  The patch is now available.  Lets work together to get it formatted as folks like in subversion and reviewed.  Where there are issues, let's work to resolve them.  With luck folks will find this work consistent and useful.  Backwards compatibility has been a goal, so with luck we will not ID regressions.  As todd mentioned earlier point, a lot of this work has already been merged into CDH and all of it has been reviewed by several apache committers already. 

2) This code works, it is the best hadoop we know of.  If you run a business on hadoop, I think you would benefit from using it.  Right now you don't have the choice of an Apache release if you are looking for a stabilized modern version of Hadoop.  We would like to make apache releases based on it, source and binary, incorporating bugfixes from everyone.  To do that we would of course need to follow the Apache Hadoop release process, which requires the release master to produce a release candidate and the PMC to vote on the release.  Since that will require a formal future vote, no one will be surprised!

3) To nigel's point - I don't think this should distract anyone from working on 22 or other Hadoop contributions.  The 22 team will have the option of incorporating this work.   We think it will be a better release if they do, but that is their choice.  The majority of out effort at yahoo is not going into 0.20 (this branch), we are working on future features for hadoop.  This is branch is the stable code we use while we are waiting for a new release.

Thanks,

E14

On Jan 17, 2011, at 12:21 PM, Nigel Daley wrote:

> 
> On Jan 17, 2011, at 12:11 PM, Doug Cutting wrote:
> 
>> On 01/12/2011 11:07 PM, Arun C Murthy wrote:
>>> Thus, I think a jumbo patch should suffice. It will also ensure this can
>>> done quickly so that the community can then concentrate on 0.22 and beyond.
>>> 
>>> However, I will (manually) ensure all relevant jiras are referenced in
>>> the CHANGES.txt and Release Notes for folks to see the contents of the
>>> release. This is the hardest part of the exercise. Also, this ensures
>>> that we can track these jiras for 0.22 as Eli suggested.
>>> 
>>> Does that seem like a reasonable way forward? I'm happy to brainstorm.
>> 
>> We would not release this until each change in it has been reviewed by the community, right?  Otherwise we may end up with changes in a 0.20 release that don't get approved when they're contributed to trunk and cause trunk to regress.  So I don't yet see the point of committing the mega patch since the community needs to review each individual change anyway, so we might wait until each is reviewed to commit it.
> 
> Unless this is a code-only drop into a branch w/ no formal Apache release.  If that's the case then I'm +1 on letting them commit in this way this one time so we can all move on to 0.22.
> 
> Nige
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Nigel Daley <nd...@mac.com>.

On Jan 17, 2011, at 12:11 PM, Doug Cutting wrote:

> On 01/12/2011 11:07 PM, Arun C Murthy wrote:
>> Thus, I think a jumbo patch should suffice. It will also ensure this can
>> done quickly so that the community can then concentrate on 0.22 and beyond.
>> 
>> However, I will (manually) ensure all relevant jiras are referenced in
>> the CHANGES.txt and Release Notes for folks to see the contents of the
>> release. This is the hardest part of the exercise. Also, this ensures
>> that we can track these jiras for 0.22 as Eli suggested.
>> 
>> Does that seem like a reasonable way forward? I'm happy to brainstorm.
> 
> We would not release this until each change in it has been reviewed by the community, right?  Otherwise we may end up with changes in a 0.20 release that don't get approved when they're contributed to trunk and cause trunk to regress.  So I don't yet see the point of committing the mega patch since the community needs to review each individual change anyway, so we might wait until each is reviewed to commit it.

Unless this is a code-only drop into a branch w/ no formal Apache release.  If that's the case then I'm +1 on letting them commit in this way this one time so we can all move on to 0.22.

Nige

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Chris Douglas <cd...@apache.org>.

On Mon, Jan 17, 2011 at 12:11 PM, Doug Cutting <cu...@apache.org> wrote:
> We would not release this until each change in it has been reviewed by the
> community, right?  Otherwise we may end up with changes in a 0.20 release
> that don't get approved when they're contributed to trunk and cause trunk to
> regress.  So I don't yet see the point of committing the mega patch since
> the community needs to review each individual change anyway, so we might
> wait until each is reviewed to commit it.

I share this concern. Releasing an omnibus pile of commits in the 0.20
series will create an impossible situation for the mainline. Worse,
the alternative sifts through this pile over months, as the
refinements wrought by consensus require remerging and revalidating of
each issue. Every subsequent issue must also be reconsidered. The
product must then be deployed, tested, and its bugs fixed, just to get
a release as battle-hardened as this one. Signing up for all this work
when most every developer and user would rather see trunk proceed
would be madness.

However, the status quo is also unacceptable. Running any version of
Apache Hadoop is rare, when compared to the popularity of its
variants. We must find a solution to that. Hadoop is not in good shape
right now, and exceptional actions to correct it should not be cast
off lightly by valuing consistency over its future.

To address Nigel and Doug's concerns about compatibility, we should
consider a different release series. We wanted to postpone 1.0
discussions, but that would be one solution. If a secure 0.20 could be
released as 1.0, then if interest in this branch persists, append
could be a 1.1 release on this series,* etc. while 0.22 and its
successors can be 2.0 (as a rare benefit to the project split, one
could argue that "Hadoop" is the unified set, and the Common, HDFS,
and MapReduce projects could continue to release on the 0.x series
until we want to declare those a stable successor to 0.20). Version
numbers are pretty cheap, when compared to our time and focus.

* In the interim, a 0.20-append release would make all kinds of sense,
and fie on the niceties of naming.

> That said, posting the mega patch is useful, so that folks can start to pick
> it apart into separate issues.  Pushing your internal commits to a public
> github branch might also make that review process easier.

Pushing to github caused this problem. CDH rebased on YDH, and today
Apache Hadoop is considered less stable, less tested, and less usable
than either one of them. Why one would expect things to work
differently this time around is not clear. I assume we all agree it's
a poor outcome.

Arun already volunteered to break up the commits and push individual
patches to the repository, so the history is manageable. We allow CTR
for branches, though it's predicated on the assumption that that work
will be spread over weeks or months; development should not be batched
this way. However, by adding obstacles to an unambiguously positive
outcome, collaborators will be skeptical of engaging more deeply with
this community. Let's focus on making forward progress, not on
ensuring the requisite pain is felt. -C

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Doug Cutting <cu...@apache.org>.

On 01/12/2011 11:07 PM, Arun C Murthy wrote:
> Thus, I think a jumbo patch should suffice. It will also ensure this can
> done quickly so that the community can then concentrate on 0.22 and beyond.
>
> However, I will (manually) ensure all relevant jiras are referenced in
> the CHANGES.txt and Release Notes for folks to see the contents of the
> release. This is the hardest part of the exercise. Also, this ensures
> that we can track these jiras for 0.22 as Eli suggested.
>
> Does that seem like a reasonable way forward? I'm happy to brainstorm.

We would not release this until each change in it has been reviewed by 
the community, right?  Otherwise we may end up with changes in a 0.20 
release that don't get approved when they're contributed to trunk and 
cause trunk to regress.  So I don't yet see the point of committing the 
mega patch since the community needs to review each individual change 
anyway, so we might wait until each is reviewed to commit it.

That said, posting the mega patch is useful, so that folks can start to 
pick it apart into separate issues.  Pushing your internal commits to a 
public github branch might also make that review process easier.

Doug

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by "Tsz Wo (Nicholas), Sze" <s2...@yahoo.com>.

Below are copied from http://httpd.apache.org/dev/release.html.  Not sure if it 
helps.

What power does the RM yield?
Regarding what makes it into a release, the RM is the unquestioned authority. No 
one can contest what makes it into the release. The community will judge the 
release's quality after it has been issued, but the community can not force the 
RM to include a feature that they feel uncomfortable adding. Remember that this 
document is only a guideline to the community and future RMs - each RM may run a 
release in a different way. If you don't like what an RM is doing, start 
preparing for your own competing release.

Nicholas

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Hi Eli,

Thanks for the suggestion.

+1 to nigel and arun's proposal.

I completely support the idea of creating a version of 20 with append for HBASE.  However, the append issue is very complicated and there does not exist any version of append that is certified against a workload as diverse as what this branch has been tested against.  I think you are trying to cross too many streams here.   If you have resources to help integrate any version of Hadoop 0.20 with append, package and test it, I fully support you doing so.  But that effort is not aligned with the goal of this branch, which is to share a substantial amount of fully integrated and tested work.  Members of the community have expressed interest in seeing this tested work get checked into Apache and I would like to share it.  Mashing it up with other patches would invalidate months of testing, defeating the purpose of the exercise.

If you are interested in integrating Append with this branch, why not create a 20.200 branch and do so?

Unless you are vetoing the sharing of work as is on a branch (the purpose of the branch), I suggest we move on.

Thanks,

E14

On Jan 13, 2011, at 8:23 PM, Arun C Murthy wrote:

> 
> On Jan 13, 2011, at 6:50 PM, Eli Collins wrote:
> 
>> The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
>> 104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
>> performance and stability fixes I think you're referring to, at least
>> the ones that have been posted to Apache jira).
>> 
>> Can you post a pointer to the version you're referring to, eg on
>> github?  If there isn't a big delta between it and the cdh3 patch set
>> (which should have the 20-based patches from jira) perhaps you and
>> Todd could easily merge in the delta to create 0.20.x?
>> 
> 
> I can guarantee it will need work to merge the enhancements since  
> 20.104.3, it's over 6 months of development. The enhancements includes  
> work on stability such as iterative ls, limits on JT to prevent single  
> jobs/users from taking it down etc. and lots of bug-fixes to security.  
> So, unfortunately the delta is pretty large.
> 
> I'm working on a CHANGES.txt which should reflect all the changes i.e.  
> bug-fixes and enhancements.
> 
>>> The version I'm offering to push to the community has fixed all of  
>>> them,
>>> *plus* the added benefit of several stability and performance fixes  
>>> we have
>>> done since 20.104.3, almost 10 internal releases. This is a battle  
>>> tested
>>> and hardened version which we have deployed on 40,000+ nodes. It is a
>>> significant upgrade on 0.20.104.3 which we never deployed. I'm  
>>> pretty sure
>>> *some* users will find that valuable. ;)
>> 
>> Definitely, but better to hit two birds with one stone right?  Instead
>> of a security + enhancements release and an append release we could
>> have a single security + append + enhancements release and users don't
>> have to choose.
>> 
> 
> 
> We are discussing two options:
> 20 + security + enhancements
> 20 + security + append
> 
> I think the value we provide via 20+security+enhancements release is  
> that it's stable, tested and deployed at scale. Doing any more work  
> merging 6 months of work at Yahoo (again, I guarantee it's a lot of  
> work) will need a lots of cycles to validate, test and stabilize.
> 
> I feel the alternative is a distraction for me, I'd rather work on 0.22.
> 
> I can get 20+security+enhancements done very, very, quickly precisely  
> because I don't have to spend cycles testing it.
> 
> Does that make sense? Thanks for being patient and bearing with me...
> 
> Arun
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Nigel Daley <nd...@mac.com>.

Eric, Arun, I'd like to explicitly clarify one aspect of this branch and what you mean by 'release' -- it can have many meanings.

Are you asking to actually create an Apache release from this branch (binary & source)?  Or, as I was assuming, simply commit all this code to this branch and leave it there without a formal release so others can role their own binary if they wish?

Thanks,
Nige


On Jan 14, 2011, at 10:30 AM, Eric Baldeschwieler wrote:

> Yup. Letting people who want to contribute, do so a good meme!
> 
> A stable next release would be great. But orgs do sustaining on stable code releases for a lot of very good reasons. 
> 
> A next Hadoop 21+ of this code quality is almost a year away in my opinion. 
> 
> ---
> E14 - via iPhone
> 
> On Jan 14, 2011, at 10:05 AM, "Jakob Homan" <jg...@gmail.com> wrote:
> 
>>> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..
>> 
>> Not all are.  I'm against it for the all the same reasons I was
>> against 20 append.  This is also being used as a wedge to get the
>> append work in as .200.  My position is that every iota effort of
>> releasing another 20 branch is an iota not spent on getting us a
>> kick-ass 22.  20 was great, and we had a lot of wonderful times
>> together, but it's time to move on and see other releases.
>> 
>> But, this is a volunteer effort, and if others want to put the effort
>> in, they're free to do so.
>> -jg
>> 
>> On Fri, Jan 14, 2011 at 9:32 AM, Nigel Daley <nd...@mac.com> wrote:
>>> Yup, I'll say it again.  The process ain't perfect but it's good enough IMO. Thank you Yahoo! for your contribution.
>>> 
>>> Clearly these patch will need review before commit when going into trunk.
>>> 
>>> Let's move on to 0.22.
>>> 
>>> Nige
>>> 
>>> On Jan 14, 2011, at 9:20 AM, Konstantin Boudnik wrote:
>>> 
>>>> I tend to second most of Ian's points here.
>>>> 
>>>> On Fri, Jan 14, 2011 at 06:14, Ian Holsman <ha...@holsman.net> wrote:
>>>>> (with my Apache hat on)
>>>>> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).
>>>> 
>>>> #1: we are creating a precedent of a "brain-dump" here. Although, it
>>>> isn't the first one in the history of OSS. Infamous Apple "patch" to
>>>> OpenBSD is another one ;)
>>>> 
>>>> #2: How to spell 'back door' any one?
>>>> 
>>>> #5: "almost 10 internal releases" Arun has mentioned above might be,
>>>> perhaps, considered as a great quality control effort. Also, not to
>>>> mention virtual impossibility to create a test plan to validate a
>>>> giant features patch.
>>>> 
>>>>> BTW, I'd like to point out a discrepancy here:
>>>>> 
>>>>> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..
>>>> 
>>>> And this ^^^
>>>> 
>>>> But, hey I guess it's totally worth it!
>>>> Cos
>>>> 
>>>>> --Ian
>>>>> 
>>>>> On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:
>>>>> 
>>>>>> 
>>>>>> On Jan 13, 2011, at 10:59 PM, Stack wrote:
>>>>>> 
>>>>>>> (Man, it was looking good there for a second when 0.20.100 was about
>>>>>>> security+append!)
>>>>>>> 
>>>>>>> Good luck w/ the release Arun.
>>>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>>> We might be following your 0.20.100 with a 0.20.200 append.
>>>>>>> 
>>>>>> 
>>>>>> Super!
>>>>>> 
>>>>>> Arun
>>>>> 
>>>>> 
>>> 
>>>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Yup. Letting people who want to contribute, do so a good meme!

A stable next release would be great. But orgs do sustaining on stable code releases for a lot of very good reasons. 

A next Hadoop 21+ of this code quality is almost a year away in my opinion. 

---
E14 - via iPhone

On Jan 14, 2011, at 10:05 AM, "Jakob Homan" <jg...@gmail.com> wrote:

>> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..
> 
> Not all are.  I'm against it for the all the same reasons I was
> against 20 append.  This is also being used as a wedge to get the
> append work in as .200.  My position is that every iota effort of
> releasing another 20 branch is an iota not spent on getting us a
> kick-ass 22.  20 was great, and we had a lot of wonderful times
> together, but it's time to move on and see other releases.
> 
> But, this is a volunteer effort, and if others want to put the effort
> in, they're free to do so.
> -jg
> 
> On Fri, Jan 14, 2011 at 9:32 AM, Nigel Daley <nd...@mac.com> wrote:
>> Yup, I'll say it again.  The process ain't perfect but it's good enough IMO. Thank you Yahoo! for your contribution.
>> 
>> Clearly these patch will need review before commit when going into trunk.
>> 
>> Let's move on to 0.22.
>> 
>> Nige
>> 
>> On Jan 14, 2011, at 9:20 AM, Konstantin Boudnik wrote:
>> 
>>> I tend to second most of Ian's points here.
>>> 
>>> On Fri, Jan 14, 2011 at 06:14, Ian Holsman <ha...@holsman.net> wrote:
>>>> (with my Apache hat on)
>>>> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).
>>> 
>>> #1: we are creating a precedent of a "brain-dump" here. Although, it
>>> isn't the first one in the history of OSS. Infamous Apple "patch" to
>>> OpenBSD is another one ;)
>>> 
>>> #2: How to spell 'back door' any one?
>>> 
>>> #5: "almost 10 internal releases" Arun has mentioned above might be,
>>> perhaps, considered as a great quality control effort. Also, not to
>>> mention virtual impossibility to create a test plan to validate a
>>> giant features patch.
>>> 
>>>> BTW, I'd like to point out a discrepancy here:
>>>> 
>>>> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..
>>> 
>>> And this ^^^
>>> 
>>> But, hey I guess it's totally worth it!
>>>  Cos
>>> 
>>>> --Ian
>>>> 
>>>> On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:
>>>> 
>>>>> 
>>>>> On Jan 13, 2011, at 10:59 PM, Stack wrote:
>>>>> 
>>>>>> (Man, it was looking good there for a second when 0.20.100 was about
>>>>>> security+append!)
>>>>>> 
>>>>>> Good luck w/ the release Arun.
>>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>>> We might be following your 0.20.100 with a 0.20.200 append.
>>>>>> 
>>>>> 
>>>>> Super!
>>>>> 
>>>>> Arun
>>>> 
>>>> 
>> 
>>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Jakob Homan <jg...@gmail.com>.

> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..

Not all are.  I'm against it for the all the same reasons I was
against 20 append.  This is also being used as a wedge to get the
append work in as .200.  My position is that every iota effort of
releasing another 20 branch is an iota not spent on getting us a
kick-ass 22.  20 was great, and we had a lot of wonderful times
together, but it's time to move on and see other releases.

But, this is a volunteer effort, and if others want to put the effort
in, they're free to do so.
-jg

On Fri, Jan 14, 2011 at 9:32 AM, Nigel Daley <nd...@mac.com> wrote:
> Yup, I'll say it again.  The process ain't perfect but it's good enough IMO. Thank you Yahoo! for your contribution.
>
> Clearly these patch will need review before commit when going into trunk.
>
> Let's move on to 0.22.
>
> Nige
>
> On Jan 14, 2011, at 9:20 AM, Konstantin Boudnik wrote:
>
>> I tend to second most of Ian's points here.
>>
>> On Fri, Jan 14, 2011 at 06:14, Ian Holsman <ha...@holsman.net> wrote:
>>> (with my Apache hat on)
>>> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).
>>
>> #1: we are creating a precedent of a "brain-dump" here. Although, it
>> isn't the first one in the history of OSS. Infamous Apple "patch" to
>> OpenBSD is another one ;)
>>
>> #2: How to spell 'back door' any one?
>>
>> #5: "almost 10 internal releases" Arun has mentioned above might be,
>> perhaps, considered as a great quality control effort. Also, not to
>> mention virtual impossibility to create a test plan to validate a
>> giant features patch.
>>
>>> BTW, I'd like to point out a discrepancy here:
>>>
>>> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..
>>
>> And this ^^^
>>
>> But, hey I guess it's totally worth it!
>>  Cos
>>
>>> --Ian
>>>
>>> On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:
>>>
>>>>
>>>> On Jan 13, 2011, at 10:59 PM, Stack wrote:
>>>>
>>>>> (Man, it was looking good there for a second when 0.20.100 was about
>>>>> security+append!)
>>>>>
>>>>> Good luck w/ the release Arun.
>>>>>
>>>>
>>>> Thanks!
>>>>
>>>>> We might be following your 0.20.100 with a 0.20.200 append.
>>>>>
>>>>
>>>> Super!
>>>>
>>>> Arun
>>>
>>>
>
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ian Holsman <ha...@holsman.net>.

On Jan 14, 2011, at 12:32 PM, Nigel Daley wrote:

> Yup, I'll say it again.  The process ain't perfect but it's good enough IMO. Thank you Yahoo! for your contribution.

agree 100%.

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Nigel Daley <nd...@mac.com>.

Yup, I'll say it again.  The process ain't perfect but it's good enough IMO. Thank you Yahoo! for your contribution.

Clearly these patch will need review before commit when going into trunk. 

Let's move on to 0.22.

Nige

On Jan 14, 2011, at 9:20 AM, Konstantin Boudnik wrote:

> I tend to second most of Ian's points here.
> 
> On Fri, Jan 14, 2011 at 06:14, Ian Holsman <ha...@holsman.net> wrote:
>> (with my Apache hat on)
>> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).
> 
> #1: we are creating a precedent of a "brain-dump" here. Although, it
> isn't the first one in the history of OSS. Infamous Apple "patch" to
> OpenBSD is another one ;)
> 
> #2: How to spell 'back door' any one?
> 
> #5: "almost 10 internal releases" Arun has mentioned above might be,
> perhaps, considered as a great quality control effort. Also, not to
> mention virtual impossibility to create a test plan to validate a
> giant features patch.
> 
>> BTW, I'd like to point out a discrepancy here:
>> 
>> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..
> 
> And this ^^^
> 
> But, hey I guess it's totally worth it!
>  Cos
> 
>> --Ian
>> 
>> On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:
>> 
>>> 
>>> On Jan 13, 2011, at 10:59 PM, Stack wrote:
>>> 
>>>> (Man, it was looking good there for a second when 0.20.100 was about
>>>> security+append!)
>>>> 
>>>> Good luck w/ the release Arun.
>>>> 
>>> 
>>> Thanks!
>>> 
>>>> We might be following your 0.20.100 with a 0.20.200 append.
>>>> 
>>> 
>>> Super!
>>> 
>>> Arun
>> 
>>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Konstantin Boudnik <co...@apache.org>.

I tend to second most of Ian's points here.

On Fri, Jan 14, 2011 at 06:14, Ian Holsman <ha...@holsman.net> wrote:
> (with my Apache hat on)
> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).

#1: we are creating a precedent of a "brain-dump" here. Although, it
isn't the first one in the history of OSS. Infamous Apple "patch" to
OpenBSD is another one ;)

#2: How to spell 'back door' any one?

#5: "almost 10 internal releases" Arun has mentioned above might be,
perhaps, considered as a great quality control effort. Also, not to
mention virtual impossibility to create a test plan to validate a
giant features patch.

> BTW, I'd like to point out a discrepancy here:
>
> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it..

And this ^^^

But, hey I guess it's totally worth it!
  Cos

> --Ian
>
> On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:
>
>>
>> On Jan 13, 2011, at 10:59 PM, Stack wrote:
>>
>>> (Man, it was looking good there for a second when 0.20.100 was about
>>> security+append!)
>>>
>>> Good luck w/ the release Arun.
>>>
>>
>> Thanks!
>>
>>> We might be following your 0.20.100 with a 0.20.200 append.
>>>
>>
>> Super!
>>
>> Arun
>
>

RE: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by "Severance, Steve" <ss...@ebay.com>.

I want to thank Yahoo! for this release. At eBay we are very excited about the opportunity to test a build of Hadoop that has already been extensively field tested on large clusters. At eBay we are primarily concerned with cluster availability and throughput so having a build like this available to the community is a huge win.

Hats off to Arun, Eric and everyone at Yahoo! for releasing this.

Steve

-----Original Message-----
From: Eric Baldeschwieler [mailto:eric14@yahoo-inc.com] 
Sent: Friday, January 14, 2011 10:25 AM
To: general@hadoop.apache.org
Cc: general@hadoop.apache.org
Subject: Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Hi Ian,

Thanks for holding off on that last .5. I've been working in a big email giving move context on this. Let me preview some issues. 

Our goal with this branch is two fold: 1) get the code out in a branch quickly so we an collaborate on it with the community. 2) not change the character of the code. See testing below. We're happy to compromise any other dimension, as long as we can do 1&2 above. 

1) I agree this is not a good precedent. We don't support mega-patches in general. We are doing this as part of discontinuing the "yahoo distribution of Hadoop".  We don't plan to continue doing 30 person year projects outside apache and then merging them in!!

2) append is hard. It is so hard we rewrote the entire write pipeline (5 person-years work) in trunk after giving up on the codeline you are suggesting we merge in. That work is what distinguishes all post 20 releases from 20 releases in my mind. I dont trust the 20 append code line. We've been hurt badly by it.  We did the rewrite only after losing a bunch of production data a bunch of times with the previous code line.  I think the various 20 append patch lines may be fine for specialized hbase clusters, but they doesn't have the rigor behind them to bet your business in them.

3) I think having a very stable recent codeline available for teams coming into Hadoop who want to run big business apps and contribute code back is very helpful. I've been talking to folks in other orgs and they've expressed a huge amount of interest in this work, but begged us to put it into apache, so their oversight bodies will let them use it. 

4) we're happy to incorporate ideas into how to best merge the work into trunk. Let's find the most cost effective way to preserve the most devel data possible. 

5) testing. Ian, I think you do us a disservice when you talk about us just testing in our environments. If you look at the history of the project, we've been the force behind every stable release of apache Hadoop.  And all the non-apache Hadoop release had been tracking this patch set. We fully support the community developing independent testing capabilities.  We plan to contribute to that effort.  But we are the organization with far and away the best record for testing Hadoop. 

We are proud of thus release, we want to share it. Help us sort out how. 

Thanks!

---
E14 - via iPhone

On Jan 14, 2011, at 6:15 AM, "Ian Holsman" <ha...@holsman.net> wrote:

> (with my Apache hat on)
> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).
> 
> for the following reasons:
> 
> 1. It encourages bad behavior. We want discussion (and development) to happen on the lists, not in some office. By allowing these large code-dumps it condones this behavior, and we will likely see it again and again. Like it or not, this is not the apache model of open source governance. 
> 
> 2. There is a risk that some code that is not in a JIRA or separate patch creeps in unwittingly. This isn't a major deal per se, but we don't really have the proper paper trail, or the documentation on what bug it fixed etc etc.
> 
> 3. Other groups (Facebook for example) are running with their own set of patches. They currently have the luxury of examining each individual patch to decide if they want to integrate it (and test it) in their environment. We are forcing them to do the work of finding the bits they want in this huge patch.
> 
> 4. By not including the append patch, we are making this release unusable for a large portion of our community who run hbase.
> 
> 5. It makes it very hard to test. While It makes me comfortable that it has gone through Yahoo!'s QA and is running in their environments, it doesn't mean that it will work in other organizations who have different workload mixes and software running on them. With one huge patch it makes it all or nothing.. either they take the code-drop and perform a large QA-integration effort, or they forgo the whole patch together.
> 
> 
> **BUT** we have both the Yahoo! & Cloudera guys happy to do it, and to spend their time doing it.. so I think having the code-drop will put us in a better place then where we are.
> 
> 
> BTW, I'd like to point out a discrepancy here:
> 
> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it.. 
> 
> --Ian
> 
> On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:
> 
>> 
>> On Jan 13, 2011, at 10:59 PM, Stack wrote:
>> 
>>> (Man, it was looking good there for a second when 0.20.100 was about
>>> security+append!)
>>> 
>>> Good luck w/ the release Arun.
>>> 
>> 
>> Thanks!
>> 
>>> We might be following your 0.20.100 with a 0.20.200 append.
>>> 
>> 
>> Super!
>> 
>> Arun
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Hi Stack,

I feel your pain.  We're running a 700 node HBASE cluster containing a HUGE collections of all web pages.  Both versions of append were started by engineers working at yahoo and we've put A LOT of investment into both.  I really, really want to see the append issue solved for HBASE!!  

My point is simply that we need to separate our concerns.  I would 300% support a community of folks building a 0.20 derived version of hadoop with append and we know that any new release post 0.20 will contain an append solution.  This branch is more backwards facing.  We are simply trying to share our last two years of 0.20 experience with the community, so that a) folks can use it if they find value in it, b) this work can be merged into future hadoop releases (that will have append).

We want to share what we have tested, since we believe that the testing is a good chunk of our contribution.

Thanks,

E14

On Jan 16, 2011, at 2:57 PM, Stack wrote:

> On Fri, Jan 14, 2011 at 10:25 AM, Eric Baldeschwieler
> <er...@yahoo-inc.com> wrote:
>> 2) append is hard. It is so hard we rewrote the entire write pipeline (5 person-years work) in trunk after giving up on the codeline you are suggesting we merge in. That work is what distinguishes all post 20 releases from 20 releases in my mind. I dont trust the 20 append code line. We've been hurt badly by it.  We did the rewrite only after losing a bunch of production data a bunch of times with the previous code line.  I think the various 20 append patch lines may be fine for specialized hbase clusters, but they doesn't have the rigor behind them to bet your business in them.
>> 
> 
> Eric:
> 
> A few comments on the above:
> 
> + Append has had a bunch of work done on it since the Y! dataloss of a
> few years ago on an ancestor of the branch-0.20-append codebase (IIRC
> the issue you refer to in particular -- the 'dataloss' because
> partially written blocks were done up in tmp dirs, and on cluster
> restart, tmp data was cleared -- has been fixed in
> branch-0.20.append).
> + You may not trust 0.20-append (or its close cousin over in CDH) but
> a bunch of HBasers do. On the one hand, we have little choice.  Until
> the *new* append becomes available in a stable Hadoop the HBase
> project has had to sustain itself (What you think?, 3-6 months before
> we see 0.22?  HBase project can't hold its breath that long).  On
> other hand, the branch-0.20-append work has been carried out by lads
> (and lasses!) who know their HDFS.  Its true that it will not have
> been tested with Y! rigor but near-derivatives -- CDH or the FB
> branches -- already do HDFS-200-based append in production.
> 
> St.Ack
> P.S. Don't get me wrong.  HBase is looking forward to *new* append.
> We just need something to suck on meantime.

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Stack <st...@duboce.net>.

On Fri, Jan 14, 2011 at 10:25 AM, Eric Baldeschwieler
<er...@yahoo-inc.com> wrote:
> 2) append is hard. It is so hard we rewrote the entire write pipeline (5 person-years work) in trunk after giving up on the codeline you are suggesting we merge in. That work is what distinguishes all post 20 releases from 20 releases in my mind. I dont trust the 20 append code line. We've been hurt badly by it. We did the rewrite only after losing a bunch of production data a bunch of times with the previous code line. I think the various 20 append patch lines may be fine for specialized hbase clusters, but they doesn't have the rigor behind them to bet your business in them.
>

Eric:

A few comments on the above:

+ Append has had a bunch of work done on it since the Y! dataloss of a
few years ago on an ancestor of the branch-0.20-append codebase (IIRC
the issue you refer to in particular -- the 'dataloss' because
partially written blocks were done up in tmp dirs, and on cluster
restart, tmp data was cleared -- has been fixed in
branch-0.20.append).
+ You may not trust 0.20-append (or its close cousin over in CDH) but
a bunch of HBasers do. On the one hand, we have little choice. Until
the *new* append becomes available in a stable Hadoop the HBase
project has had to sustain itself (What you think?, 3-6 months before
we see 0.22? HBase project can't hold its breath that long). On
other hand, the branch-0.20-append work has been carried out by lads
(and lasses!) who know their HDFS. Its true that it will not have
been tested with Y! rigor but near-derivatives -- CDH or the FB
branches -- already do HDFS-200-based append in production.

St.Ack
P.S. Don't get me wrong. HBase is looking forward to *new* append.
We just need something to suck on meantime.

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Scott Carey <sc...@richrelevance.com>.

On 1/14/11 11:24 AM, "Dhruba Borthakur" <dh...@gmail.com> wrote:

>>
>>
>> 1) I agree this is not a good precedent. We don't support mega-patches
>>in
>> general. We are doing this as part of discontinuing the "yahoo
>>distribution
>> of Hadoop".  We don't plan to continue doing 30 person year projects
>>outside
>> apache and then merging them in!!
>>
>>
>I think this is a very dangerous precedent and completely unwarranted.
>mega-patches are bad and is totally not the Apache way to go. I think if
>you
>want to contribute it back to Apache, you should avoid the mega-patch
>completely.
>

The mega-patch is not being applied to Trunk, or even the common 0.20.x
branch, so its danger is significantly mitigated.

If there is still a lot of worry about the mega-patch, there is one other
compromise:

* Take Cloudera's linearization of Y! Patches that go from 0.20.2 to
0.20.104.3 and commit them individually.
* Then take a mini-mega patch from there to the latest Y!.

That shouldn't be too hard, and meets Arun's goal of not changing the
character of the code so that testing is minimized/eliminated.  And it
incorporates some hard work on the Cloudera side that will be useful if
debugging on that branch is necessary.

I want to see as much work as possible on 0.22 -- there are major
improvements there that all can share and get the community more unified
again.  One drawback of this release is it could encourage the community
to squat on 0.20 even longer...  But sharing all that work can be seen as
a necessary step to being able let go of 0.20 and move on as well.

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Milind Bhandarkar <mb...@linkedin.com>.

Dhruba,

While I do not think that the releasability of a branch should be determined by the market-cap (either on nasdaq or second-market) of the contributing company, I think a well-tested release is beneficial to the community.

So, I support two releases: 20.100 now, that has security. And 20.200 later that incorporates appends (depending on the 0.22+appends timeline). That way, a large percentage of community is covered in 2011.

The reasons are these:

1. The proposed 20.100 is perhaps the most tested at scale, out of all 0.20 branches. In fact, among *all* hadoop releases in last 5 years. I know first hand that it causes the least disruption for users, the migration from 0.20 to 0.20.10x was the smoothest, while adding a valuable feature.

2. HBase (running on hadoop 0.20 with append) has also been scale tested at Y!, but on much less than 4000 nodes, and certainly not for varied workloads (where the bugs tend to surface). (To my knowledge, the largest HBase instance is at Y! in production.)

3. Operations folks need to get some experience with raw hadoop first for any release, before other products on top of hadoop, and then handover the installation to users. So, there is still time for HBase+0.20.100, and that can be addressed in a separate release.

4. It is not as if the community hasn't had a preview of this mega-patch already. A large portion of the sub-patches are already in cdh3bx, and many of them have already been committed one-by-one to 0.22.

- Milind

On Jan 14, 2011, at 11:24 AM, Dhruba Borthakur wrote:

>> 
>> 
>> 1) I agree this is not a good precedent. We don't support mega-patches in
>> general. We are doing this as part of discontinuing the "yahoo distribution
>> of Hadoop".  We don't plan to continue doing 30 person year projects outside
>> apache and then merging them in!!
>> 
>> 
> I think this is a very dangerous precedent and completely unwarranted.
> mega-patches are bad and is totally not the Apache way to go. I think if you
> want to contribute it back to Apache, you should avoid the mega-patch
> completely.
> 
> 
>  I think the various 20 append patch lines may be fine for specialized
>> hbase clusters, but they doesn't have the rigor behind them to bet your
>> business in them.
>> 
>> 
> I think you are completely off-track here and jumping to conclusions. Big
> business are already betting on it. HBase is becoming a big user of Hadoop
> (dunno whether Y! uses HBase) and I completely agree with Ian that all
> business have to anyway test their release themselves before using it,
> otherwise you could land up with data loss like the type you mentioned.
> 
> thanks,
> dhruba

---
Milind Bhandarkar
mbhandarkar@linkedin.com

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Dhruba Borthakur <dh...@gmail.com>.

>
>
> 1) I agree this is not a good precedent. We don't support mega-patches in
> general. We are doing this as part of discontinuing the "yahoo distribution
> of Hadoop".  We don't plan to continue doing 30 person year projects outside
> apache and then merging them in!!
>
>
I think this is a very dangerous precedent and completely unwarranted.
mega-patches are bad and is totally not the Apache way to go. I think if you
want to contribute it back to Apache, you should avoid the mega-patch
completely.


  I think the various 20 append patch lines may be fine for specialized
> hbase clusters, but they doesn't have the rigor behind them to bet your
> business in them.
>
>
I think you are completely off-track here and jumping to conclusions. Big
business are already betting on it. HBase is becoming a big user of Hadoop
(dunno whether Y! uses HBase) and I completely agree with Ian that all
business have to anyway test their release themselves before using it,
otherwise you could land up with data loss like the type you mentioned.

 thanks,
dhruba

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Hi Ian,

Thanks for holding off on that last .5. I've been working in a big email giving move context on this. Let me preview some issues. 

Our goal with this branch is two fold: 1) get the code out in a branch quickly so we an collaborate on it with the community. 2) not change the character of the code. See testing below. We're happy to compromise any other dimension, as long as we can do 1&2 above. 

1) I agree this is not a good precedent. We don't support mega-patches in general. We are doing this as part of discontinuing the "yahoo distribution of Hadoop".  We don't plan to continue doing 30 person year projects outside apache and then merging them in!!

2) append is hard. It is so hard we rewrote the entire write pipeline (5 person-years work) in trunk after giving up on the codeline you are suggesting we merge in. That work is what distinguishes all post 20 releases from 20 releases in my mind. I dont trust the 20 append code line. We've been hurt badly by it.  We did the rewrite only after losing a bunch of production data a bunch of times with the previous code line.  I think the various 20 append patch lines may be fine for specialized hbase clusters, but they doesn't have the rigor behind them to bet your business in them.

3) I think having a very stable recent codeline available for teams coming into Hadoop who want to run big business apps and contribute code back is very helpful. I've been talking to folks in other orgs and they've expressed a huge amount of interest in this work, but begged us to put it into apache, so their oversight bodies will let them use it. 

4) we're happy to incorporate ideas into how to best merge the work into trunk. Let's find the most cost effective way to preserve the most devel data possible. 

5) testing. Ian, I think you do us a disservice when you talk about us just testing in our environments. If you look at the history of the project, we've been the force behind every stable release of apache Hadoop.  And all the non-apache Hadoop release had been tracking this patch set. We fully support the community developing independent testing capabilities.  We plan to contribute to that effort.  But we are the organization with far and away the best record for testing Hadoop. 

We are proud of thus release, we want to share it. Help us sort out how. 

Thanks!

---
E14 - via iPhone

On Jan 14, 2011, at 6:15 AM, "Ian Holsman" <ha...@holsman.net> wrote:

> (with my Apache hat on)
> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).
> 
> for the following reasons:
> 
> 1. It encourages bad behavior. We want discussion (and development) to happen on the lists, not in some office. By allowing these large code-dumps it condones this behavior, and we will likely see it again and again. Like it or not, this is not the apache model of open source governance. 
> 
> 2. There is a risk that some code that is not in a JIRA or separate patch creeps in unwittingly. This isn't a major deal per se, but we don't really have the proper paper trail, or the documentation on what bug it fixed etc etc.
> 
> 3. Other groups (Facebook for example) are running with their own set of patches. They currently have the luxury of examining each individual patch to decide if they want to integrate it (and test it) in their environment. We are forcing them to do the work of finding the bits they want in this huge patch.
> 
> 4. By not including the append patch, we are making this release unusable for a large portion of our community who run hbase.
> 
> 5. It makes it very hard to test. While It makes me comfortable that it has gone through Yahoo!'s QA and is running in their environments, it doesn't mean that it will work in other organizations who have different workload mixes and software running on them. With one huge patch it makes it all or nothing.. either they take the code-drop and perform a large QA-integration effort, or they forgo the whole patch together.
> 
> 
> **BUT** we have both the Yahoo! & Cloudera guys happy to do it, and to spend their time doing it.. so I think having the code-drop will put us in a better place then where we are.
> 
> 
> BTW, I'd like to point out a discrepancy here:
> 
> On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it.. 
> 
> --Ian
> 
> On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:
> 
>> 
>> On Jan 13, 2011, at 10:59 PM, Stack wrote:
>> 
>>> (Man, it was looking good there for a second when 0.20.100 was about
>>> security+append!)
>>> 
>>> Good luck w/ the release Arun.
>>> 
>> 
>> Thanks!
>> 
>>> We might be following your 0.20.100 with a 0.20.200 append.
>>> 
>> 
>> Super!
>> 
>> Arun
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ian Holsman <ha...@holsman.net>.

(with my Apache hat on)
I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).

for the following reasons:

1. It encourages bad behavior. We want discussion (and development) to happen on the lists, not in some office. By allowing these large code-dumps it condones this behavior, and we will likely see it again and again. Like it or not, this is not the apache model of open source governance. 

2. There is a risk that some code that is not in a JIRA or separate patch creeps in unwittingly. This isn't a major deal per se, but we don't really have the proper paper trail, or the documentation on what bug it fixed etc etc.

3. Other groups (Facebook for example) are running with their own set of patches. They currently have the luxury of examining each individual patch to decide if they want to integrate it (and test it) in their environment. We are forcing them to do the work of finding the bits they want in this huge patch.

4. By not including the append patch, we are making this release unusable for a large portion of our community who run hbase.

5. It makes it very hard to test. While It makes me comfortable that it has gone through Yahoo!'s QA and is running in their environments, it doesn't mean that it will work in other organizations who have different workload mixes and software running on them. With one huge patch it makes it all or nothing.. either they take the code-drop and perform a large QA-integration effort, or they forgo the whole patch together.

**BUT** we have both the Yahoo! & Cloudera guys happy to do it, and to spend their time doing it.. so I think having the code-drop will put us in a better place then where we are.

BTW, I'd like to point out a discrepancy here:

On another thread discussing hadoop-0.20-append as a separate branch, most people agreed that new features shouldn't be added to 0.20, now we have a major feature and we are all gung ho for it.. 

--Ian

On Jan 14, 2011, at 2:21 AM, Arun C Murthy wrote:

> 
> On Jan 13, 2011, at 10:59 PM, Stack wrote:
> 
>> (Man, it was looking good there for a second when 0.20.100 was about
>> security+append!)
>> 
>> Good luck w/ the release Arun.
>> 
> 
> Thanks!
> 
>> We might be following your 0.20.100 with a 0.20.200 append.
>> 
> 
> Super!
> 
> Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 13, 2011, at 10:59 PM, Stack wrote:

> (Man, it was looking good there for a second when 0.20.100 was about
> security+append!)
>
> Good luck w/ the release Arun.
>

Thanks!

> We might be following your 0.20.100 with a 0.20.200 append.
>

Super!

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

I'd love to see that!

On Jan 13, 2011, at 10:59 PM, Stack wrote:

> (Man, it was looking good there for a second when 0.20.100 was about
> security+append!)
> 
> Good luck w/ the release Arun.
> 
> We might be following your 0.20.100 with a 0.20.200 append.
> 
> St.Ack

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Stack <st...@duboce.net>.

(Man, it was looking good there for a second when 0.20.100 was about
security+append!)

Good luck w/ the release Arun.

We might be following your 0.20.100 with a 0.20.200 append.

St.Ack

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

No worries. Thanks to both Eli & Todd for the discussion. 

I look forward to getting this done and moving ahead to 0.22 and beyond.

thanks,
Arun

On Jan 13, 2011, at 10:29 PM, "Eli Collins" <el...@cloudera.com> wrote:

> Sorry for rattling you guys, definitely wasn't discussing a veto.  I'm
> absolutely not opposed, just thought the alternative Todd raised was
> worth a couple emails since users have requested both security and
> append, and such a branch that includes both of those plus
> enhancements and substantial testing exists.
> 
> Arun - I appreciate all the info, looking forward to the release.
> 
> Thanks,
> Eli
> 
> On Thu, Jan 13, 2011 at 10:21 PM, Arun C Murthy <ar...@yahoo-inc.com> wrote:
>> *nod* Ok.
>> 
>> Arun
>> 
>> On Jan 13, 2011, at 10:08 PM, "Nigel Daley" <nd...@mac.com> wrote:
>> 
>>> I say just do it.  Eli said it wasn't a blocker. Sure it ain't perfect, but it's good enough.
>>> 
>>> Let's move on to 0.22 and beyond.
>>> 
>>> Nige
>>> 
>>> On Jan 13, 2011, at 8:23 PM, Arun C Murthy wrote:
>>> 
>>>> 
>>>> On Jan 13, 2011, at 6:50 PM, Eli Collins wrote:
>>>> 
>>>>> The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
>>>>> 104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
>>>>> performance and stability fixes I think you're referring to, at least
>>>>> the ones that have been posted to Apache jira).
>>>>> 
>>>>> Can you post a pointer to the version you're referring to, eg on
>>>>> github?  If there isn't a big delta between it and the cdh3 patch set
>>>>> (which should have the 20-based patches from jira) perhaps you and
>>>>> Todd could easily merge in the delta to create 0.20.x?
>>>>> 
>>>> 
>>>> I can guarantee it will need work to merge the enhancements since 20.104.3, it's over 6 months of development. The enhancements includes work on stability such as iterative ls, limits on JT to prevent single jobs/users from taking it down etc. and lots of bug-fixes to security. So, unfortunately the delta is pretty large.
>>>> 
>>>> I'm working on a CHANGES.txt which should reflect all the changes i.e. bug-fixes and enhancements.
>>>> 
>>>>>> The version I'm offering to push to the community has fixed all of them,
>>>>>> *plus* the added benefit of several stability and performance fixes we have
>>>>>> done since 20.104.3, almost 10 internal releases. This is a battle tested
>>>>>> and hardened version which we have deployed on 40,000+ nodes. It is a
>>>>>> significant upgrade on 0.20.104.3 which we never deployed. I'm pretty sure
>>>>>> *some* users will find that valuable. ;)
>>>>> 
>>>>> Definitely, but better to hit two birds with one stone right?  Instead
>>>>> of a security + enhancements release and an append release we could
>>>>> have a single security + append + enhancements release and users don't
>>>>> have to choose.
>>>>> 
>>>> 
>>>> 
>>>> We are discussing two options:
>>>> 20 + security + enhancements
>>>> 20 + security + append
>>>> 
>>>> I think the value we provide via 20+security+enhancements release is that it's stable, tested and deployed at scale. Doing any more work merging 6 months of work at Yahoo (again, I guarantee it's a lot of work) will need a lots of cycles to validate, test and stabilize.
>>>> 
>>>> I feel the alternative is a distraction for me, I'd rather work on 0.22.
>>>> 
>>>> I can get 20+security+enhancements done very, very, quickly precisely because I don't have to spend cycles testing it.
>>>> 
>>>> Does that make sense? Thanks for being patient and bearing with me...
>>>> 
>>>> Arun
>>>> 
>>> 
>>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Jan 13, 2011 at 10:29 PM, Eli Collins <el...@cloudera.com> wrote:

> Sorry for rattling you guys, definitely wasn't discussing a veto.  I'm
> absolutely not opposed, just thought the alternative Todd raised was
> worth a couple emails since users have requested both security and
> append, and such a branch that includes both of those plus
> enhancements and substantial testing exists.
>
> Arun - I appreciate all the info, looking forward to the release.
>
>
Same here.

Back to the patch queue for me! 0.22 here we come.

-Todd


>  On Thu, Jan 13, 2011 at 10:21 PM, Arun C Murthy <ar...@yahoo-inc.com>
> wrote:
> > *nod* Ok.
> >
> > Arun
> >
> > On Jan 13, 2011, at 10:08 PM, "Nigel Daley" <nd...@mac.com> wrote:
> >
> >> I say just do it.  Eli said it wasn't a blocker. Sure it ain't perfect,
> but it's good enough.
> >>
> >> Let's move on to 0.22 and beyond.
> >>
> >> Nige
> >>
> >> On Jan 13, 2011, at 8:23 PM, Arun C Murthy wrote:
> >>
> >>>
> >>> On Jan 13, 2011, at 6:50 PM, Eli Collins wrote:
> >>>
> >>>> The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
> >>>> 104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
> >>>> performance and stability fixes I think you're referring to, at least
> >>>> the ones that have been posted to Apache jira).
> >>>>
> >>>> Can you post a pointer to the version you're referring to, eg on
> >>>> github?  If there isn't a big delta between it and the cdh3 patch set
> >>>> (which should have the 20-based patches from jira) perhaps you and
> >>>> Todd could easily merge in the delta to create 0.20.x?
> >>>>
> >>>
> >>> I can guarantee it will need work to merge the enhancements since
> 20.104.3, it's over 6 months of development. The enhancements includes work
> on stability such as iterative ls, limits on JT to prevent single jobs/users
> from taking it down etc. and lots of bug-fixes to security. So,
> unfortunately the delta is pretty large.
> >>>
> >>> I'm working on a CHANGES.txt which should reflect all the changes i.e.
> bug-fixes and enhancements.
> >>>
> >>>>> The version I'm offering to push to the community has fixed all of
> them,
> >>>>> *plus* the added benefit of several stability and performance fixes
> we have
> >>>>> done since 20.104.3, almost 10 internal releases. This is a battle
> tested
> >>>>> and hardened version which we have deployed on 40,000+ nodes. It is a
> >>>>> significant upgrade on 0.20.104.3 which we never deployed. I'm pretty
> sure
> >>>>> *some* users will find that valuable. ;)
> >>>>
> >>>> Definitely, but better to hit two birds with one stone right?  Instead
> >>>> of a security + enhancements release and an append release we could
> >>>> have a single security + append + enhancements release and users don't
> >>>> have to choose.
> >>>>
> >>>
> >>>
> >>> We are discussing two options:
> >>> 20 + security + enhancements
> >>> 20 + security + append
> >>>
> >>> I think the value we provide via 20+security+enhancements release is
> that it's stable, tested and deployed at scale. Doing any more work merging
> 6 months of work at Yahoo (again, I guarantee it's a lot of work) will need
> a lots of cycles to validate, test and stabilize.
> >>>
> >>> I feel the alternative is a distraction for me, I'd rather work on
> 0.22.
> >>>
> >>> I can get 20+security+enhancements done very, very, quickly precisely
> because I don't have to spend cycles testing it.
> >>>
> >>> Does that make sense? Thanks for being patient and bearing with me...
> >>>
> >>> Arun
> >>>
> >>
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eli Collins <el...@cloudera.com>.

Sorry for rattling you guys, definitely wasn't discussing a veto.  I'm
absolutely not opposed, just thought the alternative Todd raised was
worth a couple emails since users have requested both security and
append, and such a branch that includes both of those plus
enhancements and substantial testing exists.

Arun - I appreciate all the info, looking forward to the release.

Thanks,
Eli

On Thu, Jan 13, 2011 at 10:21 PM, Arun C Murthy <ar...@yahoo-inc.com> wrote:
> *nod* Ok.
>
> Arun
>
> On Jan 13, 2011, at 10:08 PM, "Nigel Daley" <nd...@mac.com> wrote:
>
>> I say just do it.  Eli said it wasn't a blocker. Sure it ain't perfect, but it's good enough.
>>
>> Let's move on to 0.22 and beyond.
>>
>> Nige
>>
>> On Jan 13, 2011, at 8:23 PM, Arun C Murthy wrote:
>>
>>>
>>> On Jan 13, 2011, at 6:50 PM, Eli Collins wrote:
>>>
>>>> The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
>>>> 104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
>>>> performance and stability fixes I think you're referring to, at least
>>>> the ones that have been posted to Apache jira).
>>>>
>>>> Can you post a pointer to the version you're referring to, eg on
>>>> github?  If there isn't a big delta between it and the cdh3 patch set
>>>> (which should have the 20-based patches from jira) perhaps you and
>>>> Todd could easily merge in the delta to create 0.20.x?
>>>>
>>>
>>> I can guarantee it will need work to merge the enhancements since 20.104.3, it's over 6 months of development. The enhancements includes work on stability such as iterative ls, limits on JT to prevent single jobs/users from taking it down etc. and lots of bug-fixes to security. So, unfortunately the delta is pretty large.
>>>
>>> I'm working on a CHANGES.txt which should reflect all the changes i.e. bug-fixes and enhancements.
>>>
>>>>> The version I'm offering to push to the community has fixed all of them,
>>>>> *plus* the added benefit of several stability and performance fixes we have
>>>>> done since 20.104.3, almost 10 internal releases. This is a battle tested
>>>>> and hardened version which we have deployed on 40,000+ nodes. It is a
>>>>> significant upgrade on 0.20.104.3 which we never deployed. I'm pretty sure
>>>>> *some* users will find that valuable. ;)
>>>>
>>>> Definitely, but better to hit two birds with one stone right?  Instead
>>>> of a security + enhancements release and an append release we could
>>>> have a single security + append + enhancements release and users don't
>>>> have to choose.
>>>>
>>>
>>>
>>> We are discussing two options:
>>> 20 + security + enhancements
>>> 20 + security + append
>>>
>>> I think the value we provide via 20+security+enhancements release is that it's stable, tested and deployed at scale. Doing any more work merging 6 months of work at Yahoo (again, I guarantee it's a lot of work) will need a lots of cycles to validate, test and stabilize.
>>>
>>> I feel the alternative is a distraction for me, I'd rather work on 0.22.
>>>
>>> I can get 20+security+enhancements done very, very, quickly precisely because I don't have to spend cycles testing it.
>>>
>>> Does that make sense? Thanks for being patient and bearing with me...
>>>
>>> Arun
>>>
>>
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

*nod* Ok.

Arun

On Jan 13, 2011, at 10:08 PM, "Nigel Daley" <nd...@mac.com> wrote:

> I say just do it.  Eli said it wasn't a blocker. Sure it ain't perfect, but it's good enough.
> 
> Let's move on to 0.22 and beyond.
> 
> Nige
> 
> On Jan 13, 2011, at 8:23 PM, Arun C Murthy wrote:
> 
>> 
>> On Jan 13, 2011, at 6:50 PM, Eli Collins wrote:
>> 
>>> The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
>>> 104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
>>> performance and stability fixes I think you're referring to, at least
>>> the ones that have been posted to Apache jira).
>>> 
>>> Can you post a pointer to the version you're referring to, eg on
>>> github?  If there isn't a big delta between it and the cdh3 patch set
>>> (which should have the 20-based patches from jira) perhaps you and
>>> Todd could easily merge in the delta to create 0.20.x?
>>> 
>> 
>> I can guarantee it will need work to merge the enhancements since 20.104.3, it's over 6 months of development. The enhancements includes work on stability such as iterative ls, limits on JT to prevent single jobs/users from taking it down etc. and lots of bug-fixes to security. So, unfortunately the delta is pretty large.
>> 
>> I'm working on a CHANGES.txt which should reflect all the changes i.e. bug-fixes and enhancements.
>> 
>>>> The version I'm offering to push to the community has fixed all of them,
>>>> *plus* the added benefit of several stability and performance fixes we have
>>>> done since 20.104.3, almost 10 internal releases. This is a battle tested
>>>> and hardened version which we have deployed on 40,000+ nodes. It is a
>>>> significant upgrade on 0.20.104.3 which we never deployed. I'm pretty sure
>>>> *some* users will find that valuable. ;)
>>> 
>>> Definitely, but better to hit two birds with one stone right?  Instead
>>> of a security + enhancements release and an append release we could
>>> have a single security + append + enhancements release and users don't
>>> have to choose.
>>> 
>> 
>> 
>> We are discussing two options:
>> 20 + security + enhancements
>> 20 + security + append
>> 
>> I think the value we provide via 20+security+enhancements release is that it's stable, tested and deployed at scale. Doing any more work merging 6 months of work at Yahoo (again, I guarantee it's a lot of work) will need a lots of cycles to validate, test and stabilize.
>> 
>> I feel the alternative is a distraction for me, I'd rather work on 0.22.
>> 
>> I can get 20+security+enhancements done very, very, quickly precisely because I don't have to spend cycles testing it.
>> 
>> Does that make sense? Thanks for being patient and bearing with me...
>> 
>> Arun
>> 
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Nigel Daley <nd...@mac.com>.

I say just do it.  Eli said it wasn't a blocker. Sure it ain't perfect, but it's good enough.

Let's move on to 0.22 and beyond.

Nige

On Jan 13, 2011, at 8:23 PM, Arun C Murthy wrote:

> 
> On Jan 13, 2011, at 6:50 PM, Eli Collins wrote:
> 
>> The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
>> 104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
>> performance and stability fixes I think you're referring to, at least
>> the ones that have been posted to Apache jira).
>> 
>> Can you post a pointer to the version you're referring to, eg on
>> github?  If there isn't a big delta between it and the cdh3 patch set
>> (which should have the 20-based patches from jira) perhaps you and
>> Todd could easily merge in the delta to create 0.20.x?
>> 
> 
> I can guarantee it will need work to merge the enhancements since 20.104.3, it's over 6 months of development. The enhancements includes work on stability such as iterative ls, limits on JT to prevent single jobs/users from taking it down etc. and lots of bug-fixes to security. So, unfortunately the delta is pretty large.
> 
> I'm working on a CHANGES.txt which should reflect all the changes i.e. bug-fixes and enhancements.
> 
>>> The version I'm offering to push to the community has fixed all of them,
>>> *plus* the added benefit of several stability and performance fixes we have
>>> done since 20.104.3, almost 10 internal releases. This is a battle tested
>>> and hardened version which we have deployed on 40,000+ nodes. It is a
>>> significant upgrade on 0.20.104.3 which we never deployed. I'm pretty sure
>>> *some* users will find that valuable. ;)
>> 
>> Definitely, but better to hit two birds with one stone right?  Instead
>> of a security + enhancements release and an append release we could
>> have a single security + append + enhancements release and users don't
>> have to choose.
>> 
> 
> 
> We are discussing two options:
> 20 + security + enhancements
> 20 + security + append
> 
> I think the value we provide via 20+security+enhancements release is that it's stable, tested and deployed at scale. Doing any more work merging 6 months of work at Yahoo (again, I guarantee it's a lot of work) will need a lots of cycles to validate, test and stabilize.
> 
> I feel the alternative is a distraction for me, I'd rather work on 0.22.
> 
> I can get 20+security+enhancements done very, very, quickly precisely because I don't have to spend cycles testing it.
> 
> Does that make sense? Thanks for being patient and bearing with me...
> 
> Arun
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 13, 2011, at 6:50 PM, Eli Collins wrote:

> The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
> 104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
> performance and stability fixes I think you're referring to, at least
> the ones that have been posted to Apache jira).
>
> Can you post a pointer to the version you're referring to, eg on
> github?  If there isn't a big delta between it and the cdh3 patch set
> (which should have the 20-based patches from jira) perhaps you and
> Todd could easily merge in the delta to create 0.20.x?
>

I can guarantee it will need work to merge the enhancements since  
20.104.3, it's over 6 months of development. The enhancements includes  
work on stability such as iterative ls, limits on JT to prevent single  
jobs/users from taking it down etc. and lots of bug-fixes to security.  
So, unfortunately the delta is pretty large.

I'm working on a CHANGES.txt which should reflect all the changes i.e.  
bug-fixes and enhancements.

>> The version I'm offering to push to the community has fixed all of  
>> them,
>> *plus* the added benefit of several stability and performance fixes  
>> we have
>> done since 20.104.3, almost 10 internal releases. This is a battle  
>> tested
>> and hardened version which we have deployed on 40,000+ nodes. It is a
>> significant upgrade on 0.20.104.3 which we never deployed. I'm  
>> pretty sure
>> *some* users will find that valuable. ;)
>
> Definitely, but better to hit two birds with one stone right?  Instead
> of a security + enhancements release and an append release we could
> have a single security + append + enhancements release and users don't
> have to choose.
>

We are discussing two options:
20 + security + enhancements
20 + security + append

I think the value we provide via 20+security+enhancements release is  
that it's stable, tested and deployed at scale. Doing any more work  
merging 6 months of work at Yahoo (again, I guarantee it's a lot of  
work) will need a lots of cycles to validate, test and stabilize.

I feel the alternative is a distraction for me, I'd rather work on 0.22.

I can get 20+security+enhancements done very, very, quickly precisely  
because I don't have to spend cycles testing it.

Does that make sense? Thanks for being patient and bearing with me...

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 18, 2011, at 1:59 AM, Arun C Murthy wrote:
>
> IAC, I agree - we've spent too much time talking and too little doing
> actual work. Let me get the job done and folks can then weigh-in on
> the release at later point, folks might be willing to consider this
> more positively once they see the  branch, the change-log etc.
>
> Of course we need to get the small number of remaining patches into
> trunk asap for 0.22 and beyond.

FYI - I've merged changes to Common's branch-0.20-security (http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security 
).

Some statistics:
# Total of 441 jiras are covered across Common (164), HDFS (90) and  
Map-Reduce (187).
# 413 jiras out of the 441 are already committed to trunk.
# 28 open jiras: Common (6), HDFS (1), Map-Reduce (21).
# Of the 28 open jiras 7 are Patch Available.
# I've ensured all commits have a jira - I had to open 3 jiras (one  
each in all sub-projects), they are included in the above stats.

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Thanks for the clarifications Roy.

I considered either b) and c).

As I mentioned, the reason I think b) wasn't useful in this context is  
that we have, in several cases, 5-6 patches per jira (bug-fix on on  
top of a bug-fix) and several jiras didn't make sense for trunk since  
the bug didn't exist in trunk etc. etc. Also, I was considering a  
scenario where I would squash relevant patches together to produce a  
minimal set of coherent patches. Then there is work to remove Yahoo!  
specific commits.

IAC, I agree - we've spent too much time talking and too little doing  
actual work. Let me get the job done and folks can then weigh-in on  
the release at later point, folks might be willing to consider this  
more positively once they see the  branch, the change-log etc.

Of course we need to get the small number of remaining patches into  
trunk asap for 0.22 and beyond.

Arun

On Jan 18, 2011, at 12:20 AM, Roy T. Fielding wrote:

> I thought that this discussion would have reached some sensible
> understanding of how Apache projects work by now, but it seems not.
>
> On Jan 13, 2011, at 6:12 PM, Arun C Murthy wrote:
>> The version I'm offering to push to the community has fixed all of  
>> them, *plus* the added benefit of several stability and performance  
>> fixes we have done since 20.104.3, almost 10 internal releases.  
>> This is a battle tested and hardened version which we have deployed  
>> on 40,000+ nodes. It is a significant upgrade on 0.20.104.3 which  
>> we never deployed. I'm pretty sure *some* users will find that  
>> valuable. ;)
>>
>> Also, I've offered to push individual patches as a background  
>> activity on a branch - that should suffice, no? Or, do you consider  
>> this a blocker?
>>
>> Again, my goal in this exercise is to get a stable, improved  
>> version of Hadoop into the hands of our users asap, and focus on  
>> 0.22 and beyond.
>
> So, you have a bunch of changes that you want to contribute.
> Please do so.  There are several ways:
>
> a) break the changes down into a sequence of patches, create jira
>    issues for each one (or append to the existing issue), and then
>    provide the group with a list of the issue links so that people
>    can quickly +1 each one.  When it seems worthwhile to you, create
>    a branch off of some prior Apache release point in svn and commit
>    each patch to it until the branch is identical to (or, in your own
>    opinion, better than) the source code that you have tested locally.
>    Then RM a tarball and start a release vote.  Since all of this is
>    being done in jira and svn, others can help you do all but the
>    first part (breaking down the big patch).
>
> or
>
> b) create a branch off of some prior Apache release point in svn
>    and replay the internal Y! commits on that branch until the branch
>    source code is identical to what you have tested locally.  Then
>    RM a tarball based on that branch and start a release vote.
>    Since the history is now in svn, others could do the RM bit if
>    you don't have time.
>
> or
>
> c) create a branch off of some prior Apache release point in svn
>    and apply one big ugly patch to it.  Then RM a tarball based
>    on that branch and ask for a release vote.
>
> You will note that none of the above requires a discussion on this
> list prior to the release vote, though (a) would likely result in
> more +1s than (b), and (b) would likely receive more +1s than (c).
> Regardless, the release vote is a lazy majority decision.
>
> I do not believe that there is any rational reason to apply a
> single big patch.  "It takes too long" is nonsense -- you have
> already spent far more time discussing it than would be required
> to do it.  "It is too hard" is also nonsense -- use your version
> control system to extract the set of changes and just replay them
> (with appropriate changelog editing).  "It has already been tested
> at Y!" is simply irrelevant -- the source code has been tested, not
> the order in which the patches have been applied, so all you should
> care about is that the final branch code is comparable to the tested
> source code (i.e., use diff).
>
> Nevertheless, all contributions at Apache are voluntary.  Do what
> you have time for, now, with the understanding that others may or
> may not complete the task, and may or may not vote for the release.
>
> You can make a branch, apply the big patch, and stand by
> while the rest of the group chooses whether to just accept it
> as a big change.  Someone else might create a parallel branch to
> apply the specific changes in an orderly matter, or perhaps you'll
> discover an easy way to do that a few days from now.  Or it
> might just sit there and never be released.
>
> There is no need for the group to agree to a plan up front, just
> as there is no need for the group to approve a release just because
> someone did the work of RMing a tarball.  Sure, it might save
> a lot of time if potential disagreements can be resolved before
> work is done, but the fact is that people tend to disagree less
> with actual work products than with abstract plans.  After all,
> everyone has a plan.  It is also far easier to convince people
> to fix their own problems if the problem is right in front of them.
>
> When the release vote happens, encourage folks to test and +1
> the release.  If it passes, woohoo!  If not, then listen to the
> reasons given by the other PMC members and see if you can make
> enough changes to the release to get those extra +1s.
>
> In other words, collaborate.
>
> ....Roy

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by "Roy T. Fielding" <fi...@gbiv.com>.

I thought that this discussion would have reached some sensible
understanding of how Apache projects work by now, but it seems not.

On Jan 13, 2011, at 6:12 PM, Arun C Murthy wrote:
> The version I'm offering to push to the community has fixed all of them, *plus* the added benefit of several stability and performance fixes we have done since 20.104.3, almost 10 internal releases. This is a battle tested and hardened version which we have deployed on 40,000+ nodes. It is a significant upgrade on 0.20.104.3 which we never deployed. I'm pretty sure *some* users will find that valuable. ;)
>
> Also, I've offered to push individual patches as a background activity on a branch - that should suffice, no? Or, do you consider this a blocker?
>
> Again, my goal in this exercise is to get a stable, improved version of Hadoop into the hands of our users asap, and focus on 0.22 and beyond.

So, you have a bunch of changes that you want to contribute.
Please do so. There are several ways:

a) break the changes down into a sequence of patches, create jira
issues for each one (or append to the existing issue), and then
provide the group with a list of the issue links so that people
can quickly +1 each one. When it seems worthwhile to you, create
a branch off of some prior Apache release point in svn and commit
each patch to it until the branch is identical to (or, in your own
opinion, better than) the source code that you have tested locally.
Then RM a tarball and start a release vote. Since all of this is
being done in jira and svn, others can help you do all but the
first part (breaking down the big patch).

b) create a branch off of some prior Apache release point in svn
and replay the internal Y! commits on that branch until the branch
source code is identical to what you have tested locally. Then
RM a tarball based on that branch and start a release vote.
Since the history is now in svn, others could do the RM bit if
you don't have time.

c) create a branch off of some prior Apache release point in svn
and apply one big ugly patch to it. Then RM a tarball based
on that branch and ask for a release vote.

You will note that none of the above requires a discussion on this
list prior to the release vote, though (a) would likely result in
more +1s than (b), and (b) would likely receive more +1s than (c).
Regardless, the release vote is a lazy majority decision.

I do not believe that there is any rational reason to apply a
single big patch. "It takes too long" is nonsense -- you have
already spent far more time discussing it than would be required
to do it. "It is too hard" is also nonsense -- use your version
control system to extract the set of changes and just replay them
(with appropriate changelog editing). "It has already been tested
at Y!" is simply irrelevant -- the source code has been tested, not
the order in which the patches have been applied, so all you should
care about is that the final branch code is comparable to the tested
source code (i.e., use diff).

Nevertheless, all contributions at Apache are voluntary. Do what
you have time for, now, with the understanding that others may or
may not complete the task, and may or may not vote for the release.

You can make a branch, apply the big patch, and stand by
while the rest of the group chooses whether to just accept it
as a big change. Someone else might create a parallel branch to
apply the specific changes in an orderly matter, or perhaps you'll
discover an easy way to do that a few days from now. Or it
might just sit there and never be released.

There is no need for the group to agree to a plan up front, just
as there is no need for the group to approve a release just because
someone did the work of RMing a tarball. Sure, it might save
a lot of time if potential disagreements can be resolved before
work is done, but the fact is that people tend to disagree less
with actual work products than with abstract plans. After all,
everyone has a plan. It is also far easier to convince people
to fix their own problems if the problem is right in front of them.

When the release vote happens, encourage folks to test and +1
the release. If it passes, woohoo! If not, then listen to the
reasons given by the other PMC members and see if you can make
enough changes to the release to get those extra +1s.

In other words, collaborate.

....Roy

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Jan 13, 2011 at 6:12 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>
> On Jan 13, 2011, at 5:35 PM, Eli Collins wrote:
>>
>> Given that Todd has already done the work to rebase the 0.20.104.3
>> patch set on 0.20.2, and in a way that doesn't require one big change,
>> and his patch set includes branch20-append which the HBase guys want
>> an Apache release of wouldn't it make sense to go this route?  What do
>> others think? Seems better to have one 0.20.100 release than multiple
>> ones for security and append.
>
>
> My concern around 0.20.104.3 is that it has serious security holes including
> a root exploit that we have since fixed. I'm sure you guys are aware of
> them, Todd has helped to fix some.
>

The cdh3 patch set Todd is talking about is not vanilla 104.3, it's
104.3 re-based onto 20.2 plus patches from branch-20 and trunk (the
performance and stability fixes I think you're referring to, at least
the ones that have been posted to Apache jira).

Can you post a pointer to the version you're referring to, eg on
github?  If there isn't a big delta between it and the cdh3 patch set
(which should have the 20-based patches from jira) perhaps you and
Todd could easily merge in the delta to create 0.20.x?

> The version I'm offering to push to the community has fixed all of them,
> *plus* the added benefit of several stability and performance fixes we have
> done since 20.104.3, almost 10 internal releases. This is a battle tested
> and hardened version which we have deployed on 40,000+ nodes. It is a
> significant upgrade on 0.20.104.3 which we never deployed. I'm pretty sure
> *some* users will find that valuable. ;)

Definitely, but better to hit two birds with one stone right?  Instead
of a security + enhancements release and an append release we could
have a single security + append + enhancements release and users don't
have to choose.

> Also, I've offered to push individual patches as a background activity on a
> branch - that should suffice, no? Or, do you consider this a blocker?

Definitely not a blocker.

> Again, my goal in this exercise is to get a stable, improved version of
> Hadoop into the hands of our users asap, and focus on 0.22 and beyond.

Agree, that's everyone's goal.  My point is that a release that's
already been re-based on 20.2, doesn't require a separate HBase
release, and doesn't require you spend time on a background task to
break up the big change into smaller ones seems like a faster way
forward.

Thanks,
Eli

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 13, 2011, at 5:35 PM, Eli Collins wrote:
> Given that Todd has already done the work to rebase the 0.20.104.3
> patch set on 0.20.2, and in a way that doesn't require one big change,
> and his patch set includes branch20-append which the HBase guys want
> an Apache release of wouldn't it make sense to go this route?  What do
> others think? Seems better to have one 0.20.100 release than multiple
> ones for security and append.

My concern around 0.20.104.3 is that it has serious security holes  
including a root exploit that we have since fixed. I'm sure you guys  
are aware of them, Todd has helped to fix some.

The version I'm offering to push to the community has fixed all of  
them, *plus* the added benefit of several stability and performance  
fixes we have done since 20.104.3, almost 10 internal releases. This  
is a battle tested and hardened version which we have deployed on  
40,000+ nodes. It is a significant upgrade on 0.20.104.3 which we  
never deployed. I'm pretty sure *some* users will find that valuable. ;)

Also, I've offered to push individual patches as a background activity  
on a branch - that should suffice, no? Or, do you consider this a  
blocker?

Again, my goal in this exercise is to get a stable, improved version  
of Hadoop into the hands of our users asap, and focus on 0.22 and  
beyond.

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eli Collins <el...@cloudera.com>.

On Thursday, January 13, 2011, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>
> On Jan 13, 2011, at 3:34 PM, Todd Lipcon wrote:
>
>
> On Thu, Jan 13, 2011 at 3:05 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>
>
> Since this could be applied as a linear set of patches instead of a big
>
> lump, would there be interest in using this as the 0.20.>100 Apache
> release?
> I can take the time to remove any patches that are cloudera specific or
> not
> yet applied upstream.
>
>
>
> Interesting discussion, thanks.
>
> I'm sure it took you a fair amount of work to squash patches (which I tried
> too, btw).
>
>
>
> Yep, I had a great summer ;-)
>
>
>
> That, plus the fact that we would need to do a similar amount of work for
> the 10 or so releases we have done after 0.20.100.3 scares me.
>
>
>
> Sorry, I actually meant 0.20.104.3. Have there been many releases since
> then? That's the last version available on the Yahoo github, and that's the
> version we incorporated/linearized.
>
>
> Yep. I had a great summer! ;-)
>
>
>
> As we Nigel and I discussed here, the jumbo  patch and an up-to-date
> CHANGES.txt provides almost all of the benefits we seek and allows all of us
> to get this done very quickly to focus on hadoop-0.22 and beyond.
>
>
>
> In my opinion here are the downsides to this plan:
>
>
>
> I agree there are downsides, I think I did point them out at the outset! :)
>
>
> - a mondo "merge" patch is a big pain when trying to do debugging. It may be
> sufficient for a user to look at CHANGES.txt, but I find myself using
> blame/log/etc on individual files to understand code lineage on a daily
> basis. If all of the merge shows up as a big patch it will be very difficult
> (at least the way I work with code) to help users debug issues or understand
> which JIRA a certain regression may have come from.
>
>
>
> Right, no question. Which is why I offered to do this as a background activity right after... this ensures that the source of truth is *always* a branch in Apache subversion.
>
> I feel that we could get a usable release out of door quickly for our users. Also, please remember that almost every patch we have committed is available on relevant jiras. I understand the devs have a problem and I feel we can bear with it for a little while. Again, I agree this isn't an ideal solution, I'm just trying to expedite the release for the users.
>
>
>
> To clarify my position a bit here - I definitely appreciate your
> volunteering to do the work, and wouldn't *block* the proposal as you've put
> it forth. I just think it will have limited utility for the community by
> being opaque (if contributed as a giant patch) and by not including the sync
> feature which is critical for a large segment of users. Given those
> downsides I'd rather see the effort diverted towards making a killer 0.22
> release that we can all jump on.
>
>
>
> Thanks for understanding.
>
> Again, I completely agree this isn't an ideal situation, but I do hope it has a bit more than *limited utility* for our end-users. Who knows, I maybe hopelessly deluded! *smile*
>
> Also, I'm trying to do exactly what you suggested - spend very little time on this so that everyone, including me, can focus on the future.
>
> thanks,
> Arun
>

Given that Todd has already done the work to rebase the 0.20.104.3
patch set on 0.20.2, and in a way that doesn't require one big change,
and his patch set includes branch20-append which the HBase guys want
an Apache release of wouldn't it make sense to go this route?  What do
others think? Seems better to have one 0.20.100 release than multiple
ones for security and append.

Thanks,
Eli

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 13, 2011, at 3:34 PM, Todd Lipcon wrote:

> On Thu, Jan 13, 2011 at 3:05 PM, Arun C Murthy <ac...@yahoo-inc.com>  
> wrote:
>
>> Since this could be applied as a linear set of patches instead of a  
>> big
>>> lump, would there be interest in using this as the 0.20.>100 Apache
>>> release?
>>> I can take the time to remove any patches that are cloudera  
>>> specific or
>>> not
>>> yet applied upstream.
>>>
>>>
>> Interesting discussion, thanks.
>>
>> I'm sure it took you a fair amount of work to squash patches (which  
>> I tried
>> too, btw).
>
>
> Yep, I had a great summer ;-)
>
>
>> That, plus the fact that we would need to do a similar amount of  
>> work for
>> the 10 or so releases we have done after 0.20.100.3 scares me.
>>
>
> Sorry, I actually meant 0.20.104.3. Have there been many releases  
> since
> then? That's the last version available on the Yahoo github, and  
> that's the
> version we incorporated/linearized.

Yep. I had a great summer! ;-)

>>
>> As we Nigel and I discussed here, the jumbo  patch and an up-to-date
>> CHANGES.txt provides almost all of the benefits we seek and allows  
>> all of us
>> to get this done very quickly to focus on hadoop-0.22 and beyond.
>>
>>
> In my opinion here are the downsides to this plan:
>

I agree there are downsides, I think I did point them out at the  
outset! :)

> - a mondo "merge" patch is a big pain when trying to do debugging.  
> It may be
> sufficient for a user to look at CHANGES.txt, but I find myself using
> blame/log/etc on individual files to understand code lineage on a  
> daily
> basis. If all of the merge shows up as a big patch it will be very  
> difficult
> (at least the way I work with code) to help users debug issues or  
> understand
> which JIRA a certain regression may have come from.
>

Right, no question. Which is why I offered to do this as a background  
activity right after... this ensures that the source of truth is  
*always* a branch in Apache subversion.

I feel that we could get a usable release out of door quickly for our  
users. Also, please remember that almost every patch we have committed  
is available on relevant jiras. I understand the devs have a problem  
and I feel we can bear with it for a little while. Again, I agree this  
isn't an ideal solution, I'm just trying to expedite the release for  
the users.

>
> To clarify my position a bit here - I definitely appreciate your
> volunteering to do the work, and wouldn't *block* the proposal as  
> you've put
> it forth. I just think it will have limited utility for the  
> community by
> being opaque (if contributed as a giant patch) and by not including  
> the sync
> feature which is critical for a large segment of users. Given those
> downsides I'd rather see the effort diverted towards making a killer  
> 0.22
> release that we can all jump on.
>

Thanks for understanding.

Again, I completely agree this isn't an ideal situation, but I do hope  
it has a bit more than *limited utility* for our end-users. Who knows,  
I maybe hopelessly deluded! *smile*

Also, I'm trying to do exactly what you suggested - spend very little  
time on this so that everyone, including me, can focus on the future.

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Jan 13, 2011 at 3:05 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:

> Since this could be applied as a linear set of patches instead of a big
>> lump, would there be interest in using this as the 0.20.>100 Apache
>> release?
>> I can take the time to remove any patches that are cloudera specific or
>> not
>> yet applied upstream.
>>
>>
> Interesting discussion, thanks.
>
> I'm sure it took you a fair amount of work to squash patches (which I tried
> too, btw).

Yep, I had a great summer ;-)

> That, plus the fact that we would need to do a similar amount of work for
> the 10 or so releases we have done after 0.20.100.3 scares me.
>

Sorry, I actually meant 0.20.104.3. Have there been many releases since
then? That's the last version available on the Yahoo github, and that's the
version we incorporated/linearized.

If there is a large sequence of patches after this that you're planning on
including, it would be good to see them in your git repo.

> As we Nigel and I discussed here, the jumbo  patch and an up-to-date
> CHANGES.txt provides almost all of the benefits we seek and allows all of us
> to get this done very quickly to focus on hadoop-0.22 and beyond.
>
>
In my opinion here are the downsides to this plan:

- a mondo "merge" patch is a big pain when trying to do debugging. It may be
sufficient for a user to look at CHANGES.txt, but I find myself using
blame/log/etc on individual files to understand code lineage on a daily
basis. If all of the merge shows up as a big patch it will be very difficult
(at least the way I work with code) to help users debug issues or understand
which JIRA a certain regression may have come from.

- CHANGES.txt traditionally doesn't reference which patch file from a JIRA
was checked in. So we may know that a given JIRA has been included, but
often there are several revisions of patches on the JIRA and it's difficult
to be sure that we have the most up-to-date version. By looking at change
history it's usually easy to pick this out, but if it's one giant patch
apply, this isn't possible.

- the proposal to use the YDH distro certainly solves the Security issue,
but doesn't help out HBase at all. Given HBase has been asking for a long
time to get a real release of the append branch, I think it would be better
to have one 20-based release which has both of these features, rather than
further fragmenting the community into 0.20.2, 0.20.2+security,
0.20.2+append.

I think the first two points could be addressed if you push your git tree
either to github or an apache-hosted git, and then include in SVN as a mondo
patch. It's not ideal, but at least when trying to debug issues and
understand the history of this branch there will be a publicly available
change history to reference.

To clarify my position a bit here - I definitely appreciate your
volunteering to do the work, and wouldn't *block* the proposal as you've put
it forth. I just think it will have limited utility for the community by
being opaque (if contributed as a giant patch) and by not including the sync
feature which is critical for a large segment of users. Given those
downsides I'd rather see the effort diverted towards making a killer 0.22
release that we can all jump on.

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Todd,

On Jan 13, 2011, at 2:04 PM, Todd Lipcon wrote:

> Hi Arun, all,
>
> When we merged YDH and CDH for CDH3b3, we went through the effort of
> "linearizing" all of the YDH patches and squashing multiple commits  
> into
> single ones corresponding to a single JIRA where possible. So, we  
> have a
> 100% linear set of patches that applies on top of the 0.20.2 source  
> tree and
> includes Yahoo 0.20.100.3 as well as almost all the patches from  
> 0.20-append
> and a number of other backports.
>
> Since this could be applied as a linear set of patches instead of a  
> big
> lump, would there be interest in using this as the 0.20.>100 Apache  
> release?
> I can take the time to remove any patches that are cloudera specific  
> or not
> yet applied upstream.
>

Interesting discussion, thanks.

I'm sure it took you a fair amount of work to squash patches (which I  
tried too, btw). That, plus the fact that we would need to do a  
similar amount of work for the 10 or so releases we have done after  
0.20.100.3 scares me.

As we Nigel and I discussed here, the jumbo  patch and an up-to-date  
CHANGES.txt provides almost all of the benefits we seek and allows all  
of us to get this done very quickly to focus on hadoop-0.22 and beyond.

What do you think?

OTOH, I could get this release done and start squashing patches for  
the sake of completeness as a background activity.

Thoughts?

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Arun, all,

When we merged YDH and CDH for CDH3b3, we went through the effort of
"linearizing" all of the YDH patches and squashing multiple commits into
single ones corresponding to a single JIRA where possible. So, we have a
100% linear set of patches that applies on top of the 0.20.2 source tree and
includes Yahoo 0.20.100.3 as well as almost all the patches from 0.20-append
and a number of other backports.

Since this could be applied as a linear set of patches instead of a big
lump, would there be interest in using this as the 0.20.>100 Apache release?
I can take the time to remove any patches that are cloudera specific or not
yet applied upstream.

Thanks
-Todd

On Wed, Jan 12, 2011 at 11:07 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:

>
> On Jan 12, 2011, at 2:56 PM, Nigel Daley wrote:
>
>  +1 for 0.20.x, where x >= 100.  I agree that the 1.0 moniker would involve
>> more discussion.
>>
>
> Ok, seems like we are converging; we can continue talking. I've created the
> branch to get the ball rolling.
>
>
>  Will this be a jumbo patch attached to a Jira and then committed to the
>> branch?  Just curious.
>>
>
> I'm afraid that the svn log of the branch from github Y! branch is fairly
> useless since a single JIRA might have multiple commits in the Y! branch
> (bugfix on top of a bugfix). We have done that in several cases (but the
> patches committed to trunk have a single patch which is the result of
> forward porting a complete feature/bugfix). IAC the this branch and 0.22
> have diverged so much that almost no non-trivial patch would apply without a
> significant amount of work.
>
> Thus, I think a jumbo patch should suffice. It will also ensure this can
> done quickly so that the community can then concentrate on 0.22 and beyond.
>
> However, I will (manually) ensure all relevant jiras are referenced in the
> CHANGES.txt and Release Notes for folks to see the contents of the release.
> This is the hardest part of the exercise. Also, this ensures that we can
> track these jiras for 0.22 as Eli suggested.
>
> Does that seem like a reasonable way forward? I'm happy to brainstorm.
>
> thanks,
> Arun
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 12, 2011, at 2:56 PM, Nigel Daley wrote:

> +1 for 0.20.x, where x >= 100.  I agree that the 1.0 moniker would  
> involve more discussion.

Ok, seems like we are converging; we can continue talking. I've  
created the branch to get the ball rolling.

> Will this be a jumbo patch attached to a Jira and then committed to  
> the branch?  Just curious.

I'm afraid that the svn log of the branch from github Y! branch is  
fairly useless since a single JIRA might have multiple commits in the  
Y! branch (bugfix on top of a bugfix). We have done that in several  
cases (but the patches committed to trunk have a single patch which is  
the result of forward porting a complete feature/bugfix). IAC the this  
branch and 0.22 have diverged so much that almost no non-trivial patch  
would apply without a significant amount of work.

Thus, I think a jumbo patch should suffice. It will also ensure this  
can done quickly so that the community can then concentrate on 0.22  
and beyond.

However, I will (manually) ensure all relevant jiras are referenced in  
the CHANGES.txt and Release Notes for folks to see the contents of the  
release. This is the hardest part of the exercise. Also, this ensures  
that we can track these jiras for 0.22 as Eli suggested.

Does that seem like a reasonable way forward? I'm happy to brainstorm.

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Nigel Daley <nd...@mac.com>.

> Nigel - could we make all the patches in this branch that have not
> been committed up stream (that need to be) blockers for 22?   This way
> 22 is not a regression against 0.20.x.

I sure hope so.  May be difficult to untangle if it's a jumbo patch -- answer is in the details of how it's contributed. 

Cheers,
Nige


On Jan 12, 2011, at 3:02 PM, Eli Collins wrote:

> +1 on 0.20.x   (where x is a J > 3)
> 
> Nigel - could we make all the patches in this branch that have not
> been committed up stream (that need to be) blockers for 22?   This way
> 22 is not a regression against 0.20.x.
> 
> Thanks,
> Eli
> 
> On Wed, Jan 12, 2011 at 2:56 PM, Nigel Daley <nd...@mac.com> wrote:
>> +1 for 0.20.x, where x >= 100.  I agree that the 1.0 moniker would involve more discussion.
>> 
>> Will this be a jumbo patch attached to a Jira and then committed to the branch?  Just curious.
>> 
>> Cheers,
>> Nige
>> 
>> 
>> On Jan 12, 2011, at 1:34 PM, Arun C Murthy wrote:
>> 
>>> I'm willing to discuss any and all options, for a very short period.
>>> 
>>> Technically you have a reasonable point, Doug has suggested this in the past too. If everyone agrees, fine; if not, I'm do not want hung up on a release number. I just *do not* want a controversy.
>>> 
>>> As I mentioned, I'm looking to finish this up in a couple of weeks; so, I could do without a long discussion on the on the critical path.
>>> 
>>> I'm happy to go with a reasonable compromise, if not, hadoop-0.20.100 is what I'm priming for.
>>> 
>>> Heck, if Stack wants to call the append release (not sure how far ahead he is) as hadoop-0.20.100, I'm willing to call this hadoop-0.20.200.
>>> 
>>> All I care about is having a distinct release number from 0.20.2 (our last stable release). Again, I just want to get a release into the hands of our users. Please, let's resolve this quickly. Please.
>>> 
>>> Arun
>>> 
>>> On Jan 12, 2011, at 1:10 PM, Owen O'Malley wrote:
>>> 
>>>> 
>>>> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
>>>> 
>>>>> I'm open to suggestions - how about something like 20.100 to show
>>>>> that it's a big jump? Anything else?
>>>> 
>>>> 
>>>> Although I'm not wild about any of the potential release names, this
>>>> patch set is neither a subset or superset of the 0.21 or 0.22
>>>> branches. Given that, I think that a new major release number makes
>>>> the most sense. It is also relatively likely that additional minor
>>>> releases will be made off of this branch while 0.22 is stabilizing.
>>>> We've talked about declaring 0.20 a 1.0 for a long time and this feels
>>>> like backing into the decision, but technically, I believe it to be
>>>> the right name for such a release.
>>>> 
>>>> Thoughts?
>>>> 
>>>> -- Owen
>>> 
>> 
>>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ian Holsman <ha...@holsman.net>.

So what is the plan with 20.3 that Owen volunteered to RM? 
Should we do that, or just integrate the security code with that and call it 20.x?

---
Ian Holsman - 703 879-3128

I saw the angel in the marble and carved until I set him free -- Michelangelo

On 12/01/2011, at 6:02 PM, Eli Collins <el...@cloudera.com> wrote:

> +1 on 0.20.x   (where x is a J > 3)
> 
> Nigel - could we make all the patches in this branch that have not
> been committed up stream (that need to be) blockers for 22?   This way
> 22 is not a regression against 0.20.x.
> 
> Thanks,
> Eli
> 
> On Wed, Jan 12, 2011 at 2:56 PM, Nigel Daley <nd...@mac.com> wrote:
>> +1 for 0.20.x, where x >= 100.  I agree that the 1.0 moniker would involve more discussion.
>> 
>> Will this be a jumbo patch attached to a Jira and then committed to the branch?  Just curious.
>> 
>> Cheers,
>> Nige
>> 
>> 
>> On Jan 12, 2011, at 1:34 PM, Arun C Murthy wrote:
>> 
>>> I'm willing to discuss any and all options, for a very short period.
>>> 
>>> Technically you have a reasonable point, Doug has suggested this in the past too. If everyone agrees, fine; if not, I'm do not want hung up on a release number. I just *do not* want a controversy.
>>> 
>>> As I mentioned, I'm looking to finish this up in a couple of weeks; so, I could do without a long discussion on the on the critical path.
>>> 
>>> I'm happy to go with a reasonable compromise, if not, hadoop-0.20.100 is what I'm priming for.
>>> 
>>> Heck, if Stack wants to call the append release (not sure how far ahead he is) as hadoop-0.20.100, I'm willing to call this hadoop-0.20.200.
>>> 
>>> All I care about is having a distinct release number from 0.20.2 (our last stable release). Again, I just want to get a release into the hands of our users. Please, let's resolve this quickly. Please.
>>> 
>>> Arun
>>> 
>>> On Jan 12, 2011, at 1:10 PM, Owen O'Malley wrote:
>>> 
>>>> 
>>>> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
>>>> 
>>>>> I'm open to suggestions - how about something like 20.100 to show
>>>>> that it's a big jump? Anything else?
>>>> 
>>>> 
>>>> Although I'm not wild about any of the potential release names, this
>>>> patch set is neither a subset or superset of the 0.21 or 0.22
>>>> branches. Given that, I think that a new major release number makes
>>>> the most sense. It is also relatively likely that additional minor
>>>> releases will be made off of this branch while 0.22 is stabilizing.
>>>> We've talked about declaring 0.20 a 1.0 for a long time and this feels
>>>> like backing into the decision, but technically, I believe it to be
>>>> the right name for such a release.
>>>> 
>>>> Thoughts?
>>>> 
>>>> -- Owen
>>> 
>> 
>>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Eli Collins <el...@cloudera.com>.

+1 on 0.20.x   (where x is a J > 3)

Nigel - could we make all the patches in this branch that have not
been committed up stream (that need to be) blockers for 22?   This way
22 is not a regression against 0.20.x.

Thanks,
Eli

On Wed, Jan 12, 2011 at 2:56 PM, Nigel Daley <nd...@mac.com> wrote:
> +1 for 0.20.x, where x >= 100.  I agree that the 1.0 moniker would involve more discussion.
>
> Will this be a jumbo patch attached to a Jira and then committed to the branch?  Just curious.
>
> Cheers,
> Nige
>
>
> On Jan 12, 2011, at 1:34 PM, Arun C Murthy wrote:
>
>> I'm willing to discuss any and all options, for a very short period.
>>
>> Technically you have a reasonable point, Doug has suggested this in the past too. If everyone agrees, fine; if not, I'm do not want hung up on a release number. I just *do not* want a controversy.
>>
>> As I mentioned, I'm looking to finish this up in a couple of weeks; so, I could do without a long discussion on the on the critical path.
>>
>> I'm happy to go with a reasonable compromise, if not, hadoop-0.20.100 is what I'm priming for.
>>
>> Heck, if Stack wants to call the append release (not sure how far ahead he is) as hadoop-0.20.100, I'm willing to call this hadoop-0.20.200.
>>
>> All I care about is having a distinct release number from 0.20.2 (our last stable release). Again, I just want to get a release into the hands of our users. Please, let's resolve this quickly. Please.
>>
>> Arun
>>
>> On Jan 12, 2011, at 1:10 PM, Owen O'Malley wrote:
>>
>>>
>>> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
>>>
>>>> I'm open to suggestions - how about something like 20.100 to show
>>>> that it's a big jump? Anything else?
>>>
>>>
>>> Although I'm not wild about any of the potential release names, this
>>> patch set is neither a subset or superset of the 0.21 or 0.22
>>> branches. Given that, I think that a new major release number makes
>>> the most sense. It is also relatively likely that additional minor
>>> releases will be made off of this branch while 0.22 is stabilizing.
>>> We've talked about declaring 0.20 a 1.0 for a long time and this feels
>>> like backing into the decision, but technically, I believe it to be
>>> the right name for such a release.
>>>
>>> Thoughts?
>>>
>>> -- Owen
>>
>
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Nigel Daley <nd...@mac.com>.

+1 for 0.20.x, where x >= 100.  I agree that the 1.0 moniker would involve more discussion.

Will this be a jumbo patch attached to a Jira and then committed to the branch?  Just curious.

Cheers,
Nige


On Jan 12, 2011, at 1:34 PM, Arun C Murthy wrote:

> I'm willing to discuss any and all options, for a very short period.
> 
> Technically you have a reasonable point, Doug has suggested this in the past too. If everyone agrees, fine; if not, I'm do not want hung up on a release number. I just *do not* want a controversy.
> 
> As I mentioned, I'm looking to finish this up in a couple of weeks; so, I could do without a long discussion on the on the critical path.
> 
> I'm happy to go with a reasonable compromise, if not, hadoop-0.20.100 is what I'm priming for.
> 
> Heck, if Stack wants to call the append release (not sure how far ahead he is) as hadoop-0.20.100, I'm willing to call this hadoop-0.20.200.
> 
> All I care about is having a distinct release number from 0.20.2 (our last stable release). Again, I just want to get a release into the hands of our users. Please, let's resolve this quickly. Please.
> 
> Arun
> 
> On Jan 12, 2011, at 1:10 PM, Owen O'Malley wrote:
> 
>> 
>> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
>> 
>>> I'm open to suggestions - how about something like 20.100 to show
>>> that it's a big jump? Anything else?
>> 
>> 
>> Although I'm not wild about any of the potential release names, this
>> patch set is neither a subset or superset of the 0.21 or 0.22
>> branches. Given that, I think that a new major release number makes
>> the most sense. It is also relatively likely that additional minor
>> releases will be made off of this branch while 0.22 is stabilizing.
>> We've talked about declaring 0.20 a 1.0 for a long time and this feels
>> like backing into the decision, but technically, I believe it to be
>> the right name for such a release.
>> 
>> Thoughts?
>> 
>> -- Owen
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

I'm willing to discuss any and all options, for a very short period.

Technically you have a reasonable point, Doug has suggested this in  
the past too. If everyone agrees, fine; if not, I'm do not want hung  
up on a release number. I just *do not* want a controversy.

As I mentioned, I'm looking to finish this up in a couple of weeks;  
so, I could do without a long discussion on the on the critical path.

I'm happy to go with a reasonable compromise, if not, hadoop-0.20.100  
is what I'm priming for.

Heck, if Stack wants to call the append release (not sure how far  
ahead he is) as hadoop-0.20.100, I'm willing to call this  
hadoop-0.20.200.

All I care about is having a distinct release number from 0.20.2 (our  
last stable release). Again, I just want to get a release into the  
hands of our users. Please, let's resolve this quickly. Please.

Arun

On Jan 12, 2011, at 1:10 PM, Owen O'Malley wrote:

>
> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
>
>> I'm open to suggestions - how about something like 20.100 to show
>> that it's a big jump? Anything else?
>
>
> Although I'm not wild about any of the potential release names, this
> patch set is neither a subset or superset of the 0.21 or 0.22
> branches. Given that, I think that a new major release number makes
> the most sense. It is also relatively likely that additional minor
> releases will be made off of this branch while 0.22 is stabilizing.
> We've talked about declaring 0.20 a 1.0 for a long time and this feels
> like backing into the decision, but technically, I believe it to be
> the right name for such a release.
>
> Thoughts?
>
> -- Owen

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Chris Douglas <cd...@apache.org>.

I had exactly the same reaction when this came up in the past:

http://s.apache.org/l9
http://s.apache.org/5Gv

but our experience with myriad 0.20 variants has demonstrated that
Hadoop can support both a stable branch and a development branch.
Trying to direct effort away from 0.20 by preventing it from happening
in Apache didn't work, and I was wrong to advocate for it. The
interest in a more slow-moving, stable version of Hadoop will exist
whether we give it an outlet in Apache or not, most of us work on both
anyway, so we might as well collaborate in both fora. -C

On Wed, Jan 12, 2011 at 1:26 PM, Ian Holsman <ha...@holsman.net> wrote:
> so if 0.20 becomes 1.0, what does 0.22 become ?
>
> I'm still not sure if we shouldn't just add security to 0.22, and leave the 0.20 in maintenance mode from here on.
>
> On Jan 12, 2011, at 4:10 PM, Owen O'Malley wrote:
>
>>
>> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
>>
>>> I'm open to suggestions - how about something like 20.100 to show that it's a big jump? Anything else?
>>
>>
>> Although I'm not wild about any of the potential release names, this patch set is neither a subset or superset of the 0.21 or 0.22 branches. Given that, I think that a new major release number makes the most sense. It is also relatively likely that additional minor releases will be made off of this branch while 0.22 is stabilizing. We've talked about declaring 0.20 a 1.0 for a long time and this feels like backing into the decision, but technically, I believe it to be the right name for such a release.
>>
>> Thoughts?
>>
>> -- Owen
>
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ian Holsman <ha...@holsman.net>.

so if 0.20 becomes 1.0, what does 0.22 become ?

I'm still not sure if we shouldn't just add security to 0.22, and leave the 0.20 in maintenance mode from here on.

On Jan 12, 2011, at 4:10 PM, Owen O'Malley wrote:

> 
> On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:
> 
>> I'm open to suggestions - how about something like 20.100 to show that it's a big jump? Anything else?
> 
> 
> Although I'm not wild about any of the potential release names, this patch set is neither a subset or superset of the 0.21 or 0.22 branches. Given that, I think that a new major release number makes the most sense. It is also relatively likely that additional minor releases will be made off of this branch while 0.22 is stabilizing. We've talked about declaring 0.20 a 1.0 for a long time and this feels like backing into the decision, but technically, I believe it to be the right name for such a release.
> 
> Thoughts?
> 
> -- Owen

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Owen O'Malley <om...@apache.org>.

On Jan 11, 2011, at 9:09 PM, Arun C Murthy wrote:

> I'm open to suggestions - how about something like 20.100 to show  
> that it's a big jump? Anything else?

Although I'm not wild about any of the potential release names, this  
patch set is neither a subset or superset of the 0.21 or 0.22  
branches. Given that, I think that a new major release number makes  
the most sense. It is also relatively likely that additional minor  
releases will be made off of this branch while 0.22 is stabilizing.  
We've talked about declaring 0.20 a 1.0 for a long time and this feels  
like backing into the decision, but technically, I believe it to be  
the right name for such a release.

Thoughts?

-- Owen

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Patrick Angeles <pa...@cloudera.com>.

You're gonna call your kid 20.100?

:)

Congratz

On Wed, Jan 12, 2011 at 12:09 AM, Arun C Murthy <ar...@yahoo-inc.com> wrote:

> On Jan 11, 2011, at 11:14 AM, "Stack" <st...@duboce.net> wrote:
>
> >> I'm back now and plan to start work on this. Hopefully I can get this
> over
> >> with quickly, in a couple of weeks, to focus on the next release(s).
> >>
> >
> > What you thinking?  What'll you call it?
> >
> > Good on you,
> > St.Ack
>
> Thanks Stack.
>
> I'm open to suggestions - how about something like 20.100 to show that it's
> a big jump? Anything else?
>
> Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

+1. I like the idea of 20.100.

Thanks
mahadev


On 1/11/11 9:09 PM, "Arun C Murthy" <ar...@yahoo-inc.com> wrote:

> On Jan 11, 2011, at 11:14 AM, "Stack" <st...@duboce.net> wrote:
> 
>>> I'm back now and plan to start work on this. Hopefully I can get this over
>>> with quickly, in a couple of weeks, to focus on the next release(s).
>>> 
>> 
>> What you thinking?  What'll you call it?
>> 
>> Good on you,
>> St.Ack
> 
> Thanks Stack.
> 
> I'm open to suggestions - how about something like 20.100 to show that it's a
> big jump? Anything else?
> 
> Arun 
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On Jan 11, 2011, at 11:14 AM, "Stack" <st...@duboce.net> wrote:

>> I'm back now and plan to start work on this. Hopefully I can get this over
>> with quickly, in a couple of weeks, to focus on the next release(s).
>> 
> 
> What you thinking?  What'll you call it?
> 
> Good on you,
> St.Ack

Thanks Stack. 

I'm open to suggestions - how about something like 20.100 to show that it's a big jump? Anything else?

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Stack <st...@duboce.net>.

On Mon, Jan 10, 2011 at 11:11 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> Things stalled, my apologies. Turns out having a kid is a lot of work, who
> knew! *smile*
>

Really (smile -- congrats Arun).

> I'm back now and plan to start work on this. Hopefully I can get this over
> with quickly, in a couple of weeks, to focus on the next release(s).
>

What you thinking?  What'll you call it?

Good on you,
St.Ack

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On 10/15/2010 02:28 PM, Doug Cutting wrote:

> On 08/30/2010 02:14 PM, Arun C Murthy wrote:
> Since most people seemed to think of it as a reasonable idea, I'm  
> going
> to create the hadoop-0.20-security branch and start the necessary  
> work.
>
> I don't yet see this branch.  Are you still intending to do this?
>
> Doug

Things stalled, my apologies. Turns out having a kid is a lot of work,  
who knew! *smile*

I'm back now and plan to start work on this. Hopefully I can get this  
over with quickly, in a couple of weeks, to focus on the next  
release(s).

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Doug Cutting <cu...@apache.org>.

On 08/30/2010 02:14 PM, Arun C Murthy wrote:
> Since most people seemed to think of it as a reasonable idea, I'm going
> to create the hadoop-0.20-security branch and start the necessary work.

I don't yet see this branch.  Are you still intending to do this?

Doug

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@apache.org>.

On Aug 23, 2010, at 5:27 PM, Arun C Murthy wrote:
> In the interim I'd like to propose we push a hadoop-0.20-security
> release off the Yahoo! patchset (http://github.com/yahoo/hadoop-
> common). This will ensure the community benefits from all the work
> done at Yahoo! for over 12 months *now*, and ensures that we do not
> have to wait until hadoop-0.22 which has all of these patches.

Since most people seemed to think of it as a reasonable idea, I'm  
going to create the hadoop-0.20-security branch and start the  
necessary work.

thanks,
Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Tom White <to...@cloudera.com>.

Hi Arun,

I think it would be good to have a shared 0.20 Apache security branch.
Since security isn't in 0.21, and the 0.22 release is a some way off
as you mention, this would be useful for folks who want the security
features sooner (and want to use an Apache release).

Thanks,
Tom

On Mon, Aug 23, 2010 at 5:27 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> Even with the work on hadoop-0.22 (trunk) starting in earnest it is fairly
> obvious, given our past history, that it will take a while for us to get it
> stable and deployable - for e.g. it took us nearly 6 months to deploy
> hadoop-0.20.
>
> In the interim I'd like to propose we push a hadoop-0.20-security release
> off the Yahoo! patchset (http://github.com/yahoo/hadoop-common). This will
> ensure the community benefits from all the work done at Yahoo! for over 12
> months *now*, and ensures that we do not have to wait until hadoop-0.22
> which has all of these patches.
>
> Some salient aspects:
> a) Full-fledged security implementation deployed at scale (4000 nodes) in
> production.
> b) Lots of work on the stabilizing and optimizing the NameNode and
> JobTracker for over 12 months. This has been critical in deploying Hadoop at
> scale i.e. clusters of 4000 nodes. For e.g. we have a 50% improvement in CPU
> utilization on the JobTracker vis-a-vis the hadoop-0.20.2 release.
> c) Several new features in the scheduler (CapacityScheduler), Map-Reduce
> framework, better support for multi-tenancy etc.
> d) Several performance and stability improvements to the system e.g.
> iterative ls, robustness against rogue clients/jobs/users etc.
>
> Also, given the huge number of features and enhancements I'd like to propose
> we create a new 0.20-security branch and commit the Yahoo patchset there for
> the release.
>
> This has been proposed earlier by Doug and did not get far due to concerns
> about the effect this would have on development on trunk. However, I
> believe, we have a case for demonstrable progress on trunk now, and it would
> be useful to have an interim, fully-tested Apache Hadoop release available
> to the community.
>
>  Conceivably, one could imagine a Hadoop Security + Append release soon
> after. At this point a Hadoop Security release alone would add tremendous
> value for the reasons above. Presently we would like to get this release out
> quickly to focus the majority of our efforts on trunk.
>
> Thoughts?
>
> Arun
>
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Steve Loughran <st...@apache.org>.

On 26/08/10 17:09, Arun C Murthy wrote:
>
> On Aug 26, 2010, at 7:11 AM, Steve Loughran wrote:
>
>> On 25/08/10 18:59, Arun C Murthy wrote:
>>> On Aug 25, 2010, at 10:46 AM, Hemanth Yamijala wrote:
>>>
>>>> Arun,
>>>>
>>>> How much time do you think it would take to have a version of 0.20
>>>> with the security features in it ready ? In a different thread, Owen
>>>> has started discussing plans around 0.22. Do you think this effort
>>>> would affect 0.22 release ?
>>>>
>>>
>>> I think it should be fairly trivial to get this release out - most of
>>> the effort is just the mechanics of committing the patches to an Apache
>>> branch from the yahoo git repository, creating a release candidate and
>>> calling it a success! *smile*
>>
>> oh, and testing it..
>>
>
> Already is! *smile*
> It's running on 4k clusters in production at this point...
>

+1 then, ship it.

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Aug 26, 2010, at 7:11 AM, Steve Loughran wrote:

> On 25/08/10 18:59, Arun C Murthy wrote:
>> On Aug 25, 2010, at 10:46 AM, Hemanth Yamijala wrote:
>>
>>> Arun,
>>>
>>> How much time do you think it would take to have a version of 0.20
>>> with the security features in it ready ? In a different thread, Owen
>>> has started discussing plans around 0.22. Do you think this effort
>>> would affect 0.22 release ?
>>>
>>
>> I think it should be fairly trivial to get this release out - most of
>> the effort is just the mechanics of committing the patches to an  
>> Apache
>> branch from the yahoo git repository, creating a release candidate  
>> and
>> calling it a success! *smile*
>
> oh, and testing it..
>

Already is! *smile*
It's running on 4k clusters in production at this point...

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Steve Loughran <st...@apache.org>.

On 25/08/10 18:59, Arun C Murthy wrote:
> On Aug 25, 2010, at 10:46 AM, Hemanth Yamijala wrote:
>
>> Arun,
>>
>> How much time do you think it would take to have a version of 0.20
>> with the security features in it ready ? In a different thread, Owen
>> has started discussing plans around 0.22. Do you think this effort
>> would affect 0.22 release ?
>>
>
> I think it should be fairly trivial to get this release out - most of
> the effort is just the mechanics of committing the patches to an Apache
> branch from the yahoo git repository, creating a release candidate and
> calling it a success! *smile*

oh, and testing it..



what scalability patches like HDFS-599 are in?

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Aug 25, 2010, at 10:46 AM, Hemanth Yamijala wrote:

> Arun,
>
> How much time do you think it would take to have a version of 0.20
> with the security features in it ready ? In a different thread, Owen
> has started discussing plans around 0.22. Do you think this effort
> would affect 0.22 release ?
>

I think it should be fairly trivial to get this release out - most of  
the effort is just the mechanics of committing the patches to an  
Apache branch from the yahoo git repository, creating a release  
candidate and calling it a success! *smile*

I think doing this quickly is critical in ensuring that we do not lose  
focus on 0.22, but I believe this will definitely help the community.

> I do agree that this would be very useful for folks who want security
> sooner. And the fact that Yahoo! have been running it at scale for a
> good while now is also assuring.

Just to clarify - this has security and a bunch of other enhancements  
(which are either in 0.21 or 0.22 or both).

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Devaraj Das <dd...@yahoo-inc.com>.

>As has been mentioned a few times, part of the security features are dependent upon Yahoo!-type operations.

Allen, could you please enlist them here again (for the benefit of the community)? Or, are you referring to only the cluster-wide start scripts?

On 8/25/10 1:25 PM, "Allen Wittenauer" <aw...@linkedin.com> wrote:

On Aug 25, 2010, at 10:46 AM, Hemanth Yamijala wrote:
> I do agree that this would be very useful for folks who want security
> sooner. And the fact that Yahoo! have been running it at scale for a
> good while now is also assuring.

As has been mentioned a few times, part of the security features are dependent upon Yahoo!-type operations.  Those would need to get replaced or a decision would need to be made that we are removing/regressing certain features (the cluster-wide start scripts).

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Aug 25, 2010, at 10:46 AM, Hemanth Yamijala wrote:
> I do agree that this would be very useful for folks who want security
> sooner. And the fact that Yahoo! have been running it at scale for a
> good while now is also assuring.

As has been mentioned a few times, part of the security features are dependent upon Yahoo!-type operations.  Those would need to get replaced or a decision would need to be made that we are removing/regressing certain features (the cluster-wide start scripts).

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Hemanth Yamijala <yh...@gmail.com>.

Arun,

How much time do you think it would take to have a version of 0.20
with the security features in it ready ? In a different thread, Owen
has started discussing plans around 0.22. Do you think this effort
would affect 0.22 release ?

I do agree that this would be very useful for folks who want security
sooner. And the fact that Yahoo! have been running it at scale for a
good while now is also assuring.

Thanks
hemanth

On Tue, Aug 24, 2010 at 5:57 AM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> Even with the work on hadoop-0.22 (trunk) starting in earnest it is fairly
> obvious, given our past history, that it will take a while for us to get it
> stable and deployable - for e.g. it took us nearly 6 months to deploy
> hadoop-0.20.
>
> In the interim I'd like to propose we push a hadoop-0.20-security release
> off the Yahoo! patchset (http://github.com/yahoo/hadoop-common). This will
> ensure the community benefits from all the work done at Yahoo! for over 12
> months *now*, and ensures that we do not have to wait until hadoop-0.22
> which has all of these patches.
>
> Some salient aspects:
> a) Full-fledged security implementation deployed at scale (4000 nodes) in
> production.
> b) Lots of work on the stabilizing and optimizing the NameNode and
> JobTracker for over 12 months. This has been critical in deploying Hadoop at
> scale i.e. clusters of 4000 nodes. For e.g. we have a 50% improvement in CPU
> utilization on the JobTracker vis-a-vis the hadoop-0.20.2 release.
> c) Several new features in the scheduler (CapacityScheduler), Map-Reduce
> framework, better support for multi-tenancy etc.
> d) Several performance and stability improvements to the system e.g.
> iterative ls, robustness against rogue clients/jobs/users etc.
>
> Also, given the huge number of features and enhancements I'd like to propose
> we create a new 0.20-security branch and commit the Yahoo patchset there for
> the release.
>
> This has been proposed earlier by Doug and did not get far due to concerns
> about the effect this would have on development on trunk. However, I
> believe, we have a case for demonstrable progress on trunk now, and it would
> be useful to have an interim, fully-tested Apache Hadoop release available
> to the community.
>
>  Conceivably, one could imagine a Hadoop Security + Append release soon
> after. At this point a Hadoop Security release alone would add tremendous
> value for the reasons above. Presently we would like to get this release out
> quickly to focus the majority of our efforts on trunk.
>
> Thoughts?
>
> Arun
>
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Owen O'Malley <om...@apache.org>.

On Thu, Aug 26, 2010 at 12:08 PM, Stack <st...@duboce.net> wrote:
> Sounds good to me.  What will this release be called?  hadoop-0.20.3-security?

It is a new branch, so the question is what is the branch name. I'd
propose calling it 0.20-security and the releases would be
0.20-security.0, etc.

> Well, it'd probably be better if we just did an append release first?

I don't think the ordering maters. 0.20-security is a different branch
that isn't comparable to 0.20-append.

0.20 < 0.20-security < 0.22
0.20 < 0.20-append < 0.21 < 0.22

It does make a bit of a mess.

-- Owen

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Aug 26, 2010, at 4:30 PM, Ted Yu wrote:

> This would imply hadoop-0.20-security-append or hadoop-0.20-append- 
> security
> release be created which contains security and append features.

As I mentioned in my initial proposal - it's conceivable, not imminent.
The community might decide that it is a valuable direction and folks  
may work on integrating the two.

At this point, I am signing up to shepherd hadoop-0.20-security. I'd  
like to do it quickly and move on to working on Hadoop trunk, others  
are welcome to take this and run further.

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Ted Yu <yu...@gmail.com>.

This would imply hadoop-0.20-security-append or hadoop-0.20-append-security
release be created which contains security and append features.

On Thu, Aug 26, 2010 at 4:22 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:

>
> On Aug 26, 2010, at 12:08 PM, Stack wrote:
>
>  On Mon, Aug 23, 2010 at 5:27 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>>
>>> In the interim I'd like to propose we push a hadoop-0.20-security release
>>> off the Yahoo! patchset (http://github.com/yahoo/hadoop-common). This
>>> will
>>> ensure the community benefits from all the work done at Yahoo! for over
>>> 12
>>> months *now*, and ensures that we do not have to wait until hadoop-0.22
>>> which has all of these patches.
>>>
>>>
>> Sounds good to me.  What will this release be called?
>>  hadoop-0.20.3-security?
>>
>
> hadoop-0.20-security. I want to ensure hadoop-0.20 be a separate line, so
> as to not confuse people.
>
>
>
>>   Conceivably, one could imagine a Hadoop Security + Append release soon
>>> after.
>>>
>>
>> Well, it'd probably be better if we just did an append release first?
>> A good few of us have been banging on the 0.20-append branch w/ a
>> while now and its for sure doing append better than 0.20 did (smile).
>>
>
> I think these are orthogonal and both can run their own course.
>
> Arun
>

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Aug 26, 2010, at 12:08 PM, Stack wrote:

> On Mon, Aug 23, 2010 at 5:27 PM, Arun C Murthy <ac...@yahoo-inc.com>  
> wrote:
>> In the interim I'd like to propose we push a hadoop-0.20-security  
>> release
>> off the Yahoo! patchset (http://github.com/yahoo/hadoop-common).  
>> This will
>> ensure the community benefits from all the work done at Yahoo! for  
>> over 12
>> months *now*, and ensures that we do not have to wait until  
>> hadoop-0.22
>> which has all of these patches.
>>
>
> Sounds good to me.  What will this release be called?  hadoop-0.20.3- 
> security?

hadoop-0.20-security. I want to ensure hadoop-0.20 be a separate line,  
so as to not confuse people.


>
>>  Conceivably, one could imagine a Hadoop Security + Append release  
>> soon
>> after.
>
> Well, it'd probably be better if we just did an append release first?
> A good few of us have been banging on the 0.20-append branch w/ a
> while now and its for sure doing append better than 0.20 did (smile).

I think these are orthogonal and both can run their own course.

Arun

Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset

Posted by Stack <st...@duboce.net>.

On Mon, Aug 23, 2010 at 5:27 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> In the interim I'd like to propose we push a hadoop-0.20-security release
> off the Yahoo! patchset (http://github.com/yahoo/hadoop-common). This will
> ensure the community benefits from all the work done at Yahoo! for over 12
> months *now*, and ensures that we do not have to wait until hadoop-0.22
> which has all of these patches.
>

Sounds good to me.  What will this release be called?  hadoop-0.20.3-security?

>  Conceivably, one could imagine a Hadoop Security + Append release soon
> after.

Well, it'd probably be better if we just did an append release first?
A good few of us have been banging on the 0.20-append branch w/ a
while now and its for sure doing append better than 0.20 did (smile).

St.Ack