You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Ian Holsman <ha...@holsman.net> on 2010/10/21 21:13:16 UTC

bringing the codebases back in line

Hi guys.

I wanted to start a conversation about how we could merge the the cloudera +
yahoo distribtutions of hadoop into our codebase,
and what would be required.

Re: bringing the codebases back in line

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Oct 21, 2010, at 2:53 PM, Ian Holsman wrote:

> yep.. I've heard it's a source of contention...

	Sure. Maybe like 8 months ago to anyone who was paying attention.

> In discussing it with people, I've heard that a major issue (not the only
> one i'm sure) is lack of resources to actually test the apache releases on
> large clusters, and
> 
> So I thought I would start the thread to see if we could at least identify
> what the people think are the problems are.

	I think you've asked the wrong question to begin with and I think the discussion you've started is going to be completely counter productive and put everyone on the defensive.  [It is pretty clear it already has.]   The team is finally making progress towards a common goal (0.22) and (mostly) getting along. 

	This question/issue

>  that it is very hard getting this done in short cycles
> (hence the large gap between 20.x and 21).

	is really the cruxt of the problem.

	We wouldn't have had major diversions between the distributions if the mainline Apache distribution had been making more frequent releases.   IMNSHO, as long as the distributions don't break compatibility (forward or backward) in without a word of warning... who cares?

Re: bringing the codebases back in line

Posted by Steve Loughran <st...@apache.org>.

On 21/10/10 22:53, Ian Holsman wrote:
> yep.. I've heard it's a source of contention...
>
> but I'd like to see how we can get it so the amount of patches that the
> large companies apply on top of the current production apache release gets
> minimized, and the large installations are all running nearly identical code
> on their clusters, and that we wouldn't need to have a yahoo or cloudera
> repo of their patch sets made available.
>
> So Ideally I'd like to hear what kind of things apache needs to do help get
> these kind of things less divergent.
>
> In discussing it with people, I've heard that a major issue (not the only
> one i'm sure) is lack of resources to actually test the apache releases on
> large clusters, and that it is very hard getting this done in short cycles
> (hence the large gap between 20.x and 21).
>
> So I thought I would start the thread to see if we could at least identify
> what the people think are the problems are.

A big issue is the $ value of the data in a production cluster, the size 
of the large clusters, and the fact that they are in use. The only way 
to test on a few hundred nodes -especially now that 12-24TB/node is 
possible, is when you are bringing up a cluster of this size, which is a 
pretty rare event. Lots of us have small real or virtual clusters, but 
they don't stress the NN or JT, don't raise emergent problems like the 
increasing cost of rebalancing 24TB nodes if they go down, etc.

Re: bringing the codebases back in line

Posted by Konstantin Boudnik <co...@apache.org>.

On Fri, Oct 22, 2010 at 12:40PM, Steve Loughran wrote:
> On 22/10/10 01:10, Konstantin Boudnik wrote:
> >
> >The only way, IMO, to have a reasonable testing done on a system as complex as
> >Hadoop is to invest into automatic validation of builds at system level. This
> >requires a few things (resources, if you will):
> >   - extra hardware (the easiest and cheapest problem)
> >   - automatic deployment, testing, and analysis
> >   - system tests development which able to control and observe a cluster
> >     behavior (in other words something more sophisticated than just shell
> >     scripts)
> >
> +1 for testing, I would like to help with this, but my test stuff
> depends on my lifecycle stuff which I need to sit down, sync up with
> trunk and work out how to get in.
> 
> One thing you can do in a virtual world which you can't do in the
> physical space is reconfigure the LAN on the fly, to see what
> happens. For example, I could set up VLANs of two racks and a switch
> between them, then partition the two  and see what happens -while a
> simulated external load (separate issue) hits the NN with the same
> amount of traffic. Fun things.

Awesome idea! I guess it is well aligned with Herriot's abilities to do fault
injections on real (or virual) hardware.

Re: bringing the codebases back in line

Posted by Steve Loughran <st...@apache.org>.

On 22/10/10 01:10, Konstantin Boudnik wrote:
> On Thu, Oct 21, 2010 at 05:53PM, Ian Holsman wrote:
>> In discussing it with people, I've heard that a major issue (not the only
>> one i'm sure) is lack of resources to actually test the apache releases on
>> large clusters, and that it is very hard getting this done in short cycles
>> (hence the large gap between 20.x and 21).
>
> I do agree the lack of resources for testing Hadoop is a problem. However,
> there might be some slight difference in the meaning of word 'resources' ;)
>
> The only way, IMO, to have a reasonable testing done on a system as complex as
> Hadoop is to invest into automatic validation of builds at system level. This
> requires a few things (resources, if you will):
>    - extra hardware (the easiest and cheapest problem)
>    - automatic deployment, testing, and analysis
>    - system tests development which able to control and observe a cluster
>      behavior (in other words something more sophisticated than just shell
>      scripts)
>
> And for the semi-adequate system testing you don't need a large cluster: 10-20
> nodes will be sufficient in most cases. But the automation of all the
> processes starting from deployment is the key. Testing automation is in a
> little better shape for Hadoop has that system test framework called Herriot
> (part of Hadoop code base for about 7 months now), but it still needs further
> extending.
>

+1 for testing, I would like to help with this, but my test stuff 
depends on my lifecycle stuff which I need to sit down, sync up with 
trunk and work out how to get in.

One thing you can do in a virtual world which you can't do in the 
physical space is reconfigure the LAN on the fly, to see what happens. 
For example, I could set up VLANs of two racks and a switch between 
them, then partition the two  and see what happens -while a simulated 
external load (separate issue) hits the NN with the same amount of 
traffic. Fun things.

Re: bringing the codebases back in line

Posted by Konstantin Boudnik <co...@apache.org>.

On Thu, Oct 21, 2010 at 05:53PM, Ian Holsman wrote:
> In discussing it with people, I've heard that a major issue (not the only
> one i'm sure) is lack of resources to actually test the apache releases on
> large clusters, and that it is very hard getting this done in short cycles
> (hence the large gap between 20.x and 21).

I do agree the lack of resources for testing Hadoop is a problem. However,
there might be some slight difference in the meaning of word 'resources' ;)

The only way, IMO, to have a reasonable testing done on a system as complex as
Hadoop is to invest into automatic validation of builds at system level. This
requires a few things (resources, if you will):
  - extra hardware (the easiest and cheapest problem)
  - automatic deployment, testing, and analysis
  - system tests development which able to control and observe a cluster
    behavior (in other words something more sophisticated than just shell
    scripts)

And for the semi-adequate system testing you don't need a large cluster: 10-20
nodes will be sufficient in most cases. But the automation of all the
processes starting from deployment is the key. Testing automation is in a
little better shape for Hadoop has that system test framework called Herriot
(part of Hadoop code base for about 7 months now), but it still needs further
extending.

Hopefully this briefs you about the cluster testing side of the issue.
  Cos

> So I thought I would start the thread to see if we could at least identify
> what the people think are the problems are.
> 
> 
> On Thu, Oct 21, 2010 at 3:30 PM, Allen Wittenauer
> <aw...@linkedin.com>wrote:
> 
> >
> > On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:
> >
> > > Hi guys.
> > >
> > > I wanted to start a conversation about how we could merge the the
> > cloudera +
> > > yahoo distribtutions of hadoop into our codebase,
> > > and what would be required.
> >
> >
> > *grabs popcorn*
> >
> >

Re: bringing the codebases back in line

Posted by Ian Holsman <ha...@holsman.net>.

yep.. I've heard it's a source of contention...

but I'd like to see how we can get it so the amount of patches that the
large companies apply on top of the current production apache release gets
minimized, and the large installations are all running nearly identical code
on their clusters, and that we wouldn't need to have a yahoo or cloudera
repo of their patch sets made available.

So Ideally I'd like to hear what kind of things apache needs to do help get
these kind of things less divergent.

In discussing it with people, I've heard that a major issue (not the only
one i'm sure) is lack of resources to actually test the apache releases on
large clusters, and that it is very hard getting this done in short cycles
(hence the large gap between 20.x and 21).

So I thought I would start the thread to see if we could at least identify
what the people think are the problems are.

On Thu, Oct 21, 2010 at 3:30 PM, Allen Wittenauer
<aw...@linkedin.com>wrote:

>
> On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:
>
> > Hi guys.
> >
> > I wanted to start a conversation about how we could merge the the
> cloudera +
> > yahoo distribtutions of hadoop into our codebase,
> > and what would be required.
>
>
> *grabs popcorn*
>
>

Re: bringing the codebases back in line

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:

> Hi guys.
> 
> I wanted to start a conversation about how we could merge the the cloudera +
> yahoo distribtutions of hadoop into our codebase,
> and what would be required.


*grabs popcorn*

Re: bringing the codebases back in line

Posted by Steve Loughran <st...@apache.org>.

On 22/10/10 01:42, Milind A Bhandarkar wrote:
>
>>
>> but the other question I have which hopefully you guys can answer is does
>> the yahoo distribution have ALL the patches from the trunk on it? because if
>> it doesn't I think that is problematic as well for other reasons.
>
>
> What are these "other" reasons ?
>
> yahoo distribution runs on our production clusters, and I do not see why any production cluster should run code from trunk.

Transient virtual clusters where the FS only exists for a couple of 
hours can do this, but big live physical ones shouldn't, too much data 
at risk. So it depends on your view of "production". If it is "someone 
wants a cluster for 3h to analyse their nightly logs", then you can get 
away with it -it's the best testing you can do.

One problem I hit doing this is that if you do upgrade every week, if 
the app starts failing, you can waste a lot of time trying to decide 
whether its the app code that's changed or trunk, and then if its trunk, 
whether that's a regression or a change that's caused a bug in the app 
code to surface. To be really strict, you should A/B test your virtual 
clusters on the stable and trunk versions, see which finishes faster 
-and whether there are any differences in the output.


that would be very slick for testing indeed.

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Sat, Oct 23, 2010 at 1:48 AM, Bernd Fondermann <bf...@brainlounge.de> wrote:
> On 22.10.10 23:52, Eli Collins wrote:
>>
>> On Thu, Oct 21, 2010 at 5:54 PM, Ian Holsman<ha...@holsman.net>  wrote:
>> Hey Ian,
>>
>> I think we're all in agreement that we need to re-convene on a common
>> branch that removes most of the deltas against an Apache release that
>> we have all accumulated (primarily security, append, trunk backports).
>> The open question is whether we try to come up with a common 20-based
>> branch or wait for 22.  That's been previously discussed on this list
>> and there were some concerns, re-iterated on this thread, that we
>> should invest in 22 rather than the current 20-based branches.
>>
>> Our current plan is to reconvene with everyone on 22 - a well-tested
>> release with security and append should get users off the current
>> 20-based branches. However if you and/or the Apache community feels
>> that there needs to be an Apache 20-based branch and release that
>> reflects what people are using (security, append, various backports in
>> YDH/CDH) we are willing to create and maintain this branch on Apache.
>>
>> Thanks,
>> Eli
>
> Eli,
>
> You are using "we" and "our" placeholders a lot in the last two paragraphs
> and my brain's parser is failing to resolve them correctly - please can you
> clarify where you speak of "we" the community/the
> PMC/Cloudera/Yahoo/somebody else respectively.
>
> Thank you,
>
>  Bernd
>

Hey Bernd,

Apologies for the confusion, I see your point. This is hopefully more clear:

I think the *Hadoop contributors* are all in agreement that we need to
re-convene on a common branch that removes most of the deltas against
an Apache release that we have all accumulated (primarily security,
append, trunk backports). The open question is whether *those of us
that maintain 20-based branches (used internally or distributed)* try
to come up with a common 20-based branch or wait for 22.  That's been
previously discussed on this list and there were some concerns,
re-iterated on this thread, that *Hadoop contributors* should invest
in 22 rather than the current 20-based branches.

*Those of us who maintain Cloudera's 20-based branch* would like to
reconvene with everyone on 22 - a well-tested release from trunk with
security and append should get users off the current 20-based
branches. However if you and/or the Apache community feels that there
needs to be an Apache 20-based branch and release that reflects what
people are using (security, append, various backports in YDH/CDH),
*those of us who maintain Cloudera's 20-based branch* are willing to
create and maintain this branch at Apache.

Thanks,
Eli

Re: bringing the codebases back in line

Posted by Bernd Fondermann <bf...@brainlounge.de>.

On 22.10.10 23:52, Eli Collins wrote:
> On Thu, Oct 21, 2010 at 5:54 PM, Ian Holsman<ha...@holsman.net>  wrote:
> Hey Ian,
>
> I think we're all in agreement that we need to re-convene on a common
> branch that removes most of the deltas against an Apache release that
> we have all accumulated (primarily security, append, trunk backports).
> The open question is whether we try to come up with a common 20-based
> branch or wait for 22.  That's been previously discussed on this list
> and there were some concerns, re-iterated on this thread, that we
> should invest in 22 rather than the current 20-based branches.
>
> Our current plan is to reconvene with everyone on 22 - a well-tested
> release with security and append should get users off the current
> 20-based branches. However if you and/or the Apache community feels
> that there needs to be an Apache 20-based branch and release that
> reflects what people are using (security, append, various backports in
> YDH/CDH) we are willing to create and maintain this branch on Apache.
>
> Thanks,
> Eli

Eli,

You are using "we" and "our" placeholders a lot in the last two 
paragraphs and my brain's parser is failing to resolve them correctly - 
please can you clarify where you speak of "we" the community/the 
PMC/Cloudera/Yahoo/somebody else respectively.

Thank you,

   Bernd

Re: bringing the codebases back in line

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

> as well as in I agree with the other people who want to have a 0.22 release,
> as opposed to wanting to have another 0.20 release.


+1 !

- Milind

--
Milind Bhandarkar
(mailto:milindb@yahoo-inc.com)
(phone: 408-203-5213 W)

Re: bringing the codebases back in line

Posted by Ian Holsman <ha...@holsman.net>.

On Fri, Oct 22, 2010 at 11:04 PM, Milind A Bhandarkar <milindb@yahoo-inc.com
> wrote:

>
> On Oct 22, 2010, at 6:33 PM, Ian Holsman wrote:
>
> > I think we should push forward to 0.22 as well.
>
> "As Well" ? That means there is something else you want to do, right ?
>

as well as in I agree with the other people who want to have a 0.22 release,
as opposed to wanting to have another 0.20 release.

>
> What is it ?
>
> You have said in earlier emails that "Yahoo distribution of hadoop not
> being the same as apache hadoop trunk will cause 'other' problems".
>

I'm picking on yahoo here, but the same could be said for cloudera as well.

>
> Let me ask a yes/no question, based on some of your ambiguous statements in
> this thread.
>
> Do you want Yahoo! distribution of Hadoop the same as trunk ?
>

I want the Yahoo & Cloudera distributions of hadoop to be as close as
possible to the released version of apache hadoop.

I want Yahoo (and others) to look at the apache release and be able say we
can use this on our own cluster, and not have to maintain their 500 or so
patches on top of the standard release.

I want to get the 0.22 (and future) apache releases to a point where the
internal Yahoo developers start asking themselves if they should switch, and
if there is a need for them to maintain their github release at all.

and like Bernd says, I don't have the power to dictate what Yahoo runs on
their cluster internally, neither do I want it.

As a user I was quite pleased when Yahoo and Cloudera put their versions out
there. It was tremendously helpful to me getting my shit done, but by them
doing so it told me (by the fact that they had to release it, and how
different they were) that I shouldn't be running on a standard apache
release.

To repeat for those who think I write vaguely.
I want to remove the need for multiple distributions.

which back to the original thread:

one approach suggested to resolve the multiple branches is to do releases
frequently, but in order to do that we have need to things in place to help
test the releases quickly so as to ensure the quality is there.

> - milind
>
>
> --
> Milind Bhandarkar
> (mailto:milindb@yahoo-inc.com)
> (phone: 408-203-5213 W)
>
>
>

Re: bringing the codebases back in line

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

> 
> Side note: Here at Apache it is common to prefix a statement with "with 
> my PMC hat on...", "with my board hat on...", "with my $EMPLOYER hat on...".


I have my "user of yahoo distribution of hadoop" hat on.

- milind

--
Milind Bhandarkar
(mailto:milindb@yahoo-inc.com)
(phone: 408-203-5213 W)

Re: bringing the codebases back in line

Posted by Bernd Fondermann <bf...@brainlounge.de>.

On 23.10.10 05:04, Milind A Bhandarkar wrote:
>
> On Oct 22, 2010, at 6:33 PM, Ian Holsman wrote:
>
>> I think we should push forward to 0.22 as well.
>
> "As Well" ? That means there is something else you want to do, right ?
>
> What is it ?
>
> You have said in earlier emails that "Yahoo distribution of hadoop not being the same as apache hadoop trunk will cause 'other' problems".
>
> Let me ask a yes/no question, based on some of your ambiguous statements in this thread.
>
> Do you want Yahoo! distribution of Hadoop the same as trunk ?

(You didn't ask me but FWIW) I don't think the Hadoop community can 
mandadate or even should care what a company will put in their 
downstream distributions. After all, this is Apache-licensed code.

Individuals who work on Hadoop and commercial derivatives of Hadoop at 
the same time might be in a different position, but that's basically 
their problem.

Side note: Here at Apache it is common to prefix a statement with "with 
my PMC hat on...", "with my board hat on...", "with my $EMPLOYER hat on...".

So, with my ASF member hat on, I'd say that the Hadoop project is only 
tasked with working on Apache premises (Apache mailing lists, svn, 
wikis, etc). Currently, this community is by far too much concerned with 
what downstream projects are doing with the code. Reason: Currently 
Hadoop looks like a downstream project from others. This is very very 
bad and has to be turned around.
How to get there?
1. Firmly put your Apache contributor head on
2. Integrate all the pending code committers are willing to contribute
3. Release.

   Bernd

Re: bringing the codebases back in line

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

On Oct 22, 2010, at 6:33 PM, Ian Holsman wrote:

> I think we should push forward to 0.22 as well.

"As Well" ? That means there is something else you want to do, right ?

What is it ?

You have said in earlier emails that "Yahoo distribution of hadoop not being the same as apache hadoop trunk will cause 'other' problems".

Let me ask a yes/no question, based on some of your ambiguous statements in this thread.

Do you want Yahoo! distribution of Hadoop the same as trunk ?

- milind


--
Milind Bhandarkar
(mailto:milindb@yahoo-inc.com)
(phone: 408-203-5213 W)

Re: bringing the codebases back in line

Posted by Bernd Fondermann <bf...@brainlounge.de>.

On 23.10.10 18:10, Owen O'Malley wrote:
>
> On Oct 23, 2010, at 2:28 AM, Bernd Fondermann wrote:
>
>> On 23.10.10 05:40, Owen O'Malley wrote:
>>>
>>> The current plan of record is to make
>>> cut a branch next month, stabilize it, and release it.
>>
>> I'd like to revisit the mailing list thread where this decision was
>> made. Can you point me to it?
>
> Sure.
>
> http://www.mail-archive.com/common-dev@hadoop.apache.org/msg01388.html

Thanks, I wouldn't have been able to track that down myself so easily.

   Bernd

Re: bringing the codebases back in line

Posted by Owen O'Malley <om...@apache.org>.

On Oct 23, 2010, at 2:28 AM, Bernd Fondermann wrote:

> On 23.10.10 05:40, Owen O'Malley wrote:
>>
>> The current plan of record is to make
>> cut a branch next month, stabilize it, and release it.
>
> I'd like to revisit the mailing list thread where this decision was  
> made. Can you point me to it?

Sure.

http://www.mail-archive.com/common-dev@hadoop.apache.org/msg01388.html

-- Owen

Re: bringing the codebases back in line

Posted by Bernd Fondermann <bf...@brainlounge.de>.

On 23.10.10 17:48, Allen Wittenauer wrote:
>
> On Oct 23, 2010, at 2:28 AM, Bernd Fondermann wrote:
>
>> On 23.10.10 05:40, Owen O'Malley wrote:
>>>
>>> The current plan of record is to make
>>> cut a branch next month, stabilize it, and release it.
>>
>> I'd like to revisit the mailing list thread where this decision was made. Can you point me to it?
>
>
> 	While you are looking for threads, can you point me to the one where the ASF board announced the very heavy handed VP change to the Hadoop community?

Sure, but you'd have to be a committer at the ASF to have access to it:
Message-ID: <4C...@apache.org>
20.10.10 13:15 -0700 - committers@apache.org:
"ASF Board Meeting Summary - October 20, 2010"

PMC chair changes are never announced to a broader audience, AFAICR.
There are other things which currently trouble the ASF community way 
more than this VP change, as every Apache committer can read there and 
which is subject to a press release for everyone to read.

What I learned when talking to users of ASF projects over the years is 
that they mostly don't even know what PMCs are all about, and thus they 
don't care about VPs.

All they care about is features, releases, documentation and support.

   Bernd

Re: bringing the codebases back in line

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Oct 23, 2010, at 2:28 AM, Bernd Fondermann wrote:

> On 23.10.10 05:40, Owen O'Malley wrote:
>> 
>> The current plan of record is to make
>> cut a branch next month, stabilize it, and release it.
> 
> I'd like to revisit the mailing list thread where this decision was made. Can you point me to it?


	While you are looking for threads, can you point me to the one where the ASF board announced the very heavy handed VP change to the Hadoop community?

Re: bringing the codebases back in line

Posted by Bernd Fondermann <bf...@brainlounge.de>.

On 23.10.10 05:40, Owen O'Malley wrote:
>
>  The current plan of record is to make
> cut a branch next month, stabilize it, and release it.

I'd like to revisit the mailing list thread where this decision was 
made. Can you point me to it?

Thanks a lot,

   Bernd

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Fri, Oct 22, 2010 at 8:40 PM, Owen O'Malley <om...@apache.org> wrote:
>
> On Oct 22, 2010, at 6:33 PM, Ian Holsman wrote:
>
>> I think we should push forward to 0.22 as well.
>> The question then becomes what should be in it.
>
> I've had better luck with time-based releases rather than feature-based
> releases for open source projects. The current plan of record is to make cut
> a branch next month, stabilize it, and release it.
>
> -- Owen

Great.  At the contributor's meetup it sounded like the plan of record
was to hold 22 for HDFS-1052.  Speaking of which does someone have
notes to post to the list/wiki?

It would be great to see the project move to periodic time-based
releases.  In previous discussion it seemed like feature-based
releases were favored:
http://www.mail-archive.com/general@hadoop.apache.org/msg01022.html

Thanks,
Eli

Re: bringing the codebases back in line

Posted by Owen O'Malley <om...@apache.org>.

On Oct 22, 2010, at 6:33 PM, Ian Holsman wrote:

> I think we should push forward to 0.22 as well.
> The question then becomes what should be in it.

I've had better luck with time-based releases rather than feature- 
based releases for open source projects. The current plan of record is  
to make cut a branch next month, stabilize it, and release it.

-- Owen

Re: bringing the codebases back in line

Posted by Ian Holsman <ha...@holsman.net>.

I think we should push forward to 0.22 as well.
The question then becomes what should be in it.

2010/10/22 Mahadev Konar <ma...@yahoo-inc.com>

> +1 for moving to 0.22 trunk.
>
> Thanks
> mahadev
>
>
> On 10/22/10 3:03 PM, "Konstantin Boudnik" <co...@apache.org> wrote:
>
> > +1 on moving forward to common 0.22 trunk. 0.20 was dragging on for quite
> a
> > long time and, in a sense, create certain imbalance toward 0.20-centric
> > Hadoop environment
> >
> > On Fri, Oct 22, 2010 at 02:52PM, Eli Collins wrote:
> >> On Thu, Oct 21, 2010 at 5:54 PM, Ian Holsman <ha...@holsman.net>
> wrote:
> >>> On Thu, Oct 21, 2010 at 8:42 PM, Milind A Bhandarkar
> >>> <mi...@yahoo-inc.com>wrote:
> >>>
> >>>>> but the other question I have which hopefully you guys can answer is
> does
> >>>>> the yahoo distribution have ALL the patches from the trunk on it?
> because
> >>>> if
> >>>>> it doesn't I think that is problematic as well for other reasons.
> >>>>
> >>>> What are these "other" reasons ?
> >>>
> >>> yahoo distribution runs on our production clusters, and I do not see
> why any
> >>>> production cluster should run code from trunk.
> >>>>
> >>>>
> >>> right.. the trunk is not for production use. ═I wasn't suggesting that.
> >>>
> >>> but the trunk is what will eventually become the next release.
> >>>
> >>> Then someone in yahoo will have to decide if they are going to move to
> >>> rebase their production cluster to 0.21, or just continue back-porting
> what
> >>> they need to the version they are running on their clusters.
> >>>
> >>> and if yahoo fixes a bug in their version, it would need to be
> >>> forward-ported over to the current trunk. which will get harder and
> harder
> >>> as the paths diverge.
> >>>
> >>> I'm sure you've seen it happen on other projects when a major branch
> lands
> >>> on the trunk, and the amount of effort it takes to reconcile them.
> >>
> >> Hey Ian,
> >>
> >> I think we're all in agreement that we need to re-convene on a common
> >> branch that removes most of the deltas against an Apache release that
> >> we have all accumulated (primarily security, append, trunk backports).
> >> The open question is whether we try to come up with a common 20-based
> >> branch or wait for 22.  That's been previously discussed on this list
> >> and there were some concerns, re-iterated on this thread, that we
> >> should invest in 22 rather than the current 20-based branches.
> >>
> >> Our current plan is to reconvene with everyone on 22 - a well-tested
> >> release with security and append should get users off the current
> >> 20-based branches. However if you and/or the Apache community feels
> >> that there needs to be an Apache 20-based branch and release that
> >> reflects what people are using (security, append, various backports in
> >> YDH/CDH) we are willing to create and maintain this branch on Apache.
> >>
> >> Thanks,
> >> Eli
> >>
> >>
> >>
> >>
> >>> ═- Milind
> >>>
> >>> --
> >>> Milind Bhandarkar
> >>> (mailto:milindb@yahoo-inc.com)
> >>> (phone: 408-203-5213 W)
> >>>
>
>

Re: bringing the codebases back in line

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

+1 for moving to 0.22 trunk.

Thanks
mahadev


On 10/22/10 3:03 PM, "Konstantin Boudnik" <co...@apache.org> wrote:

> +1 on moving forward to common 0.22 trunk. 0.20 was dragging on for quite a
> long time and, in a sense, create certain imbalance toward 0.20-centric
> Hadoop environment
> 
> On Fri, Oct 22, 2010 at 02:52PM, Eli Collins wrote:
>> On Thu, Oct 21, 2010 at 5:54 PM, Ian Holsman <ha...@holsman.net> wrote:
>>> On Thu, Oct 21, 2010 at 8:42 PM, Milind A Bhandarkar
>>> <mi...@yahoo-inc.com>wrote:
>>> 
>>>>> but the other question I have which hopefully you guys can answer is does
>>>>> the yahoo distribution have ALL the patches from the trunk on it? because
>>>> if
>>>>> it doesn't I think that is problematic as well for other reasons.
>>>> 
>>>> What are these "other" reasons ?
>>> 
>>> yahoo distribution runs on our production clusters, and I do not see why any
>>>> production cluster should run code from trunk.
>>>> 
>>>> 
>>> right.. the trunk is not for production use. ��I wasn't suggesting that.
>>> 
>>> but the trunk is what will eventually become the next release.
>>> 
>>> Then someone in yahoo will have to decide if they are going to move to
>>> rebase their production cluster to 0.21, or just continue back-porting what
>>> they need to the version they are running on their clusters.
>>> 
>>> and if yahoo fixes a bug in their version, it would need to be
>>> forward-ported over to the current trunk. which will get harder and harder
>>> as the paths diverge.
>>> 
>>> I'm sure you've seen it happen on other projects when a major branch lands
>>> on the trunk, and the amount of effort it takes to reconcile them.
>> 
>> Hey Ian,
>> 
>> I think we're all in agreement that we need to re-convene on a common
>> branch that removes most of the deltas against an Apache release that
>> we have all accumulated (primarily security, append, trunk backports).
>> The open question is whether we try to come up with a common 20-based
>> branch or wait for 22.  That's been previously discussed on this list
>> and there were some concerns, re-iterated on this thread, that we
>> should invest in 22 rather than the current 20-based branches.
>> 
>> Our current plan is to reconvene with everyone on 22 - a well-tested
>> release with security and append should get users off the current
>> 20-based branches. However if you and/or the Apache community feels
>> that there needs to be an Apache 20-based branch and release that
>> reflects what people are using (security, append, various backports in
>> YDH/CDH) we are willing to create and maintain this branch on Apache.
>> 
>> Thanks,
>> Eli
>> 
>> 
>> 
>> 
>>> ��- Milind
>>> 
>>> --
>>> Milind Bhandarkar
>>> (mailto:milindb@yahoo-inc.com)
>>> (phone: 408-203-5213 W)
>>>

Re: bringing the codebases back in line

Posted by Konstantin Boudnik <co...@apache.org>.

+1 on moving forward to common 0.22 trunk. 0.20 was dragging on for quite a
long time and, in a sense, create certain imbalance toward 0.20-centric
Hadoop environment

On Fri, Oct 22, 2010 at 02:52PM, Eli Collins wrote:
> On Thu, Oct 21, 2010 at 5:54 PM, Ian Holsman <ha...@holsman.net> wrote:
> > On Thu, Oct 21, 2010 at 8:42 PM, Milind A Bhandarkar
> > <mi...@yahoo-inc.com>wrote:
> >
> >> > but the other question I have which hopefully you guys can answer is does
> >> > the yahoo distribution have ALL the patches from the trunk on it? because
> >> if
> >> > it doesn't I think that is problematic as well for other reasons.
> >>
> >> What are these "other" reasons ?
> >
> > yahoo distribution runs on our production clusters, and I do not see why any
> >> production cluster should run code from trunk.
> >>
> >>
> > right.. the trunk is not for production use. ═I wasn't suggesting that.
> >
> > but the trunk is what will eventually become the next release.
> >
> > Then someone in yahoo will have to decide if they are going to move to
> > rebase their production cluster to 0.21, or just continue back-porting what
> > they need to the version they are running on their clusters.
> >
> > and if yahoo fixes a bug in their version, it would need to be
> > forward-ported over to the current trunk. which will get harder and harder
> > as the paths diverge.
> >
> > I'm sure you've seen it happen on other projects when a major branch lands
> > on the trunk, and the amount of effort it takes to reconcile them.
> 
> Hey Ian,
> 
> I think we're all in agreement that we need to re-convene on a common
> branch that removes most of the deltas against an Apache release that
> we have all accumulated (primarily security, append, trunk backports).
> The open question is whether we try to come up with a common 20-based
> branch or wait for 22.  That's been previously discussed on this list
> and there were some concerns, re-iterated on this thread, that we
> should invest in 22 rather than the current 20-based branches.
> 
> Our current plan is to reconvene with everyone on 22 - a well-tested
> release with security and append should get users off the current
> 20-based branches. However if you and/or the Apache community feels
> that there needs to be an Apache 20-based branch and release that
> reflects what people are using (security, append, various backports in
> YDH/CDH) we are willing to create and maintain this branch on Apache.
> 
> Thanks,
> Eli
> 
> 
> 
> 
> > ═- Milind
> >
> > --
> > Milind Bhandarkar
> > (mailto:milindb@yahoo-inc.com)
> > (phone: 408-203-5213 W)
> >

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Oct 21, 2010 at 5:54 PM, Ian Holsman <ha...@holsman.net> wrote:
> On Thu, Oct 21, 2010 at 8:42 PM, Milind A Bhandarkar
> <mi...@yahoo-inc.com>wrote:
>
>>
>> >
>> > but the other question I have which hopefully you guys can answer is does
>> > the yahoo distribution have ALL the patches from the trunk on it? because
>> if
>> > it doesn't I think that is problematic as well for other reasons.
>>
>>
>> What are these "other" reasons ?
>>
>
> yahoo distribution runs on our production clusters, and I do not see why any
>> production cluster should run code from trunk.
>>
>>
> right.. the trunk is not for production use.  I wasn't suggesting that.
>
> but the trunk is what will eventually become the next release.
>
> Then someone in yahoo will have to decide if they are going to move to
> rebase their production cluster to 0.21, or just continue back-porting what
> they need to the version they are running on their clusters.
>
> and if yahoo fixes a bug in their version, it would need to be
> forward-ported over to the current trunk. which will get harder and harder
> as the paths diverge.
>
> I'm sure you've seen it happen on other projects when a major branch lands
> on the trunk, and the amount of effort it takes to reconcile them.

Hey Ian,

I think we're all in agreement that we need to re-convene on a common
branch that removes most of the deltas against an Apache release that
we have all accumulated (primarily security, append, trunk backports).
The open question is whether we try to come up with a common 20-based
branch or wait for 22.  That's been previously discussed on this list
and there were some concerns, re-iterated on this thread, that we
should invest in 22 rather than the current 20-based branches.

Our current plan is to reconvene with everyone on 22 - a well-tested
release with security and append should get users off the current
20-based branches. However if you and/or the Apache community feels
that there needs to be an Apache 20-based branch and release that
reflects what people are using (security, append, various backports in
YDH/CDH) we are willing to create and maintain this branch on Apache.

Thanks,
Eli

>  - Milind
>
> --
> Milind Bhandarkar
> (mailto:milindb@yahoo-inc.com)
> (phone: 408-203-5213 W)
>

Re: bringing the codebases back in line

Posted by Sanjay Radia <sr...@yahoo-inc.com>.

On Oct 22, 2010, at 10:36 AM, Konstantin Shvachko wrote:

> Milind's point is valid, the PMC cannot demand or control what Yahoo,
> Facebook, et. al. run in their productions, or what Couldera sells  
> to their
> customers  AS  LONG  AS  it is within the Apache licensing  
> requirements.
>
> What Apache Hadoop can and should provide is a *steady* stream of base
> A-releases.
>
> I think that a single fact that we missed to release Hadoop 0.21  
> late last
> year got us into the state we are in now. As it let different Hadoop
> installations to diverge drastically from each other, whether it was  
> based
> on production or commercial reasons.
>
> Now that we have that, it would not be feasible or worthwhile to  
> find the
> common denominator based on the old 0.20 version, unless we want to  
> spend
> another year looking for it and diverging the individual  
> installations even
> more in the process.
>
> So the question imo is not "how we merge the cloudera and yahoo
> distributions", but when/how do we make the new 0.22 release.
> And how do we provide a steady release cycle after that.

+1

sanjay


>
> --Konstantin
>
> On Thu, Oct 21, 2010 at 9:29 PM, Milind A Bhandarkar
> <mi...@yahoo-inc.com>wrote:
>
>>>>
>>> right.. the trunk is not for production use.  I wasn't suggesting  
>>> that.
>>
>> So, what are you suggesting ? That Yahoo distribution of Hadoop  
>> should
>> *not* be the version we run on our production clusters ?
>>
>>>
>>> but the trunk is what will eventually become the next release.
>>
>>>
>>> Then someone in yahoo will have to decide if they are going to  
>>> move to
>>> rebase their production cluster to 0.21, or just continue back- 
>>> porting
>> what
>>> they need to the version they are running on their clusters.
>>
>> Yes, that is what we do now. If there are committed patches in  
>> trunk that
>> do not scale for our needs, or break existing applications, or are  
>> deemed
>> not worth the efforts needed to backport, we do not include them in  
>> our
>> deployments, and therefore do not include in Yahoo distribution.
>>
>>>
>>> and if yahoo fixes a bug in their version, it would need to be
>>> forward-ported over to the current trunk. which will get harder and
>> harder
>>> as the paths diverge.
>>
>> Yes, indeed. So, care must be taken that paths do not diverge too  
>> much. I
>> have seen some cases where the bug fixes did not need to be forward  
>> ported,
>> because that piece of code was completely re-written in trunk.
>>
>>>
>>> I'm sure you've seen it happen on other projects when a major branch
>> lands
>>> on the trunk, and the amount of effort it takes to reconcile them.
>>
>> Yes. And that results in delayed releases. An unexpected benefit for
>> application developers was that they could spend time adding  
>> features to
>> their applications, rather than porting same applications from
>> release-to-release, and validating releases. So, it's not always bad.
>>
>> - Milind
>>
>>
>> --
>> Milind Bhandarkar
>> (mailto:milindb@yahoo-inc.com)
>> (phone: 408-203-5213 W)
>>
>>
>>

Re: bringing the codebases back in line

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

+1

On Oct 22, 2010, at 10:36 AM, Konstantin Shvachko wrote:

> So the question imo is not "how we merge the cloudera and yahoo
> distributions", but when/how do we make the new 0.22 release.
> And how do we provide a steady release cycle after that.


--
Milind Bhandarkar
(mailto:milindb@yahoo-inc.com)
(phone: 408-203-5213 W)

Re: bringing the codebases back in line

Posted by Konstantin Shvachko <sh...@gmail.com>.

Milind's point is valid, the PMC cannot demand or control what Yahoo,
Facebook, et. al. run in their productions, or what Couldera sells to their
customers  AS  LONG  AS  it is within the Apache licensing requirements.

What Apache Hadoop can and should provide is a *steady* stream of base
A-releases.

I think that a single fact that we missed to release Hadoop 0.21 late last
year got us into the state we are in now. As it let different Hadoop
installations to diverge drastically from each other, whether it was based
on production or commercial reasons.

Now that we have that, it would not be feasible or worthwhile to find the
common denominator based on the old 0.20 version, unless we want to spend
another year looking for it and diverging the individual installations even
more in the process.

So the question imo is not "how we merge the cloudera and yahoo
distributions", but when/how do we make the new 0.22 release.
And how do we provide a steady release cycle after that.

--Konstantin

On Thu, Oct 21, 2010 at 9:29 PM, Milind A Bhandarkar
<mi...@yahoo-inc.com>wrote:

> >>
> > right.. the trunk is not for production use.  I wasn't suggesting that.
>
> So, what are you suggesting ? That Yahoo distribution of Hadoop should
> *not* be the version we run on our production clusters ?
>
> >
> > but the trunk is what will eventually become the next release.
>
> >
> > Then someone in yahoo will have to decide if they are going to move to
> > rebase their production cluster to 0.21, or just continue back-porting
> what
> > they need to the version they are running on their clusters.
>
> Yes, that is what we do now. If there are committed patches in trunk that
> do not scale for our needs, or break existing applications, or are deemed
> not worth the efforts needed to backport, we do not include them in our
> deployments, and therefore do not include in Yahoo distribution.
>
> >
> > and if yahoo fixes a bug in their version, it would need to be
> > forward-ported over to the current trunk. which will get harder and
> harder
> > as the paths diverge.
>
> Yes, indeed. So, care must be taken that paths do not diverge too much. I
> have seen some cases where the bug fixes did not need to be forward ported,
> because that piece of code was completely re-written in trunk.
>
> >
> > I'm sure you've seen it happen on other projects when a major branch
> lands
> > on the trunk, and the amount of effort it takes to reconcile them.
>
> Yes. And that results in delayed releases. An unexpected benefit for
> application developers was that they could spend time adding features to
> their applications, rather than porting same applications from
> release-to-release, and validating releases. So, it's not always bad.
>
> - Milind
>
>
> --
> Milind Bhandarkar
> (mailto:milindb@yahoo-inc.com)
> (phone: 408-203-5213 W)
>
>
>

Re: bringing the codebases back in line

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

>> 
> right.. the trunk is not for production use.  I wasn't suggesting that.

So, what are you suggesting ? That Yahoo distribution of Hadoop should *not* be the version we run on our production clusters ?

> 
> but the trunk is what will eventually become the next release.

> 
> Then someone in yahoo will have to decide if they are going to move to
> rebase their production cluster to 0.21, or just continue back-porting what
> they need to the version they are running on their clusters.

Yes, that is what we do now. If there are committed patches in trunk that do not scale for our needs, or break existing applications, or are deemed not worth the efforts needed to backport, we do not include them in our deployments, and therefore do not include in Yahoo distribution.

> 
> and if yahoo fixes a bug in their version, it would need to be
> forward-ported over to the current trunk. which will get harder and harder
> as the paths diverge.

Yes, indeed. So, care must be taken that paths do not diverge too much. I have seen some cases where the bug fixes did not need to be forward ported, because that piece of code was completely re-written in trunk.

> 
> I'm sure you've seen it happen on other projects when a major branch lands
> on the trunk, and the amount of effort it takes to reconcile them.

Yes. And that results in delayed releases. An unexpected benefit for application developers was that they could spend time adding features to their applications, rather than porting same applications from release-to-release, and validating releases. So, it's not always bad.

- Milind


--
Milind Bhandarkar
(mailto:milindb@yahoo-inc.com)
(phone: 408-203-5213 W)

Re: bringing the codebases back in line

Posted by Ian Holsman <ha...@holsman.net>.

On Thu, Oct 21, 2010 at 8:42 PM, Milind A Bhandarkar
<mi...@yahoo-inc.com>wrote:

>
> >
> > but the other question I have which hopefully you guys can answer is does
> > the yahoo distribution have ALL the patches from the trunk on it? because
> if
> > it doesn't I think that is problematic as well for other reasons.
>
>
> What are these "other" reasons ?
>

yahoo distribution runs on our production clusters, and I do not see why any
> production cluster should run code from trunk.
>
>
right.. the trunk is not for production use.  I wasn't suggesting that.

but the trunk is what will eventually become the next release.

Then someone in yahoo will have to decide if they are going to move to
rebase their production cluster to 0.21, or just continue back-porting what
they need to the version they are running on their clusters.

and if yahoo fixes a bug in their version, it would need to be
forward-ported over to the current trunk. which will get harder and harder
as the paths diverge.

I'm sure you've seen it happen on other projects when a major branch lands
on the trunk, and the amount of effort it takes to reconcile them.
  - Milind

--
Milind Bhandarkar
(mailto:milindb@yahoo-inc.com)
(phone: 408-203-5213 W)

Re: bringing the codebases back in line

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

> 
> but the other question I have which hopefully you guys can answer is does
> the yahoo distribution have ALL the patches from the trunk on it? because if
> it doesn't I think that is problematic as well for other reasons.


What are these "other" reasons ?

yahoo distribution runs on our production clusters, and I do not see why any production cluster should run code from trunk.

- Milind

--
Milind Bhandarkar
(mailto:milindb@yahoo-inc.com)
(phone: 408-203-5213 W)

Re: bringing the codebases back in line

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Ian,

On Oct 21, 2010, at 4:50 PM, Ian Holsman wrote:

> but the other question I have which hopefully you guys can answer is  
> does
> the yahoo distribution have ALL the patches from the trunk on it?  
> because if
> it doesn't I think that is problematic as well for other reasons.

Yahoo put security on Apache Hadoop-0.20.

Apache Hadoop trunk is very far from hadoop-0.20, there are lots of  
features in trunk which aren't part of yahoo-hadoop-0.20 simply  
because there wasn't a need or it wasn't worth our effort to backport  
them etc. I know, since I have a big hand in deciding it.

However, we have been very religious about porting all our changes to  
trunk, we might have missed a couple due to time pressure, human  
mistake etc.

Thus, it isn't feasible for yahoo distribution to be a superset of  
trunk. Even more because it takes a *huge* amount of effort to qualify  
trunk... we at Yahoo qualified Apache Hadoop 0.20 and have stuck with  
it for over a year now, same as Cloudera, Facebook etc. Again, I'll  
point out that we have been very good at porting nearly 4000 internal  
commits to trunk throughout this time.

Hope that helps.

thanks,
Arun

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Oct 21, 2010 at 4:50 PM, Ian Holsman <ha...@holsman.net> wrote:
> right.. Cloudera is bundling it's add-ons into a single tarball to make it
> easier to install.

CDH contains a number of different projects, however each project has
a distinct tarball (and packages). The tarball is essentially an
Apache release (tarball) plus a directory that has a set of patches
that we've applied to the Apache release (our build process downloads
the Apache release and applies our set of patches).  For each version
of CDH we rebase our patch set on the latest Apache dot release
available at the time to minimize our delta with upstream.  Here's an
example tarball:
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+737.tar.gz

> In my ideal world, I'd like to be able to just download/buy any of those
> tools and have them run on a released apache hadoop tarball. and then if
> someone else comes along with a competing tool I would be free to choose it
> and have it also run on my apache hadoop tarball, not have to go through the
> pain of saying XXX tool needs their customized version of hadoop so I can't
> use it. (ie remove the lock-in that comes from a forked base).

All of our Apache projects are an Apache release plus a set of
patches, these are typically backports of bug fixes in trunk but not a
dot release. Except for Hadoop, the set of additional patches is very
small. Here's an example, the 16 changes not in Pig 0.7 that we've
included: http://archive.cloudera.com/cdh/3/pig-0.7.0+16.CHANGES.txt

> so what I'd like to see is both cloudera and yahoo running a minimal set of
> patches as a 'superset' of the apache hadoop stuff, with the apache hadoop
> very close to both of these. the only patches being in either being to fix
> bugs or performance issues that would be available in the next release of
> a-hadoop.

That's our goal as well. For all the Apache projects in CDH, except
for Hadoop, that is the case today.  For CDH3 we ended up adding large
additional patch sets (the  security patch set the append patch set to
support HBase), but for Apache 22 the majority of the delta that CDH
and YDH have against Apache 20 will go away (thanks to Y! contributing
security and append to trunk).

> And when a new release of a-hadoop comes, it the vendors would switch to
> using that a-hadoop version as their baseline.
>
> I don't want to get into the situation that linux is in with redhat in that
> their kernel is dramatically different to the one on kernel.org.
>
> does that make sense?

Definitely.

Thanks,
Eli
>
>
>
>
> On Thu, Oct 21, 2010 at 6:42 PM, Owen O'Malley <om...@apache.org> wrote:
>
>>
>> On Oct 21, 2010, at 3:19 PM, Doug Cutting wrote:
>>
>>  Cloudera's distribution is based on Y!'s 0.20 distribution, together with
>>> patches from the Apache 0.20-append branch,
>>>
>>
>> Cloudera's Distribution of Hadoop includes many tools from outside of
>> Hadoop and even outside of Apache.
>>
>> -- Owen
>>
>>
>

Re: bringing the codebases back in line

Posted by Ian Holsman <ha...@holsman.net>.

right.. Cloudera is bundling it's add-ons into a single tarball to make it
easier to install.

but my main bone of contention here is not in the bundling, but that in
order for those tools to work, they need to make changes to the base hadoop
package.

In my ideal world, I'd like to be able to just download/buy any of those
tools and have them run on a released apache hadoop tarball. and then if
someone else comes along with a competing tool I would be free to choose it
and have it also run on my apache hadoop tarball, not have to go through the
pain of saying XXX tool needs their customized version of hadoop so I can't
use it. (ie remove the lock-in that comes from a forked base).

but the other question I have which hopefully you guys can answer is does
the yahoo distribution have ALL the patches from the trunk on it? because if
it doesn't I think that is problematic as well for other reasons.

so what I'd like to see is both cloudera and yahoo running a minimal set of
patches as a 'superset' of the apache hadoop stuff, with the apache hadoop
very close to both of these. the only patches being in either being to fix
bugs or performance issues that would be available in the next release of
a-hadoop.
And when a new release of a-hadoop comes, it the vendors would switch to
using that a-hadoop version as their baseline.

I don't want to get into the situation that linux is in with redhat in that
their kernel is dramatically different to the one on kernel.org.

does that make sense?

On Thu, Oct 21, 2010 at 6:42 PM, Owen O'Malley <om...@apache.org> wrote:

>
> On Oct 21, 2010, at 3:19 PM, Doug Cutting wrote:
>
>  Cloudera's distribution is based on Y!'s 0.20 distribution, together with
>> patches from the Apache 0.20-append branch,
>>
>
> Cloudera's Distribution of Hadoop includes many tools from outside of
> Hadoop and even outside of Apache.
>
> -- Owen
>
>

Re: bringing the codebases back in line

Posted by Owen O'Malley <om...@apache.org>.

On Oct 21, 2010, at 3:19 PM, Doug Cutting wrote:

> Cloudera's distribution is based on Y!'s 0.20 distribution, together  
> with patches from the Apache 0.20-append branch,

Cloudera's Distribution of Hadoop includes many tools from outside of  
Hadoop and even outside of Apache.

-- Owen

Re: bringing the codebases back in line

Posted by Doug Cutting <cu...@apache.org>.

On 10/21/2010 12:13 PM, Ian Holsman wrote:
> I wanted to start a conversation about how we could merge the the cloudera +
> yahoo distribtutions of hadoop into our codebase,
> and what would be required.

Cloudera's distribution is based on Y!'s 0.20 distribution, together 
with patches from the Apache 0.20-append branch, patches cherry-picked 
from trunk and patches that have been submitted but not yet committed.

Please note that both Cloudera and Y!'s distributions are 0.20-based 
distributions.  Patches may have been committed to trunk, but nearly 
everyone is using 0.20 in production and will be for some time.  There 
is currently no 0.20 branch at Apache that contains what most folks are 
using in production.  There have been proposals to create such branches, 
most recently by Arun.  Do you think this is worthwhile?

Doug

Re: bringing the codebases back in line

Posted by Owen O'Malley <om...@apache.org>.

On Oct 21, 2010, at 2:00 PM, Ian Holsman wrote:

> so what do you think is required to get them into a release?

I'd planned to start making a release next month.

-- Owen

Re: bringing the codebases back in line

Posted by Ian Holsman <ha...@holsman.net>.

so what do you think is required to get them into a release?

On Thu, Oct 21, 2010 at 4:00 PM, Owen O'Malley <om...@apache.org> wrote:

>
> On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:
>
>  I wanted to start a conversation about how we could merge the the cloudera
>> +
>> yahoo distribtutions of hadoop into our codebase,
>> and what would be required.
>>
>
> All of the patches that are the "yahoo distribution of hadoop" have been in
> Apache's trunk for months.
>
> -- Owen
>

Re: bringing the codebases back in line

Posted by Owen O'Malley <om...@apache.org>.

On Oct 21, 2010, at 2:54 PM, Eli Collins wrote:

> It's worth double checking.  When we added the YDH patch set to CDH3
> we ran a script to see which patches were in YDH but not yet in trunk
> and it turned up around 100 or so patches.

If you could generate a list, that would be useful for tracking down  
the differences. Certainly, everyone was supposed to have gotten  
everything checked in. I see that I managed to forget h-6832. I just  
uploaded the patch and marked it patch available.

-- Owen

Re: bringing the codebases back in line

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

I was merely pointing out, given the number of interested parties on  
that jira, that having Hadoop RPMs for Linux is very desirable.

We could have a technical discussion on the ways to go about doing  
RPMs, debs etc., but it is clear that there is a need for something  
more than tgz releases, even if they are Linux specific. This is  
particularly true given Linux specific components such as  
LinuxTaskController, jsvc for security etc.

Arun


On Oct 21, 2010, at 5:51 PM, Eli Collins wrote:

> On Thu, Oct 21, 2010 at 5:31 PM, Arun C Murthy <ac...@yahoo-inc.com>  
> wrote:
>>
>> On Oct 21, 2010, at 5:17 PM, Eli Collins wrote:
>>
>>> On Thu, Oct 21, 2010 at 3:30 PM, Jakob Homan <jhoman@yahoo- 
>>> inc.com> wrote:
>>>>
>>>> If the patch was just checking 1:1 Jira to patch, it would  
>>>> certainly not
>>>> work.  We were uploading multiple patches to the same JIRA to avoid
>>>> opening
>>>> extraneous issues before generating patches for trunk. Venerable  
>>>> old
>>>> HDFS-1150, for instance, went through about different patches  
>>>> applied to
>>>> Y!'s branch before the final version.
>>>
>>> That's what I meant by saying "a fair number of those may have been
>>> included in trunk but under a different jira".  There seemed to be a
>>> number of patches that are not in trunk under any jira (eg MR-1088,
>>> MR-1100 where the jira is still open).  We need to go through the
>>> patches in YDH and CDH and get them reviewed and checked in.
>>
>> It would be great to get HADOOP-6255, the functionality for  
>> creating RPMs,
>> from CDH to Apache Hadoop.
>>
>
> There are some challenges in contributing packaging up stream:
> - Packaging is typically versioned independently from the project code
> (we share packaging code across project major versions). This is why
> most projects have a separate repository for the packaging, and the
> packaging is done by a distribution.
> - The packaging source shares code across the 10 or so components,
> which is useful since we continuously make the packaging more
> consistent, so hosting the code on any single project repository
> doesn't fit well.
> - The packaging is Linux specific, we've gotten push back when trying
> to contribute modifications upstream with Linuxisms since Apache
> supports non-Linux platforms (namely Solaris).
>
> All of our packaging is Apache licensed (and we publish source RPMs)
> so there's no issue from that perspective.  But this is a digression
> from the subject at hand so we should probably table for a separate
> discussion.
>
> Thanks,
> Eli

Re: bringing the codebases back in line

Posted by Steve Loughran <st...@apache.org>.

On 22/10/10 01:51, Eli Collins wrote:

>
> There are some challenges in contributing packaging up stream:
> - Packaging is typically versioned independently from the project code
> (we share packaging code across project major versions). This is why
> most projects have a separate repository for the packaging, and the
> packaging is done by a distribution.

which is why I proposed a downstream integration project

> - The packaging source shares code across the 10 or so components,
> which is useful since we continuously make the packaging more
> consistent, so hosting the code on any single project repository
> doesn't fit well.

same reason

> - The packaging is Linux specific, we've gotten push back when trying
> to contribute modifications upstream with Linuxisms since Apache
> supports non-Linux platforms (namely Solaris).

I think the apple and oracle Java issues may allow those issues to be 
re-evaluated.

Again, something in the ASF that's downstream -and which then does the 
testing- would be good.

One thing people don't realise is how much effort goes into good RPM 
packaging -Eli does- but others may not. It is hard work to get right, 
very hard to test things like RPM upgrades of an existing installation, etc.

-steve

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Oct 21, 2010 at 11:46 PM, Allen Wittenauer
<aw...@linkedin.com> wrote:
>
> On Oct 21, 2010, at 5:51 PM, Eli Collins wrote:
>>
>> - The packaging is Linux specific, we've gotten push back when trying
>> to contribute modifications upstream with Linuxisms since Apache
>> supports non-Linux platforms (namely Solaris).
>
>        Oh come now Eli.  Just say it: I push everyone really hard on making things multiple platform to the point of being as trollish as this Ian fellow that started this thread.
>
>        Most of the Linux-specific modifications that I've seen that have been submitted are specifically to avoid configuring the system properly*, built in such a way that they weren't pluggable for other operating systems, or could be done in a much more POSIX/OS-agnostic way.  The fact that there are parts of Hadoop that don't work properly on Mac OS X (despite the overwhelming number of Mac laptops in use by Hadoop core devs) always struck me as particularly funny when people get frustrated with me when I point this problems out.
>
>        It is also worth mentioning that, AFAIK, only Linux and AIX have OS-specific code in Hadoop.  Attempts to get fixes (not even features!) for specific issues for other operating systems have been fully rejected with the cry of "we the community don't want to support specific OS issues in the core", even tho some of them are direct hinderance to the proper operation of Hadoop on those OSes.  We have a very big double standard at work here.

Hey Allen,

Sorry for the slow response, Google marked this as spam ;)

Are there outstanding issues building Hadoop on OSX/Solaris?  I
thought we got all those in trunk.

Thanks,
Eli

Re: bringing the codebases back in line

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Oct 21, 2010, at 5:51 PM, Eli Collins wrote:
> 
> - The packaging is Linux specific, we've gotten push back when trying
> to contribute modifications upstream with Linuxisms since Apache
> supports non-Linux platforms (namely Solaris).

	Oh come now Eli.  Just say it: I push everyone really hard on making things multiple platform to the point of being as trollish as this Ian fellow that started this thread.

	Most of the Linux-specific modifications that I've seen that have been submitted are specifically to avoid configuring the system properly*, built in such a way that they weren't pluggable for other operating systems, or could be done in a much more POSIX/OS-agnostic way.  The fact that there are parts of Hadoop that don't work properly on Mac OS X (despite the overwhelming number of Mac laptops in use by Hadoop core devs) always struck me as particularly funny when people get frustrated with me when I point this problems out.

	It is also worth mentioning that, AFAIK, only Linux and AIX have OS-specific code in Hadoop.  Attempts to get fixes (not even features!) for specific issues for other operating systems have been fully rejected with the cry of "we the community don't want to support specific OS issues in the core", even tho some of them are direct hinderance to the proper operation of Hadoop on those OSes.  We have a very big double standard at work here.

*  e.g., searching through x LInux distribution directories looking for a working JRE in the primary hadoop script to protect users from putting the location in hadoop-env.sh.   To me the great irony here is that this is the type of code that'd be great for a (mostly) single use tool do... like, say, a packaging system....

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Oct 21, 2010 at 5:31 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>
> On Oct 21, 2010, at 5:17 PM, Eli Collins wrote:
>
>> On Thu, Oct 21, 2010 at 3:30 PM, Jakob Homan <jh...@yahoo-inc.com> wrote:
>>>
>>> If the patch was just checking 1:1 Jira to patch, it would certainly not
>>> work.  We were uploading multiple patches to the same JIRA to avoid
>>> opening
>>> extraneous issues before generating patches for trunk. Venerable old
>>> HDFS-1150, for instance, went through about different patches applied to
>>> Y!'s branch before the final version.
>>
>> That's what I meant by saying "a fair number of those may have been
>> included in trunk but under a different jira".  There seemed to be a
>> number of patches that are not in trunk under any jira (eg MR-1088,
>> MR-1100 where the jira is still open).  We need to go through the
>> patches in YDH and CDH and get them reviewed and checked in.
>
> It would be great to get HADOOP-6255, the functionality for creating RPMs,
> from CDH to Apache Hadoop.
>

There are some challenges in contributing packaging up stream:
- Packaging is typically versioned independently from the project code
(we share packaging code across project major versions). This is why
most projects have a separate repository for the packaging, and the
packaging is done by a distribution.
- The packaging source shares code across the 10 or so components,
which is useful since we continuously make the packaging more
consistent, so hosting the code on any single project repository
doesn't fit well.
- The packaging is Linux specific, we've gotten push back when trying
to contribute modifications upstream with Linuxisms since Apache
supports non-Linux platforms (namely Solaris).

All of our packaging is Apache licensed (and we publish source RPMs)
so there's no issue from that perspective.  But this is a digression
from the subject at hand so we should probably table for a separate
discussion.

Thanks,
Eli

Re: bringing the codebases back in line

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Oct 21, 2010, at 5:17 PM, Eli Collins wrote:

> On Thu, Oct 21, 2010 at 3:30 PM, Jakob Homan <jh...@yahoo-inc.com>  
> wrote:
>> If the patch was just checking 1:1 Jira to patch, it would  
>> certainly not
>> work.  We were uploading multiple patches to the same JIRA to avoid  
>> opening
>> extraneous issues before generating patches for trunk. Venerable old
>> HDFS-1150, for instance, went through about different patches  
>> applied to
>> Y!'s branch before the final version.
>
> That's what I meant by saying "a fair number of those may have been
> included in trunk but under a different jira".  There seemed to be a
> number of patches that are not in trunk under any jira (eg MR-1088,
> MR-1100 where the jira is still open).  We need to go through the
> patches in YDH and CDH and get them reviewed and checked in.

It would be great to get HADOOP-6255, the functionality for creating  
RPMs, from CDH to Apache Hadoop.

Arun

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Oct 21, 2010 at 3:30 PM, Jakob Homan <jh...@yahoo-inc.com> wrote:
>> It's worth double checking.  When we added the YDH patch set to CDH3
>> we ran a script to see which patches were in YDH but not yet in trunk
>> and it turned up around 100 or so patches.
>
> If the patch was just checking 1:1 Jira to patch, it would certainly not
> work.  We were uploading multiple patches to the same JIRA to avoid opening
> extraneous issues before generating patches for trunk. Venerable old
> HDFS-1150, for instance, went through about different patches applied to
> Y!'s branch before the final version.

That's what I meant by saying "a fair number of those may have been
included in trunk but under a different jira".  There seemed to be a
number of patches that are not in trunk under any jira (eg MR-1088,
MR-1100 where the jira is still open).  We need to go through the
patches in YDH and CDH and get them reviewed and checked in.

Thanks,
Eli


> -jg
>
>
> Eli Collins wrote:
>>
>> On Thu, Oct 21, 2010 at 1:00 PM, Owen O'Malley <om...@apache.org> wrote:
>>>
>>> On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:
>>>
>>>> I wanted to start a conversation about how we could merge the the
>>>> cloudera
>>>> +
>>>> yahoo distribtutions of hadoop into our codebase,
>>>> and what would be required.
>>>
>>> All of the patches that are the "yahoo distribution of hadoop" have been
>>> in
>>> Apache's trunk for months.
>>
>  fair number of those may
>>
>> have been included in trunk but under a different jira, however some
>> (eg MR-1088, MR-1100) are definitely not in trunk.  Also, if I
>> remember correctly some of the 20-based patches are substantially
>> different than the versions for trunk.
>>
>> Thanks,
>> Eli
>
>

Re: bringing the codebases back in line

Posted by Jakob Homan <jh...@yahoo-inc.com>.

 > It's worth double checking.  When we added the YDH patch set to CDH3
 > we ran a script to see which patches were in YDH but not yet in trunk
 > and it turned up around 100 or so patches.

If the patch was just checking 1:1 Jira to patch, it would certainly not 
work.  We were uploading multiple patches to the same JIRA to avoid 
opening extraneous issues before generating patches for trunk. 
Venerable old HDFS-1150, for instance, went through about different 
patches applied to Y!'s branch before the final version.
-jg


Eli Collins wrote:
> On Thu, Oct 21, 2010 at 1:00 PM, Owen O'Malley <om...@apache.org> wrote:
>> On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:
>>
>>> I wanted to start a conversation about how we could merge the the cloudera
>>> +
>>> yahoo distribtutions of hadoop into our codebase,
>>> and what would be required.
>> All of the patches that are the "yahoo distribution of hadoop" have been in
>> Apache's trunk for months.
> 
  fair number of those may
> have been included in trunk but under a different jira, however some
> (eg MR-1088, MR-1100) are definitely not in trunk.  Also, if I
> remember correctly some of the 20-based patches are substantially
> different than the versions for trunk.
> 
> Thanks,
> Eli

Re: bringing the codebases back in line

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Oct 21, 2010 at 1:00 PM, Owen O'Malley <om...@apache.org> wrote:
>
> On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:
>
>> I wanted to start a conversation about how we could merge the the cloudera
>> +
>> yahoo distribtutions of hadoop into our codebase,
>> and what would be required.
>
> All of the patches that are the "yahoo distribution of hadoop" have been in
> Apache's trunk for months.

It's worth double checking.  When we added the YDH patch set to CDH3
we ran a script to see which patches were in YDH but not yet in trunk
and it turned up around 100 or so patches.  A fair number of those may
have been included in trunk but under a different jira, however some
(eg MR-1088, MR-1100) are definitely not in trunk.  Also, if I
remember correctly some of the 20-based patches are substantially
different than the versions for trunk.

Thanks,
Eli

Re: bringing the codebases back in line

Posted by Owen O'Malley <om...@apache.org>.

On Oct 21, 2010, at 12:13 PM, Ian Holsman wrote:

> I wanted to start a conversation about how we could merge the the  
> cloudera +
> yahoo distribtutions of hadoop into our codebase,
> and what would be required.

All of the patches that are the "yahoo distribution of hadoop" have  
been in Apache's trunk for months.

-- Owen