You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Nigel Daley <nd...@mac.com> on 2011/02/11 04:40:40 UTC

[VOTE] Abandon hdfsproxy HDFS contrib

I think the PMC should abandon the hdfsproxy HDFS contrib component.  It's last meaningful contribution was August 2010:

HDFS-1340. A null delegation token is appended to the url if security is disabled when browsing filesystem.

There are 7 unresolved contrib/hdfsproxy issues in Jira, none of them Patch Available.

Here is my +1.

Nige

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Stack <st...@duboce.net>.
+1

On Thu, Feb 10, 2011 at 7:40 PM, Nigel Daley <nd...@mac.com> wrote:
> I think the PMC should abandon the hdfsproxy HDFS contrib component.  It's last meaningful contribution was August 2010:
>
> HDFS-1340. A null delegation token is appended to the url if security is disabled when browsing filesystem.
>
> There are 7 unresolved contrib/hdfsproxy issues in Jira, none of them Patch Available.
>
> Here is my +1.
>
> Nige
>

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Bernd Fondermann <bf...@brainlounge.de>.
On 18.02.11 17:11, Allen Wittenauer wrote:
>
> On Feb 18, 2011, at 2:11 AM, Bernd Fondermann wrote:
>> I don't know how many Y-employees are working on H internally. Only
>> the contributors can sort that out.
>
> Did Carol Bartz run over your puppy or something?

Something, definitively. I assume EOR.

 > You don't appear to realize that pretty much all the
> major companies that house the major committers
 > are doing essentially the same thing:
> work on a major feature patch with their internal
> teams then offer up the patch mostly complete, having
> dealt with all of the minor nits on the way due to
> having field tested it for real.

I realize that. I even critized it: The fact that a closed group 
develops code until they donate it to the project in smaller or larger 
chunks. Whoever these closed groups are, I don't make a difference.

> The patches still get discussed before getting committed.
> In some cases, this has resulted in either the patch
> getting rejected/not committed or reworked.

Sounds good.

> I sometimes wonder if the people who troll the team
> realize that Hadoop is essentially a distributed, networked
> operating system.

I realize that.

> That requires a higher level of quality control than a simple framework.

Which again is only accessible to a closed group. Which excludes the 
rest of the committership.

   Bernd

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Feb 18, 2011, at 2:11 AM, Bernd Fondermann wrote:
> I don't know how many Y-employees are working on H internally. Only
> the contributors can sort that out.

	Did Carol Bartz run over your puppy or something?  You don't appear to realize that pretty much all the major companies that house the major committers are doing essentially the same thing:  work on a major feature patch with their internal teams then offer up the patch mostly complete, having dealt with all of the minor nits on the way due to having field tested it for real.   The patches still get discussed before getting committed.  In some cases, this has resulted in either the patch getting rejected/not committed or reworked.

	I sometimes wonder if the people who troll the team realize that Hadoop is essentially a distributed, networked operating system.  That requires a higher level of quality control than a simple framework. 


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Bernd Fondermann <be...@googlemail.com>.
On Thu, Feb 17, 2011 at 19:43, Roy T. Fielding <fi...@gbiv.com> wrote:
> On Feb 17, 2011, at 4:43 AM, Bernd Fondermann wrote:
>> We have the very unfortunate situation here at Hadoop where Apache
>> Hadoop is not the primary and foremost place of Hadoop development.
>> Instead, code is developed internally at Yahoo and then contributed in
>> (smaller or larger) chunks to Hadoop.
>> This is open source development upside down.
>
> It is not open development.  The development community can do better,
> but it has to make up for past mistakes first.
>
>> It is not ok for people to diff ASF svn against their internal code
>> and provide the diff as a patch without reviewing IP first for every
>> line of code changed.
>
> That is simply untrue.  If the code came from one company's employees
> and they all signed an employment agreement with their employer and
> the employer approves of the contribution and the committer knows that
> when they commit (and logs the authors of the patch when committed),
> then all necessary IP clearance has been done.

I don't know how many Y-employees are working on H internally. Only
the contributors can sort that out.

> Committers are responsible
> for ensuring that they have permission to commit under their own CLA.

I just wanted to point that out, not stop someone from contributing.
Of course, your words are much more precise than mine.

>> For larger chunks I'd suggest to even go via the Incubator IP clearance process.
>
> Nonsense (with director hat on).
>
>> Only then will we force committers to primarily work here in the open
>> and return to what I'd consider a healthy project.
>
> No, you'll force people to work on the open by actually collaborating
> with them as they work and veto a patch for any technical faults it
> may contain.  Pestering them about your personal view of the Apache Way
> of development is not a contribution.

Getting personal is not a valuable contribution either.

  Bernd

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by "Roy T. Fielding" <fi...@gbiv.com>.
On Feb 17, 2011, at 4:43 AM, Bernd Fondermann wrote:
> We have the very unfortunate situation here at Hadoop where Apache
> Hadoop is not the primary and foremost place of Hadoop development.
> Instead, code is developed internally at Yahoo and then contributed in
> (smaller or larger) chunks to Hadoop.
> This is open source development upside down.

It is not open development.  The development community can do better,
but it has to make up for past mistakes first.

> It is not ok for people to diff ASF svn against their internal code
> and provide the diff as a patch without reviewing IP first for every
> line of code changed.

That is simply untrue.  If the code came from one company's employees
and they all signed an employment agreement with their employer and
the employer approves of the contribution and the committer knows that
when they commit (and logs the authors of the patch when committed),
then all necessary IP clearance has been done.  Committers are responsible
for ensuring that they have permission to commit under their own CLA.

> For larger chunks I'd suggest to even go via the Incubator IP clearance process.

Nonsense (with director hat on).

> Only then will we force committers to primarily work here in the open
> and return to what I'd consider a healthy project.

No, you'll force people to work on the open by actually collaborating
with them as they work and veto a patch for any technical faults it
may contain.  Pestering them about your personal view of the Apache Way
of development is not a contribution.

....Roy


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Bernd Fondermann <be...@googlemail.com>.
Hi Eric,

On Fri, Feb 18, 2011 at 13:46, Eric Baldeschwieler <er...@yahoo-inc.com> wrote:
> Hi Bernd,
>
> Apache Hadoop is about scale. Most clusters will always be small, but Hadoop is going mainstream precisely because it scales to huge data and cluster sizes.
>
> There are lots of systems that work well on 10 node clusters. People select   Hadoop because they are confident that as their business / problem grows, Hadoop can grow with it.

Please note that I did not say that Hadoop should not scale.
I know that winning Sorting contests is a great achievement and a huge
selling point.

I'm thinking along the lines of: How much scalability would the
majority of users be willing to trade for
a. more active committers (guess: 0%)
b. more regular releases
c. more non-scalability features (hot standby NN, security, younameit)

I for myself as a low-scale user *would* trade a few percent for b. and c.

Thanks,

  Bernd

> ---
> E14 - via iPhone
>
> On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" <be...@googlemail.com> wrote:
>
>> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
>>> Hi Bernd.
>>>
>>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>>>
>>>> We have the very unfortunate situation here at Hadoop where Apache
>>>> Hadoop is not the primary and foremost place of Hadoop development.
>>>> Instead, code is developed internally at Yahoo and then contributed in
>>>> (smaller or larger) chunks to Hadoop.
>>>
>>> This has been the situation in the past,
>>> but as you can see in the last month, this has changed.
>>>
>>> Yahoo! has publicly committed to move their development into the main code base, and you can see they have started doing this with the 20.100 branch,
>>> and their recent commits to trunk.
>>> Combine this with Nige taking on the 0.22 release branch, (and sheperding it into a stable release) and I think we have are addressing your concerns.
>>>
>>> They have also started bringing the discussions back on the list, see the recent discussion about Jobtracker-nextgen Arun has re-started in MAPREDUCE-279.
>>>
>>> I'm not saying it's perfect, but I think the major players understand there is an issue, and they are *ALL* moving in the right direction.
>>
>> I enthusiastically would like to see your optimism be verified.
>> Maybe I'm misreading the statements issued publicly, but I don't think
>> that this is fully understood. I agree though that it's a move into
>> the right direction.
>>
>>>> This is open source development upside down.
>>>> It is not ok for people to diff ASF svn against their internal code
>>>> and provide the diff as a patch without reviewing IP first for every
>>>> line of code changed.
>>>> For larger chunks I'd suggest to even go via the Incubator IP clearance process.
>>>> Only then will we force committers to primarily work here in the open
>>>> and return to what I'd consider a healthy project.
>>>>
>>>> To be honest: Hadoop is in the process of falling apart.
>>>> Contrib Code gets moved out of Apache instead of being maintained here.
>>>> Discussions are seldom consense-driven.
>>>> Release branches stagnate.
>>>
>>> True. releases do take a long time. This is mainly due to it being extremely hard to test and verify that a release is stable.
>>> It's not enough to just run the thing on 4 machines, you need at least 50 to test some of the major problems. This requires some serious $ for someone to verify.
>>
>> It has been proposed on the list before, IIRC. Don't know how to get
>> there, but the project seriously needs access to a cluster of this
>> size.
>>
>>>> Downstream projects like HBase don't get proper support.
>>>> Production setups are made from 3rd party distributions.
>>>> Development is not happening here, but elsewhere behind corporate doors.
>>>> Discussion about future developments are started on corporate blogs (
>>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>>> ) instead of on the proper mailing list.
>>>> Hurdles for committing are way too high.
>>>> On the bright side, new committers and PMC members are added, this is
>>>> an improvement.
>>>>
>>>> I'd suggest to move away from relying on large code dumps from
>>>> corporations, and move back to the ASF-proven "individual committer
>>>> commits on trunk"-model where more committers can get involved.
>>>> If that means not to support high end cluster sizes for some months,
>>>> well, so be it.
>>>
>>>> Average committers cannot run - e.g. test - on high
>>>> end cluster sizes. If that would mean they cannot participate, then
>>>> the open source project better concentrate on small and medium sized
>>>> cluster instead.
>>>
>>>
>>> Well.. that's one approach.. but there are several companies out there who rely on apache's hadoop to power their large clusters, so I'd hate to see hadoop become something that only runs well on
>>> 10-nodes.. as I don't think that will help anyone either.
>>
>> But only looking at high-end scale doesn't help either.
>>
>> Lets face the fact that Hadoop is now moving from early adaptors phase
>> into a much broader market. I predict that small to medium sized
>> clusters will be the majority of Hadoop deployments in a few month
>> time. 4000, or even 500 machines is the high-end range. If the open
>> source project Hadoop cannot support those users adequately (without
>> becoming defunct), the committership might be better off to focus on
>> the low-end and medium sized users.
>>
>> I'm not suggesting to turn away from the handfull (?) of high-end
>> users. They certainly have most valuable input. But also, *they*
>> obviously have the resources in terms of larger clusters and
>> developers to deal with their specific setups. Obviously, they don't
>> need to rely on the open source project to make releases. In fact,
>> they *do* work on their own Hadoop derivatives.
>> All the other users, the hundreds of boring small cluster users, don't
>> have that choice. They *depend* on the open source releases.
>>
>> Hadoop is an Apache project, to provide HDFS and MR free of charge to
>> the general public. Not only to me - nor to only one or two big
>> companies either.
>> Focus on all the users.
>>
>>  Bernd
>

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Eli Collins <el...@cloudera.com>.
Sounds like hdfs proxy isn't being maintained. The hdfs proxy tests
were disabled as part of HDFS-1666 because no one would maintain it.
If it's not going to be maintained we should remove it, currently
HDFS-1666 is a blocker for 0.22 because we don't want to release
bitrot'd code to users as part of 22.

Let's remove the current hdfsproxy in contrib and then either add the
features to HDFS directly (Sanjay suggested above might be an option)
or we can add a new non-contrib one (Alejandro built a nice working
one that just needs docs and some more tests that he could
contribute).

Owen - is that an acceptable solution to you?  HDSF-1666 is the last
blocker for 0.22 and we'd like to make progress on the release.

Thanks,
Eli

On Sat, Apr 9, 2011 at 10:01 PM, Nigel Daley <nd...@mac.com> wrote:
> AFAICT, Owen was the one to -1 removal of HDFS Proxy.  Owen, are you guys maintaining this?
>
> Cheers,
> Nige
>
> On Apr 4, 2011, at 12:19 PM, Todd Lipcon wrote:
>
>> Could those of you who -1ed the removal of HDFS Proxy please look into the
>> test that has been failing our Hudson build for the last several months:
>> https://issues.apache.org/jira/browse/HDFS-1666
>>
>> <https://issues.apache.org/jira/browse/HDFS-1666>It is one thing to say that
>> we "should" maintain a piece of code, but it's another to actually maintain
>> it. In my mind, part of maintaining a project involves addressing consistent
>> test failures as high priority items.
>>
>> -Todd
>>
>> On Tue, Feb 22, 2011 at 9:27 PM, Nigel Daley <nd...@mac.com> wrote:
>>
>>> For closure, this vote fails due to a couple binding -1 votes.
>>>
>>> Nige
>>>
>>> On Feb 18, 2011, at 4:46 AM, Eric Baldeschwieler wrote:
>>>
>>>> Hi Bernd,
>>>>
>>>> Apache Hadoop is about scale. Most clusters will always be small, but
>>> Hadoop is going mainstream precisely because it scales to huge data and
>>> cluster sizes.
>>>>
>>>> There are lots of systems that work well on 10 node clusters. People
>>> select   Hadoop because they are confident that as their business / problem
>>> grows, Hadoop can grow with it.
>>>>
>>>> ---
>>>> E14 - via iPhone
>>>>
>>>> On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" <
>>> bernd.fondermann@googlemail.com> wrote:
>>>>
>>>>> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
>>>>>> Hi Bernd.
>>>>>>
>>>>>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>>>>>>
>>>>>>> We have the very unfortunate situation here at Hadoop where Apache
>>>>>>> Hadoop is not the primary and foremost place of Hadoop development.
>>>>>>> Instead, code is developed internally at Yahoo and then contributed in
>>>>>>> (smaller or larger) chunks to Hadoop.
>>>>>>
>>>>>> This has been the situation in the past,
>>>>>> but as you can see in the last month, this has changed.
>>>>>>
>>>>>> Yahoo! has publicly committed to move their development into the main
>>> code base, and you can see they have started doing this with the 20.100
>>> branch,
>>>>>> and their recent commits to trunk.
>>>>>> Combine this with Nige taking on the 0.22 release branch, (and
>>> sheperding it into a stable release) and I think we have are addressing your
>>> concerns.
>>>>>>
>>>>>> They have also started bringing the discussions back on the list, see
>>> the recent discussion about Jobtracker-nextgen Arun has re-started in
>>> MAPREDUCE-279.
>>>>>>
>>>>>> I'm not saying it's perfect, but I think the major players understand
>>> there is an issue, and they are *ALL* moving in the right direction.
>>>>>
>>>>> I enthusiastically would like to see your optimism be verified.
>>>>> Maybe I'm misreading the statements issued publicly, but I don't think
>>>>> that this is fully understood. I agree though that it's a move into
>>>>> the right direction.
>>>>>
>>>>>>> This is open source development upside down.
>>>>>>> It is not ok for people to diff ASF svn against their internal code
>>>>>>> and provide the diff as a patch without reviewing IP first for every
>>>>>>> line of code changed.
>>>>>>> For larger chunks I'd suggest to even go via the Incubator IP
>>> clearance process.
>>>>>>> Only then will we force committers to primarily work here in the open
>>>>>>> and return to what I'd consider a healthy project.
>>>>>>>
>>>>>>> To be honest: Hadoop is in the process of falling apart.
>>>>>>> Contrib Code gets moved out of Apache instead of being maintained
>>> here.
>>>>>>> Discussions are seldom consense-driven.
>>>>>>> Release branches stagnate.
>>>>>>
>>>>>> True. releases do take a long time. This is mainly due to it being
>>> extremely hard to test and verify that a release is stable.
>>>>>> It's not enough to just run the thing on 4 machines, you need at least
>>> 50 to test some of the major problems. This requires some serious $ for
>>> someone to verify.
>>>>>
>>>>> It has been proposed on the list before, IIRC. Don't know how to get
>>>>> there, but the project seriously needs access to a cluster of this
>>>>> size.
>>>>>
>>>>>>> Downstream projects like HBase don't get proper support.
>>>>>>> Production setups are made from 3rd party distributions.
>>>>>>> Development is not happening here, but elsewhere behind corporate
>>> doors.
>>>>>>> Discussion about future developments are started on corporate blogs (
>>>>>>>
>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>>>>>> ) instead of on the proper mailing list.
>>>>>>> Hurdles for committing are way too high.
>>>>>>> On the bright side, new committers and PMC members are added, this is
>>>>>>> an improvement.
>>>>>>>
>>>>>>> I'd suggest to move away from relying on large code dumps from
>>>>>>> corporations, and move back to the ASF-proven "individual committer
>>>>>>> commits on trunk"-model where more committers can get involved.
>>>>>>> If that means not to support high end cluster sizes for some months,
>>>>>>> well, so be it.
>>>>>>
>>>>>>> Average committers cannot run - e.g. test - on high
>>>>>>> end cluster sizes. If that would mean they cannot participate, then
>>>>>>> the open source project better concentrate on small and medium sized
>>>>>>> cluster instead.
>>>>>>
>>>>>>
>>>>>> Well.. that's one approach.. but there are several companies out there
>>> who rely on apache's hadoop to power their large clusters, so I'd hate to
>>> see hadoop become something that only runs well on
>>>>>> 10-nodes.. as I don't think that will help anyone either.
>>>>>
>>>>> But only looking at high-end scale doesn't help either.
>>>>>
>>>>> Lets face the fact that Hadoop is now moving from early adaptors phase
>>>>> into a much broader market. I predict that small to medium sized
>>>>> clusters will be the majority of Hadoop deployments in a few month
>>>>> time. 4000, or even 500 machines is the high-end range. If the open
>>>>> source project Hadoop cannot support those users adequately (without
>>>>> becoming defunct), the committership might be better off to focus on
>>>>> the low-end and medium sized users.
>>>>>
>>>>> I'm not suggesting to turn away from the handfull (?) of high-end
>>>>> users. They certainly have most valuable input. But also, *they*
>>>>> obviously have the resources in terms of larger clusters and
>>>>> developers to deal with their specific setups. Obviously, they don't
>>>>> need to rely on the open source project to make releases. In fact,
>>>>> they *do* work on their own Hadoop derivatives.
>>>>> All the other users, the hundreds of boring small cluster users, don't
>>>>> have that choice. They *depend* on the open source releases.
>>>>>
>>>>> Hadoop is an Apache project, to provide HDFS and MR free of charge to
>>>>> the general public. Not only to me - nor to only one or two big
>>>>> companies either.
>>>>> Focus on all the users.
>>>>>
>>>>> Bernd
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>
>

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Nigel Daley <nd...@mac.com>.
AFAICT, Owen was the one to -1 removal of HDFS Proxy.  Owen, are you guys maintaining this?

Cheers,
Nige

On Apr 4, 2011, at 12:19 PM, Todd Lipcon wrote:

> Could those of you who -1ed the removal of HDFS Proxy please look into the
> test that has been failing our Hudson build for the last several months:
> https://issues.apache.org/jira/browse/HDFS-1666
> 
> <https://issues.apache.org/jira/browse/HDFS-1666>It is one thing to say that
> we "should" maintain a piece of code, but it's another to actually maintain
> it. In my mind, part of maintaining a project involves addressing consistent
> test failures as high priority items.
> 
> -Todd
> 
> On Tue, Feb 22, 2011 at 9:27 PM, Nigel Daley <nd...@mac.com> wrote:
> 
>> For closure, this vote fails due to a couple binding -1 votes.
>> 
>> Nige
>> 
>> On Feb 18, 2011, at 4:46 AM, Eric Baldeschwieler wrote:
>> 
>>> Hi Bernd,
>>> 
>>> Apache Hadoop is about scale. Most clusters will always be small, but
>> Hadoop is going mainstream precisely because it scales to huge data and
>> cluster sizes.
>>> 
>>> There are lots of systems that work well on 10 node clusters. People
>> select   Hadoop because they are confident that as their business / problem
>> grows, Hadoop can grow with it.
>>> 
>>> ---
>>> E14 - via iPhone
>>> 
>>> On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" <
>> bernd.fondermann@googlemail.com> wrote:
>>> 
>>>> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
>>>>> Hi Bernd.
>>>>> 
>>>>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>>>>> 
>>>>>> We have the very unfortunate situation here at Hadoop where Apache
>>>>>> Hadoop is not the primary and foremost place of Hadoop development.
>>>>>> Instead, code is developed internally at Yahoo and then contributed in
>>>>>> (smaller or larger) chunks to Hadoop.
>>>>> 
>>>>> This has been the situation in the past,
>>>>> but as you can see in the last month, this has changed.
>>>>> 
>>>>> Yahoo! has publicly committed to move their development into the main
>> code base, and you can see they have started doing this with the 20.100
>> branch,
>>>>> and their recent commits to trunk.
>>>>> Combine this with Nige taking on the 0.22 release branch, (and
>> sheperding it into a stable release) and I think we have are addressing your
>> concerns.
>>>>> 
>>>>> They have also started bringing the discussions back on the list, see
>> the recent discussion about Jobtracker-nextgen Arun has re-started in
>> MAPREDUCE-279.
>>>>> 
>>>>> I'm not saying it's perfect, but I think the major players understand
>> there is an issue, and they are *ALL* moving in the right direction.
>>>> 
>>>> I enthusiastically would like to see your optimism be verified.
>>>> Maybe I'm misreading the statements issued publicly, but I don't think
>>>> that this is fully understood. I agree though that it's a move into
>>>> the right direction.
>>>> 
>>>>>> This is open source development upside down.
>>>>>> It is not ok for people to diff ASF svn against their internal code
>>>>>> and provide the diff as a patch without reviewing IP first for every
>>>>>> line of code changed.
>>>>>> For larger chunks I'd suggest to even go via the Incubator IP
>> clearance process.
>>>>>> Only then will we force committers to primarily work here in the open
>>>>>> and return to what I'd consider a healthy project.
>>>>>> 
>>>>>> To be honest: Hadoop is in the process of falling apart.
>>>>>> Contrib Code gets moved out of Apache instead of being maintained
>> here.
>>>>>> Discussions are seldom consense-driven.
>>>>>> Release branches stagnate.
>>>>> 
>>>>> True. releases do take a long time. This is mainly due to it being
>> extremely hard to test and verify that a release is stable.
>>>>> It's not enough to just run the thing on 4 machines, you need at least
>> 50 to test some of the major problems. This requires some serious $ for
>> someone to verify.
>>>> 
>>>> It has been proposed on the list before, IIRC. Don't know how to get
>>>> there, but the project seriously needs access to a cluster of this
>>>> size.
>>>> 
>>>>>> Downstream projects like HBase don't get proper support.
>>>>>> Production setups are made from 3rd party distributions.
>>>>>> Development is not happening here, but elsewhere behind corporate
>> doors.
>>>>>> Discussion about future developments are started on corporate blogs (
>>>>>> 
>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>>>>> ) instead of on the proper mailing list.
>>>>>> Hurdles for committing are way too high.
>>>>>> On the bright side, new committers and PMC members are added, this is
>>>>>> an improvement.
>>>>>> 
>>>>>> I'd suggest to move away from relying on large code dumps from
>>>>>> corporations, and move back to the ASF-proven "individual committer
>>>>>> commits on trunk"-model where more committers can get involved.
>>>>>> If that means not to support high end cluster sizes for some months,
>>>>>> well, so be it.
>>>>> 
>>>>>> Average committers cannot run - e.g. test - on high
>>>>>> end cluster sizes. If that would mean they cannot participate, then
>>>>>> the open source project better concentrate on small and medium sized
>>>>>> cluster instead.
>>>>> 
>>>>> 
>>>>> Well.. that's one approach.. but there are several companies out there
>> who rely on apache's hadoop to power their large clusters, so I'd hate to
>> see hadoop become something that only runs well on
>>>>> 10-nodes.. as I don't think that will help anyone either.
>>>> 
>>>> But only looking at high-end scale doesn't help either.
>>>> 
>>>> Lets face the fact that Hadoop is now moving from early adaptors phase
>>>> into a much broader market. I predict that small to medium sized
>>>> clusters will be the majority of Hadoop deployments in a few month
>>>> time. 4000, or even 500 machines is the high-end range. If the open
>>>> source project Hadoop cannot support those users adequately (without
>>>> becoming defunct), the committership might be better off to focus on
>>>> the low-end and medium sized users.
>>>> 
>>>> I'm not suggesting to turn away from the handfull (?) of high-end
>>>> users. They certainly have most valuable input. But also, *they*
>>>> obviously have the resources in terms of larger clusters and
>>>> developers to deal with their specific setups. Obviously, they don't
>>>> need to rely on the open source project to make releases. In fact,
>>>> they *do* work on their own Hadoop derivatives.
>>>> All the other users, the hundreds of boring small cluster users, don't
>>>> have that choice. They *depend* on the open source releases.
>>>> 
>>>> Hadoop is an Apache project, to provide HDFS and MR free of charge to
>>>> the general public. Not only to me - nor to only one or two big
>>>> companies either.
>>>> Focus on all the users.
>>>> 
>>>> Bernd
>> 
>> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Todd Lipcon <to...@cloudera.com>.
Could those of you who -1ed the removal of HDFS Proxy please look into the
test that has been failing our Hudson build for the last several months:
https://issues.apache.org/jira/browse/HDFS-1666

<https://issues.apache.org/jira/browse/HDFS-1666>It is one thing to say that
we "should" maintain a piece of code, but it's another to actually maintain
it. In my mind, part of maintaining a project involves addressing consistent
test failures as high priority items.

-Todd

On Tue, Feb 22, 2011 at 9:27 PM, Nigel Daley <nd...@mac.com> wrote:

> For closure, this vote fails due to a couple binding -1 votes.
>
> Nige
>
> On Feb 18, 2011, at 4:46 AM, Eric Baldeschwieler wrote:
>
> > Hi Bernd,
> >
> > Apache Hadoop is about scale. Most clusters will always be small, but
> Hadoop is going mainstream precisely because it scales to huge data and
> cluster sizes.
> >
> > There are lots of systems that work well on 10 node clusters. People
> select   Hadoop because they are confident that as their business / problem
> grows, Hadoop can grow with it.
> >
> > ---
> > E14 - via iPhone
> >
> > On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" <
> bernd.fondermann@googlemail.com> wrote:
> >
> >> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
> >>> Hi Bernd.
> >>>
> >>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
> >>>>
> >>>> We have the very unfortunate situation here at Hadoop where Apache
> >>>> Hadoop is not the primary and foremost place of Hadoop development.
> >>>> Instead, code is developed internally at Yahoo and then contributed in
> >>>> (smaller or larger) chunks to Hadoop.
> >>>
> >>> This has been the situation in the past,
> >>> but as you can see in the last month, this has changed.
> >>>
> >>> Yahoo! has publicly committed to move their development into the main
> code base, and you can see they have started doing this with the 20.100
> branch,
> >>> and their recent commits to trunk.
> >>> Combine this with Nige taking on the 0.22 release branch, (and
> sheperding it into a stable release) and I think we have are addressing your
> concerns.
> >>>
> >>> They have also started bringing the discussions back on the list, see
> the recent discussion about Jobtracker-nextgen Arun has re-started in
> MAPREDUCE-279.
> >>>
> >>> I'm not saying it's perfect, but I think the major players understand
> there is an issue, and they are *ALL* moving in the right direction.
> >>
> >> I enthusiastically would like to see your optimism be verified.
> >> Maybe I'm misreading the statements issued publicly, but I don't think
> >> that this is fully understood. I agree though that it's a move into
> >> the right direction.
> >>
> >>>> This is open source development upside down.
> >>>> It is not ok for people to diff ASF svn against their internal code
> >>>> and provide the diff as a patch without reviewing IP first for every
> >>>> line of code changed.
> >>>> For larger chunks I'd suggest to even go via the Incubator IP
> clearance process.
> >>>> Only then will we force committers to primarily work here in the open
> >>>> and return to what I'd consider a healthy project.
> >>>>
> >>>> To be honest: Hadoop is in the process of falling apart.
> >>>> Contrib Code gets moved out of Apache instead of being maintained
> here.
> >>>> Discussions are seldom consense-driven.
> >>>> Release branches stagnate.
> >>>
> >>> True. releases do take a long time. This is mainly due to it being
> extremely hard to test and verify that a release is stable.
> >>> It's not enough to just run the thing on 4 machines, you need at least
> 50 to test some of the major problems. This requires some serious $ for
> someone to verify.
> >>
> >> It has been proposed on the list before, IIRC. Don't know how to get
> >> there, but the project seriously needs access to a cluster of this
> >> size.
> >>
> >>>> Downstream projects like HBase don't get proper support.
> >>>> Production setups are made from 3rd party distributions.
> >>>> Development is not happening here, but elsewhere behind corporate
> doors.
> >>>> Discussion about future developments are started on corporate blogs (
> >>>>
> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
> >>>> ) instead of on the proper mailing list.
> >>>> Hurdles for committing are way too high.
> >>>> On the bright side, new committers and PMC members are added, this is
> >>>> an improvement.
> >>>>
> >>>> I'd suggest to move away from relying on large code dumps from
> >>>> corporations, and move back to the ASF-proven "individual committer
> >>>> commits on trunk"-model where more committers can get involved.
> >>>> If that means not to support high end cluster sizes for some months,
> >>>> well, so be it.
> >>>
> >>>> Average committers cannot run - e.g. test - on high
> >>>> end cluster sizes. If that would mean they cannot participate, then
> >>>> the open source project better concentrate on small and medium sized
> >>>> cluster instead.
> >>>
> >>>
> >>> Well.. that's one approach.. but there are several companies out there
> who rely on apache's hadoop to power their large clusters, so I'd hate to
> see hadoop become something that only runs well on
> >>> 10-nodes.. as I don't think that will help anyone either.
> >>
> >> But only looking at high-end scale doesn't help either.
> >>
> >> Lets face the fact that Hadoop is now moving from early adaptors phase
> >> into a much broader market. I predict that small to medium sized
> >> clusters will be the majority of Hadoop deployments in a few month
> >> time. 4000, or even 500 machines is the high-end range. If the open
> >> source project Hadoop cannot support those users adequately (without
> >> becoming defunct), the committership might be better off to focus on
> >> the low-end and medium sized users.
> >>
> >> I'm not suggesting to turn away from the handfull (?) of high-end
> >> users. They certainly have most valuable input. But also, *they*
> >> obviously have the resources in terms of larger clusters and
> >> developers to deal with their specific setups. Obviously, they don't
> >> need to rely on the open source project to make releases. In fact,
> >> they *do* work on their own Hadoop derivatives.
> >> All the other users, the hundreds of boring small cluster users, don't
> >> have that choice. They *depend* on the open source releases.
> >>
> >> Hadoop is an Apache project, to provide HDFS and MR free of charge to
> >> the general public. Not only to me - nor to only one or two big
> >> companies either.
> >> Focus on all the users.
> >>
> >> Bernd
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Nigel Daley <nd...@mac.com>.
For closure, this vote fails due to a couple binding -1 votes.

Nige

On Feb 18, 2011, at 4:46 AM, Eric Baldeschwieler wrote:

> Hi Bernd,
> 
> Apache Hadoop is about scale. Most clusters will always be small, but Hadoop is going mainstream precisely because it scales to huge data and cluster sizes. 
> 
> There are lots of systems that work well on 10 node clusters. People select   Hadoop because they are confident that as their business / problem grows, Hadoop can grow with it. 
> 
> ---
> E14 - via iPhone
> 
> On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" <be...@googlemail.com> wrote:
> 
>> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
>>> Hi Bernd.
>>> 
>>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>>> 
>>>> We have the very unfortunate situation here at Hadoop where Apache
>>>> Hadoop is not the primary and foremost place of Hadoop development.
>>>> Instead, code is developed internally at Yahoo and then contributed in
>>>> (smaller or larger) chunks to Hadoop.
>>> 
>>> This has been the situation in the past,
>>> but as you can see in the last month, this has changed.
>>> 
>>> Yahoo! has publicly committed to move their development into the main code base, and you can see they have started doing this with the 20.100 branch,
>>> and their recent commits to trunk.
>>> Combine this with Nige taking on the 0.22 release branch, (and sheperding it into a stable release) and I think we have are addressing your concerns.
>>> 
>>> They have also started bringing the discussions back on the list, see the recent discussion about Jobtracker-nextgen Arun has re-started in MAPREDUCE-279.
>>> 
>>> I'm not saying it's perfect, but I think the major players understand there is an issue, and they are *ALL* moving in the right direction.
>> 
>> I enthusiastically would like to see your optimism be verified.
>> Maybe I'm misreading the statements issued publicly, but I don't think
>> that this is fully understood. I agree though that it's a move into
>> the right direction.
>> 
>>>> This is open source development upside down.
>>>> It is not ok for people to diff ASF svn against their internal code
>>>> and provide the diff as a patch without reviewing IP first for every
>>>> line of code changed.
>>>> For larger chunks I'd suggest to even go via the Incubator IP clearance process.
>>>> Only then will we force committers to primarily work here in the open
>>>> and return to what I'd consider a healthy project.
>>>> 
>>>> To be honest: Hadoop is in the process of falling apart.
>>>> Contrib Code gets moved out of Apache instead of being maintained here.
>>>> Discussions are seldom consense-driven.
>>>> Release branches stagnate.
>>> 
>>> True. releases do take a long time. This is mainly due to it being extremely hard to test and verify that a release is stable.
>>> It's not enough to just run the thing on 4 machines, you need at least 50 to test some of the major problems. This requires some serious $ for someone to verify.
>> 
>> It has been proposed on the list before, IIRC. Don't know how to get
>> there, but the project seriously needs access to a cluster of this
>> size.
>> 
>>>> Downstream projects like HBase don't get proper support.
>>>> Production setups are made from 3rd party distributions.
>>>> Development is not happening here, but elsewhere behind corporate doors.
>>>> Discussion about future developments are started on corporate blogs (
>>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>>> ) instead of on the proper mailing list.
>>>> Hurdles for committing are way too high.
>>>> On the bright side, new committers and PMC members are added, this is
>>>> an improvement.
>>>> 
>>>> I'd suggest to move away from relying on large code dumps from
>>>> corporations, and move back to the ASF-proven "individual committer
>>>> commits on trunk"-model where more committers can get involved.
>>>> If that means not to support high end cluster sizes for some months,
>>>> well, so be it.
>>> 
>>>> Average committers cannot run - e.g. test - on high
>>>> end cluster sizes. If that would mean they cannot participate, then
>>>> the open source project better concentrate on small and medium sized
>>>> cluster instead.
>>> 
>>> 
>>> Well.. that's one approach.. but there are several companies out there who rely on apache's hadoop to power their large clusters, so I'd hate to see hadoop become something that only runs well on
>>> 10-nodes.. as I don't think that will help anyone either.
>> 
>> But only looking at high-end scale doesn't help either.
>> 
>> Lets face the fact that Hadoop is now moving from early adaptors phase
>> into a much broader market. I predict that small to medium sized
>> clusters will be the majority of Hadoop deployments in a few month
>> time. 4000, or even 500 machines is the high-end range. If the open
>> source project Hadoop cannot support those users adequately (without
>> becoming defunct), the committership might be better off to focus on
>> the low-end and medium sized users.
>> 
>> I'm not suggesting to turn away from the handfull (?) of high-end
>> users. They certainly have most valuable input. But also, *they*
>> obviously have the resources in terms of larger clusters and
>> developers to deal with their specific setups. Obviously, they don't
>> need to rely on the open source project to make releases. In fact,
>> they *do* work on their own Hadoop derivatives.
>> All the other users, the hundreds of boring small cluster users, don't
>> have that choice. They *depend* on the open source releases.
>> 
>> Hadoop is an Apache project, to provide HDFS and MR free of charge to
>> the general public. Not only to me - nor to only one or two big
>> companies either.
>> Focus on all the users.
>> 
>> Bernd


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.
Hi Bernd,

Apache Hadoop is about scale. Most clusters will always be small, but Hadoop is going mainstream precisely because it scales to huge data and cluster sizes. 

There are lots of systems that work well on 10 node clusters. People select   Hadoop because they are confident that as their business / problem grows, Hadoop can grow with it. 

---
E14 - via iPhone

On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" <be...@googlemail.com> wrote:

> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
>> Hi Bernd.
>> 
>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>> 
>>> We have the very unfortunate situation here at Hadoop where Apache
>>> Hadoop is not the primary and foremost place of Hadoop development.
>>> Instead, code is developed internally at Yahoo and then contributed in
>>> (smaller or larger) chunks to Hadoop.
>> 
>> This has been the situation in the past,
>> but as you can see in the last month, this has changed.
>> 
>> Yahoo! has publicly committed to move their development into the main code base, and you can see they have started doing this with the 20.100 branch,
>> and their recent commits to trunk.
>> Combine this with Nige taking on the 0.22 release branch, (and sheperding it into a stable release) and I think we have are addressing your concerns.
>> 
>> They have also started bringing the discussions back on the list, see the recent discussion about Jobtracker-nextgen Arun has re-started in MAPREDUCE-279.
>> 
>> I'm not saying it's perfect, but I think the major players understand there is an issue, and they are *ALL* moving in the right direction.
> 
> I enthusiastically would like to see your optimism be verified.
> Maybe I'm misreading the statements issued publicly, but I don't think
> that this is fully understood. I agree though that it's a move into
> the right direction.
> 
>>> This is open source development upside down.
>>> It is not ok for people to diff ASF svn against their internal code
>>> and provide the diff as a patch without reviewing IP first for every
>>> line of code changed.
>>> For larger chunks I'd suggest to even go via the Incubator IP clearance process.
>>> Only then will we force committers to primarily work here in the open
>>> and return to what I'd consider a healthy project.
>>> 
>>> To be honest: Hadoop is in the process of falling apart.
>>> Contrib Code gets moved out of Apache instead of being maintained here.
>>> Discussions are seldom consense-driven.
>>> Release branches stagnate.
>> 
>> True. releases do take a long time. This is mainly due to it being extremely hard to test and verify that a release is stable.
>> It's not enough to just run the thing on 4 machines, you need at least 50 to test some of the major problems. This requires some serious $ for someone to verify.
> 
> It has been proposed on the list before, IIRC. Don't know how to get
> there, but the project seriously needs access to a cluster of this
> size.
> 
>>> Downstream projects like HBase don't get proper support.
>>> Production setups are made from 3rd party distributions.
>>> Development is not happening here, but elsewhere behind corporate doors.
>>> Discussion about future developments are started on corporate blogs (
>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>> ) instead of on the proper mailing list.
>>> Hurdles for committing are way too high.
>>> On the bright side, new committers and PMC members are added, this is
>>> an improvement.
>>> 
>>> I'd suggest to move away from relying on large code dumps from
>>> corporations, and move back to the ASF-proven "individual committer
>>> commits on trunk"-model where more committers can get involved.
>>> If that means not to support high end cluster sizes for some months,
>>> well, so be it.
>> 
>>> Average committers cannot run - e.g. test - on high
>>> end cluster sizes. If that would mean they cannot participate, then
>>> the open source project better concentrate on small and medium sized
>>> cluster instead.
>> 
>> 
>> Well.. that's one approach.. but there are several companies out there who rely on apache's hadoop to power their large clusters, so I'd hate to see hadoop become something that only runs well on
>> 10-nodes.. as I don't think that will help anyone either.
> 
> But only looking at high-end scale doesn't help either.
> 
> Lets face the fact that Hadoop is now moving from early adaptors phase
> into a much broader market. I predict that small to medium sized
> clusters will be the majority of Hadoop deployments in a few month
> time. 4000, or even 500 machines is the high-end range. If the open
> source project Hadoop cannot support those users adequately (without
> becoming defunct), the committership might be better off to focus on
> the low-end and medium sized users.
> 
> I'm not suggesting to turn away from the handfull (?) of high-end
> users. They certainly have most valuable input. But also, *they*
> obviously have the resources in terms of larger clusters and
> developers to deal with their specific setups. Obviously, they don't
> need to rely on the open source project to make releases. In fact,
> they *do* work on their own Hadoop derivatives.
> All the other users, the hundreds of boring small cluster users, don't
> have that choice. They *depend* on the open source releases.
> 
> Hadoop is an Apache project, to provide HDFS and MR free of charge to
> the general public. Not only to me - nor to only one or two big
> companies either.
> Focus on all the users.
> 
>  Bernd

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Ian Holsman <ha...@holsman.net>.
On Feb 17, 2011, at 10:23 AM, Bernd Fondermann wrote:

> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
>> Hi Bernd.
>> 
>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>> 
>>> We have the very unfortunate situation here at Hadoop where Apache
>>> Hadoop is not the primary and foremost place of Hadoop development.
>>> Instead, code is developed internally at Yahoo and then contributed in
>>> (smaller or larger) chunks to Hadoop.
>> 
>> This has been the situation in the past,
>> but as you can see in the last month, this has changed.
>> 
>> Yahoo! has publicly committed to move their development into the main code base, and you can see they have started doing this with the 20.100 branch,
>> and their recent commits to trunk.
>> Combine this with Nige taking on the 0.22 release branch, (and sheperding it into a stable release) and I think we have are addressing your concerns.
>> 
>> They have also started bringing the discussions back on the list, see the recent discussion about Jobtracker-nextgen Arun has re-started in MAPREDUCE-279.
>> 
>> I'm not saying it's perfect, but I think the major players understand there is an issue, and they are *ALL* moving in the right direction.
> 
> I enthusiastically would like to see your optimism be verified.
> Maybe I'm misreading the statements issued publicly, but I don't think
> that this is fully understood. I agree though that it's a move into
> the right direction.

I also hope to see more as well... and hopefully we will start seeing more and more of this from everyone.


> 
>  Bernd


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Bernd Fondermann <be...@googlemail.com>.
On Thu, Feb 17, 2011 at 14:58, Ian Holsman <ha...@holsman.net> wrote:
> Hi Bernd.
>
> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>
>> We have the very unfortunate situation here at Hadoop where Apache
>> Hadoop is not the primary and foremost place of Hadoop development.
>> Instead, code is developed internally at Yahoo and then contributed in
>> (smaller or larger) chunks to Hadoop.
>
> This has been the situation in the past,
> but as you can see in the last month, this has changed.
>
> Yahoo! has publicly committed to move their development into the main code base, and you can see they have started doing this with the 20.100 branch,
> and their recent commits to trunk.
> Combine this with Nige taking on the 0.22 release branch, (and sheperding it into a stable release) and I think we have are addressing your concerns.
>
> They have also started bringing the discussions back on the list, see the recent discussion about Jobtracker-nextgen Arun has re-started in MAPREDUCE-279.
>
> I'm not saying it's perfect, but I think the major players understand there is an issue, and they are *ALL* moving in the right direction.

I enthusiastically would like to see your optimism be verified.
Maybe I'm misreading the statements issued publicly, but I don't think
that this is fully understood. I agree though that it's a move into
the right direction.

>> This is open source development upside down.
>> It is not ok for people to diff ASF svn against their internal code
>> and provide the diff as a patch without reviewing IP first for every
>> line of code changed.
>> For larger chunks I'd suggest to even go via the Incubator IP clearance process.
>> Only then will we force committers to primarily work here in the open
>> and return to what I'd consider a healthy project.
>>
>> To be honest: Hadoop is in the process of falling apart.
>> Contrib Code gets moved out of Apache instead of being maintained here.
>> Discussions are seldom consense-driven.
>> Release branches stagnate.
>
> True. releases do take a long time. This is mainly due to it being extremely hard to test and verify that a release is stable.
> It's not enough to just run the thing on 4 machines, you need at least 50 to test some of the major problems. This requires some serious $ for someone to verify.

It has been proposed on the list before, IIRC. Don't know how to get
there, but the project seriously needs access to a cluster of this
size.

>> Downstream projects like HBase don't get proper support.
>> Production setups are made from 3rd party distributions.
>> Development is not happening here, but elsewhere behind corporate doors.
>> Discussion about future developments are started on corporate blogs (
>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>> ) instead of on the proper mailing list.
>> Hurdles for committing are way too high.
>> On the bright side, new committers and PMC members are added, this is
>> an improvement.
>>
>> I'd suggest to move away from relying on large code dumps from
>> corporations, and move back to the ASF-proven "individual committer
>> commits on trunk"-model where more committers can get involved.
>> If that means not to support high end cluster sizes for some months,
>> well, so be it.
>
>> Average committers cannot run - e.g. test - on high
>> end cluster sizes. If that would mean they cannot participate, then
>> the open source project better concentrate on small and medium sized
>> cluster instead.
>
>
> Well.. that's one approach.. but there are several companies out there who rely on apache's hadoop to power their large clusters, so I'd hate to see hadoop become something that only runs well on
> 10-nodes.. as I don't think that will help anyone either.

But only looking at high-end scale doesn't help either.

Lets face the fact that Hadoop is now moving from early adaptors phase
into a much broader market. I predict that small to medium sized
clusters will be the majority of Hadoop deployments in a few month
time. 4000, or even 500 machines is the high-end range. If the open
source project Hadoop cannot support those users adequately (without
becoming defunct), the committership might be better off to focus on
the low-end and medium sized users.

I'm not suggesting to turn away from the handfull (?) of high-end
users. They certainly have most valuable input. But also, *they*
obviously have the resources in terms of larger clusters and
developers to deal with their specific setups. Obviously, they don't
need to rely on the open source project to make releases. In fact,
they *do* work on their own Hadoop derivatives.
All the other users, the hundreds of boring small cluster users, don't
have that choice. They *depend* on the open source releases.

Hadoop is an Apache project, to provide HDFS and MR free of charge to
the general public. Not only to me - nor to only one or two big
companies either.
Focus on all the users.

  Bernd

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Ian Holsman <ha...@holsman.net>.
Hi Bernd.

On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
> 
> We have the very unfortunate situation here at Hadoop where Apache
> Hadoop is not the primary and foremost place of Hadoop development.
> Instead, code is developed internally at Yahoo and then contributed in
> (smaller or larger) chunks to Hadoop.

This has been the situation in the past,
but as you can see in the last month, this has changed.

Yahoo! has publicly committed to move their development into the main code base, and you can see they have started doing this with the 20.100 branch,
and their recent commits to trunk. 
Combine this with Nige taking on the 0.22 release branch, (and sheperding it into a stable release) and I think we have are addressing your concerns.

They have also started bringing the discussions back on the list, see the recent discussion about Jobtracker-nextgen Arun has re-started in MAPREDUCE-279.

I'm not saying it's perfect, but I think the major players understand there is an issue, and they are *ALL* moving in the right direction.



> This is open source development upside down.
> It is not ok for people to diff ASF svn against their internal code
> and provide the diff as a patch without reviewing IP first for every
> line of code changed.
> For larger chunks I'd suggest to even go via the Incubator IP clearance process.
> Only then will we force committers to primarily work here in the open
> and return to what I'd consider a healthy project.
> 
> To be honest: Hadoop is in the process of falling apart.
> Contrib Code gets moved out of Apache instead of being maintained here.
> Discussions are seldom consense-driven.
> Release branches stagnate.

True. releases do take a long time. This is mainly due to it being extremely hard to test and verify that a release is stable.
It's not enough to just run the thing on 4 machines, you need at least 50 to test some of the major problems. This requires some serious $ for someone to verify.

> Downstream projects like HBase don't get proper support.
> Production setups are made from 3rd party distributions.
> Development is not happening here, but elsewhere behind corporate doors.
> Discussion about future developments are started on corporate blogs (
> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
> ) instead of on the proper mailing list.
> Hurdles for committing are way too high.
> On the bright side, new committers and PMC members are added, this is
> an improvement.
> 
> I'd suggest to move away from relying on large code dumps from
> corporations, and move back to the ASF-proven "individual committer
> commits on trunk"-model where more committers can get involved.
> If that means not to support high end cluster sizes for some months,
> well, so be it.

> Average committers cannot run - e.g. test - on high
> end cluster sizes. If that would mean they cannot participate, then
> the open source project better concentrate on small and medium sized
> cluster instead.


Well.. that's one approach.. but there are several companies out there who rely on apache's hadoop to power their large clusters, so I'd hate to see hadoop become something that only runs well on 10-nodes.. as I don't think that will help anyone either.



> 
>  Bernd


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Feb 17, 2011, at 4:43 AM, Bernd Fondermann wrote:
> To be honest: Hadoop is in the process of falling apart.

	We can thank the Apache Board for helping there as well.  Their high handed interference basically set the project back 6 mos to a year; we're still recovering from the general mistrust that their actions have instilled in the community.  (If we recover.  I'm not sure we will.)

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Bernd Fondermann <be...@googlemail.com>.
On Sat, Feb 12, 2011 at 09:18, Roy T. Fielding <fi...@gbiv.com> wrote:
> On Feb 11, 2011, at 2:28 AM, Bernd Fondermann wrote:
>> On Fri, Feb 11, 2011 at 07:33, Ian Holsman <ha...@holsman.net> wrote:
>>> They probably have patched it, and mistakenly forgot to submit them.. any chance of doing a diff on your version and submitting it?
>>
>> Please keep in mind: The original author(s) would need to submit it -
>> not a proxy.
>
> No, anyone with permission of the copyright owner can submit it
> if it is a separately copyrightable work.  If it is just a repair,
> then anyone can submit it, since repairs are not copyrightable.
> The original author should be noted in any credit given for the
> fix.

We have the very unfortunate situation here at Hadoop where Apache
Hadoop is not the primary and foremost place of Hadoop development.
Instead, code is developed internally at Yahoo and then contributed in
(smaller or larger) chunks to Hadoop.
This is open source development upside down.
It is not ok for people to diff ASF svn against their internal code
and provide the diff as a patch without reviewing IP first for every
line of code changed.
For larger chunks I'd suggest to even go via the Incubator IP clearance process.
Only then will we force committers to primarily work here in the open
and return to what I'd consider a healthy project.

To be honest: Hadoop is in the process of falling apart.
Contrib Code gets moved out of Apache instead of being maintained here.
Discussions are seldom consense-driven.
Release branches stagnate.
Downstream projects like HBase don't get proper support.
Production setups are made from 3rd party distributions.
Development is not happening here, but elsewhere behind corporate doors.
Discussion about future developments are started on corporate blogs (
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
) instead of on the proper mailing list.
Hurdles for committing are way too high.
On the bright side, new committers and PMC members are added, this is
an improvement.

I'd suggest to move away from relying on large code dumps from
corporations, and move back to the ASF-proven "individual committer
commits on trunk"-model where more committers can get involved.
If that means not to support high end cluster sizes for some months,
well, so be it. Average committers cannot run - e.g. test - on high
end cluster sizes. If that would mean they cannot participate, then
the open source project better concentrate on small and medium sized
cluster instead.

  Bernd

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by "Roy T. Fielding" <fi...@gbiv.com>.
On Feb 11, 2011, at 2:28 AM, Bernd Fondermann wrote:
> On Fri, Feb 11, 2011 at 07:33, Ian Holsman <ha...@holsman.net> wrote:
>> They probably have patched it, and mistakenly forgot to submit them.. any chance of doing a diff on your version and submitting it?
> 
> Please keep in mind: The original author(s) would need to submit it -
> not a proxy.

No, anyone with permission of the copyright owner can submit it
if it is a separately copyrightable work.  If it is just a repair,
then anyone can submit it, since repairs are not copyrightable.
The original author should be noted in any credit given for the
fix.

....Roy


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Bernd Fondermann <be...@googlemail.com>.
On Fri, Feb 11, 2011 at 07:33, Ian Holsman <ha...@holsman.net> wrote:
>
> On Feb 11, 2011, at 5:11 PM, Nigel Daley wrote:
>
>>
>> On Feb 10, 2011, at 9:24 PM, Owen O'Malley wrote:
>>
>>> On Thu, Feb 10, 2011 at 7:40 PM, Nigel Daley <nd...@mac.com> wrote:
>>>
>>>> I think the PMC should abandon the hdfsproxy HDFS contrib component.  It's
>>>> last meaningful contribution was August 2010:
>>>>
>>>
>>> -1 we still use and are maintaining this.
>>
>>
>> Who's the 'we'?  Looking at HDFS-1164 it looks like hdfsproxy was failing a unit test for 7 months.  This is exactly the reason we should thoughtfully consider whether it has a future within Hadoop.
>>
>> Perhaps Y! uses a different version of hdfsproxy?
>
> They probably have patched it, and mistakenly forgot to submit them.. any chance of doing a diff on your version and submitting it?

Please keep in mind: The original author(s) would need to submit it -
not a proxy.

  Bernd

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.
Well put Allen. 

My thinking is to move this API into HDFS removing the need for HDFS proxy. Then you can use the http proxy of your choice if you have needs for a proxy (we do this for external bandwidth management, security and ip space preservation reasons). 

I'm +1 for removing it. 

We can always revive the project in extras if the API changes hit a wall. 

---
E14 - via iPhone

On Feb 17, 2011, at 1:26 PM, "Allen Wittenauer" <aw...@linkedin.com> wrote:

> 
> On Feb 17, 2011, at 1:21 PM, Konstantin Shvachko wrote:
> 
>> hdfsproxy is a wrapper around hftpFileSystem (in its current state).
>> So you can always replace hdfsproxy with hftpFileSystem.
>> Also it uses pure FileSystem api, so it can successfully be maintained
>> outside of hdfs.
>> 
>> Therefore I am +1 removing it from hdfs/contrib.
>> 
>> What is the use case for hdfsproxy anyways?
> 
>    A stable, secure http get interface for files.  (No, the normal web ui is not good enough.  Think firewalls.).
> 
> 
> 

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Feb 17, 2011, at 1:21 PM, Konstantin Shvachko wrote:

> hdfsproxy is a wrapper around hftpFileSystem (in its current state).
> So you can always replace hdfsproxy with hftpFileSystem.
> Also it uses pure FileSystem api, so it can successfully be maintained
> outside of hdfs.
> 
> Therefore I am +1 removing it from hdfs/contrib.
> 
> What is the use case for hdfsproxy anyways?

	A stable, secure http get interface for files.  (No, the normal web ui is not good enough.  Think firewalls.).




Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Konstantin Shvachko <sh...@gmail.com>.
hdfsproxy is a wrapper around hftpFileSystem (in its current state).
So you can always replace hdfsproxy with hftpFileSystem.
Also it uses pure FileSystem api, so it can successfully be maintained
outside of hdfs.

Therefore I am +1 removing it from hdfs/contrib.

What is the use case for hdfsproxy anyways?
Thanks,
--Konstantin


On Wed, Feb 16, 2011 at 8:31 PM, Sanjay Radia <sr...@yahoo-inc.com> wrote:

>
> On Feb 11, 2011, at 12:03 PM, Ian Holsman wrote:
>
>
>> On Feb 11, 2011, at 5:11 PM, Nigel Daley wrote:
>>
>>
>>> On Feb 10, 2011, at 9:24 PM, Owen O'Malley wrote:
>>>
>>>  On Thu, Feb 10, 2011 at 7:40 PM, Nigel Daley <nd...@mac.com> wrote:
>>>>
>>>>  I think the PMC should abandon the hdfsproxy HDFS contrib component.
>>>>>  It's
>>>>> last meaningful contribution was August 2010:
>>>>>
>>>>>
>>>> -1 we still use and are maintaining this.
>>>>
>>>
>>>
>>> Who's the 'we'?  Looking at HDFS-1164 it looks like hdfsproxy was failing
>>> a unit test for 7 months.  This is exactly the reason we should thoughtfully
>>> consider whether it has a future within Hadoop.
>>>
>>> Perhaps Y! uses a different version of hdfsproxy?
>>>
>>
>> They probably have patched it, and mistakenly forgot to submit them
>>
> Yes, we have some updates that we haven;t got around to pushing around
>
>  .. any chance of doing a diff on your version and submitting it?
>>
>
>
> I will check into this and get back.
> We are also working to incorporate some of the features of the proxy into
> hdfs proper -- if that work completes and is accepted by the community then
> hdfs proxy may become
> less useful.
>
> Hence, for now,  my vote is -1 for removing hdfs proxy.
>
> sanjay
>
> sanjay
>
>>
>> Regards
>> Ian
>>
>>
>>> Nige
>>>
>>>
>>
>

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Sanjay Radia <sr...@yahoo-inc.com>.
On Feb 11, 2011, at 12:03 PM, Ian Holsman wrote:

>
> On Feb 11, 2011, at 5:11 PM, Nigel Daley wrote:
>
>>
>> On Feb 10, 2011, at 9:24 PM, Owen O'Malley wrote:
>>
>>> On Thu, Feb 10, 2011 at 7:40 PM, Nigel Daley <nd...@mac.com> wrote:
>>>
>>>> I think the PMC should abandon the hdfsproxy HDFS contrib  
>>>> component.  It's
>>>> last meaningful contribution was August 2010:
>>>>
>>>
>>> -1 we still use and are maintaining this.
>>
>>
>> Who's the 'we'?  Looking at HDFS-1164 it looks like hdfsproxy was  
>> failing a unit test for 7 months.  This is exactly the reason we  
>> should thoughtfully consider whether it has a future within Hadoop.
>>
>> Perhaps Y! uses a different version of hdfsproxy?
>
> They probably have patched it, and mistakenly forgot to submit them
Yes, we have some updates that we haven;t got around to pushing around

> .. any chance of doing a diff on your version and submitting it?


I will check into this and get back.
We are also working to incorporate some of the features of the proxy  
into hdfs proper -- if that work completes and is accepted by the  
community then hdfs proxy may become
less useful.

Hence, for now,  my vote is -1 for removing hdfs proxy.

sanjay

sanjay
>
> Regards
> Ian
>
>>
>> Nige
>>
>


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Ian Holsman <ha...@holsman.net>.
On Feb 11, 2011, at 5:11 PM, Nigel Daley wrote:

> 
> On Feb 10, 2011, at 9:24 PM, Owen O'Malley wrote:
> 
>> On Thu, Feb 10, 2011 at 7:40 PM, Nigel Daley <nd...@mac.com> wrote:
>> 
>>> I think the PMC should abandon the hdfsproxy HDFS contrib component.  It's
>>> last meaningful contribution was August 2010:
>>> 
>> 
>> -1 we still use and are maintaining this.
> 
> 
> Who's the 'we'?  Looking at HDFS-1164 it looks like hdfsproxy was failing a unit test for 7 months.  This is exactly the reason we should thoughtfully consider whether it has a future within Hadoop.
> 
> Perhaps Y! uses a different version of hdfsproxy? 

They probably have patched it, and mistakenly forgot to submit them.. any chance of doing a diff on your version and submitting it? 

Regards
Ian

> 
> Nige
> 


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Nigel Daley <nd...@mac.com>.
On Feb 10, 2011, at 9:24 PM, Owen O'Malley wrote:

> On Thu, Feb 10, 2011 at 7:40 PM, Nigel Daley <nd...@mac.com> wrote:
> 
>> I think the PMC should abandon the hdfsproxy HDFS contrib component.  It's
>> last meaningful contribution was August 2010:
>> 
> 
> -1 we still use and are maintaining this.


Who's the 'we'?  Looking at HDFS-1164 it looks like hdfsproxy was failing a unit test for 7 months.  This is exactly the reason we should thoughtfully consider whether it has a future within Hadoop.

Perhaps Y! uses a different version of hdfsproxy? 

Nige


Re: [VOTE] Abandon hdfsproxy HDFS contrib

Posted by Owen O'Malley <om...@apache.org>.
On Thu, Feb 10, 2011 at 7:40 PM, Nigel Daley <nd...@mac.com> wrote:

> I think the PMC should abandon the hdfsproxy HDFS contrib component.  It's
> last meaningful contribution was August 2010:
>

-1 we still use and are maintaining this.

-- Owen