You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Ashish Thusoo <at...@facebook.com> on 2010/04/19 20:57:14 UTC

[DISCUSSION] To be (or not to be) a TLP - that is the question

Hi Folks,

Recently Apache Board asked the Hadoop PMC if some sub projects can become top level projects. In the opinion of the board, big umbrella projects make it difficult to monitor the health of the communities within the sub projects. If Hive does become a TLP, then we would have to elect our own PMC and take on all the administrative tasks that the Hadoop PMC does for us. So there is definitely more administrative work involved as a TLP. So the question is whether we should take on this additional task keeping at this time and what tangible advantages and disadvantages would such a move entail for the project. Would like to hear what the community thinks on this issue.

Thanks,
Ashish

PS: As some reference to what is happening in the other subprojects, at this time PIG and Zookeeper have decided NOT to become TLPs where as Hbase and Avro have decided to become TLPs.

Re: [DISCUSSION] To be (or not to be) a TLP - that is the question

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Apr 21, 2010 at 10:35 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> Hive already does the work to run on multiple versions of Hadoop, and the
> release cycle is independent of Hadoop's. I don't see why it should remain
> a
> subproject. I'm +1 on Hive becoming a TLP.
>
> On Tue, Apr 20, 2010 at 2:03 PM, Zheng Shao <zs...@gmail.com> wrote:
>
> > As a Hive committer, I don't feel the benefit we get from becoming a
> > TLP is big enough (compared with the cost) to make Hive a TLP.
> > From Chris's comment I see that the cost is not that big, but I still
> > wonder what benefit we will get from that.
> >
> > Also I didn't get the idea of the joke ("In fact, one could argue that
> > Pig opting not to be TLP yet is why Hive should go TLP"). I don't see
> > any reasons that applies to Pig but not Hive.
> > We should continue the discussion here, but anything in the Pig's
> > discussion should also be considered here.
> >
> > Zheng
> >
> > On Mon, Apr 19, 2010 at 5:48 PM, Amr Awadallah <aa...@cloudera.com> wrote:
> > > I am personally +1 on Hive being a TLP, I think it did reach the
> > community
> > > adoption and maturity level required for that. In fact, one could argue
> > that
> > > Pig opting not to be TLP yet is why Hive should go TLP :) (jk).
> > >
> > > The real question to ask is whether there is a volunteer to take care
> of
> > the
> > > "administrative" tasks, which isn't a ton of work afaiu (I am willing
> to
> > > volunteer if no body else up to the task, but I am not a committer and
> > only
> > > contributed a minor patch for bash/cygwin).
> > >
> > > BTW, here is a very nice summary from Yahoo's Chris Douglas on TLP
> > > tradeoffs. I happen to agree with all he says, and frankly I couldn't
> > have
> > > wrote it better my self. I highlight certain parts from his message,
> but
> > I
> > > recommend you read the whole thing.
> > >
> > > ---------- Forwarded message ----------
> > > From: Chris Douglas <cd...@apache.org>
> > > Date: Tue, Apr 13, 2010 at 11:46 PM
> > > Subject: Subprojects and TLP status
> > > To: general@hadoop.apache.org, private@hadoop.apache.org
> > >
> > > Most of Hadoop's subprojects have discussed becoming top-level Apache
> > > projects (TLPs) in the last few weeks. Most have expressed a desire to
> > > remain in Hadoop. The salient parts of the discussions I've read tend
> > > to focus on three aspects: a technical dependence on Hadoop,
> > > additional overhead as a TLP, and visibility both within the Hadoop
> > > ecosystem and in the open source community generally.
> > >
> > > Life as a TLP: this is not much harder than being a Hadoop subproject,
> > > and the Apache preferences being tossed around- particularly
> > > "insufficiently diverse"- are not blockers. Every subproject needs to
> > > write a section of the report Hadoop sends to the board; almost the
> > > same report, sent to a new address. The initial cost is similarly
> > > light: copy bylaws, send a few notes to INFRA, and follow some
> > > directions. I think the estimated costs are far higher than they will
> > > be in practice. Inertia is a powerful force, but it should be
> > > overcome. The directions are here, and should not intimidating:
> > >
> > > http://apache.org/dev/project-creation.html
> > >
> > > Visibility: the Hadoop site does not need to change. For each
> > > subproject, we can literally change the hyperlinks to point to the new
> > > page and be done. Long-term, linking to all ASF projects that run on
> > > Hadoop from a prominent page is something we all want. So particularly
> > > in the medium-term that most are considering: visibility through the
> > > website will not change. Each subproject will still be linked from the
> > > front page.
> > >
> > > Hadoop would not be nearly as popular as it is without Zookeeper,
> > > HBase, Hive, and Pig. All statistics on work in shared MapReduce
> > > clusters show that users vastly prefer running Pig and Hive queries to
> > > writing MapReduce jobs. HBase continues to push features in HDFS that
> > > increase its adoption and relevance outside MapReduce, while sharing
> > > some of its NoSQL limelight. Zookeeper is not only a linchpin in real
> > > workloads, but many proposals for future features require it. The
> > > bottom line is that MapReduce and HDFS need these projects for
> > > visibility and adoption in precisely the same way. I don't think
> > > separate TLPs will uncouple the broader community from one another.
> > >
> > > Technical dependence: this has two dimensions. First, influencing
> > > MapReduce and HDFS. This is nonsense. Earning influence by
> > > contributing to a subproject is the only way to push code changes;
> > > nobody from any of these projects has violated that by unilaterally
> > > committing to HDFS or MapReduce, anyway. And anyone cynical enough to
> > > believe that MapReduce and HDFS would deliberately screw over or
> > > ignore dependent projects because they don't have PMC members is
> > > plainly unsuited to community-driven development. I understand that
> > > these projects need to protect their users, but lobbying rights are
> > > not an actual benefit.
> > >
> > > Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
> > > true that Hadoop currently offers a set of mutually compatible
> > > frameworks. It is not true that moving them to separate Apache
> > > projects would make solutions less coherent or affect existing or
> > > future users at all. The cohesion between projects' governance is
> > > sufficiently weak to justify independent units, but the real
> > > dependencies between the projects are strong enough to keep us engaged
> > > with one another. And it's not as if other projects- Cascading, for
> > > example- aren't also organisms adapted and specialized for life in
> > > Hadoop.
> > >
> > > Arguments on technical dependence are ignoring the nature of the
> > > existing interactions. Besides, weak technical dependencies are not a
> > > necessary prerequisite for a subproject's independence.
> > >
> > > As for what was *not* said in these discussions, there is no argument
> > > that every one of these subprojects has a distinct, autonomous
> > > community. There was also no argument that the Hadoop PMC offers any
> > > valuable oversight, given that the representatives of its fiefdoms are
> > > too consumed by provincial matters to participate in neighboring
> > > governance. Most releases I've voted on: I run the unit tests, check
> > > the signature, verify the checksum, and know literally nothing else
> > > about its content. I have often never heard the names of many proposed
> > > committers and even some proposed PMC members. Right now, subprojects
> > > with enough PMC members essentially vote out their own releases and
> > > vote in their own committers: TLPs in all but name.
> > >
> > > The Hadoop club- in conferences, meetups, technical debates, etc.- is
> > > broad, diverse, and intertwined, but communities of developers have
> > > already clustered around subprojects. Allowing that each cluster
> > > should govern itself is a dry, practical matter, not an existential
> > > crisis. -C
> > >
> > > On 4/19/2010 11:57 AM, Ashish Thusoo wrote:
> > >>
> > >> Hi Folks,
> > >>
> > >> Recently Apache Board asked the Hadoop PMC if some sub projects can
> > become
> > >> top level projects. In the opinion of the board, big umbrella projects
> > make
> > >> it difficult to monitor the health of the communities within the sub
> > >> projects. If Hive does become a TLP, then we would have to elect our
> own
> > PMC
> > >> and take on all the administrative tasks that the Hadoop PMC does for
> > us. So
> > >> there is definitely more administrative work involved as a TLP. So the
> > >> question is whether we should take on this additional task keeping at
> > this
> > >> time and what tangible advantages and disadvantages would such a move
> > entail
> > >> for the project. Would like to hear what the community thinks on this
> > issue.
> > >>
> > >> Thanks,
> > >> Ashish
> > >>
> > >> PS: As some reference to what is happening in the other subprojects,
> at
> > >> this time PIG and Zookeeper have decided NOT to become TLPs where as
> > Hbase
> > >> and Avro have decided to become TLPs.
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Yours,
> > Zheng
> > http://www.linkedin.com/in/zshao
> >
>

I always thought that becoming a top-level project was something you did
after a while like "moving out of your parents house".

On a more serious note, the top bar of the hadoop page is getting seriously
crowded. Avro,Chukwa,Hbase HDFS,mapred, zookeeper, pig, zookeeper.

As Hive is now working ontop of hbase, and possibly soon cassandra. I can
see a day where hive could even theoretically use no-hadoop if some
alternate JobTracker,TaskTracker implementation was ever written. (Crazy I
know.)

I could see us wanting to use a different wiki or having other site features
that might be combersome to implement under  hadoop.

We should also look at this as the load we are taking off the hadoop PMC.
Right now they have to handle our administration. Doing our own
administration would leave them more free to do whatever they need to do.

Either alternative is fine. As a preference I would say go TLP.

Edward

Re: [DISCUSSION] To be (or not to be) a TLP - that is the question

Posted by Dhruba Borthakur <dh...@gmail.com>.

I am definitely against moving Hive out of Hadoop. There is appreciable
representation of Hive inside the Hadoop PMC and, as far as I can say, there
is no additional burden on the Hadooo PMC to make Hive remain inside Hadoop.

I respect Jeff/Amr's comments on their viewpoints, but I beg to differ from
that. I really do not see any benefit on moving Hive out of Hadoop.

thanks,
dhruba

On Thu, Apr 22, 2010 at 10:09 AM, Ashish Thusoo <at...@facebook.com>wrote:

> What is the advantage of becoming a TLP to the project itself? I have heard
> that it is something that apache wants, but considering that we are very
> comfortable on how Hive interacts with the Hadoop ecosystem as a sub project
> for Hadoop, there has to be some big incentive for the project to be a TLP
> and nowhere have a seen how this would benefit Hive. Any thoughts on that?
>
> Ashish
>
> ________________________________
> From: Jeff Hammerbacher [mailto:hammer@cloudera.com]
> Sent: Wednesday, April 21, 2010 7:35 PM
> To: hive-dev@hadoop.apache.org
> Cc: Ashish Thusoo
> Subject: Re: [DISCUSSION] To be (or not to be) a TLP - that is the question
>
> Hive already does the work to run on multiple versions of Hadoop, and the
> release cycle is independent of Hadoop's. I don't see why it should remain a
> subproject. I'm +1 on Hive becoming a TLP.
>
> On Tue, Apr 20, 2010 at 2:03 PM, Zheng Shao <zshao9@gmail.com<mailto:
> zshao9@gmail.com>> wrote:
> As a Hive committer, I don't feel the benefit we get from becoming a
> TLP is big enough (compared with the cost) to make Hive a TLP.
> From Chris's comment I see that the cost is not that big, but I still
> wonder what benefit we will get from that.
>
> Also I didn't get the idea of the joke ("In fact, one could argue that
> Pig opting not to be TLP yet is why Hive should go TLP"). I don't see
> any reasons that applies to Pig but not Hive.
> We should continue the discussion here, but anything in the Pig's
> discussion should also be considered here.
>
> Zheng
>
> On Mon, Apr 19, 2010 at 5:48 PM, Amr Awadallah <aaa@cloudera.com<mailto:
> aaa@cloudera.com>> wrote:
> > I am personally +1 on Hive being a TLP, I think it did reach the
> community
> > adoption and maturity level required for that. In fact, one could argue
> that
> > Pig opting not to be TLP yet is why Hive should go TLP :) (jk).
> >
> > The real question to ask is whether there is a volunteer to take care of
> the
> > "administrative" tasks, which isn't a ton of work afaiu (I am willing to
> > volunteer if no body else up to the task, but I am not a committer and
> only
> > contributed a minor patch for bash/cygwin).
> >
> > BTW, here is a very nice summary from Yahoo's Chris Douglas on TLP
> > tradeoffs. I happen to agree with all he says, and frankly I couldn't
> have
> > wrote it better my self. I highlight certain parts from his message, but
> I
> > recommend you read the whole thing.
> >
> > ---------- Forwarded message ----------
> > From: Chris Douglas <cd...@apache.org>>
> > Date: Tue, Apr 13, 2010 at 11:46 PM
> > Subject: Subprojects and TLP status
> > To: general@hadoop.apache.org<ma...@hadoop.apache.org>,
> private@hadoop.apache.org<ma...@hadoop.apache.org>
> >
> > Most of Hadoop's subprojects have discussed becoming top-level Apache
> > projects (TLPs) in the last few weeks. Most have expressed a desire to
> > remain in Hadoop. The salient parts of the discussions I've read tend
> > to focus on three aspects: a technical dependence on Hadoop,
> > additional overhead as a TLP, and visibility both within the Hadoop
> > ecosystem and in the open source community generally.
> >
> > Life as a TLP: this is not much harder than being a Hadoop subproject,
> > and the Apache preferences being tossed around- particularly
> > "insufficiently diverse"- are not blockers. Every subproject needs to
> > write a section of the report Hadoop sends to the board; almost the
> > same report, sent to a new address. The initial cost is similarly
> > light: copy bylaws, send a few notes to INFRA, and follow some
> > directions. I think the estimated costs are far higher than they will
> > be in practice. Inertia is a powerful force, but it should be
> > overcome. The directions are here, and should not intimidating:
> >
> > http://apache.org/dev/project-creation.html
> >
> > Visibility: the Hadoop site does not need to change. For each
> > subproject, we can literally change the hyperlinks to point to the new
> > page and be done. Long-term, linking to all ASF projects that run on
> > Hadoop from a prominent page is something we all want. So particularly
> > in the medium-term that most are considering: visibility through the
> > website will not change. Each subproject will still be linked from the
> > front page.
> >
> > Hadoop would not be nearly as popular as it is without Zookeeper,
> > HBase, Hive, and Pig. All statistics on work in shared MapReduce
> > clusters show that users vastly prefer running Pig and Hive queries to
> > writing MapReduce jobs. HBase continues to push features in HDFS that
> > increase its adoption and relevance outside MapReduce, while sharing
> > some of its NoSQL limelight. Zookeeper is not only a linchpin in real
> > workloads, but many proposals for future features require it. The
> > bottom line is that MapReduce and HDFS need these projects for
> > visibility and adoption in precisely the same way. I don't think
> > separate TLPs will uncouple the broader community from one another.
> >
> > Technical dependence: this has two dimensions. First, influencing
> > MapReduce and HDFS. This is nonsense. Earning influence by
> > contributing to a subproject is the only way to push code changes;
> > nobody from any of these projects has violated that by unilaterally
> > committing to HDFS or MapReduce, anyway. And anyone cynical enough to
> > believe that MapReduce and HDFS would deliberately screw over or
> > ignore dependent projects because they don't have PMC members is
> > plainly unsuited to community-driven development. I understand that
> > these projects need to protect their users, but lobbying rights are
> > not an actual benefit.
> >
> > Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
> > true that Hadoop currently offers a set of mutually compatible
> > frameworks. It is not true that moving them to separate Apache
> > projects would make solutions less coherent or affect existing or
> > future users at all. The cohesion between projects' governance is
> > sufficiently weak to justify independent units, but the real
> > dependencies between the projects are strong enough to keep us engaged
> > with one another. And it's not as if other projects- Cascading, for
> > example- aren't also organisms adapted and specialized for life in
> > Hadoop.
> >
> > Arguments on technical dependence are ignoring the nature of the
> > existing interactions. Besides, weak technical dependencies are not a
> > necessary prerequisite for a subproject's independence.
> >
> > As for what was *not* said in these discussions, there is no argument
> > that every one of these subprojects has a distinct, autonomous
> > community. There was also no argument that the Hadoop PMC offers any
> > valuable oversight, given that the representatives of its fiefdoms are
> > too consumed by provincial matters to participate in neighboring
> > governance. Most releases I've voted on: I run the unit tests, check
> > the signature, verify the checksum, and know literally nothing else
> > about its content. I have often never heard the names of many proposed
> > committers and even some proposed PMC members. Right now, subprojects
> > with enough PMC members essentially vote out their own releases and
> > vote in their own committers: TLPs in all but name.
> >
> > The Hadoop club- in conferences, meetups, technical debates, etc.- is
> > broad, diverse, and intertwined, but communities of developers have
> > already clustered around subprojects. Allowing that each cluster
> > should govern itself is a dry, practical matter, not an existential
> > crisis. -C
> >
> > On 4/19/2010 11:57 AM, Ashish Thusoo wrote:
> >>
> >> Hi Folks,
> >>
> >> Recently Apache Board asked the Hadoop PMC if some sub projects can
> become
> >> top level projects. In the opinion of the board, big umbrella projects
> make
> >> it difficult to monitor the health of the communities within the sub
> >> projects. If Hive does become a TLP, then we would have to elect our own
> PMC
> >> and take on all the administrative tasks that the Hadoop PMC does for
> us. So
> >> there is definitely more administrative work involved as a TLP. So the
> >> question is whether we should take on this additional task keeping at
> this
> >> time and what tangible advantages and disadvantages would such a move
> entail
> >> for the project. Would like to hear what the community thinks on this
> issue.
> >>
> >> Thanks,
> >> Ashish
> >>
> >> PS: As some reference to what is happening in the other subprojects, at
> >> this time PIG and Zookeeper have decided NOT to become TLPs where as
> Hbase
> >> and Avro have decided to become TLPs.
> >>
> >>
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>
>


-- 
Connect to me at http://www.facebook.com/dhruba

RE: [DISCUSSION] To be (or not to be) a TLP - that is the question

Posted by Ashish Thusoo <at...@facebook.com>.

What is the advantage of becoming a TLP to the project itself? I have heard that it is something that apache wants, but considering that we are very comfortable on how Hive interacts with the Hadoop ecosystem as a sub project for Hadoop, there has to be some big incentive for the project to be a TLP and nowhere have a seen how this would benefit Hive. Any thoughts on that?

Ashish

________________________________
From: Jeff Hammerbacher [mailto:hammer@cloudera.com]
Sent: Wednesday, April 21, 2010 7:35 PM
To: hive-dev@hadoop.apache.org
Cc: Ashish Thusoo
Subject: Re: [DISCUSSION] To be (or not to be) a TLP - that is the question

Hive already does the work to run on multiple versions of Hadoop, and the release cycle is independent of Hadoop's. I don't see why it should remain a subproject. I'm +1 on Hive becoming a TLP.

On Tue, Apr 20, 2010 at 2:03 PM, Zheng Shao <zs...@gmail.com>> wrote:
As a Hive committer, I don't feel the benefit we get from becoming a
TLP is big enough (compared with the cost) to make Hive a TLP.
>From Chris's comment I see that the cost is not that big, but I still
wonder what benefit we will get from that.

Also I didn't get the idea of the joke ("In fact, one could argue that
Pig opting not to be TLP yet is why Hive should go TLP"). I don't see
any reasons that applies to Pig but not Hive.
We should continue the discussion here, but anything in the Pig's
discussion should also be considered here.

Zheng

On Mon, Apr 19, 2010 at 5:48 PM, Amr Awadallah <aa...@cloudera.com>> wrote:
> I am personally +1 on Hive being a TLP, I think it did reach the community
> adoption and maturity level required for that. In fact, one could argue that
> Pig opting not to be TLP yet is why Hive should go TLP :) (jk).
>
> The real question to ask is whether there is a volunteer to take care of the
> "administrative" tasks, which isn't a ton of work afaiu (I am willing to
> volunteer if no body else up to the task, but I am not a committer and only
> contributed a minor patch for bash/cygwin).
>
> BTW, here is a very nice summary from Yahoo's Chris Douglas on TLP
> tradeoffs. I happen to agree with all he says, and frankly I couldn't have
> wrote it better my self. I highlight certain parts from his message, but I
> recommend you read the whole thing.
>
> ---------- Forwarded message ----------
> From: Chris Douglas <cd...@apache.org>>
> Date: Tue, Apr 13, 2010 at 11:46 PM
> Subject: Subprojects and TLP status
> To: general@hadoop.apache.org<ma...@hadoop.apache.org>, private@hadoop.apache.org<ma...@hadoop.apache.org>
>
> Most of Hadoop's subprojects have discussed becoming top-level Apache
> projects (TLPs) in the last few weeks. Most have expressed a desire to
> remain in Hadoop. The salient parts of the discussions I've read tend
> to focus on three aspects: a technical dependence on Hadoop,
> additional overhead as a TLP, and visibility both within the Hadoop
> ecosystem and in the open source community generally.
>
> Life as a TLP: this is not much harder than being a Hadoop subproject,
> and the Apache preferences being tossed around- particularly
> "insufficiently diverse"- are not blockers. Every subproject needs to
> write a section of the report Hadoop sends to the board; almost the
> same report, sent to a new address. The initial cost is similarly
> light: copy bylaws, send a few notes to INFRA, and follow some
> directions. I think the estimated costs are far higher than they will
> be in practice. Inertia is a powerful force, but it should be
> overcome. The directions are here, and should not intimidating:
>
> http://apache.org/dev/project-creation.html
>
> Visibility: the Hadoop site does not need to change. For each
> subproject, we can literally change the hyperlinks to point to the new
> page and be done. Long-term, linking to all ASF projects that run on
> Hadoop from a prominent page is something we all want. So particularly
> in the medium-term that most are considering: visibility through the
> website will not change. Each subproject will still be linked from the
> front page.
>
> Hadoop would not be nearly as popular as it is without Zookeeper,
> HBase, Hive, and Pig. All statistics on work in shared MapReduce
> clusters show that users vastly prefer running Pig and Hive queries to
> writing MapReduce jobs. HBase continues to push features in HDFS that
> increase its adoption and relevance outside MapReduce, while sharing
> some of its NoSQL limelight. Zookeeper is not only a linchpin in real
> workloads, but many proposals for future features require it. The
> bottom line is that MapReduce and HDFS need these projects for
> visibility and adoption in precisely the same way. I don't think
> separate TLPs will uncouple the broader community from one another.
>
> Technical dependence: this has two dimensions. First, influencing
> MapReduce and HDFS. This is nonsense. Earning influence by
> contributing to a subproject is the only way to push code changes;
> nobody from any of these projects has violated that by unilaterally
> committing to HDFS or MapReduce, anyway. And anyone cynical enough to
> believe that MapReduce and HDFS would deliberately screw over or
> ignore dependent projects because they don't have PMC members is
> plainly unsuited to community-driven development. I understand that
> these projects need to protect their users, but lobbying rights are
> not an actual benefit.
>
> Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
> true that Hadoop currently offers a set of mutually compatible
> frameworks. It is not true that moving them to separate Apache
> projects would make solutions less coherent or affect existing or
> future users at all. The cohesion between projects' governance is
> sufficiently weak to justify independent units, but the real
> dependencies between the projects are strong enough to keep us engaged
> with one another. And it's not as if other projects- Cascading, for
> example- aren't also organisms adapted and specialized for life in
> Hadoop.
>
> Arguments on technical dependence are ignoring the nature of the
> existing interactions. Besides, weak technical dependencies are not a
> necessary prerequisite for a subproject's independence.
>
> As for what was *not* said in these discussions, there is no argument
> that every one of these subprojects has a distinct, autonomous
> community. There was also no argument that the Hadoop PMC offers any
> valuable oversight, given that the representatives of its fiefdoms are
> too consumed by provincial matters to participate in neighboring
> governance. Most releases I've voted on: I run the unit tests, check
> the signature, verify the checksum, and know literally nothing else
> about its content. I have often never heard the names of many proposed
> committers and even some proposed PMC members. Right now, subprojects
> with enough PMC members essentially vote out their own releases and
> vote in their own committers: TLPs in all but name.
>
> The Hadoop club- in conferences, meetups, technical debates, etc.- is
> broad, diverse, and intertwined, but communities of developers have
> already clustered around subprojects. Allowing that each cluster
> should govern itself is a dry, practical matter, not an existential
> crisis. -C
>
> On 4/19/2010 11:57 AM, Ashish Thusoo wrote:
>>
>> Hi Folks,
>>
>> Recently Apache Board asked the Hadoop PMC if some sub projects can become
>> top level projects. In the opinion of the board, big umbrella projects make
>> it difficult to monitor the health of the communities within the sub
>> projects. If Hive does become a TLP, then we would have to elect our own PMC
>> and take on all the administrative tasks that the Hadoop PMC does for us. So
>> there is definitely more administrative work involved as a TLP. So the
>> question is whether we should take on this additional task keeping at this
>> time and what tangible advantages and disadvantages would such a move entail
>> for the project. Would like to hear what the community thinks on this issue.
>>
>> Thanks,
>> Ashish
>>
>> PS: As some reference to what is happening in the other subprojects, at
>> this time PIG and Zookeeper have decided NOT to become TLPs where as Hbase
>> and Avro have decided to become TLPs.
>>
>>
>



--
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: [DISCUSSION] To be (or not to be) a TLP - that is the question

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hive already does the work to run on multiple versions of Hadoop, and the
release cycle is independent of Hadoop's. I don't see why it should remain a
subproject. I'm +1 on Hive becoming a TLP.

On Tue, Apr 20, 2010 at 2:03 PM, Zheng Shao <zs...@gmail.com> wrote:

> As a Hive committer, I don't feel the benefit we get from becoming a
> TLP is big enough (compared with the cost) to make Hive a TLP.
> From Chris's comment I see that the cost is not that big, but I still
> wonder what benefit we will get from that.
>
> Also I didn't get the idea of the joke ("In fact, one could argue that
> Pig opting not to be TLP yet is why Hive should go TLP"). I don't see
> any reasons that applies to Pig but not Hive.
> We should continue the discussion here, but anything in the Pig's
> discussion should also be considered here.
>
> Zheng
>
> On Mon, Apr 19, 2010 at 5:48 PM, Amr Awadallah <aa...@cloudera.com> wrote:
> > I am personally +1 on Hive being a TLP, I think it did reach the
> community
> > adoption and maturity level required for that. In fact, one could argue
> that
> > Pig opting not to be TLP yet is why Hive should go TLP :) (jk).
> >
> > The real question to ask is whether there is a volunteer to take care of
> the
> > "administrative" tasks, which isn't a ton of work afaiu (I am willing to
> > volunteer if no body else up to the task, but I am not a committer and
> only
> > contributed a minor patch for bash/cygwin).
> >
> > BTW, here is a very nice summary from Yahoo's Chris Douglas on TLP
> > tradeoffs. I happen to agree with all he says, and frankly I couldn't
> have
> > wrote it better my self. I highlight certain parts from his message, but
> I
> > recommend you read the whole thing.
> >
> > ---------- Forwarded message ----------
> > From: Chris Douglas <cd...@apache.org>
> > Date: Tue, Apr 13, 2010 at 11:46 PM
> > Subject: Subprojects and TLP status
> > To: general@hadoop.apache.org, private@hadoop.apache.org
> >
> > Most of Hadoop's subprojects have discussed becoming top-level Apache
> > projects (TLPs) in the last few weeks. Most have expressed a desire to
> > remain in Hadoop. The salient parts of the discussions I've read tend
> > to focus on three aspects: a technical dependence on Hadoop,
> > additional overhead as a TLP, and visibility both within the Hadoop
> > ecosystem and in the open source community generally.
> >
> > Life as a TLP: this is not much harder than being a Hadoop subproject,
> > and the Apache preferences being tossed around- particularly
> > "insufficiently diverse"- are not blockers. Every subproject needs to
> > write a section of the report Hadoop sends to the board; almost the
> > same report, sent to a new address. The initial cost is similarly
> > light: copy bylaws, send a few notes to INFRA, and follow some
> > directions. I think the estimated costs are far higher than they will
> > be in practice. Inertia is a powerful force, but it should be
> > overcome. The directions are here, and should not intimidating:
> >
> > http://apache.org/dev/project-creation.html
> >
> > Visibility: the Hadoop site does not need to change. For each
> > subproject, we can literally change the hyperlinks to point to the new
> > page and be done. Long-term, linking to all ASF projects that run on
> > Hadoop from a prominent page is something we all want. So particularly
> > in the medium-term that most are considering: visibility through the
> > website will not change. Each subproject will still be linked from the
> > front page.
> >
> > Hadoop would not be nearly as popular as it is without Zookeeper,
> > HBase, Hive, and Pig. All statistics on work in shared MapReduce
> > clusters show that users vastly prefer running Pig and Hive queries to
> > writing MapReduce jobs. HBase continues to push features in HDFS that
> > increase its adoption and relevance outside MapReduce, while sharing
> > some of its NoSQL limelight. Zookeeper is not only a linchpin in real
> > workloads, but many proposals for future features require it. The
> > bottom line is that MapReduce and HDFS need these projects for
> > visibility and adoption in precisely the same way. I don't think
> > separate TLPs will uncouple the broader community from one another.
> >
> > Technical dependence: this has two dimensions. First, influencing
> > MapReduce and HDFS. This is nonsense. Earning influence by
> > contributing to a subproject is the only way to push code changes;
> > nobody from any of these projects has violated that by unilaterally
> > committing to HDFS or MapReduce, anyway. And anyone cynical enough to
> > believe that MapReduce and HDFS would deliberately screw over or
> > ignore dependent projects because they don't have PMC members is
> > plainly unsuited to community-driven development. I understand that
> > these projects need to protect their users, but lobbying rights are
> > not an actual benefit.
> >
> > Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
> > true that Hadoop currently offers a set of mutually compatible
> > frameworks. It is not true that moving them to separate Apache
> > projects would make solutions less coherent or affect existing or
> > future users at all. The cohesion between projects' governance is
> > sufficiently weak to justify independent units, but the real
> > dependencies between the projects are strong enough to keep us engaged
> > with one another. And it's not as if other projects- Cascading, for
> > example- aren't also organisms adapted and specialized for life in
> > Hadoop.
> >
> > Arguments on technical dependence are ignoring the nature of the
> > existing interactions. Besides, weak technical dependencies are not a
> > necessary prerequisite for a subproject's independence.
> >
> > As for what was *not* said in these discussions, there is no argument
> > that every one of these subprojects has a distinct, autonomous
> > community. There was also no argument that the Hadoop PMC offers any
> > valuable oversight, given that the representatives of its fiefdoms are
> > too consumed by provincial matters to participate in neighboring
> > governance. Most releases I've voted on: I run the unit tests, check
> > the signature, verify the checksum, and know literally nothing else
> > about its content. I have often never heard the names of many proposed
> > committers and even some proposed PMC members. Right now, subprojects
> > with enough PMC members essentially vote out their own releases and
> > vote in their own committers: TLPs in all but name.
> >
> > The Hadoop club- in conferences, meetups, technical debates, etc.- is
> > broad, diverse, and intertwined, but communities of developers have
> > already clustered around subprojects. Allowing that each cluster
> > should govern itself is a dry, practical matter, not an existential
> > crisis. -C
> >
> > On 4/19/2010 11:57 AM, Ashish Thusoo wrote:
> >>
> >> Hi Folks,
> >>
> >> Recently Apache Board asked the Hadoop PMC if some sub projects can
> become
> >> top level projects. In the opinion of the board, big umbrella projects
> make
> >> it difficult to monitor the health of the communities within the sub
> >> projects. If Hive does become a TLP, then we would have to elect our own
> PMC
> >> and take on all the administrative tasks that the Hadoop PMC does for
> us. So
> >> there is definitely more administrative work involved as a TLP. So the
> >> question is whether we should take on this additional task keeping at
> this
> >> time and what tangible advantages and disadvantages would such a move
> entail
> >> for the project. Would like to hear what the community thinks on this
> issue.
> >>
> >> Thanks,
> >> Ashish
> >>
> >> PS: As some reference to what is happening in the other subprojects, at
> >> this time PIG and Zookeeper have decided NOT to become TLPs where as
> Hbase
> >> and Avro have decided to become TLPs.
> >>
> >>
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>

Re: [DISCUSSION] To be (or not to be) a TLP - that is the question

Posted by Zheng Shao <zs...@gmail.com>.

As a Hive committer, I don't feel the benefit we get from becoming a
TLP is big enough (compared with the cost) to make Hive a TLP.
>From Chris's comment I see that the cost is not that big, but I still
wonder what benefit we will get from that.

Also I didn't get the idea of the joke ("In fact, one could argue that
Pig opting not to be TLP yet is why Hive should go TLP"). I don't see
any reasons that applies to Pig but not Hive.
We should continue the discussion here, but anything in the Pig's
discussion should also be considered here.

Zheng

On Mon, Apr 19, 2010 at 5:48 PM, Amr Awadallah <aa...@cloudera.com> wrote:
> I am personally +1 on Hive being a TLP, I think it did reach the community
> adoption and maturity level required for that. In fact, one could argue that
> Pig opting not to be TLP yet is why Hive should go TLP :) (jk).
>
> The real question to ask is whether there is a volunteer to take care of the
> "administrative" tasks, which isn't a ton of work afaiu (I am willing to
> volunteer if no body else up to the task, but I am not a committer and only
> contributed a minor patch for bash/cygwin).
>
> BTW, here is a very nice summary from Yahoo's Chris Douglas on TLP
> tradeoffs. I happen to agree with all he says, and frankly I couldn't have
> wrote it better my self. I highlight certain parts from his message, but I
> recommend you read the whole thing.
>
> ---------- Forwarded message ----------
> From: Chris Douglas <cd...@apache.org>
> Date: Tue, Apr 13, 2010 at 11:46 PM
> Subject: Subprojects and TLP status
> To: general@hadoop.apache.org, private@hadoop.apache.org
>
> Most of Hadoop's subprojects have discussed becoming top-level Apache
> projects (TLPs) in the last few weeks. Most have expressed a desire to
> remain in Hadoop. The salient parts of the discussions I've read tend
> to focus on three aspects: a technical dependence on Hadoop,
> additional overhead as a TLP, and visibility both within the Hadoop
> ecosystem and in the open source community generally.
>
> Life as a TLP: this is not much harder than being a Hadoop subproject,
> and the Apache preferences being tossed around- particularly
> "insufficiently diverse"- are not blockers. Every subproject needs to
> write a section of the report Hadoop sends to the board; almost the
> same report, sent to a new address. The initial cost is similarly
> light: copy bylaws, send a few notes to INFRA, and follow some
> directions. I think the estimated costs are far higher than they will
> be in practice. Inertia is a powerful force, but it should be
> overcome. The directions are here, and should not intimidating:
>
> http://apache.org/dev/project-creation.html
>
> Visibility: the Hadoop site does not need to change. For each
> subproject, we can literally change the hyperlinks to point to the new
> page and be done. Long-term, linking to all ASF projects that run on
> Hadoop from a prominent page is something we all want. So particularly
> in the medium-term that most are considering: visibility through the
> website will not change. Each subproject will still be linked from the
> front page.
>
> Hadoop would not be nearly as popular as it is without Zookeeper,
> HBase, Hive, and Pig. All statistics on work in shared MapReduce
> clusters show that users vastly prefer running Pig and Hive queries to
> writing MapReduce jobs. HBase continues to push features in HDFS that
> increase its adoption and relevance outside MapReduce, while sharing
> some of its NoSQL limelight. Zookeeper is not only a linchpin in real
> workloads, but many proposals for future features require it. The
> bottom line is that MapReduce and HDFS need these projects for
> visibility and adoption in precisely the same way. I don't think
> separate TLPs will uncouple the broader community from one another.
>
> Technical dependence: this has two dimensions. First, influencing
> MapReduce and HDFS. This is nonsense. Earning influence by
> contributing to a subproject is the only way to push code changes;
> nobody from any of these projects has violated that by unilaterally
> committing to HDFS or MapReduce, anyway. And anyone cynical enough to
> believe that MapReduce and HDFS would deliberately screw over or
> ignore dependent projects because they don't have PMC members is
> plainly unsuited to community-driven development. I understand that
> these projects need to protect their users, but lobbying rights are
> not an actual benefit.
>
> Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
> true that Hadoop currently offers a set of mutually compatible
> frameworks. It is not true that moving them to separate Apache
> projects would make solutions less coherent or affect existing or
> future users at all. The cohesion between projects' governance is
> sufficiently weak to justify independent units, but the real
> dependencies between the projects are strong enough to keep us engaged
> with one another. And it's not as if other projects- Cascading, for
> example- aren't also organisms adapted and specialized for life in
> Hadoop.
>
> Arguments on technical dependence are ignoring the nature of the
> existing interactions. Besides, weak technical dependencies are not a
> necessary prerequisite for a subproject's independence.
>
> As for what was *not* said in these discussions, there is no argument
> that every one of these subprojects has a distinct, autonomous
> community. There was also no argument that the Hadoop PMC offers any
> valuable oversight, given that the representatives of its fiefdoms are
> too consumed by provincial matters to participate in neighboring
> governance. Most releases I've voted on: I run the unit tests, check
> the signature, verify the checksum, and know literally nothing else
> about its content. I have often never heard the names of many proposed
> committers and even some proposed PMC members. Right now, subprojects
> with enough PMC members essentially vote out their own releases and
> vote in their own committers: TLPs in all but name.
>
> The Hadoop club- in conferences, meetups, technical debates, etc.- is
> broad, diverse, and intertwined, but communities of developers have
> already clustered around subprojects. Allowing that each cluster
> should govern itself is a dry, practical matter, not an existential
> crisis. -C
>
> On 4/19/2010 11:57 AM, Ashish Thusoo wrote:
>>
>> Hi Folks,
>>
>> Recently Apache Board asked the Hadoop PMC if some sub projects can become
>> top level projects. In the opinion of the board, big umbrella projects make
>> it difficult to monitor the health of the communities within the sub
>> projects. If Hive does become a TLP, then we would have to elect our own PMC
>> and take on all the administrative tasks that the Hadoop PMC does for us. So
>> there is definitely more administrative work involved as a TLP. So the
>> question is whether we should take on this additional task keeping at this
>> time and what tangible advantages and disadvantages would such a move entail
>> for the project. Would like to hear what the community thinks on this issue.
>>
>> Thanks,
>> Ashish
>>
>> PS: As some reference to what is happening in the other subprojects, at
>> this time PIG and Zookeeper have decided NOT to become TLPs where as Hbase
>> and Avro have decided to become TLPs.
>>
>>
>



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: [DISCUSSION] To be (or not to be) a TLP - that is the question

Posted by Amr Awadallah <aa...@cloudera.com>.

I am personally +1 on Hive being a TLP, I think it did reach the 
community adoption and maturity level required for that. In fact, one 
could argue that Pig opting not to be TLP yet is why Hive should go TLP 
:) (jk).

The real question to ask is whether there is a volunteer to take care of 
the "administrative" tasks, which isn't a ton of work afaiu (I am 
willing to volunteer if no body else up to the task, but I am not a 
committer and only contributed a minor patch for bash/cygwin).

BTW, here is a very nice summary from Yahoo's Chris Douglas on TLP 
tradeoffs. I happen to agree with all he says, and frankly I couldn't 
have wrote it better my self. I highlight certain parts from his 
message, but I recommend you read the whole thing.

---------- Forwarded message ----------
From: Chris Douglas <cd...@apache.org>
Date: Tue, Apr 13, 2010 at 11:46 PM
Subject: Subprojects and TLP status
To: general@hadoop.apache.org, private@hadoop.apache.org

Most of Hadoop's subprojects have discussed becoming top-level Apache
projects (TLPs) in the last few weeks. Most have expressed a desire to
remain in Hadoop. The salient parts of the discussions I've read tend
to focus on three aspects: a technical dependence on Hadoop,
additional overhead as a TLP, and visibility both within the Hadoop
ecosystem and in the open source community generally.

Life as a TLP: this is not much harder than being a Hadoop subproject,
and the Apache preferences being tossed around- particularly
"insufficiently diverse"- are not blockers. Every subproject needs to
write a section of the report Hadoop sends to the board; almost the
same report, sent to a new address. The initial cost is similarly
light: copy bylaws, send a few notes to INFRA, and follow some
directions. I think the estimated costs are far higher than they will
be in practice. Inertia is a powerful force, but it should be
overcome. The directions are here, and should not intimidating:

http://apache.org/dev/project-creation.html

Visibility: the Hadoop site does not need to change. For each
subproject, we can literally change the hyperlinks to point to the new
page and be done. Long-term, linking to all ASF projects that run on
Hadoop from a prominent page is something we all want. So particularly
in the medium-term that most are considering: visibility through the
website will not change. Each subproject will still be linked from the
front page.

Hadoop would not be nearly as popular as it is without Zookeeper,
HBase, Hive, and Pig. All statistics on work in shared MapReduce
clusters show that users vastly prefer running Pig and Hive queries to
writing MapReduce jobs. HBase continues to push features in HDFS that
increase its adoption and relevance outside MapReduce, while sharing
some of its NoSQL limelight. Zookeeper is not only a linchpin in real
workloads, but many proposals for future features require it. The
bottom line is that MapReduce and HDFS need these projects for
visibility and adoption in precisely the same way. I don't think
separate TLPs will uncouple the broader community from one another.

Technical dependence: this has two dimensions. First, influencing
MapReduce and HDFS. This is nonsense. Earning influence by
contributing to a subproject is the only way to push code changes;
nobody from any of these projects has violated that by unilaterally
committing to HDFS or MapReduce, anyway. And anyone cynical enough to
believe that MapReduce and HDFS would deliberately screw over or
ignore dependent projects because they don't have PMC members is
plainly unsuited to community-driven development. I understand that
these projects need to protect their users, but lobbying rights are
not an actual benefit.

Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
true that Hadoop currently offers a set of mutually compatible
frameworks. It is not true that moving them to separate Apache
projects would make solutions less coherent or affect existing or
future users at all. The cohesion between projects' governance is
sufficiently weak to justify independent units, but the real
dependencies between the projects are strong enough to keep us engaged
with one another. And it's not as if other projects- Cascading, for
example- aren't also organisms adapted and specialized for life in
Hadoop.

Arguments on technical dependence are ignoring the nature of the
existing interactions. Besides, weak technical dependencies are not a
necessary prerequisite for a subproject's independence.

As for what was *not* said in these discussions, there is no argument
that every one of these subprojects has a distinct, autonomous
community. There was also no argument that the Hadoop PMC offers any
valuable oversight, given that the representatives of its fiefdoms are
too consumed by provincial matters to participate in neighboring
governance. Most releases I've voted on: I run the unit tests, check
the signature, verify the checksum, and know literally nothing else
about its content. I have often never heard the names of many proposed
committers and even some proposed PMC members. Right now, subprojects
with enough PMC members essentially vote out their own releases and
vote in their own committers: TLPs in all but name.

The Hadoop club- in conferences, meetups, technical debates, etc.- is
broad, diverse, and intertwined, but communities of developers have
already clustered around subprojects. Allowing that each cluster
should govern itself is a dry, practical matter, not an existential
crisis. -C

On 4/19/2010 11:57 AM, Ashish Thusoo wrote:
> Hi Folks,
>
> Recently Apache Board asked the Hadoop PMC if some sub projects can become top level projects. In the opinion of the board, big umbrella projects make it difficult to monitor the health of the communities within the sub projects. If Hive does become a TLP, then we would have to elect our own PMC and take on all the administrative tasks that the Hadoop PMC does for us. So there is definitely more administrative work involved as a TLP. So the question is whether we should take on this additional task keeping at this time and what tangible advantages and disadvantages would such a move entail for the project. Would like to hear what the community thinks on this issue.
>
> Thanks,
> Ashish
>
> PS: As some reference to what is happening in the other subprojects, at this time PIG and Zookeeper have decided NOT to become TLPs where as Hbase and Avro have decided to become TLPs.
>
>