You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Makoto Yui <yu...@gmail.com> on 2014/11/21 14:02:53 UTC

[PROPOSAL] Hivemall incubation

Hi all,

I am the principal developer of Hivemall, a scalable machine learning
library for Apache Hive.

  https://github.com/myui/hivemall

When I presented a talk at the last Hadoop Summit in San Jose [1],
several audiences asked me the possibility to change the software
license of Hivemall to Apache License v2 and then sustainability of the
project was their major concerns.

Since then, I am wondering to propose Hivemall as an Apache Incubator
project. The position of Hivemall for Hive would become similar one to
DataFu (an Apache Incubator project) for Apache Pig.

I believe that adding machine learning functionality over Apache Hive
could extend application range of Apache Hive and Hivemall could help
existing Hive users in their learning-scale data analytics projects.

I have got approved from my employer (AIST) to change the license of
Hivemall to Apache License version 2 and the donating the code to Apache
Foundation. And now, I am willing to propose Hivemall as an Apache
incubator project, together with Hivemall contributors in NTT corp.

I am considering that the current Hivemall codebase is bits large to be
included in Hive contrib and thus it is better to be a separated
incubator project. I would like to propose Hivemall to be graduated as a
subproject of Apache Hive.

Is the strategy possible from the Hive PMC point of view?
http://incubator.apache.org/guides/graduation.html#subproject-or-top-level

Before formulating a proposal, I would like to hear Hive developers’
opinion (e.g., possibilities, +1/-1, and missing pieces for incubations)
on incubating Hivemall.

BTW, I found this JIRA issue mentioning Hivemall.
https://issues.apache.org/jira/browse/HIVE-7940

Is there a possibility to cooperate with them in proposing Hivemall to
Apache Incubator project? According the incubation guides, I need a
mentor/champion for incubating.
http://incubator.apache.org/guides/proposal.html#formulating

Your help toward the incubation will be much appreciated.

Thanks,
Makoto

[1] http://www.slideshare.net/myui/hivemall-hadoop-summit-2014-san-jose

-- ******************************************* Makoto YUI
<m....@aist.go.jp> Information Technology Research Institute, AIST.
http://staff.aist.go.jp/m.yui/ *******************************************

Re: [PROPOSAL] Hivemall incubation

Posted by Nick Dimiduk <nd...@gmail.com>.
Thank you for humoring my questions. I do not know the mind of the DataFu
community. Your observations are quite clear; I have no further concerns.

-n

On Friday, November 21, 2014, Makoto Yui <yu...@gmail.com> wrote:

> Hi Nick,
>
> Thank you for the comments.
>
> (2014/11/22 3:42), Nick Dimiduk wrote:
>
>> I would also encourage you to consider joining forces with DataFu,
>> rather than "competing". I think there's a real appetite a wholistic
>> toolbox of patterns and implementations that can span these projects.
>>  From my understanding, there's nothing about DataFu that's unique to
>> Pig, they just need the work done to abstract away the Pig bits and
>> implement the Hive interfaces.
>>
>
> My current understanding of DataFu is that it is UDF collections for
> Apache Pig. Though Hive interface is not yet supported in DataFu, is the
> direction (to extend DataFu for Hive) a consensus in DataFu community?
>
> My concern is that merging Hivemall codebase to DataFu makes the building
> and packing process of DataFu complex and the target/objective of the
> project unclear.
>
> I do not think that Hivemall competes with DataFu because
> 1) There are users who prefer Pig and Hive respectively, and
> 2) Pig/DataFu is useful for what HiveQL is unsuited (e.g., complex feature
> engineering steps). After preprocessing using DataFu, Hivemall can be
> applied for classification/regression in a scalable way in Hive.
>
>  Is there anything about Hivemall that's unique to Hive, that wouldn't be
>> applicable to Pig as well?
>>
>
> The techniques used in Hivemall (e.g., training data amplification that
> emulates iterative training and machine learning algorithms as
> table-generating functions) could be appreciable to Apache Pig.
>
> However, I am not a heavy user of Pig and porting Hivemall to Pig requires
> a bunch of works. So, I am currently considering to stick with HiveQL
> interfaces (Hive, HCatalog, and Tez for the software stack of Hivemall) in
> developing Hivemall because SQL-like interface is friendly to a broader
> range of developers.
>
> Thanks,
> Makoto
>
> --
> *******************************************
> Makoto YUI <m....@aist.go.jp>
> Information Technology Research Institute, AIST.
> https://staff.aist.go.jp/m.yui/index_e.html
> *******************************************
>

Re: [PROPOSAL] Hivemall incubation

Posted by Makoto Yui <yu...@gmail.com>.
Hi Nick,

Thank you for the comments.

(2014/11/22 3:42), Nick Dimiduk wrote:
> I would also encourage you to consider joining forces with DataFu,
> rather than "competing". I think there's a real appetite a wholistic
> toolbox of patterns and implementations that can span these projects.
>  From my understanding, there's nothing about DataFu that's unique to
> Pig, they just need the work done to abstract away the Pig bits and
> implement the Hive interfaces.

My current understanding of DataFu is that it is UDF collections for 
Apache Pig. Though Hive interface is not yet supported in DataFu, is the 
direction (to extend DataFu for Hive) a consensus in DataFu community?

My concern is that merging Hivemall codebase to DataFu makes the 
building and packing process of DataFu complex and the target/objective 
of the project unclear.

I do not think that Hivemall competes with DataFu because
1) There are users who prefer Pig and Hive respectively, and
2) Pig/DataFu is useful for what HiveQL is unsuited (e.g., complex 
feature engineering steps). After preprocessing using DataFu, Hivemall 
can be applied for classification/regression in a scalable way in Hive.

> Is there anything about Hivemall that's unique to Hive, that wouldn't be
> applicable to Pig as well?

The techniques used in Hivemall (e.g., training data amplification that 
emulates iterative training and machine learning algorithms as 
table-generating functions) could be appreciable to Apache Pig.

However, I am not a heavy user of Pig and porting Hivemall to Pig 
requires a bunch of works. So, I am currently considering to stick with 
HiveQL interfaces (Hive, HCatalog, and Tez for the software stack of 
Hivemall) in developing Hivemall because SQL-like interface is friendly 
to a broader range of developers.

Thanks,
Makoto

-- 
*******************************************
Makoto YUI <m....@aist.go.jp>
Information Technology Research Institute, AIST.
https://staff.aist.go.jp/m.yui/index_e.html
*******************************************

Re: [PROPOSAL] Hivemall incubation

Posted by Nick Dimiduk <nd...@gmail.com>.
Hi Makoto,

I cannot speak for Hive PMC, only as a data tool user and occasional
contributor. I think the idea is very much a good one. Incubator takes a
lot of work because it's all about establishing a vibrant developer and
user community for the project. "Community before code," as they say.

I would also encourage you to consider joining forces with DataFu, rather
than "competing". I think there's a real appetite a wholistic toolbox of
patterns and implementations that can span these projects. From my
understanding, there's nothing about DataFu that's unique to Pig, they just
need the work done to abstract away the Pig bits and implement the Hive
interfaces.

Is there anything about Hivemall that's unique to Hive, that wouldn't be
applicable to Pig as well?

+Casey, as I believe he has some interest in seeing DataFu reach a wider
audience as well.

Good on you.
Nick

On Friday, November 21, 2014, Makoto Yui <yu...@gmail.com> wrote:

> Hi all,
>
> I am the principal developer of Hivemall, a scalable machine learning
> library for Apache Hive.
>
>   https://github.com/myui/hivemall
>
> When I presented a talk at the last Hadoop Summit in San Jose [1],
> several audiences asked me the possibility to change the software
> license of Hivemall to Apache License v2 and then sustainability of the
> project was their major concerns.
>
> Since then, I am wondering to propose Hivemall as an Apache Incubator
> project. The position of Hivemall for Hive would become similar one to
> DataFu (an Apache Incubator project) for Apache Pig.
>
> I believe that adding machine learning functionality over Apache Hive
> could extend application range of Apache Hive and Hivemall could help
> existing Hive users in their learning-scale data analytics projects.
>
> I have got approved from my employer (AIST) to change the license of
> Hivemall to Apache License version 2 and the donating the code to Apache
> Foundation. And now, I am willing to propose Hivemall as an Apache
> incubator project, together with Hivemall contributors in NTT corp.
>
> I am considering that the current Hivemall codebase is bits large to be
> included in Hive contrib and thus it is better to be a separated
> incubator project. I would like to propose Hivemall to be graduated as a
> subproject of Apache Hive.
>
> Is the strategy possible from the Hive PMC point of view?
> http://incubator.apache.org/guides/graduation.html#subproject-or-top-level
>
> Before formulating a proposal, I would like to hear Hive developers’
> opinion (e.g., possibilities, +1/-1, and missing pieces for incubations)
> on incubating Hivemall.
>
> BTW, I found this JIRA issue mentioning Hivemall.
> https://issues.apache.org/jira/browse/HIVE-7940
>
> Is there a possibility to cooperate with them in proposing Hivemall to
> Apache Incubator project? According the incubation guides, I need a
> mentor/champion for incubating.
> http://incubator.apache.org/guides/proposal.html#formulating
>
> Your help toward the incubation will be much appreciated.
>
> Thanks,
> Makoto
>
> [1] http://www.slideshare.net/myui/hivemall-hadoop-summit-2014-san-jose
>
> -- ******************************************* Makoto YUI
> <m.yui@aist.go.jp <javascript:;>> Information Technology Research
> Institute, AIST.
> http://staff.aist.go.jp/m.yui/ *******************************************
>