You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Xiaokai Wei <xw...@palantir.com> on 2014/06/18 02:00:00 UTC

Contributing to MLlib on GLM

Hi,

I am an intern at PalantirTech and we are building some stuff on top of
MLlib. In Particular, GLM is of great interest to us.  Though
GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as
Logistic Regression, Linear Regression, some other important GLMs like
Poisson Regression are still missing.

I am curious that if anyone is already working on other GLMs (e.g. Poisson,
Gamma). If not, we would like to contribute to MLlib on GLM. Is adding more
GLMs on the roadmap of MLlib?


Sincerely,

Xiaokai

Re: Contributing to MLlib on GLM

Posted by Gang Bai <ba...@staff.sina.com.cn>.

Poisson and Gamma regressions for modeling count data are definitely important in spark.mllib.regression. So don’t worry. Let’s change the updater to SquaredL2Updater as we discussed in the PR. Then we can ask Jenkins to run the test.

On Jul 8, 2014, at 3:00 AM, xwei <we...@gmail.com> wrote:

> Hi Gang,
> 
> No admin is looking at our patch:( do you have some suggestions so that our
> patch can get noticed by the admin?
> 
> Best regards,
> 
> Xiaokai
> 
> 
> On Mon, Jun 30, 2014 at 8:18 PM, Gang Bai [via Apache Spark Developers
> List] <ml...@n3.nabble.com> wrote:
> 
>> Thanks Xiaokai,
>> 
>> I’ve created a pull request to merge features in my PR to your repo.
>> Please take a review here https://github.com/xwei-datageek/spark/pull/2 .
>> 
>> As for GLMs, here at Sina, we are solving the problem of predicting the
>> num of visitors who read a particular news article or watch an online
>> sports live stream in a particular period. I’m trying to improve the
>> prediction results by tuning features and incorporating new models. So I’ll
>> try Gamma regression later. Thanks for the implementation.
>> 
>> Cheers,
>> -Gang
>> 
>> On Jun 29, 2014, at 8:17 AM, xwei <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=7131&i=0>> wrote:
>> 
>>> Hi Gang,
>>> 
>>> No worries!
>>> 
>>> I agree LBFGS would converge faster and your test suite is more
>> comprehensive. I'd like to merge my branch with yours.
>>> 
>>> I also agree with your viewpoint on the redundancy issue. For different
>> GLMs, usually they only differ in gradient calculation but the
>> ****regression.scala files are quite similar. For example,
>> linearRegressionSGD, logisticRegressionSGD, RidgeRegressionSGD,
>> poissonRegressionSGD all share quite a bit of common code in their class
>> implementations. Since such redundancy is already there in the legacy code,
>> simply merging Poisson and Gamma does not seem to help much. So I suggest
>> we just leave them as separate classes for the time being.
>>> 
>>> 
>>> Best regards,
>>> 
>>> Xiaokai
>>> 
>>> On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List]
>> wrote:
>>> 
>>>> Hi Xiaokai,
>>>> 
>>>> My bad. I didn't notice this before I created another PR for Poisson
>> regression. The mails were buried in junk by the corp mail master. Also,
>> thanks for considering my comments and advice in your PR.
>>>> 
>>>> Adding my two cents here:
>>>> 
>>>> * PoissonRegressionModel and GammaRegressionModel have the same fields
>> and prediction method. Shall we use one instead of two redundant classes?
>> Say, a LogLinearModel.
>>>> * The LBFGS optimizer takes fewer iterations and results in better
>> convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes
>> using LBFGS and SGD respectively. You may take a look into it. If it's OK
>> to you, I'd be happy to send a PR to your branch.
>>>> * In addition to the generated test data, We may use some real-world
>> data for testing. In my implementation, I added the test data from
>> https://onlinecourses.science.psu.edu/stat504/node/223. Please check my
>> test suite.
>>>> 
>>>> -Gang
>>>> Sent from my iPad
>>>> 
>>>>> On 2014年6月27日, at 下午6:03, "xwei" <[hidden email]> wrote:
>>>>> 
>>>>> 
>>>>> Yes, that's what we did: adding two gradient functions to
>> Gradient.scala and
>>>>> create PoissonRegression and GammaRegression using these gradients. We
>> made
>>>>> a PR on this.
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
>>>>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>>> 
>>>> 
>>>> If you reply to this email, your message will be added to the
>> discussion below:
>>>> 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html
>>>> To unsubscribe from Contributing to MLlib on GLM, click here.
>>>> NAML
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html
>> 
>>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> 
>> 
>> 
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7131.html
>> To unsubscribe from Contributing to MLlib on GLM, click here
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7033&code=d2VpeGlhb2thaUBnbWFpbC5jb218NzAzM3w2NTc5NDUzMzA=>
>> .
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7197.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

Posted by xwei <we...@gmail.com>.

Hi Gang,

No admin is looking at our patch:( do you have some suggestions so that our
patch can get noticed by the admin?

Best regards,

Xiaokai


On Mon, Jun 30, 2014 at 8:18 PM, Gang Bai [via Apache Spark Developers
List] <ml...@n3.nabble.com> wrote:

> Thanks Xiaokai,
>
> I’ve created a pull request to merge features in my PR to your repo.
> Please take a review here https://github.com/xwei-datageek/spark/pull/2 .
>
> As for GLMs, here at Sina, we are solving the problem of predicting the
> num of visitors who read a particular news article or watch an online
> sports live stream in a particular period. I’m trying to improve the
> prediction results by tuning features and incorporating new models. So I’ll
> try Gamma regression later. Thanks for the implementation.
>
> Cheers,
> -Gang
>
> On Jun 29, 2014, at 8:17 AM, xwei <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7131&i=0>> wrote:
>
> > Hi Gang,
> >
> > No worries!
> >
> > I agree LBFGS would converge faster and your test suite is more
> comprehensive. I'd like to merge my branch with yours.
> >
> > I also agree with your viewpoint on the redundancy issue. For different
> GLMs, usually they only differ in gradient calculation but the
> ****regression.scala files are quite similar. For example,
> linearRegressionSGD, logisticRegressionSGD, RidgeRegressionSGD,
> poissonRegressionSGD all share quite a bit of common code in their class
> implementations. Since such redundancy is already there in the legacy code,
> simply merging Poisson and Gamma does not seem to help much. So I suggest
> we just leave them as separate classes for the time being.
> >
> >
> > Best regards,
> >
> > Xiaokai
> >
> > On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List]
> wrote:
> >
> >> Hi Xiaokai,
> >>
> >> My bad. I didn't notice this before I created another PR for Poisson
> regression. The mails were buried in junk by the corp mail master. Also,
> thanks for considering my comments and advice in your PR.
> >>
> >> Adding my two cents here:
> >>
> >> * PoissonRegressionModel and GammaRegressionModel have the same fields
> and prediction method. Shall we use one instead of two redundant classes?
> Say, a LogLinearModel.
> >> * The LBFGS optimizer takes fewer iterations and results in better
> convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes
> using LBFGS and SGD respectively. You may take a look into it. If it's OK
> to you, I'd be happy to send a PR to your branch.
> >> * In addition to the generated test data, We may use some real-world
> data for testing. In my implementation, I added the test data from
> https://onlinecourses.science.psu.edu/stat504/node/223. Please check my
> test suite.
> >>
> >> -Gang
> >> Sent from my iPad
> >>
> >>> On 2014年6月27日, at 下午6:03, "xwei" <[hidden email]> wrote:
> >>>
> >>>
> >>> Yes, that's what we did: adding two gradient functions to
> Gradient.scala and
> >>> create PoissonRegression and GammaRegression using these gradients. We
> made
> >>> a PR on this.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
> >>> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >>
> >>
> >> If you reply to this email, your message will be added to the
> discussion below:
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html
> >> To unsubscribe from Contributing to MLlib on GLM, click here.
> >> NAML
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html
>
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7131.html
>  To unsubscribe from Contributing to MLlib on GLM, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7033&code=d2VpeGlhb2thaUBnbWFpbC5jb218NzAzM3w2NTc5NDUzMzA=>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7197.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

Posted by Gang Bai <ba...@staff.sina.com.cn>.

Thanks Xiaokai,

I’ve created a pull request to merge features in my PR to your repo. Please take a review here https://github.com/xwei-datageek/spark/pull/2 .

As for GLMs, here at Sina, we are solving the problem of predicting the num of visitors who read a particular news article or watch an online sports live stream in a particular period. I’m trying to improve the prediction results by tuning features and incorporating new models. So I’ll try Gamma regression later. Thanks for the implementation.

Cheers,
-Gang

On Jun 29, 2014, at 8:17 AM, xwei <we...@gmail.com> wrote:

> Hi Gang,
> 
> No worries! 
> 
> I agree LBFGS would converge faster and your test suite is more comprehensive. I'd like to merge my branch with yours.
> 
> I also agree with your viewpoint on the redundancy issue. For different GLMs, usually they only differ in gradient calculation but the ****regression.scala files are quite similar. For example, linearRegressionSGD, logisticRegressionSGD, RidgeRegressionSGD, poissonRegressionSGD all share quite a bit of common code in their class implementations. Since such redundancy is already there in the legacy code, simply merging Poisson and Gamma does not seem to help much. So I suggest we just leave them as separate classes for the time being. 
> 
> 
> Best regards,
> 
> Xiaokai
> 
> On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List] wrote:
> 
>> Hi Xiaokai, 
>> 
>> My bad. I didn't notice this before I created another PR for Poisson regression. The mails were buried in junk by the corp mail master. Also, thanks for considering my comments and advice in your PR. 
>> 
>> Adding my two cents here: 
>> 
>> * PoissonRegressionModel and GammaRegressionModel have the same fields and prediction method. Shall we use one instead of two redundant classes? Say, a LogLinearModel. 
>> * The LBFGS optimizer takes fewer iterations and results in better convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and SGD respectively. You may take a look into it. If it's OK to you, I'd be happy to send a PR to your branch. 
>> * In addition to the generated test data, We may use some real-world data for testing. In my implementation, I added the test data from https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test suite. 
>> 
>> -Gang 
>> Sent from my iPad 
>> 
>>> On 2014年6月27日, at 下午6:03, "xwei" <[hidden email]> wrote: 
>>> 
>>> 
>>> Yes, that's what we did: adding two gradient functions to Gradient.scala and 
>>> create PoissonRegression and GammaRegression using these gradients. We made 
>>> a PR on this. 
>>> 
>>> 
>>> 
>>> -- 
>>> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
>>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com. 
>> 
>> 
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html
>> To unsubscribe from Contributing to MLlib on GLM, click here.
>> NAML
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

Posted by xwei <we...@gmail.com>.

Hi Gang,

No worries! 

I agree LBFGS would converge faster and your test suite is more comprehensive. I'd like to merge my branch with yours.

I also agree with your viewpoint on the redundancy issue. For different GLMs, usually they only differ in gradient calculation but the ****regression.scala files are quite similar. For example, linearRegressionSGD, logisticRegressionSGD, RidgeRegressionSGD, poissonRegressionSGD all share quite a bit of common code in their class implementations. Since such redundancy is already there in the legacy code, simply merging Poisson and Gamma does not seem to help much. So I suggest we just leave them as separate classes for the time being. 


Best regards,

Xiaokai

On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List] wrote:

> Hi Xiaokai, 
> 
> My bad. I didn't notice this before I created another PR for Poisson regression. The mails were buried in junk by the corp mail master. Also, thanks for considering my comments and advice in your PR. 
> 
> Adding my two cents here: 
> 
> * PoissonRegressionModel and GammaRegressionModel have the same fields and prediction method. Shall we use one instead of two redundant classes? Say, a LogLinearModel. 
> * The LBFGS optimizer takes fewer iterations and results in better convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and SGD respectively. You may take a look into it. If it's OK to you, I'd be happy to send a PR to your branch. 
> * In addition to the generated test data, We may use some real-world data for testing. In my implementation, I added the test data from https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test suite. 
> 
> -Gang 
> Sent from my iPad 
> 
> > On 2014年6月27日, at 下午6:03, "xwei" <[hidden email]> wrote: 
> > 
> > 
> > Yes, that's what we did: adding two gradient functions to Gradient.scala and 
> > create PoissonRegression and GammaRegression using these gradients. We made 
> > a PR on this. 
> > 
> > 
> > 
> > -- 
> > View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
> > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html
> To unsubscribe from Contributing to MLlib on GLM, click here.
> NAML





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

Posted by 白刚 <ba...@staff.sina.com.cn>.

Hi Xiaokai,

My bad. I didn't notice this before I created another PR for Poisson regression. The mails were buried in junk by the corp mail master. Also, thanks for considering my comments and advice in your PR.

Adding my two cents here:

* PoissonRegressionModel and GammaRegressionModel have the same fields and prediction method. Shall we use one instead of two redundant classes? Say, a LogLinearModel.
* The LBFGS optimizer takes fewer iterations and results in better convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and SGD respectively. You may take a look into it. If it's OK to you, I'd be happy to send a PR to your branch.
* In addition to the generated test data, We may use some real-world data for testing. In my implementation, I added the test data from https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test suite.

-Gang
Sent from my iPad

> On 2014年6月27日, at 下午6:03, "xwei" <we...@gmail.com> wrote:
> 
> 
> Yes, that's what we did: adding two gradient functions to Gradient.scala and
> create PoissonRegression and GammaRegression using these gradients. We made
> a PR on this.
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

Posted by xwei <we...@gmail.com>.

Yes, that's what we did: adding two gradient functions to Gradient.scala and
create PoissonRegression and GammaRegression using these gradients. We made
a PR on this.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

Posted by Sung Hwan Chung <co...@cs.stanford.edu>.

Well, as you said, MLLib already supports GLM in a sense. Except they only
support two link functions - identity (linear regression) and logit
(logistic regression). It should not be too hard to add other link
functions, as all you have to do is add a different gradient function for
Poisson/Gamma, etc - look at Gradient.scala in mllib.

On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei <xw...@palantir.com> wrote:

> Hi,
>
> I am an intern at PalantirTech and we are building some stuff on top of
> MLlib. In Particular, GLM is of great interest to us.  Though
> GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as
> Logistic Regression, Linear Regression, some other important GLMs like
> Poisson Regression are still missing.
>
> I am curious that if anyone is already working on other GLMs (e.g.
> Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is
> adding more GLMs on the roadmap of MLlib?
>
>
> Sincerely,
>
> Xiaokai
>

Re: Contributing to MLlib on GLM

Posted by Andrew Ash <an...@andrewash.com>.

Hi Xiaokai,

Also take a look through Xiangrui's slides from HadoopSummit a few weeks
back: http://www.slideshare.net/xrmeng/m-llib-hadoopsummit  The roadmap
starting at slide 51 will probably be interesting to you.

Andrew


On Tue, Jun 17, 2014 at 7:37 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Xiaokai,
>
> I think MLLib is definitely interested in supporting additional GLMs.  I'm
> not aware of anybody working on this at the moment.
>
> -Sandy
>
>
> On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei <xw...@palantir.com> wrote:
>
> > Hi,
> >
> > I am an intern at PalantirTech and we are building some stuff on top of
> > MLlib. In Particular, GLM is of great interest to us.  Though
> > GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as
> > Logistic Regression, Linear Regression, some other important GLMs like
> > Poisson Regression are still missing.
> >
> > I am curious that if anyone is already working on other GLMs (e.g.
> > Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is
> > adding more GLMs on the roadmap of MLlib?
> >
> >
> > Sincerely,
> >
> > Xiaokai
> >
>

Re: Contributing to MLlib on GLM

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Xiaokai,

I think MLLib is definitely interested in supporting additional GLMs.  I'm
not aware of anybody working on this at the moment.

-Sandy


On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei <xw...@palantir.com> wrote:

> Hi,
>
> I am an intern at PalantirTech and we are building some stuff on top of
> MLlib. In Particular, GLM is of great interest to us.  Though
> GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as
> Logistic Regression, Linear Regression, some other important GLMs like
> Poisson Regression are still missing.
>
> I am curious that if anyone is already working on other GLMs (e.g.
> Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is
> adding more GLMs on the roadmap of MLlib?
>
>
> Sincerely,
>
> Xiaokai
>