You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Saikat Kanjilal <sx...@hotmail.com> on 2017/02/09 20:59:35 UTC

Contributing an algorithm for samsara

Trevor et al,

I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly.    The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.

Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>
mahout.apache.org
Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL



Any thoughts/guidance or recommendations would be very helpful.
Thanks in advance.

Re: Contributing an algorithm for samsara

Posted by Saikat Kanjilal <sx...@hotmail.com>.

To start this off I figure we should spend some time understanding the current implementations and theory before we dig deep into implementing this in mahout:


1) https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/

Alternating Least Squares Method for Collaborative ...<https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/>
bugra.github.io
Alternating Least Square Formulation for Recommender Systems¶ We have users $u$ for items $i$ matrix as in the following: $$ Q_{ui} = \cases{ r & \text{if user u ...


2) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala


[https://avatars1.githubusercontent.com/u/47359?v=3&s=400]<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

spark/ALS.scala at master · apache/spark · GitHub<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>
github.com
spark - Mirror of Apache Spark ... * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.


3) https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala
mahout/ALS.scala at master · apache/mahout · GitHub<https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala>
github.com
mahout - Mirror of Apache Mahout


4) https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/
Alternating Least Squares – Data Science Made Simpler<https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/>
datasciencemadesimpler.wordpress.com
Collaborative Filtering. Collaborative Filtering (CF) is a method of making automatic predictions about the interests of a user by learning its preferences (or taste ...




Jim I would suggest we spend some time researching and digging into these resources and circle back next week to get this off the ground, let me know if you want to meet offline as well, I would recommend the next steps is a design proposal to the dev list of how the implementation will fit into the current samsara algorithms, what do you think?

Regards

________________________________
From: Jim Jagielski <ji...@jaguNET.com>
Sent: Friday, February 17, 2017 8:18 AM
To: dev@mahout.apache.org
Subject: Re: Contributing an algorithm for samsara

Sounds good to me. +1

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>
> Jim,
> What do you say we start with ALS and then tackle glm?
>
>
> Sent from my iPhone
>
>> On Feb 17, 2017, at 6:56 AM, Trevor Grant <tr...@gmail.com> wrote:
>>
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>
en.wikipedia.org
Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model



>> from there a Logistic regression is a trivial extension.
>>
>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
>> in neck first for both Jim and Saikat...
>>
>> MAHOUT-1928 and MAHOUT-1929
>>
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
>>
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
>> in there.
>>
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>>
>> There is also a case to be made for implementing the ALS...
>>
>> 1) It's a much better 'beginner' project.
>> 2) Mahout has some world class Recommenders, a toy ALS implementation might
>> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>> think through building out that section of the framework.
>>
>>
>>
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.



>> http://stackexchange.com/users/3002022/rawkintrevo
User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>
stackexchange.com
Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions



>> http://trevorgrant.org
[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>
trevorgrant.org
Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.



>>
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>
>>
>>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>>
>>> My own thoughts are that logistic regression seems a more "generalized"
>>> and hence more useful algo to be factored in... At least in the
>>> use cases that I've been toying with.
>>>
>>> So I'd like to help out with that if wanted...
>>>
>>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>>>>
>>>> Trevor et al,
>>>>
>>>> I'd like to contribute an algorithm or two in samsara using spark as I
>>> would like to do a compare and contrast with mahout with R server for a
>>> data science pipeline, machine learning repo that I'm working on, in
>>> looking at the list of algorithms (https://mahout.apache.org/
>>> users/basics/algorithms.html) is there an algorithm for spark that would
>>> be beneficial for the community, my use cases would typically be around
>>> clustering or real time machine learning for building recommendations on
>>> the fly.    The algorithms I see that could potentially be useful are: 1)
>>> Matrix Factorization with ALS 2) Logistic regression with SVD.
>>>>
>>>> Apache Mahout: Scalable machine learning and data mining<
>>> https://mahout.apache.org/users/basics/algorithms.html>
>>>> mahout.apache.org
>>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>>> Flink; Mahout Math-Scala Core Library and Scala DSL
>>>>
>>>>
>>>>
>>>> Any thoughts/guidance or recommendations would be very helpful.
>>>> Thanks in advance.
>>>
>>>

Re: Contributing an algorithm for samsara

Posted by Jim Jagielski <ji...@jaguNET.com>.

Sounds good to me. +1

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> 
> Jim,
> What do you say we start with ALS and then tackle glm?
> 
> 
> Sent from my iPhone
> 
>> On Feb 17, 2017, at 6:56 AM, Trevor Grant <tr...@gmail.com> wrote:
>> 
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
>> from there a Logistic regression is a trivial extension.
>> 
>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
>> in neck first for both Jim and Saikat...
>> 
>> MAHOUT-1928 and MAHOUT-1929
>> 
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
>> 
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
>> in there.
>> 
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>> 
>> There is also a case to be made for implementing the ALS...
>> 
>> 1) It's a much better 'beginner' project.
>> 2) Mahout has some world class Recommenders, a toy ALS implementation might
>> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>> think through building out that section of the framework.
>> 
>> 
>> 
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/rawkintrevo
>> http://trevorgrant.org
>> 
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>> 
>> 
>>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>> 
>>> My own thoughts are that logistic regression seems a more "generalized"
>>> and hence more useful algo to be factored in... At least in the
>>> use cases that I've been toying with.
>>> 
>>> So I'd like to help out with that if wanted...
>>> 
>>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>>>> 
>>>> Trevor et al,
>>>> 
>>>> I'd like to contribute an algorithm or two in samsara using spark as I
>>> would like to do a compare and contrast with mahout with R server for a
>>> data science pipeline, machine learning repo that I'm working on, in
>>> looking at the list of algorithms (https://mahout.apache.org/
>>> users/basics/algorithms.html) is there an algorithm for spark that would
>>> be beneficial for the community, my use cases would typically be around
>>> clustering or real time machine learning for building recommendations on
>>> the fly.    The algorithms I see that could potentially be useful are: 1)
>>> Matrix Factorization with ALS 2) Logistic regression with SVD.
>>>> 
>>>> Apache Mahout: Scalable machine learning and data mining<
>>> https://mahout.apache.org/users/basics/algorithms.html>
>>>> mahout.apache.org
>>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>>> Flink; Mahout Math-Scala Core Library and Scala DSL
>>>> 
>>>> 
>>>> 
>>>> Any thoughts/guidance or recommendations would be very helpful.
>>>> Thanks in advance.
>>> 
>>>

Re: Contributing an algorithm for samsara

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Jim,
What do you say we start with ALS and then tackle glm?


Sent from my iPhone

> On Feb 17, 2017, at 6:56 AM, Trevor Grant <tr...@gmail.com> wrote:
> 
> Jim is right, and I would take it one further and say, it would be best to
> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
> from there a Logistic regression is a trivial extension.
> 
> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
> in neck first for both Jim and Saikat...
> 
> MAHOUT-1928 and MAHOUT-1929
> 
> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
> 
> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
> in there.
> 
> If you have an algorithm you are particularly intimate with, or explicitly
> need/want- feel free to open a JIRA and assign to yourself.
> 
> There is also a case to be made for implementing the ALS...
> 
> 1) It's a much better 'beginner' project.
> 2) Mahout has some world class Recommenders, a toy ALS implementation might
> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
> the framework. E.g. ALS being the toy-prototype reccomender that helps us
> think through building out that section of the framework.
> 
> 
> 
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
> 
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> 
> 
>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>> 
>> My own thoughts are that logistic regression seems a more "generalized"
>> and hence more useful algo to be factored in... At least in the
>> use cases that I've been toying with.
>> 
>> So I'd like to help out with that if wanted...
>> 
>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>>> 
>>> Trevor et al,
>>> 
>>> I'd like to contribute an algorithm or two in samsara using spark as I
>> would like to do a compare and contrast with mahout with R server for a
>> data science pipeline, machine learning repo that I'm working on, in
>> looking at the list of algorithms (https://mahout.apache.org/
>> users/basics/algorithms.html) is there an algorithm for spark that would
>> be beneficial for the community, my use cases would typically be around
>> clustering or real time machine learning for building recommendations on
>> the fly.    The algorithms I see that could potentially be useful are: 1)
>> Matrix Factorization with ALS 2) Logistic regression with SVD.
>>> 
>>> Apache Mahout: Scalable machine learning and data mining<
>> https://mahout.apache.org/users/basics/algorithms.html>
>>> mahout.apache.org
>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>> Flink; Mahout Math-Scala Core Library and Scala DSL
>>> 
>>> 
>>> 
>>> Any thoughts/guidance or recommendations would be very helpful.
>>> Thanks in advance.
>> 
>>

Re: Contributing an algorithm for samsara

Posted by Jim Jagielski <ji...@jaguNET.com>.

Apologies for letting this slide... way too much life got in the way :)

> On Mar 3, 2017, at 3:36 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> And by formula yes i mean R syntax.
> 
> possible use case would be to take Spark DataFrame and formula (say, `age ~
> . -1`) and produce outputs of DrmLike[Int] (a distributed matrix type) that
> converts into predictors and target.
> 
> In this particular case, this formula means that the predictor matrix (X)
> would have all original variables except `age` (for categorical variables
> factor extraction is applied), with no bias column.
> 
> Some knowledge of R and SAS is required to pin the compatibility nuances
> there.
> 
> Maybe we could have reasonable simplifications or omissions compared to R
> stuff, if we can be reasonably convinced it is actually better that way
> than vanilla R contract, but IMO it would be really useful to retain 100%
> compatibility there since it is one of ideas there -- retain R-like-ness
> with these things.
> 
> 
> On Fri, Mar 3, 2017 at 12:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
>> 
>> 
>> On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>> 
>>>> 
>>>> 
>>> 
>>>>> 
>>>>> 3) On the feature extraction per R like formula can you elaborate more
>>>> here, are you talking about feature extraction using R like dataframes and
>>>> operators?
>>>> 
>>> 
>>> 
>> Yes. I would start doing generic formula parser and then specific part
>> that works with backend-speicifc data frames. For spark, i don't see any
>> reason to write our own; we'd just had an adapter for the Spark native data
>> frames.
>>

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

And by formula yes i mean R syntax.

possible use case would be to take Spark DataFrame and formula (say, `age ~
. -1`) and produce outputs of DrmLike[Int] (a distributed matrix type) that
converts into predictors and target.

In this particular case, this formula means that the predictor matrix (X)
would have all original variables except `age` (for categorical variables
factor extraction is applied), with no bias column.

Some knowledge of R and SAS is required to pin the compatibility nuances
there.

Maybe we could have reasonable simplifications or omissions compared to R
stuff, if we can be reasonably convinced it is actually better that way
than vanilla R contract, but IMO it would be really useful to retain 100%
compatibility there since it is one of ideas there -- retain R-like-ness
with these things.

On Fri, Mar 3, 2017 at 12:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
> On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>
>>>
>>>
>>
>>> >
>>> > 3) On the feature extraction per R like formula can you elaborate more
>>> here, are you talking about feature extraction using R like dataframes and
>>> operators?
>>>
>>
>>
> Yes. I would start doing generic formula parser and then specific part
> that works with backend-speicifc data frames. For spark, i don't see any
> reason to write our own; we'd just had an adapter for the Spark native data
> frames.
>

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>
>>
>>
>
>> >
>> > 3) On the feature extraction per R like formula can you elaborate more
>> here, are you talking about feature extraction using R like dataframes and
>> operators?
>>
>
>
Yes. I would start doing generic formula parser and then specific part that
works with backend-speicifc data frames. For spark, i don't see any reason
to write our own; we'd just had an adapter for the Spark native data
frames.

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I am getting a liittle bit lost who asked what here, inline.

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:

>
>
> Would it make sense to keep them as-is, and "pull them out", as
> it were, should they prove to be wanted/needed by the other algo users?
>

I would hope it is of some help (especially math and in-memory prototype)
for something to look back to. I would really try to plot it all anew, I
found it usually helps my focus if I work with my own code from the ground
up.

So no, i would not just try to take it as is. Not without careful review.

Also, if you noticed, the distributed version is quasi-algebraic, i.e., it
contains direct Spark dependencies and code that relies on Spark. As such,
it cannot be put into our decompositions package in mahout-math-scala
module, where most of other distributed decompositions sit.

I suspect it could be made 100% algebraic with current primitives available
in Samsara. This is necessary condition to get it into mahout-math-scala.
If it can't be done, then it has to live in mahout-spark module as one
backend implementation only.

>
> >
> > 3) On the feature extraction per R like formula can you elaborate more
> here, are you talking about feature extraction using R like dataframes and
> operators?
>

> >
> >
> >
> > More later as I read through the papers.
>

I would really start there before anything else. (Moreover, this is the
most fun part of all of it, as far as i am concerned:) ).

Also my adapted formulas are attached to the issue like i mentioned. I
would look thru the math if it is clear (for interpretation), if not let's
discuss any questions.

> >

Re: Contributing an algorithm for samsara

Posted by Jim Jagielski <ji...@jaguNET.com>.

> On Feb 25, 2017, at 5:41 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> 
> Dmitry,
> 
> I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state

+1. It looks v. impressive.

> , given that, here are some thoughts/questions:
> 
> 
> 1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?
> 
> 2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

Would it make sense to keep them as-is, and "pull them out", as
it were, should they prove to be wanted/needed by the other algo users?

> 
> 3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?
> 
> 
> 
> More later as I read through the papers.
> 
> 
> ________________________________
> From: Dmitriy Lyubimov <dl...@gmail.com>
> Sent: Friday, February 17, 2017 1:45 PM
> To: dev@mahout.apache.org
> Subject: Re: Contributing an algorithm for samsara
> 
> in particular, this is the samsara implementation of double-weighed als :
> https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626
> MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>
> github.com
> mahout - Mirror of Apache Mahout
> 
> 
> 
> 
> 
> On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
>> Jim,
>> 
>> if ALS is of interest, and as far as weighed ALS is concerned (since we
>> already have trivial regularized ALS in the "decompositions" package),
>> here's uncommitted samsara-compatible patch from a while back:
>> https://issues.apache.org/jira/browse/MAHOUT-1365
> [MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>
> issues.apache.org
> Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...
> 
> 
> 
>> 
>> it combines weights on both data points (a.k.a "implicit feedback" als)
>> and regularization rates  (paper references are given). We combine both
>> approaches in one (which is novel, i guess, but yet simple enough).
>> Obviously the final solver can also be used as pure reg rate regularized if
>> wanted, making it equivalent to one of the papers.
>> 
>> You may know implicit feedback paper from mllib's implicit als, but unlike
>> it was done over there (as a use case sort problem that takes input before
>> even features were extracted), we split the problem into pure algebraic
>> solver (double-weighed ALS math) and leave the feature extraction outside
>> of this issue per se (it can be added as a separate adapter).
>> 
>> The reason for that is that the specific use-case oriented implementation
>> does not necessarily leave the space for feature extraction that is
>> different from described use case of partially consumed streamed videos in
>> the paper. (e.g., instead of videos one could count visits or clicks or
>> add-to-cart events which may need additional hyperparameter found for them
>> as part of feature extraction and converting observations into "weghts").
>> 
>> The biggest problem with these ALS methods however is that all
>> hyperparameters require multidimensional crossvalidation and optimization.
>> I think i mentioned it before as list of desired solutions, as it stands,
>> Mahout does not have hyperarameter fitting routine.
>> 
>> In practice, when using these kind of ALS, we have a case of
>> multidimensional hyperparameter optimization. One of them comes from the
>> fitter (reg rate, or base reg rate in case of weighed regularization), and
>> the others come from feature extraction process. E.g., in original paper
>> they introduce (at least) 2 formulas to extract measure weighs from the
>> streaming video observations, and each of them had one parameter, alhpa,
>> which in context of the whole problem becomes effectively yet another
>> hyperparameter to fit. In other use cases when your confidence measurement
>> may be coming from different sources and observations, the confidence
>> extraction may actually have even more hyperparameters to fit than just
>> one. And when we have a multidimensional case, simple approaches (like grid
>> or random search) become either cost prohibitive or ineffective, due to the
>> curse of dimensionality.
>> 
>> At the time i was contributing that method, i was using it in conjunction
>> with multidimensional bayesian optimizer, but the company that i wrote it
>> for did not have it approved for contribution (unlike weighed als) at that
>> time.
>> 
>> Anyhow, perhaps you could read the algebra in both ALS papers there and
>> ask questions, and we could worry about hyperparameter optimization a bit
>> later and performance a bit later.
>> 
>> On the feature extraction front (as in implicit feedback als per Koren
>> etc.), this is an ideal use case for more general R-like formula approach,
>> which is also on desired list of things to have.
>> 
>> So i guess we have 3 problems really here:
>> (1) double-weighed ALS
>> (2) bayesian optimization and crossvalidation in an n-dimensional
>> hyperparameter space
>> (3) feature extraction per (preferrably R-like) formula.
>> 
>> 
>> -d
>> 
>> 
>> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap...@outlook.com>
>> wrote:
>> 
>>> +1 to glms
>>> 
>>> 
>>> 
>>> Sent from my Verizon Wireless 4G LTE smartphone
>>> 
>>> 
>>> -------- Original message --------
>>> From: Trevor Grant <tr...@gmail.com>
>>> Date: 02/17/2017 6:56 AM (GMT-08:00)
>>> To: dev@mahout.apache.org
>>> Subject: Re: Contributing an algorithm for samsara
>>> 
>>> Jim is right, and I would take it one further and say, it would be best to
>>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
> [http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>
> 
> Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>
> en.wikipedia.org
> Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model
> 
> 
> 
>>> from there a Logistic regression is a trivial extension.
>>> 
>>> Buyer beware- GLMs will be a bit of work- doable, but that would be
>>> jumping
>>> in neck first for both Jim and Saikat...
>>> 
>>> MAHOUT-1928 and MAHOUT-1929
>>> 
>>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec
>>> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
>>> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%
>>> 20priority%20DESC%2C%20created%20ASC
>>> 
>>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs
>>> are
>>> in there.
>>> 
>>> If you have an algorithm you are particularly intimate with, or explicitly
>>> need/want- feel free to open a JIRA and assign to yourself.
>>> 
>>> There is also a case to be made for implementing the ALS...
>>> 
>>> 1) It's a much better 'beginner' project.
>>> 2) Mahout has some world class Recommenders, a toy ALS implementation
>>> might
>>> help us think through how the other reccomenders (e.g. CCO) will 'fit'
>>> into
>>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>>> think through building out that section of the framework.
>>> 
>>> 
>>> 
>>> Trevor Grant
>>> Data Scientist
>>> https://github.com/rawkintrevo
> [https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>
> 
> rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
> github.com
> rawkintrevo has 22 repositories available. Follow their code on GitHub.
> 
> 
> 
>>> http://stackexchange.com/users/3002022/rawkintrevo
> User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>
> stackexchange.com
> Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions
> 
> 
> 
>>> http://trevorgrant.org
> [https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>
> 
> The musings of rawkintrevo<http://trevorgrant.org/>
> trevorgrant.org
> Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.
> 
> 
> 
>>> 
>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>> 
>>> 
>>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>> 
>>>> My own thoughts are that logistic regression seems a more "generalized"
>>>> and hence more useful algo to be factored in... At least in the
>>>> use cases that I've been toying with.
>>>> 
>>>> So I'd like to help out with that if wanted...
>>>> 
>>>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com>
>>> wrote:
>>>>> 
>>>>> Trevor et al,
>>>>> 
>>>>> I'd like to contribute an algorithm or two in samsara using spark as I
>>>> would like to do a compare and contrast with mahout with R server for a
>>>> data science pipeline, machine learning repo that I'm working on, in
>>>> looking at the list of algorithms (https://mahout.apache.org/
>>>> users/basics/algorithms.html) is there an algorithm for spark that would
>>>> be beneficial for the community, my use cases would typically be around
>>>> clustering or real time machine learning for building recommendations on
>>>> the fly.    The algorithms I see that could potentially be useful are:
>>> 1)
>>>> Matrix Factorization with ALS 2) Logistic regression with SVD.
>>>>> 
>>>>> Apache Mahout: Scalable machine learning and data mining<
>>>> https://mahout.apache.org/users/basics/algorithms.html>
>>>>> mahout.apache.org
>>>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>>>> Flink; Mahout Math-Scala Core Library and Scala DSL
>>>>> 
>>>>> 
>>>>> 
>>>>> Any thoughts/guidance or recommendations would be very helpful.
>>>>> Thanks in advance.

Re: Contributing an algorithm for samsara

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Dmitry,

I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state, given that, here are some thoughts/questions:


1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?

2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?



More later as I read through the papers.


________________________________
From: Dmitriy Lyubimov <dl...@gmail.com>
Sent: Friday, February 17, 2017 1:45 PM
To: dev@mahout.apache.org
Subject: Re: Contributing an algorithm for samsara

in particular, this is the samsara implementation of double-weighed als :
https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626
MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>
github.com
mahout - Mirror of Apache Mahout





On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Jim,
>
> if ALS is of interest, and as far as weighed ALS is concerned (since we
> already have trivial regularized ALS in the "decompositions" package),
> here's uncommitted samsara-compatible patch from a while back:
> https://issues.apache.org/jira/browse/MAHOUT-1365
[MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>
issues.apache.org
Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...



>
> it combines weights on both data points (a.k.a "implicit feedback" als)
> and regularization rates  (paper references are given). We combine both
> approaches in one (which is novel, i guess, but yet simple enough).
> Obviously the final solver can also be used as pure reg rate regularized if
> wanted, making it equivalent to one of the papers.
>
> You may know implicit feedback paper from mllib's implicit als, but unlike
> it was done over there (as a use case sort problem that takes input before
> even features were extracted), we split the problem into pure algebraic
> solver (double-weighed ALS math) and leave the feature extraction outside
> of this issue per se (it can be added as a separate adapter).
>
> The reason for that is that the specific use-case oriented implementation
> does not necessarily leave the space for feature extraction that is
> different from described use case of partially consumed streamed videos in
> the paper. (e.g., instead of videos one could count visits or clicks or
> add-to-cart events which may need additional hyperparameter found for them
> as part of feature extraction and converting observations into "weghts").
>
> The biggest problem with these ALS methods however is that all
> hyperparameters require multidimensional crossvalidation and optimization.
> I think i mentioned it before as list of desired solutions, as it stands,
> Mahout does not have hyperarameter fitting routine.
>
> In practice, when using these kind of ALS, we have a case of
> multidimensional hyperparameter optimization. One of them comes from the
> fitter (reg rate, or base reg rate in case of weighed regularization), and
> the others come from feature extraction process. E.g., in original paper
> they introduce (at least) 2 formulas to extract measure weighs from the
> streaming video observations, and each of them had one parameter, alhpa,
> which in context of the whole problem becomes effectively yet another
> hyperparameter to fit. In other use cases when your confidence measurement
> may be coming from different sources and observations, the confidence
> extraction may actually have even more hyperparameters to fit than just
> one. And when we have a multidimensional case, simple approaches (like grid
> or random search) become either cost prohibitive or ineffective, due to the
> curse of dimensionality.
>
> At the time i was contributing that method, i was using it in conjunction
> with multidimensional bayesian optimizer, but the company that i wrote it
> for did not have it approved for contribution (unlike weighed als) at that
> time.
>
> Anyhow, perhaps you could read the algebra in both ALS papers there and
> ask questions, and we could worry about hyperparameter optimization a bit
> later and performance a bit later.
>
> On the feature extraction front (as in implicit feedback als per Koren
> etc.), this is an ideal use case for more general R-like formula approach,
> which is also on desired list of things to have.
>
> So i guess we have 3 problems really here:
> (1) double-weighed ALS
> (2) bayesian optimization and crossvalidation in an n-dimensional
> hyperparameter space
> (3) feature extraction per (preferrably R-like) formula.
>
>
> -d
>
>
> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap...@outlook.com>
> wrote:
>
>> +1 to glms
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>> -------- Original message --------
>> From: Trevor Grant <tr...@gmail.com>
>> Date: 02/17/2017 6:56 AM (GMT-08:00)
>> To: dev@mahout.apache.org
>> Subject: Re: Contributing an algorithm for samsara
>>
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>
en.wikipedia.org
Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model



>> from there a Logistic regression is a trivial extension.
>>
>> Buyer beware- GLMs will be a bit of work- doable, but that would be
>> jumping
>> in neck first for both Jim and Saikat...
>>
>> MAHOUT-1928 and MAHOUT-1929
>>
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec
>> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
>> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%
>> 20priority%20DESC%2C%20created%20ASC
>>
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs
>> are
>> in there.
>>
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>>
>> There is also a case to be made for implementing the ALS...
>>
>> 1) It's a much better 'beginner' project.
>> 2) Mahout has some world class Recommenders, a toy ALS implementation
>> might
>> help us think through how the other reccomenders (e.g. CCO) will 'fit'
>> into
>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>> think through building out that section of the framework.
>>
>>
>>
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.



>> http://stackexchange.com/users/3002022/rawkintrevo
User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>
stackexchange.com
Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions



>> http://trevorgrant.org
[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>
trevorgrant.org
Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.



>>
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>
>>
>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>
>> > My own thoughts are that logistic regression seems a more "generalized"
>> > and hence more useful algo to be factored in... At least in the
>> > use cases that I've been toying with.
>> >
>> > So I'd like to help out with that if wanted...
>> >
>> > > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com>
>> wrote:
>> > >
>> > > Trevor et al,
>> > >
>> > > I'd like to contribute an algorithm or two in samsara using spark as I
>> > would like to do a compare and contrast with mahout with R server for a
>> > data science pipeline, machine learning repo that I'm working on, in
>> > looking at the list of algorithms (https://mahout.apache.org/
>> > users/basics/algorithms.html) is there an algorithm for spark that would
>> > be beneficial for the community, my use cases would typically be around
>> > clustering or real time machine learning for building recommendations on
>> > the fly.    The algorithms I see that could potentially be useful are:
>> 1)
>> > Matrix Factorization with ALS 2) Logistic regression with SVD.
>> > >
>> > > Apache Mahout: Scalable machine learning and data mining<
>> > https://mahout.apache.org/users/basics/algorithms.html>
>> > > mahout.apache.org
>> > > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>> > Flink; Mahout Math-Scala Core Library and Scala DSL
>> > >
>> > >
>> > >
>> > > Any thoughts/guidance or recommendations would be very helpful.
>> > > Thanks in advance.
>> >
>> >
>>
>
>

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

in particular, this is the samsara implementation of double-weighed als :
https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626


On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Jim,
>
> if ALS is of interest, and as far as weighed ALS is concerned (since we
> already have trivial regularized ALS in the "decompositions" package),
> here's uncommitted samsara-compatible patch from a while back:
> https://issues.apache.org/jira/browse/MAHOUT-1365
>
> it combines weights on both data points (a.k.a "implicit feedback" als)
> and regularization rates  (paper references are given). We combine both
> approaches in one (which is novel, i guess, but yet simple enough).
> Obviously the final solver can also be used as pure reg rate regularized if
> wanted, making it equivalent to one of the papers.
>
> You may know implicit feedback paper from mllib's implicit als, but unlike
> it was done over there (as a use case sort problem that takes input before
> even features were extracted), we split the problem into pure algebraic
> solver (double-weighed ALS math) and leave the feature extraction outside
> of this issue per se (it can be added as a separate adapter).
>
> The reason for that is that the specific use-case oriented implementation
> does not necessarily leave the space for feature extraction that is
> different from described use case of partially consumed streamed videos in
> the paper. (e.g., instead of videos one could count visits or clicks or
> add-to-cart events which may need additional hyperparameter found for them
> as part of feature extraction and converting observations into "weghts").
>
> The biggest problem with these ALS methods however is that all
> hyperparameters require multidimensional crossvalidation and optimization.
> I think i mentioned it before as list of desired solutions, as it stands,
> Mahout does not have hyperarameter fitting routine.
>
> In practice, when using these kind of ALS, we have a case of
> multidimensional hyperparameter optimization. One of them comes from the
> fitter (reg rate, or base reg rate in case of weighed regularization), and
> the others come from feature extraction process. E.g., in original paper
> they introduce (at least) 2 formulas to extract measure weighs from the
> streaming video observations, and each of them had one parameter, alhpa,
> which in context of the whole problem becomes effectively yet another
> hyperparameter to fit. In other use cases when your confidence measurement
> may be coming from different sources and observations, the confidence
> extraction may actually have even more hyperparameters to fit than just
> one. And when we have a multidimensional case, simple approaches (like grid
> or random search) become either cost prohibitive or ineffective, due to the
> curse of dimensionality.
>
> At the time i was contributing that method, i was using it in conjunction
> with multidimensional bayesian optimizer, but the company that i wrote it
> for did not have it approved for contribution (unlike weighed als) at that
> time.
>
> Anyhow, perhaps you could read the algebra in both ALS papers there and
> ask questions, and we could worry about hyperparameter optimization a bit
> later and performance a bit later.
>
> On the feature extraction front (as in implicit feedback als per Koren
> etc.), this is an ideal use case for more general R-like formula approach,
> which is also on desired list of things to have.
>
> So i guess we have 3 problems really here:
> (1) double-weighed ALS
> (2) bayesian optimization and crossvalidation in an n-dimensional
> hyperparameter space
> (3) feature extraction per (preferrably R-like) formula.
>
>
> -d
>
>
> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap...@outlook.com>
> wrote:
>
>> +1 to glms
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>> -------- Original message --------
>> From: Trevor Grant <tr...@gmail.com>
>> Date: 02/17/2017 6:56 AM (GMT-08:00)
>> To: dev@mahout.apache.org
>> Subject: Re: Contributing an algorithm for samsara
>>
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
>> from there a Logistic regression is a trivial extension.
>>
>> Buyer beware- GLMs will be a bit of work- doable, but that would be
>> jumping
>> in neck first for both Jim and Saikat...
>>
>> MAHOUT-1928 and MAHOUT-1929
>>
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec
>> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
>> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%
>> 20priority%20DESC%2C%20created%20ASC
>>
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs
>> are
>> in there.
>>
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>>
>> There is also a case to be made for implementing the ALS...
>>
>> 1) It's a much better 'beginner' project.
>> 2) Mahout has some world class Recommenders, a toy ALS implementation
>> might
>> help us think through how the other reccomenders (e.g. CCO) will 'fit'
>> into
>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>> think through building out that section of the framework.
>>
>>
>>
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/rawkintrevo
>> http://trevorgrant.org
>>
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>
>>
>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>
>> > My own thoughts are that logistic regression seems a more "generalized"
>> > and hence more useful algo to be factored in... At least in the
>> > use cases that I've been toying with.
>> >
>> > So I'd like to help out with that if wanted...
>> >
>> > > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com>
>> wrote:
>> > >
>> > > Trevor et al,
>> > >
>> > > I'd like to contribute an algorithm or two in samsara using spark as I
>> > would like to do a compare and contrast with mahout with R server for a
>> > data science pipeline, machine learning repo that I'm working on, in
>> > looking at the list of algorithms (https://mahout.apache.org/
>> > users/basics/algorithms.html) is there an algorithm for spark that would
>> > be beneficial for the community, my use cases would typically be around
>> > clustering or real time machine learning for building recommendations on
>> > the fly.    The algorithms I see that could potentially be useful are:
>> 1)
>> > Matrix Factorization with ALS 2) Logistic regression with SVD.
>> > >
>> > > Apache Mahout: Scalable machine learning and data mining<
>> > https://mahout.apache.org/users/basics/algorithms.html>
>> > > mahout.apache.org
>> > > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>> > Flink; Mahout Math-Scala Core Library and Scala DSL
>> > >
>> > >
>> > >
>> > > Any thoughts/guidance or recommendations would be very helpful.
>> > > Thanks in advance.
>> >
>> >
>>
>
>

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jim,

if ALS is of interest, and as far as weighed ALS is concerned (since we
already have trivial regularized ALS in the "decompositions" package),
here's uncommitted samsara-compatible patch from a while back:
https://issues.apache.org/jira/browse/MAHOUT-1365

it combines weights on both data points (a.k.a "implicit feedback" als) and
regularization rates  (paper references are given). We combine both
approaches in one (which is novel, i guess, but yet simple enough).
Obviously the final solver can also be used as pure reg rate regularized if
wanted, making it equivalent to one of the papers.

You may know implicit feedback paper from mllib's implicit als, but unlike
it was done over there (as a use case sort problem that takes input before
even features were extracted), we split the problem into pure algebraic
solver (double-weighed ALS math) and leave the feature extraction outside
of this issue per se (it can be added as a separate adapter).

The reason for that is that the specific use-case oriented implementation
does not necessarily leave the space for feature extraction that is
different from described use case of partially consumed streamed videos in
the paper. (e.g., instead of videos one could count visits or clicks or
add-to-cart events which may need additional hyperparameter found for them
as part of feature extraction and converting observations into "weghts").

The biggest problem with these ALS methods however is that all
hyperparameters require multidimensional crossvalidation and optimization.
I think i mentioned it before as list of desired solutions, as it stands,
Mahout does not have hyperarameter fitting routine.

In practice, when using these kind of ALS, we have a case of
multidimensional hyperparameter optimization. One of them comes from the
fitter (reg rate, or base reg rate in case of weighed regularization), and
the others come from feature extraction process. E.g., in original paper
they introduce (at least) 2 formulas to extract measure weighs from the
streaming video observations, and each of them had one parameter, alhpa,
which in context of the whole problem becomes effectively yet another
hyperparameter to fit. In other use cases when your confidence measurement
may be coming from different sources and observations, the confidence
extraction may actually have even more hyperparameters to fit than just
one. And when we have a multidimensional case, simple approaches (like grid
or random search) become either cost prohibitive or ineffective, due to the
curse of dimensionality.

At the time i was contributing that method, i was using it in conjunction
with multidimensional bayesian optimizer, but the company that i wrote it
for did not have it approved for contribution (unlike weighed als) at that
time.

Anyhow, perhaps you could read the algebra in both ALS papers there and ask
questions, and we could worry about hyperparameter optimization a bit later
and performance a bit later.

On the feature extraction front (as in implicit feedback als per Koren
etc.), this is an ideal use case for more general R-like formula approach,
which is also on desired list of things to have.

So i guess we have 3 problems really here:
(1) double-weighed ALS
(2) bayesian optimization and crossvalidation in an n-dimensional
hyperparameter space
(3) feature extraction per (preferrably R-like) formula.

-d

On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap...@outlook.com> wrote:

> +1 to glms
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
> -------- Original message --------
> From: Trevor Grant <tr...@gmail.com>
> Date: 02/17/2017 6:56 AM (GMT-08:00)
> To: dev@mahout.apache.org
> Subject: Re: Contributing an algorithm for samsara
>
> Jim is right, and I would take it one further and say, it would be best to
> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
> from there a Logistic regression is a trivial extension.
>
> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
> in neck first for both Jim and Saikat...
>
> MAHOUT-1928 and MAHOUT-1929
>
> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=
> project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%
> 20DESC%2C%20created%20ASC
>
> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
> in there.
>
> If you have an algorithm you are particularly intimate with, or explicitly
> need/want- feel free to open a JIRA and assign to yourself.
>
> There is also a case to be made for implementing the ALS...
>
> 1) It's a much better 'beginner' project.
> 2) Mahout has some world class Recommenders, a toy ALS implementation might
> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
> the framework. E.g. ALS being the toy-prototype reccomender that helps us
> think through building out that section of the framework.
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>
> > My own thoughts are that logistic regression seems a more "generalized"
> > and hence more useful algo to be factored in... At least in the
> > use cases that I've been toying with.
> >
> > So I'd like to help out with that if wanted...
> >
> > > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com>
> wrote:
> > >
> > > Trevor et al,
> > >
> > > I'd like to contribute an algorithm or two in samsara using spark as I
> > would like to do a compare and contrast with mahout with R server for a
> > data science pipeline, machine learning repo that I'm working on, in
> > looking at the list of algorithms (https://mahout.apache.org/
> > users/basics/algorithms.html) is there an algorithm for spark that would
> > be beneficial for the community, my use cases would typically be around
> > clustering or real time machine learning for building recommendations on
> > the fly.    The algorithms I see that could potentially be useful are: 1)
> > Matrix Factorization with ALS 2) Logistic regression with SVD.
> > >
> > > Apache Mahout: Scalable machine learning and data mining<
> > https://mahout.apache.org/users/basics/algorithms.html>
> > > mahout.apache.org
> > > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
> > Flink; Mahout Math-Scala Core Library and Scala DSL
> > >
> > >
> > >
> > > Any thoughts/guidance or recommendations would be very helpful.
> > > Thanks in advance.
> >
> >
>

RE: Contributing an algorithm for samsara

Posted by Andrew Palumbo <ap...@outlook.com>.

+1 to glms

Sent from my Verizon Wireless 4G LTE smartphone

-------- Original message --------
From: Trevor Grant <tr...@gmail.com>
Date: 02/17/2017 6:56 AM (GMT-08:00)
To: dev@mahout.apache.org
Subject: Re: Contributing an algorithm for samsara

Jim is right, and I would take it one further and say, it would be best to
implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
in there.

If you have an algorithm you are particularly intimate with, or explicitly
need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.
2) Mahout has some world class Recommenders, a toy ALS implementation might
help us think through how the other reccomenders (e.g. CCO) will 'fit' into
the framework. E.g. ALS being the toy-prototype reccomender that helps us
think through building out that section of the framework.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:

> My own thoughts are that logistic regression seems a more "generalized"
> and hence more useful algo to be factored in... At least in the
> use cases that I've been toying with.
>
> So I'd like to help out with that if wanted...
>
> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> >
> > Trevor et al,
> >
> > I'd like to contribute an algorithm or two in samsara using spark as I
> would like to do a compare and contrast with mahout with R server for a
> data science pipeline, machine learning repo that I'm working on, in
> looking at the list of algorithms (https://mahout.apache.org/
> users/basics/algorithms.html) is there an algorithm for spark that would
> be beneficial for the community, my use cases would typically be around
> clustering or real time machine learning for building recommendations on
> the fly.    The algorithms I see that could potentially be useful are: 1)
> Matrix Factorization with ALS 2) Logistic regression with SVD.
> >
> > Apache Mahout: Scalable machine learning and data mining<
> https://mahout.apache.org/users/basics/algorithms.html>
> > mahout.apache.org
> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
> Flink; Mahout Math-Scala Core Library and Scala DSL
> >
> >
> >
> > Any thoughts/guidance or recommendations would be very helpful.
> > Thanks in advance.
>
>

Re: Contributing an algorithm for samsara

Posted by Trevor Grant <tr...@gmail.com>.

Jim is right, and I would take it one further and say, it would be best to
implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
in there.

If you have an algorithm you are particularly intimate with, or explicitly
need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.
2) Mahout has some world class Recommenders, a toy ALS implementation might
help us think through how the other reccomenders (e.g. CCO) will 'fit' into
the framework. E.g. ALS being the toy-prototype reccomender that helps us
think through building out that section of the framework.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:

> My own thoughts are that logistic regression seems a more "generalized"
> and hence more useful algo to be factored in... At least in the
> use cases that I've been toying with.
>
> So I'd like to help out with that if wanted...
>
> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> >
> > Trevor et al,
> >
> > I'd like to contribute an algorithm or two in samsara using spark as I
> would like to do a compare and contrast with mahout with R server for a
> data science pipeline, machine learning repo that I'm working on, in
> looking at the list of algorithms (https://mahout.apache.org/
> users/basics/algorithms.html) is there an algorithm for spark that would
> be beneficial for the community, my use cases would typically be around
> clustering or real time machine learning for building recommendations on
> the fly.    The algorithms I see that could potentially be useful are: 1)
> Matrix Factorization with ALS 2) Logistic regression with SVD.
> >
> > Apache Mahout: Scalable machine learning and data mining<
> https://mahout.apache.org/users/basics/algorithms.html>
> > mahout.apache.org
> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
> Flink; Mahout Math-Scala Core Library and Scala DSL
> >
> >
> >
> > Any thoughts/guidance or recommendations would be very helpful.
> > Thanks in advance.
>
>

Re: Contributing an algorithm for samsara

Posted by Jim Jagielski <ji...@jaguNET.com>.

My own thoughts are that logistic regression seems a more "generalized"
and hence more useful algo to be factored in... At least in the
use cases that I've been toying with.

So I'd like to help out with that if wanted...

> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> 
> Trevor et al,
> 
> I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly.    The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.
> 
> Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>
> mahout.apache.org
> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL
> 
> 
> 
> Any thoughts/guidance or recommendations would be very helpful.
> Thanks in advance.