You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jim Jagielski <ji...@jaguNET.com> on 2017/03/03 12:09:30 UTC

Re: Contributing an algorithm for samsara

> On Feb 25, 2017, at 5:41 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> 
> Dmitry,
> 
> I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state

+1. It looks v. impressive.

> , given that, here are some thoughts/questions:
> 
> 
> 1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?
> 
> 2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

Would it make sense to keep them as-is, and "pull them out", as
it were, should they prove to be wanted/needed by the other algo users?

> 
> 3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?
> 
> 
> 
> More later as I read through the papers.
> 
> 
> ________________________________
> From: Dmitriy Lyubimov <dl...@gmail.com>
> Sent: Friday, February 17, 2017 1:45 PM
> To: dev@mahout.apache.org
> Subject: Re: Contributing an algorithm for samsara
> 
> in particular, this is the samsara implementation of double-weighed als :
> https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626
> MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>
> github.com
> mahout - Mirror of Apache Mahout
> 
> 
> 
> 
> 
> On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
>> Jim,
>> 
>> if ALS is of interest, and as far as weighed ALS is concerned (since we
>> already have trivial regularized ALS in the "decompositions" package),
>> here's uncommitted samsara-compatible patch from a while back:
>> https://issues.apache.org/jira/browse/MAHOUT-1365
> [MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>
> issues.apache.org
> Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...
> 
> 
> 
>> 
>> it combines weights on both data points (a.k.a "implicit feedback" als)
>> and regularization rates  (paper references are given). We combine both
>> approaches in one (which is novel, i guess, but yet simple enough).
>> Obviously the final solver can also be used as pure reg rate regularized if
>> wanted, making it equivalent to one of the papers.
>> 
>> You may know implicit feedback paper from mllib's implicit als, but unlike
>> it was done over there (as a use case sort problem that takes input before
>> even features were extracted), we split the problem into pure algebraic
>> solver (double-weighed ALS math) and leave the feature extraction outside
>> of this issue per se (it can be added as a separate adapter).
>> 
>> The reason for that is that the specific use-case oriented implementation
>> does not necessarily leave the space for feature extraction that is
>> different from described use case of partially consumed streamed videos in
>> the paper. (e.g., instead of videos one could count visits or clicks or
>> add-to-cart events which may need additional hyperparameter found for them
>> as part of feature extraction and converting observations into "weghts").
>> 
>> The biggest problem with these ALS methods however is that all
>> hyperparameters require multidimensional crossvalidation and optimization.
>> I think i mentioned it before as list of desired solutions, as it stands,
>> Mahout does not have hyperarameter fitting routine.
>> 
>> In practice, when using these kind of ALS, we have a case of
>> multidimensional hyperparameter optimization. One of them comes from the
>> fitter (reg rate, or base reg rate in case of weighed regularization), and
>> the others come from feature extraction process. E.g., in original paper
>> they introduce (at least) 2 formulas to extract measure weighs from the
>> streaming video observations, and each of them had one parameter, alhpa,
>> which in context of the whole problem becomes effectively yet another
>> hyperparameter to fit. In other use cases when your confidence measurement
>> may be coming from different sources and observations, the confidence
>> extraction may actually have even more hyperparameters to fit than just
>> one. And when we have a multidimensional case, simple approaches (like grid
>> or random search) become either cost prohibitive or ineffective, due to the
>> curse of dimensionality.
>> 
>> At the time i was contributing that method, i was using it in conjunction
>> with multidimensional bayesian optimizer, but the company that i wrote it
>> for did not have it approved for contribution (unlike weighed als) at that
>> time.
>> 
>> Anyhow, perhaps you could read the algebra in both ALS papers there and
>> ask questions, and we could worry about hyperparameter optimization a bit
>> later and performance a bit later.
>> 
>> On the feature extraction front (as in implicit feedback als per Koren
>> etc.), this is an ideal use case for more general R-like formula approach,
>> which is also on desired list of things to have.
>> 
>> So i guess we have 3 problems really here:
>> (1) double-weighed ALS
>> (2) bayesian optimization and crossvalidation in an n-dimensional
>> hyperparameter space
>> (3) feature extraction per (preferrably R-like) formula.
>> 
>> 
>> -d
>> 
>> 
>> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap...@outlook.com>
>> wrote:
>> 
>>> +1 to glms
>>> 
>>> 
>>> 
>>> Sent from my Verizon Wireless 4G LTE smartphone
>>> 
>>> 
>>> -------- Original message --------
>>> From: Trevor Grant <tr...@gmail.com>
>>> Date: 02/17/2017 6:56 AM (GMT-08:00)
>>> To: dev@mahout.apache.org
>>> Subject: Re: Contributing an algorithm for samsara
>>> 
>>> Jim is right, and I would take it one further and say, it would be best to
>>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
> [http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>
> 
> Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>
> en.wikipedia.org
> Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model
> 
> 
> 
>>> from there a Logistic regression is a trivial extension.
>>> 
>>> Buyer beware- GLMs will be a bit of work- doable, but that would be
>>> jumping
>>> in neck first for both Jim and Saikat...
>>> 
>>> MAHOUT-1928 and MAHOUT-1929
>>> 
>>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec
>>> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
>>> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%
>>> 20priority%20DESC%2C%20created%20ASC
>>> 
>>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs
>>> are
>>> in there.
>>> 
>>> If you have an algorithm you are particularly intimate with, or explicitly
>>> need/want- feel free to open a JIRA and assign to yourself.
>>> 
>>> There is also a case to be made for implementing the ALS...
>>> 
>>> 1) It's a much better 'beginner' project.
>>> 2) Mahout has some world class Recommenders, a toy ALS implementation
>>> might
>>> help us think through how the other reccomenders (e.g. CCO) will 'fit'
>>> into
>>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>>> think through building out that section of the framework.
>>> 
>>> 
>>> 
>>> Trevor Grant
>>> Data Scientist
>>> https://github.com/rawkintrevo
> [https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>
> 
> rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
> github.com
> rawkintrevo has 22 repositories available. Follow their code on GitHub.
> 
> 
> 
>>> http://stackexchange.com/users/3002022/rawkintrevo
> User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>
> stackexchange.com
> Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions
> 
> 
> 
>>> http://trevorgrant.org
> [https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>
> 
> The musings of rawkintrevo<http://trevorgrant.org/>
> trevorgrant.org
> Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.
> 
> 
> 
>>> 
>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>> 
>>> 
>>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>> 
>>>> My own thoughts are that logistic regression seems a more "generalized"
>>>> and hence more useful algo to be factored in... At least in the
>>>> use cases that I've been toying with.
>>>> 
>>>> So I'd like to help out with that if wanted...
>>>> 
>>>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sx...@hotmail.com>
>>> wrote:
>>>>> 
>>>>> Trevor et al,
>>>>> 
>>>>> I'd like to contribute an algorithm or two in samsara using spark as I
>>>> would like to do a compare and contrast with mahout with R server for a
>>>> data science pipeline, machine learning repo that I'm working on, in
>>>> looking at the list of algorithms (https://mahout.apache.org/
>>>> users/basics/algorithms.html) is there an algorithm for spark that would
>>>> be beneficial for the community, my use cases would typically be around
>>>> clustering or real time machine learning for building recommendations on
>>>> the fly.    The algorithms I see that could potentially be useful are:
>>> 1)
>>>> Matrix Factorization with ALS 2) Logistic regression with SVD.
>>>>> 
>>>>> Apache Mahout: Scalable machine learning and data mining<
>>>> https://mahout.apache.org/users/basics/algorithms.html>
>>>>> mahout.apache.org
>>>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>>>> Flink; Mahout Math-Scala Core Library and Scala DSL
>>>>> 
>>>>> 
>>>>> 
>>>>> Any thoughts/guidance or recommendations would be very helpful.
>>>>> Thanks in advance.

Re: Contributing an algorithm for samsara

Posted by Jim Jagielski <ji...@jaguNET.com>.

Apologies for letting this slide... way too much life got in the way :)

> On Mar 3, 2017, at 3:36 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> And by formula yes i mean R syntax.
> 
> possible use case would be to take Spark DataFrame and formula (say, `age ~
> . -1`) and produce outputs of DrmLike[Int] (a distributed matrix type) that
> converts into predictors and target.
> 
> In this particular case, this formula means that the predictor matrix (X)
> would have all original variables except `age` (for categorical variables
> factor extraction is applied), with no bias column.
> 
> Some knowledge of R and SAS is required to pin the compatibility nuances
> there.
> 
> Maybe we could have reasonable simplifications or omissions compared to R
> stuff, if we can be reasonably convinced it is actually better that way
> than vanilla R contract, but IMO it would be really useful to retain 100%
> compatibility there since it is one of ideas there -- retain R-like-ness
> with these things.
> 
> 
> On Fri, Mar 3, 2017 at 12:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
>> 
>> 
>> On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>> 
>>>> 
>>>> 
>>> 
>>>>> 
>>>>> 3) On the feature extraction per R like formula can you elaborate more
>>>> here, are you talking about feature extraction using R like dataframes and
>>>> operators?
>>>> 
>>> 
>>> 
>> Yes. I would start doing generic formula parser and then specific part
>> that works with backend-speicifc data frames. For spark, i don't see any
>> reason to write our own; we'd just had an adapter for the Spark native data
>> frames.
>>

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

And by formula yes i mean R syntax.

possible use case would be to take Spark DataFrame and formula (say, `age ~
. -1`) and produce outputs of DrmLike[Int] (a distributed matrix type) that
converts into predictors and target.

In this particular case, this formula means that the predictor matrix (X)
would have all original variables except `age` (for categorical variables
factor extraction is applied), with no bias column.

Some knowledge of R and SAS is required to pin the compatibility nuances
there.

Maybe we could have reasonable simplifications or omissions compared to R
stuff, if we can be reasonably convinced it is actually better that way
than vanilla R contract, but IMO it would be really useful to retain 100%
compatibility there since it is one of ideas there -- retain R-like-ness
with these things.

On Fri, Mar 3, 2017 at 12:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
> On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>>
>>>
>>>
>>
>>> >
>>> > 3) On the feature extraction per R like formula can you elaborate more
>>> here, are you talking about feature extraction using R like dataframes and
>>> operators?
>>>
>>
>>
> Yes. I would start doing generic formula parser and then specific part
> that works with backend-speicifc data frames. For spark, i don't see any
> reason to write our own; we'd just had an adapter for the Spark native data
> frames.
>

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>
>>
>>
>
>> >
>> > 3) On the feature extraction per R like formula can you elaborate more
>> here, are you talking about feature extraction using R like dataframes and
>> operators?
>>
>
>
Yes. I would start doing generic formula parser and then specific part that
works with backend-speicifc data frames. For spark, i don't see any reason
to write our own; we'd just had an adapter for the Spark native data
frames.

Re: Contributing an algorithm for samsara

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I am getting a liittle bit lost who asked what here, inline.

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <ji...@jagunet.com> wrote:

>
>
> Would it make sense to keep them as-is, and "pull them out", as
> it were, should they prove to be wanted/needed by the other algo users?
>

I would hope it is of some help (especially math and in-memory prototype)
for something to look back to. I would really try to plot it all anew, I
found it usually helps my focus if I work with my own code from the ground
up.

So no, i would not just try to take it as is. Not without careful review.

Also, if you noticed, the distributed version is quasi-algebraic, i.e., it
contains direct Spark dependencies and code that relies on Spark. As such,
it cannot be put into our decompositions package in mahout-math-scala
module, where most of other distributed decompositions sit.

I suspect it could be made 100% algebraic with current primitives available
in Samsara. This is necessary condition to get it into mahout-math-scala.
If it can't be done, then it has to live in mahout-spark module as one
backend implementation only.

>
> >
> > 3) On the feature extraction per R like formula can you elaborate more
> here, are you talking about feature extraction using R like dataframes and
> operators?
>

> >
> >
> >
> > More later as I read through the papers.
>

I would really start there before anything else. (Moreover, this is the
most fun part of all of it, as far as i am concerned:) ).

Also my adapted formulas are attached to the issue like i mentioned. I
would look thru the math if it is clear (for interpretation), if not let's
discuss any questions.

> >