You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2013/11/12 22:22:17 UTC

[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

    [ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820475#comment-13820475 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:21 PM:
--------------------------------------------------------------------

https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and fitting. 
Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameter over user's preference. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less interesting. since there is not  so much test data (we still should assume we will have just a handful of crossvalidation runs) various "online" convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again.  This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? 

thanks . 
-Dmitriy


was (Author: dlyubimov):
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and fitting. 
Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameter over user's preference. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation .

However, since there are many parameters, the task becomes fairly less interesting. since there is not  so much test data (we still should assume we will have just a handful of crossvalidation runs) various "online" convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again.  This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? 

thanks . 
-Dmitriy

> Spark Bindings (DRM)
> --------------------
>
>                 Key: MAHOUT-1346
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> Spark bindings for Mahout DRM. 
> DRM DSL. 
> Disclaimer. This will all be experimental at this point.
> The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer 
> (0) Spark serialization support for Vector, Matrix 
> (1) Bagel transposition 
> (2) slim X'X
> (2a) not-so-slim X'X
> (3) blockify() (compose RDD containing vertical blocks of original input)
> (4) read/write Mahout DRM off HDFS
> (5) A'B
> ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)