You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/01/03 00:16:33 UTC

Spark Matrix Factorization

Hi,

I am not noticing any DSGD implementation of ALS in Spark.

There are two ALS implementations.

org.apache.spark.examples.SparkALS does not run on large matrices and seems
more like a demo code.

org.apache.spark.mllib.recommendation.ALS looks feels more robust version
and I am experimenting with it.

References here are Jellyfish, Twitter's implementation of Jellyfish called
Scalafish, Google paper called Sparkler and similar idea put forward by IBM
paper by Gemulla et al. (large-scale matrix factorization with distributed
stochastic gradient descent)

https://github.com/azymnis/scalafish

Are there any plans of adding DSGD in Spark or there are any existing JIRA ?

Thanks.
Deb

Re: Spark Matrix Factorization

Posted by Sebastian Schelter <ss...@apache.org>.

Just a minor correction: The Sparkler paper was done by IBM. IIRC they
did not only implement the algorithm but also modified Spark to tune it
for that usecase.

--sebastian

On 03.01.2014 00:16, Debasish Das wrote:
> Hi,
> 
> I am not noticing any DSGD implementation of ALS in Spark.
> 
> There are two ALS implementations.
> 
> org.apache.spark.examples.SparkALS does not run on large matrices and seems
> more like a demo code.
> 
> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
> and I am experimenting with it.
> 
> References here are Jellyfish, Twitter's implementation of Jellyfish called
> Scalafish, Google paper called Sparkler and similar idea put forward by IBM
> paper by Gemulla et al. (large-scale matrix factorization with distributed
> stochastic gradient descent)
> 
> https://github.com/azymnis/scalafish
> 
> Are there any plans of adding DSGD in Spark or there are any existing JIRA ?
> 
> Thanks.
> Deb
>

Re: Spark Matrix Factorization

Posted by Krakna H <sh...@gmail.com>.

Hi Deb,

Putting your code on github will be much appreciated -- it will give us a
good starting point to adapt for our purposes.

Regards.


On Sat, Jun 28, 2014 at 10:57 AM, Debasish Das [via Apache Spark Developers
List] <ml...@n3.nabble.com> wrote:

> Factorization problems are non-convex and so both ALS and DSGD will
> converge to local minima and it is not clear which minima will be better
> than the other until we run both the algorithms and see...
>
> So I will still say get a DSGD version running in the test setup while you
> experiment with the Spark ALS...so that you can see if on your particular
> dataset DSGD is converging to a better minima...
>
> If you want I can put the DSGD code base that I used for experimentation
> on
> github...I am not sure if Professor Re already put it on github...
>
>
> On Sat, Jun 28, 2014 at 2:46 AM, Krakna H <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7110&i=0>> wrote:
>
> > Hi Deb,
> >
> > Thanks so much for your response! At this point, we haven't determined
> > which of DSGD/ALS to go with and were waiting on guidance like yours to
> > tell us what the right option would be. It looks like ALS seems to be
> good
> > enough for our purposes.
> >
> > Regards.
> >
> >
> > On Fri, Jun 27, 2014 at 12:47 PM, Debasish Das [via Apache Spark
> Developers
> > List] <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7110&i=1>> wrote:
> >
> > > Hi,
> > >
> > > In my experiments with Jellyfish I did not see any substantial RMSE
> loss
> > > over DSGD for Netflix dataset...
> > >
> > > So we decided to stick with ALS and implemented a family of Quadratic
> > > Minimization solvers that stays in the ALS realm but can solve
> > interesting
> > > constraints(positivity, bounds, L1, equality constrained bounds
> etc)...We
> > > are going to show it at the Spark Summit...Also ALS structure is
> > favorable
> > > to matrix factorization use-cases where missing entries means zero and
> > you
> > > want to compute a global gram matrix using broadcast and use that for
> > each
> > > Quadratic Minimization for all users/products...
> > >
> > > Implementing DSGD in the data partitioning that Spark ALS uses will be
> > > straightforward but I would be more keen to see a dataset where DSGD
> is
> > > showing you better RMSEs than ALS....
> > >
> > > If you have a dataset where DSGD produces much better result could you
> > > please point it to us ?
> > >
> > > Also you can use Jellyfish to run DSGD benchmarks to compare against
> > > ALS...It is multithreaded and if you have good RAM, you should be able
> to
> > > run fairly large datasets...
> > >
> > > Be careful about the default Jellyfish...it has been tuned for netflix
> > > dataset (regularization, rating normalization etc)...So before you
> > compare
> > > RMSE make sure ALS and Jellyfish is running same algorithm (L2
> > regularized
> > > Quadratic Loss)....
> > >
> > > Thanks.
> > > Deb
> > >
> > >
> > > On Fri, Jun 27, 2014 at 3:40 AM, Krakna H <[hidden email]
> > > <http://user/SendEmail.jtp?type=node&node=7098&i=0>> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Just found this thread -- is there an update on including DSGD in
> > Spark?
> > > We
> > > > have a project that entails topic modeling on a document-term matrix
> > > using
> > > > matrix factorization, and were wondering if we should use ALS or
> > attempt
> > > > writing our own matrix factorization implementation on top of Spark.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7097.html
> > > > Sent from the Apache Spark Developers List mailing list archive at
> > > > Nabble.com.
> > > >
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7098.html
> > >  To start a new topic under Apache Spark Developers List, email
> > > [hidden email] <http://user/SendEmail.jtp?type=node&node=7110&i=2>
> > > To unsubscribe from Apache Spark Developers List, click here
> > > <
> >
> >
> > > .
> > > NAML
> > > <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>
> > >
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7109.html
>
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7110.html
>  To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1h88@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2hhbmthcmsrc3lzQGdtYWlsLmNvbXwxfDk3NjU5Mzg0>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7111.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Spark Matrix Factorization

Posted by Debasish Das <de...@gmail.com>.

Factorization problems are non-convex and so both ALS and DSGD will
converge to local minima and it is not clear which minima will be better
than the other until we run both the algorithms and see...

So I will still say get a DSGD version running in the test setup while you
experiment with the Spark ALS...so that you can see if on your particular
dataset DSGD is converging to a better minima...

If you want I can put the DSGD code base that I used for experimentation on
github...I am not sure if Professor Re already put it on github...


On Sat, Jun 28, 2014 at 2:46 AM, Krakna H <sh...@gmail.com> wrote:

> Hi Deb,
>
> Thanks so much for your response! At this point, we haven't determined
> which of DSGD/ALS to go with and were waiting on guidance like yours to
> tell us what the right option would be. It looks like ALS seems to be good
> enough for our purposes.
>
> Regards.
>
>
> On Fri, Jun 27, 2014 at 12:47 PM, Debasish Das [via Apache Spark Developers
> List] <ml...@n3.nabble.com> wrote:
>
> > Hi,
> >
> > In my experiments with Jellyfish I did not see any substantial RMSE loss
> > over DSGD for Netflix dataset...
> >
> > So we decided to stick with ALS and implemented a family of Quadratic
> > Minimization solvers that stays in the ALS realm but can solve
> interesting
> > constraints(positivity, bounds, L1, equality constrained bounds etc)...We
> > are going to show it at the Spark Summit...Also ALS structure is
> favorable
> > to matrix factorization use-cases where missing entries means zero and
> you
> > want to compute a global gram matrix using broadcast and use that for
> each
> > Quadratic Minimization for all users/products...
> >
> > Implementing DSGD in the data partitioning that Spark ALS uses will be
> > straightforward but I would be more keen to see a dataset where DSGD is
> > showing you better RMSEs than ALS....
> >
> > If you have a dataset where DSGD produces much better result could you
> > please point it to us ?
> >
> > Also you can use Jellyfish to run DSGD benchmarks to compare against
> > ALS...It is multithreaded and if you have good RAM, you should be able to
> > run fairly large datasets...
> >
> > Be careful about the default Jellyfish...it has been tuned for netflix
> > dataset (regularization, rating normalization etc)...So before you
> compare
> > RMSE make sure ALS and Jellyfish is running same algorithm (L2
> regularized
> > Quadratic Loss)....
> >
> > Thanks.
> > Deb
> >
> >
> > On Fri, Jun 27, 2014 at 3:40 AM, Krakna H <[hidden email]
> > <http://user/SendEmail.jtp?type=node&node=7098&i=0>> wrote:
> >
> > > Hi all,
> > >
> > > Just found this thread -- is there an update on including DSGD in
> Spark?
> > We
> > > have a project that entails topic modeling on a document-term matrix
> > using
> > > matrix factorization, and were wondering if we should use ALS or
> attempt
> > > writing our own matrix factorization implementation on top of Spark.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7097.html
> > > Sent from the Apache Spark Developers List mailing list archive at
> > > Nabble.com.
> > >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7098.html
> >  To start a new topic under Apache Spark Developers List, email
> > ml-node+s1001551n1h88@n3.nabble.com
> > To unsubscribe from Apache Spark Developers List, click here
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2hhbmthcmsrc3lzQGdtYWlsLmNvbXwxfDk3NjU5Mzg0
> >
> > .
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7109.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: Spark Matrix Factorization

Posted by Krakna H <sh...@gmail.com>.

Hi Deb,

Thanks so much for your response! At this point, we haven't determined
which of DSGD/ALS to go with and were waiting on guidance like yours to
tell us what the right option would be. It looks like ALS seems to be good
enough for our purposes.

Regards.


On Fri, Jun 27, 2014 at 12:47 PM, Debasish Das [via Apache Spark Developers
List] <ml...@n3.nabble.com> wrote:

> Hi,
>
> In my experiments with Jellyfish I did not see any substantial RMSE loss
> over DSGD for Netflix dataset...
>
> So we decided to stick with ALS and implemented a family of Quadratic
> Minimization solvers that stays in the ALS realm but can solve interesting
> constraints(positivity, bounds, L1, equality constrained bounds etc)...We
> are going to show it at the Spark Summit...Also ALS structure is favorable
> to matrix factorization use-cases where missing entries means zero and you
> want to compute a global gram matrix using broadcast and use that for each
> Quadratic Minimization for all users/products...
>
> Implementing DSGD in the data partitioning that Spark ALS uses will be
> straightforward but I would be more keen to see a dataset where DSGD is
> showing you better RMSEs than ALS....
>
> If you have a dataset where DSGD produces much better result could you
> please point it to us ?
>
> Also you can use Jellyfish to run DSGD benchmarks to compare against
> ALS...It is multithreaded and if you have good RAM, you should be able to
> run fairly large datasets...
>
> Be careful about the default Jellyfish...it has been tuned for netflix
> dataset (regularization, rating normalization etc)...So before you compare
> RMSE make sure ALS and Jellyfish is running same algorithm (L2 regularized
> Quadratic Loss)....
>
> Thanks.
> Deb
>
>
> On Fri, Jun 27, 2014 at 3:40 AM, Krakna H <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7098&i=0>> wrote:
>
> > Hi all,
> >
> > Just found this thread -- is there an update on including DSGD in Spark?
> We
> > have a project that entails topic modeling on a document-term matrix
> using
> > matrix factorization, and were wondering if we should use ALS or attempt
> > writing our own matrix factorization implementation on top of Spark.
> >
> > Thanks.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7097.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7098.html
>  To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1h88@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2hhbmthcmsrc3lzQGdtYWlsLmNvbXwxfDk3NjU5Mzg0>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7109.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Spark Matrix Factorization

Posted by Debasish Das <de...@gmail.com>.

Hi,

In my experiments with Jellyfish I did not see any substantial RMSE loss
over DSGD for Netflix dataset...

So we decided to stick with ALS and implemented a family of Quadratic
Minimization solvers that stays in the ALS realm but can solve interesting
constraints(positivity, bounds, L1, equality constrained bounds etc)...We
are going to show it at the Spark Summit...Also ALS structure is favorable
to matrix factorization use-cases where missing entries means zero and you
want to compute a global gram matrix using broadcast and use that for each
Quadratic Minimization for all users/products...

Implementing DSGD in the data partitioning that Spark ALS uses will be
straightforward but I would be more keen to see a dataset where DSGD is
showing you better RMSEs than ALS....

If you have a dataset where DSGD produces much better result could you
please point it to us ?

Also you can use Jellyfish to run DSGD benchmarks to compare against
ALS...It is multithreaded and if you have good RAM, you should be able to
run fairly large datasets...

Be careful about the default Jellyfish...it has been tuned for netflix
dataset (regularization, rating normalization etc)...So before you compare
RMSE make sure ALS and Jellyfish is running same algorithm (L2 regularized
Quadratic Loss)....

Thanks.
Deb

On Fri, Jun 27, 2014 at 3:40 AM, Krakna H <sh...@gmail.com> wrote:

> Hi all,
>
> Just found this thread -- is there an update on including DSGD in Spark? We
> have a project that entails topic modeling on a document-term matrix using
> matrix factorization, and were wondering if we should use ALS or attempt
> writing our own matrix factorization implementation on top of Spark.
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7097.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: Spark Matrix Factorization

Posted by Krakna H <sh...@gmail.com>.

Hi all,

Just found this thread -- is there an update on including DSGD in Spark? We
have a project that entails topic modeling on a document-term matrix using
matrix factorization, and were wondering if we should use ALS or attempt
writing our own matrix factorization implementation on top of Spark.

Thanks.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7097.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Spark Matrix Factorization

Posted by Ameet Talwalkar <am...@eecs.berkeley.edu>.

>
> Matrix factorization is a non-convex problem and ALS solves it using 2
> convex problems, DSGD solves the problem by finding a local minima.
>
>
ALS and SGD solve the same non-convex objective function, and thus both
yield local minima.  The following reference provides a nice overview (in
particular see equation 2 of this paper):

http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf


>
>
>
> On Thu, Jan 2, 2014 at 4:06 PM, Ameet Talwalkar <am...@eecs.berkeley.edu>wrote:
>
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust
>>> version and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
>

Re: Spark Matrix Factorization

Posted by Ameet Talwalkar <am...@eecs.berkeley.edu>.

>
> Matrix factorization is a non-convex problem and ALS solves it using 2
> convex problems, DSGD solves the problem by finding a local minima.
>
>
ALS and SGD solve the same non-convex objective function, and thus both
yield local minima.  The following reference provides a nice overview (in
particular see equation 2 of this paper):

http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf


>
>
>
> On Thu, Jan 2, 2014 at 4:06 PM, Ameet Talwalkar <am...@eecs.berkeley.edu>wrote:
>
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust
>>> version and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
>

Re: Spark Matrix Factorization

Posted by Debasish Das <de...@gmail.com>.

Hi Ameet,

Matrix factorization is a non-convex problem and ALS solves it using 2
convex problems, DSGD solves the problem by finding a local minima.

I am experimenting with Spark Parallel ALS but I intend to port Scalafish
https://github.com/azymnis/scalafish to Spark as well.

For bigger matrices jury is not out that which algorithms provides a better
local optima with an iteration bound. It is also highly dependent on
datasets I believe.

Thanks.
Deb



On Thu, Jan 2, 2014 at 4:06 PM, Ameet Talwalkar <am...@eecs.berkeley.edu>wrote:

> Hi Deb,
>
> Thanks for your email.  We currently do not have a DSGD implementation in
> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
> different algorithm for solving the same the same bi-convex objective
> function.
>
> It would be a good thing to do add, but to the best of my knowledge, no
> one is actively working on this right now.
>
> Also, as you mentioned, the ALS implementation in mllib is more
> robust/scalable than the one in spark.examples.
>
> -Ameet
>
>
> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:
>
>> Hi,
>>
>> I am not noticing any DSGD implementation of ALS in Spark.
>>
>> There are two ALS implementations.
>>
>> org.apache.spark.examples.SparkALS does not run on large matrices and
>> seems more like a demo code.
>>
>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>> and I am experimenting with it.
>>
>> References here are Jellyfish, Twitter's implementation of Jellyfish
>> called Scalafish, Google paper called Sparkler and similar idea put forward
>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>> distributed stochastic gradient descent)
>>
>> https://github.com/azymnis/scalafish
>>
>> Are there any plans of adding DSGD in Spark or there are any existing
>> JIRA ?
>>
>> Thanks.
>> Deb
>>
>>
>

Re: Spark Matrix Factorization

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

it's in Mahout - 0.9. It should be in very final stages now.


On Fri, Jan 3, 2014 at 10:51 AM, Debasish Das <de...@gmail.com>wrote:

> Hi Dmitri,
>
> We have a mahout mirror from github but I don't see any of the math-scala
> code.
>
> Where do I see the math-scala code ? I thought github mirror is updated
> with svn repo.
>
> Thanks.
> Deb
>
>
>
> On Fri, Jan 3, 2014 at 10:43 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>>
>>
>>
>> On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <ss...@apache.org>wrote:
>>
>>> > I wonder if anyone might have recommendation on scala native
>>> implementation
>>> > of SVD.
>>>
>>> Mahout has a scala implementation of an SVD variant called Stochastic
>>> SVD:
>>>
>>>
>>> https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup
>>
>>
>> Mahout also has SVD and Eigen decompositions  mapped to scala as svd()
>> and eigen(). Unfortunately i have not put it on wiki yet but the summary is
>> available here https://issues.apache.org/jira/browse/MAHOUT-1297
>>
>> Mahout also has distributed PCA implementation (which is based on
>> distributed Stochastic SVD and has a special provisions for sparse matrix
>> cases). Unfortunately our wiki is in flux now due to migration off
>> confluence to CMS so the SSVD page has not yet been migrated to CMS so
>> confluence version is here
>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>
>>
>>>
>>> Otherwise, all the major java math libraries (mahout math, jblas,
>>> commons-math) should provide an implementation that you can use in scala.
>>>
>>> --sebastian
>>>
>>> > C
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <
>>> ameet@eecs.berkeley.edu>wrote:
>>> >
>>> >> Hi Deb,
>>> >>
>>> >> Thanks for your email.  We currently do not have a DSGD
>>> implementation in
>>> >> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but
>>> rather a
>>> >> different algorithm for solving the same the same bi-convex objective
>>> >> function.
>>> >>
>>> >> It would be a good thing to do add, but to the best of my knowledge,
>>> no
>>> >> one is actively working on this right now.
>>> >>
>>> >> Also, as you mentioned, the ALS implementation in mllib is more
>>> >> robust/scalable than the one in spark.examples.
>>> >>
>>> >> -Ameet
>>> >>
>>> >>
>>> >> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <
>>> debasish.das83@gmail.com>wrote:
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>> I am not noticing any DSGD implementation of ALS in Spark.
>>> >>>
>>> >>> There are two ALS implementations.
>>> >>>
>>> >>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> >>> seems more like a demo code.
>>> >>>
>>> >>> org.apache.spark.mllib.recommendation.ALS looks feels more robust
>>> version
>>> >>> and I am experimenting with it.
>>> >>>
>>> >>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> >>> called Scalafish, Google paper called Sparkler and similar idea put
>>> forward
>>> >>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> >>> distributed stochastic gradient descent)
>>> >>>
>>> >>> https://github.com/azymnis/scalafish
>>> >>>
>>> >>> Are there any plans of adding DSGD in Spark or there are any existing
>>> >>> JIRA ?
>>> >>>
>>> >>> Thanks.
>>> >>> Deb
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>>
>>>
>>
>

Re: Spark Matrix Factorization

Posted by Debasish Das <de...@gmail.com>.

Hi Dmitri,

We have a mahout mirror from github but I don't see any of the math-scala
code.

Where do I see the math-scala code ? I thought github mirror is updated
with svn repo.

Thanks.
Deb



On Fri, Jan 3, 2014 at 10:43 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
>
> On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> > I wonder if anyone might have recommendation on scala native
>> implementation
>> > of SVD.
>>
>> Mahout has a scala implementation of an SVD variant called Stochastic SVD:
>>
>>
>> https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup
>
>
> Mahout also has SVD and Eigen decompositions  mapped to scala as svd() and
> eigen(). Unfortunately i have not put it on wiki yet but the summary is
> available here https://issues.apache.org/jira/browse/MAHOUT-1297
>
> Mahout also has distributed PCA implementation (which is based on
> distributed Stochastic SVD and has a special provisions for sparse matrix
> cases). Unfortunately our wiki is in flux now due to migration off
> confluence to CMS so the SSVD page has not yet been migrated to CMS so
> confluence version is here
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>
>
>>
>> Otherwise, all the major java math libraries (mahout math, jblas,
>> commons-math) should provide an implementation that you can use in scala.
>>
>> --sebastian
>>
>> > C
>> >
>> >
>> >
>> >
>> > On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <
>> ameet@eecs.berkeley.edu>wrote:
>> >
>> >> Hi Deb,
>> >>
>> >> Thanks for your email.  We currently do not have a DSGD implementation
>> in
>> >> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather
>> a
>> >> different algorithm for solving the same the same bi-convex objective
>> >> function.
>> >>
>> >> It would be a good thing to do add, but to the best of my knowledge, no
>> >> one is actively working on this right now.
>> >>
>> >> Also, as you mentioned, the ALS implementation in mllib is more
>> >> robust/scalable than the one in spark.examples.
>> >>
>> >> -Ameet
>> >>
>> >>
>> >> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <debasish.das83@gmail.com
>> >wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I am not noticing any DSGD implementation of ALS in Spark.
>> >>>
>> >>> There are two ALS implementations.
>> >>>
>> >>> org.apache.spark.examples.SparkALS does not run on large matrices and
>> >>> seems more like a demo code.
>> >>>
>> >>> org.apache.spark.mllib.recommendation.ALS looks feels more robust
>> version
>> >>> and I am experimenting with it.
>> >>>
>> >>> References here are Jellyfish, Twitter's implementation of Jellyfish
>> >>> called Scalafish, Google paper called Sparkler and similar idea put
>> forward
>> >>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>> >>> distributed stochastic gradient descent)
>> >>>
>> >>> https://github.com/azymnis/scalafish
>> >>>
>> >>> Are there any plans of adding DSGD in Spark or there are any existing
>> >>> JIRA ?
>> >>>
>> >>> Thanks.
>> >>> Deb
>> >>>
>> >>>
>> >>
>> >
>> >
>>
>>
>

Re: Spark Matrix Factorization

Posted by Ameet Talwalkar <am...@eecs.berkeley.edu>.

Hi all,

The following pull
request<https://github.com/apache/incubator-spark/pull/315>
implementing
SVD in MLlib is highly relevant to this discussion.

-Ameet


On Fri, Jan 3, 2014 at 10:43 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
>
> On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> > I wonder if anyone might have recommendation on scala native
>> implementation
>> > of SVD.
>>
>> Mahout has a scala implementation of an SVD variant called Stochastic SVD:
>>
>>
>> https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup
>
>
> Mahout also has SVD and Eigen decompositions  mapped to scala as svd() and
> eigen(). Unfortunately i have not put it on wiki yet but the summary is
> available here https://issues.apache.org/jira/browse/MAHOUT-1297
>
> Mahout also has distributed PCA implementation (which is based on
> distributed Stochastic SVD and has a special provisions for sparse matrix
> cases). Unfortunately our wiki is in flux now due to migration off
> confluence to CMS so the SSVD page has not yet been migrated to CMS so
> confluence version is here
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>
>
>>
>> Otherwise, all the major java math libraries (mahout math, jblas,
>> commons-math) should provide an implementation that you can use in scala.
>>
>> --sebastian
>>
>> > C
>> >
>> >
>> >
>> >
>> > On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <
>> ameet@eecs.berkeley.edu>wrote:
>> >
>> >> Hi Deb,
>> >>
>> >> Thanks for your email.  We currently do not have a DSGD implementation
>> in
>> >> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather
>> a
>> >> different algorithm for solving the same the same bi-convex objective
>> >> function.
>> >>
>> >> It would be a good thing to do add, but to the best of my knowledge, no
>> >> one is actively working on this right now.
>> >>
>> >> Also, as you mentioned, the ALS implementation in mllib is more
>> >> robust/scalable than the one in spark.examples.
>> >>
>> >> -Ameet
>> >>
>> >>
>> >> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <debasish.das83@gmail.com
>> >wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I am not noticing any DSGD implementation of ALS in Spark.
>> >>>
>> >>> There are two ALS implementations.
>> >>>
>> >>> org.apache.spark.examples.SparkALS does not run on large matrices and
>> >>> seems more like a demo code.
>> >>>
>> >>> org.apache.spark.mllib.recommendation.ALS looks feels more robust
>> version
>> >>> and I am experimenting with it.
>> >>>
>> >>> References here are Jellyfish, Twitter's implementation of Jellyfish
>> >>> called Scalafish, Google paper called Sparkler and similar idea put
>> forward
>> >>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>> >>> distributed stochastic gradient descent)
>> >>>
>> >>> https://github.com/azymnis/scalafish
>> >>>
>> >>> Are there any plans of adding DSGD in Spark or there are any existing
>> >>> JIRA ?
>> >>>
>> >>> Thanks.
>> >>> Deb
>> >>>
>> >>>
>> >>
>> >
>> >
>>
>>
>

Re: Spark Matrix Factorization

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <ss...@apache.org> wrote:

> > I wonder if anyone might have recommendation on scala native
> implementation
> > of SVD.
>
> Mahout has a scala implementation of an SVD variant called Stochastic SVD:
>
>
> https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup


Mahout also has SVD and Eigen decompositions  mapped to scala as svd() and
eigen(). Unfortunately i have not put it on wiki yet but the summary is
available here https://issues.apache.org/jira/browse/MAHOUT-1297

Mahout also has distributed PCA implementation (which is based on
distributed Stochastic SVD and has a special provisions for sparse matrix
cases). Unfortunately our wiki is in flux now due to migration off
confluence to CMS so the SSVD page has not yet been migrated to CMS so
confluence version is here
https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition


>
> Otherwise, all the major java math libraries (mahout math, jblas,
> commons-math) should provide an implementation that you can use in scala.
>
> --sebastian
>
> > C
> >
> >
> >
> >
> > On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <ameet@eecs.berkeley.edu
> >wrote:
> >
> >> Hi Deb,
> >>
> >> Thanks for your email.  We currently do not have a DSGD implementation
> in
> >> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
> >> different algorithm for solving the same the same bi-convex objective
> >> function.
> >>
> >> It would be a good thing to do add, but to the best of my knowledge, no
> >> one is actively working on this right now.
> >>
> >> Also, as you mentioned, the ALS implementation in mllib is more
> >> robust/scalable than the one in spark.examples.
> >>
> >> -Ameet
> >>
> >>
> >> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <debasish.das83@gmail.com
> >wrote:
> >>
> >>> Hi,
> >>>
> >>> I am not noticing any DSGD implementation of ALS in Spark.
> >>>
> >>> There are two ALS implementations.
> >>>
> >>> org.apache.spark.examples.SparkALS does not run on large matrices and
> >>> seems more like a demo code.
> >>>
> >>> org.apache.spark.mllib.recommendation.ALS looks feels more robust
> version
> >>> and I am experimenting with it.
> >>>
> >>> References here are Jellyfish, Twitter's implementation of Jellyfish
> >>> called Scalafish, Google paper called Sparkler and similar idea put
> forward
> >>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
> >>> distributed stochastic gradient descent)
> >>>
> >>> https://github.com/azymnis/scalafish
> >>>
> >>> Are there any plans of adding DSGD in Spark or there are any existing
> >>> JIRA ?
> >>>
> >>> Thanks.
> >>> Deb
> >>>
> >>>
> >>
> >
> >
>
>

Re: Spark Matrix Factorization

Posted by Sebastian Schelter <ss...@apache.org>.

> I wonder if anyone might have recommendation on scala native implementation
> of SVD.

Mahout has a scala implementation of an SVD variant called Stochastic SVD:

https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup

Otherwise, all the major java math libraries (mahout math, jblas,
commons-math) should provide an implementation that you can use in scala.

--sebastian

> C
> 
> 
> 
> 
> On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <am...@eecs.berkeley.edu>wrote:
> 
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>>> and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
> 
>

Re: Spark Matrix Factorization

Posted by Charles Earl <ch...@gmail.com>.

In a slightly related note, I am trying to write a distributed PCA based
upon
http://biglearn.org/2013/files/papers/biglearning2013_submission_18.pdf
The algorithm works by computing SVD locally then broadcasting the locally
computed principal components.
I wonder if anyone might have recommendation on scala native implementation
of SVD.
C




On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <am...@eecs.berkeley.edu>wrote:

> Hi Deb,
>
> Thanks for your email.  We currently do not have a DSGD implementation in
> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
> different algorithm for solving the same the same bi-convex objective
> function.
>
> It would be a good thing to do add, but to the best of my knowledge, no
> one is actively working on this right now.
>
> Also, as you mentioned, the ALS implementation in mllib is more
> robust/scalable than the one in spark.examples.
>
> -Ameet
>
>
> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:
>
>> Hi,
>>
>> I am not noticing any DSGD implementation of ALS in Spark.
>>
>> There are two ALS implementations.
>>
>> org.apache.spark.examples.SparkALS does not run on large matrices and
>> seems more like a demo code.
>>
>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>> and I am experimenting with it.
>>
>> References here are Jellyfish, Twitter's implementation of Jellyfish
>> called Scalafish, Google paper called Sparkler and similar idea put forward
>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>> distributed stochastic gradient descent)
>>
>> https://github.com/azymnis/scalafish
>>
>> Are there any plans of adding DSGD in Spark or there are any existing
>> JIRA ?
>>
>> Thanks.
>> Deb
>>
>>
>


-- 
- Charles

Re: Spark Matrix Factorization

Posted by Debasish Das <de...@gmail.com>.

Hi Ameet,

Matrix factorization is a non-convex problem and ALS solves it using 2
convex problems, DSGD solves the problem by finding a local minima.

I am experimenting with Spark Parallel ALS but I intend to port Scalafish
https://github.com/azymnis/scalafish to Spark as well.

For bigger matrices jury is not out that which algorithms provides a better
local optima with an iteration bound. It is also highly dependent on
datasets I believe.

Thanks.
Deb



On Thu, Jan 2, 2014 at 4:06 PM, Ameet Talwalkar <am...@eecs.berkeley.edu>wrote:

> Hi Deb,
>
> Thanks for your email.  We currently do not have a DSGD implementation in
> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
> different algorithm for solving the same the same bi-convex objective
> function.
>
> It would be a good thing to do add, but to the best of my knowledge, no
> one is actively working on this right now.
>
> Also, as you mentioned, the ALS implementation in mllib is more
> robust/scalable than the one in spark.examples.
>
> -Ameet
>
>
> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:
>
>> Hi,
>>
>> I am not noticing any DSGD implementation of ALS in Spark.
>>
>> There are two ALS implementations.
>>
>> org.apache.spark.examples.SparkALS does not run on large matrices and
>> seems more like a demo code.
>>
>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>> and I am experimenting with it.
>>
>> References here are Jellyfish, Twitter's implementation of Jellyfish
>> called Scalafish, Google paper called Sparkler and similar idea put forward
>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>> distributed stochastic gradient descent)
>>
>> https://github.com/azymnis/scalafish
>>
>> Are there any plans of adding DSGD in Spark or there are any existing
>> JIRA ?
>>
>> Thanks.
>> Deb
>>
>>
>

Re: Spark Matrix Factorization

Posted by Ameet Talwalkar <am...@eecs.berkeley.edu>.

Hi Deb,

Thanks for your email.  We currently do not have a DSGD implementation in
MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
different algorithm for solving the same the same bi-convex objective
function.

It would be a good thing to do add, but to the best of my knowledge, no one
is actively working on this right now.

Also, as you mentioned, the ALS implementation in mllib is more
robust/scalable than the one in spark.examples.

-Ameet

On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:

> Hi,
>
> I am not noticing any DSGD implementation of ALS in Spark.
>
> There are two ALS implementations.
>
> org.apache.spark.examples.SparkALS does not run on large matrices and
> seems more like a demo code.
>
> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
> and I am experimenting with it.
>
> References here are Jellyfish, Twitter's implementation of Jellyfish
> called Scalafish, Google paper called Sparkler and similar idea put forward
> by IBM paper by Gemulla et al. (large-scale matrix factorization with
> distributed stochastic gradient descent)
>
> https://github.com/azymnis/scalafish
>
> Are there any plans of adding DSGD in Spark or there are any existing JIRA
> ?
>
> Thanks.
> Deb
>
>

Re: Spark Matrix Factorization

Posted by Ameet Talwalkar <am...@eecs.berkeley.edu>.

Hi Deb,

Thanks for your email.  We currently do not have a DSGD implementation in
MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
different algorithm for solving the same the same bi-convex objective
function.

It would be a good thing to do add, but to the best of my knowledge, no one
is actively working on this right now.

Also, as you mentioned, the ALS implementation in mllib is more
robust/scalable than the one in spark.examples.

-Ameet

On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <de...@gmail.com>wrote:

> Hi,
>
> I am not noticing any DSGD implementation of ALS in Spark.
>
> There are two ALS implementations.
>
> org.apache.spark.examples.SparkALS does not run on large matrices and
> seems more like a demo code.
>
> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
> and I am experimenting with it.
>
> References here are Jellyfish, Twitter's implementation of Jellyfish
> called Scalafish, Google paper called Sparkler and similar idea put forward
> by IBM paper by Gemulla et al. (large-scale matrix factorization with
> distributed stochastic gradient descent)
>
> https://github.com/azymnis/scalafish
>
> Are there any plans of adding DSGD in Spark or there are any existing JIRA
> ?
>
> Thanks.
> Deb
>
>