You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/03/26 17:05:02 UTC

Mahout on Spark

New name for a new thread.

A lot of the discussion on MAHOUT-1464 has been around integrating that feature with the Scala DSL. As Saikat says this is of general interest since people seem to agree that this is a good place to integrate efforts.

I’m interested in what I think Dmitriy called data frames. Being a complete noob on Spark I may have gotten this wrong but let me take a shot so he can correct me.

There are a lot of problems that require a pipeline. The text input pipeline is an example, but almost any input to Mahout requires at least an id translation step. What I though Dmitriy was suggesting was that by avoiding the disk write + read between steps we might get significant speedups. This has many implications, I’m sure.

For one I think it means the non-serialized objects are being used by multiple parts of the pipeline and so are not subject to “translation”.

Dmitriy can you explain more? You mentioned a talk you have given, do you have slides somewhere or a PDF?

On Mar 26, 2014, at 7:15 AM, Ted Dunning <te...@gmail.com> wrote:

It would be great to have you.

(go ahead and start new threads when appropriate ... better than hijacking)

On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <sm...@gmail.com>wrote:

> Sorry to hijack the thread,
> 
> this seems like first steps of mahout geeting it to work on spark
> 
> there are similar efforts going on with R+Spark aka Spark R
> 
> not sure if this helpos, played with spark ec2 scripts and it brings up
> multinode cluster using mesos and its configurable - willing to contribute
> donations for mahout-dev
> 
> 
> 
> 
> 
> On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <jira@apache.org
>> wrote:
> 
>> 
>>    [
>> 
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> ]
>> 
>> Saikat Kanjilal commented on MAHOUT-1464:
>> -----------------------------------------
>> 
>> +1 on Andrew's suggestion on using AWS to do this.  Andrew is it possible
>> to have a shared account so mahout contributors can use this, I 'd even
> be
>> willing to chip in donations :) to have a shared AWS account
>> 
>>> RowSimilarityJob on Spark
>>> -------------------------
>>> 
>>>                Key: MAHOUT-1464
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>         Components: Collaborative Filtering
>>>   Affects Versions: 0.9
>>>        Environment: hadoop, spark
>>>           Reporter: Pat Ferrel
>>>             Labels: performance
>>>            Fix For: 1.0
>>> 
>>>        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch
>>> 
>>> 
>>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a
>> prototype here: https://gist.github.com/sscdotopen/8314254. This should
>> be compatible with Mahout Spark DRM DSL so a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422 which is a feature
>> request for RSJ on two inputs to calculate the similarity of rows of one
>> DRM with those of another. This cross-similarity has several applications
>> including cross-action recommendations.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>> 
>

RE: Mahout on Spark

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Created some placeholders for the first two pieces:https://issues.apache.org/jira/browse/MAHOUT-1489https://issues.apache.org/jira/browse/MAHOUT-1490

@Dmitry feel free to add some more descriptions/use cases onto these, I'll read through the spark description and have some more questions for you around these

> Date: Wed, 26 Mar 2014 10:31:38 -0700
> Subject: Re: Mahout on Spark
> From: dlieu.7@gmail.com
> To: dev@mahout.apache.org
> 
> No, we probably don't want to create them unless we have someone to assign
> them to. You are more than welcome create one if you want to take a stub at
> any of those.
> 
> -d
> 
> 
> On Wed, Mar 26, 2014 at 10:28 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
> 
> > @DmitryAre there JIRA items created for the wanted pieces?  I'd like to
> > volunteer to take on the shell and the R bindings , should I create JIRA
> > items for these?
> >
> > > Date: Wed, 26 Mar 2014 10:12:01 -0700
> > > Subject: Re: Mahout on Spark
> > > From: dlieu.7@gmail.com
> > > To: sxk1969@hotmail.com
> > > CC: dev@mahout.apache.org
> > >
> > > Sure.
> > >
> > > @Saikat et al:
> > >
> > > Check out the http://mahout.apache.org/users/sparkbindings/home.html"Wanted"
> > > section.
> > >
> > > Of course, data frames and vectorization(feature prep) standardization is
> > > very high priority there.
> > > Another high priority is interactive shell /scripting (just like spark
> > > shell). Something very similar in R interactive/script runner mode in
> > > spirit. It is very important.
> > >
> > > Re: data frames. Anyone familiar with R, knows what it is. Basically a
> > set
> > > of named columnar vectors (with rows named or enumerated as well). A set
> > of
> > > filtering/modifying DSL expressions similar to R (I haven't really
> > thought
> > > about it at depth). The tricky part here is in-core data frame support of
> > > course, since data frames are based on vectors that go beyond just a real
> > > (double) values we have right now. in R, vector values could be integral,
> > > boolean and character(i.e.string) types as well. If we had an in-core
> > > support for that (or borrowed it from somewhere), the rest would have
> > been
> > > easy -- it is just a matter of semantic elegance. Really, i suggest to
> > look
> > > at R paradigms there, it is a pretty elegant way to work with closures
> > > there.
> > >
> > > Of course we could use off-the-shelf stuff such as Map's  to support
> > > something  named, with string values. I don't know at this point. Scala
> > > itself comes a long way to help out here.
> > >
> > > As for slides, they are of little interest themselves since they mostly
> > > re-interpret and summarize the working notes pdf in a bit more palatable
> > > way. It is just an opportunity to deliver some content to folks who shy
> > > away from reading docs for some reason *wink wink*. I will put them on
> > the
> > > site after meetup if it is ok.
> > >
> > >
> > >
> > >
> > > On Wed, Mar 26, 2014 at 9:09 AM, Saikat Kanjilal <sxk1969@hotmail.com
> > >wrote:
> > >
> > > > +1, in fact I would be very much indebted if someone (namely Dmitry :)
> > )
> > > > could do a google hangout focused on spark where folks can ask
> > questions
> > > > and learn more, to this end I want to bring up something else, it'd be
> > > > great if mahout itself either through the apache project foundation or
> > > > through committer means have a hadoop cluster to test algorithms, it
> > seems
> > > > like folks have their own cluster to test on but I think it'd be a
> > benefit
> > > > to the community to have a cluster that everyone can leverage.
> > > >
> > > > > Subject: Mahout on Spark
> > > > > From: pat@occamsmachete.com
> > > > > Date: Wed, 26 Mar 2014 09:05:02 -0700
> > > > > To: dev@mahout.apache.org; dlieu.7@gmail.com
> > > >
> > > > >
> > > > > New name for a new thread.
> > > > >
> > > > > A lot of the discussion on MAHOUT-1464 has been around integrating
> > that
> > > > feature with the Scala DSL. As Saikat says this is of general interest
> > > > since people seem to agree that this is a good place to integrate
> > efforts.
> > > > >
> > > > > I'm interested in what I think Dmitriy called data frames. Being a
> > > > complete noob on Spark I may have gotten this wrong but let me take a
> > shot
> > > > so he can correct me.
> > > > >
> > > > > There are a lot of problems that require a pipeline. The text input
> > > > pipeline is an example, but almost any input to Mahout requires at
> > least an
> > > > id translation step. What I though Dmitriy was suggesting was that by
> > > > avoiding the disk write + read between steps we might get significant
> > > > speedups. This has many implications, I'm sure.
> > > > >
> > > > > For one I think it means the non-serialized objects are being used by
> > > > multiple parts of the pipeline and so are not subject to "translation".
> > > > >
> > > > > Dmitriy can you explain more? You mentioned a talk you have given, do
> > > > you have slides somewhere or a PDF?
> > > > >
> > > > >
> > > > > On Mar 26, 2014, at 7:15 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > > >
> > > > > It would be great to have you.
> > > > >
> > > > >
> > > > > (go ahead and start new threads when appropriate ... better than
> > > > hijacking)
> > > > >
> > > > >
> > > > > On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <
> > smarty.juice@gmail.com
> > > > >wrote:
> > > > >
> > > > > > Sorry to hijack the thread,
> > > > > >
> > > > > > this seems like first steps of mahout geeting it to work on spark
> > > > > >
> > > > > > there are similar efforts going on with R+Spark aka Spark R
> > > > > >
> > > > > > not sure if this helpos, played with spark ec2 scripts and it
> > brings up
> > > > > > multinode cluster using mesos and its configurable - willing to
> > > > contribute
> > > > > > donations for mahout-dev
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <
> > > > jira@apache.org
> > > > > >> wrote:
> > > > > >
> > > > > >>
> > > > > >> [
> > > > > >>
> > > > > >
> > > >
> > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> > > > > > ]
> > > > > >>
> > > > > >> Saikat Kanjilal commented on MAHOUT-1464:
> > > > > >> -----------------------------------------
> > > > > >>
> > > > > >> +1 on Andrew's suggestion on using AWS to do this. Andrew is it
> > > > possible
> > > > > >> to have a shared account so mahout contributors can use this, I 'd
> > > > even
> > > > > > be
> > > > > >> willing to chip in donations :) to have a shared AWS account
> > > > > >>
> > > > > >>> RowSimilarityJob on Spark
> > > > > >>> -------------------------
> > > > > >>>
> > > > > >>> Key: MAHOUT-1464
> > > > > >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > > > > >>> Project: Mahout
> > > > > >>> Issue Type: Improvement
> > > > > >>> Components: Collaborative Filtering
> > > > > >>> Affects Versions: 0.9
> > > > > >>> Environment: hadoop, spark
> > > > > >>> Reporter: Pat Ferrel
> > > > > >>> Labels: performance
> > > > > >>> Fix For: 1.0
> > > > > >>>
> > > > > >>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> > > > > >> MAHOUT-1464.patch
> > > > > >>>
> > > > > >>>
> > > > > >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has
> > a
> > > > > >> prototype here: https://gist.github.com/sscdotopen/8314254. This
> > > > should
> > > > > >> be compatible with Mahout Spark DRM DSL so a DRM can be used as
> > input.
> > > > > >>> Ideally this would extend to cover MAHOUT-1422 which is a feature
> > > > > >> request for RSJ on two inputs to calculate the similarity of rows
> > of
> > > > one
> > > > > >> DRM with those of another. This cross-similarity has several
> > > > applications
> > > > > >> including cross-action recommendations.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> This message was sent by Atlassian JIRA
> > > > > >> (v6.2#6252)
> > > > > >>
> > > > > >
> > > > >
> > > >
> >
> >

Re: Mahout on Spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

No, we probably don't want to create them unless we have someone to assign
them to. You are more than welcome create one if you want to take a stub at
any of those.

-d


On Wed, Mar 26, 2014 at 10:28 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:

> @DmitryAre there JIRA items created for the wanted pieces?  I'd like to
> volunteer to take on the shell and the R bindings , should I create JIRA
> items for these?
>
> > Date: Wed, 26 Mar 2014 10:12:01 -0700
> > Subject: Re: Mahout on Spark
> > From: dlieu.7@gmail.com
> > To: sxk1969@hotmail.com
> > CC: dev@mahout.apache.org
> >
> > Sure.
> >
> > @Saikat et al:
> >
> > Check out the http://mahout.apache.org/users/sparkbindings/home.html"Wanted"
> > section.
> >
> > Of course, data frames and vectorization(feature prep) standardization is
> > very high priority there.
> > Another high priority is interactive shell /scripting (just like spark
> > shell). Something very similar in R interactive/script runner mode in
> > spirit. It is very important.
> >
> > Re: data frames. Anyone familiar with R, knows what it is. Basically a
> set
> > of named columnar vectors (with rows named or enumerated as well). A set
> of
> > filtering/modifying DSL expressions similar to R (I haven't really
> thought
> > about it at depth). The tricky part here is in-core data frame support of
> > course, since data frames are based on vectors that go beyond just a real
> > (double) values we have right now. in R, vector values could be integral,
> > boolean and character(i.e.string) types as well. If we had an in-core
> > support for that (or borrowed it from somewhere), the rest would have
> been
> > easy -- it is just a matter of semantic elegance. Really, i suggest to
> look
> > at R paradigms there, it is a pretty elegant way to work with closures
> > there.
> >
> > Of course we could use off-the-shelf stuff such as Map's  to support
> > something  named, with string values. I don't know at this point. Scala
> > itself comes a long way to help out here.
> >
> > As for slides, they are of little interest themselves since they mostly
> > re-interpret and summarize the working notes pdf in a bit more palatable
> > way. It is just an opportunity to deliver some content to folks who shy
> > away from reading docs for some reason *wink wink*. I will put them on
> the
> > site after meetup if it is ok.
> >
> >
> >
> >
> > On Wed, Mar 26, 2014 at 9:09 AM, Saikat Kanjilal <sxk1969@hotmail.com
> >wrote:
> >
> > > +1, in fact I would be very much indebted if someone (namely Dmitry :)
> )
> > > could do a google hangout focused on spark where folks can ask
> questions
> > > and learn more, to this end I want to bring up something else, it'd be
> > > great if mahout itself either through the apache project foundation or
> > > through committer means have a hadoop cluster to test algorithms, it
> seems
> > > like folks have their own cluster to test on but I think it'd be a
> benefit
> > > to the community to have a cluster that everyone can leverage.
> > >
> > > > Subject: Mahout on Spark
> > > > From: pat@occamsmachete.com
> > > > Date: Wed, 26 Mar 2014 09:05:02 -0700
> > > > To: dev@mahout.apache.org; dlieu.7@gmail.com
> > >
> > > >
> > > > New name for a new thread.
> > > >
> > > > A lot of the discussion on MAHOUT-1464 has been around integrating
> that
> > > feature with the Scala DSL. As Saikat says this is of general interest
> > > since people seem to agree that this is a good place to integrate
> efforts.
> > > >
> > > > I'm interested in what I think Dmitriy called data frames. Being a
> > > complete noob on Spark I may have gotten this wrong but let me take a
> shot
> > > so he can correct me.
> > > >
> > > > There are a lot of problems that require a pipeline. The text input
> > > pipeline is an example, but almost any input to Mahout requires at
> least an
> > > id translation step. What I though Dmitriy was suggesting was that by
> > > avoiding the disk write + read between steps we might get significant
> > > speedups. This has many implications, I'm sure.
> > > >
> > > > For one I think it means the non-serialized objects are being used by
> > > multiple parts of the pipeline and so are not subject to "translation".
> > > >
> > > > Dmitriy can you explain more? You mentioned a talk you have given, do
> > > you have slides somewhere or a PDF?
> > > >
> > > >
> > > > On Mar 26, 2014, at 7:15 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > > >
> > > > It would be great to have you.
> > > >
> > > >
> > > > (go ahead and start new threads when appropriate ... better than
> > > hijacking)
> > > >
> > > >
> > > > On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <
> smarty.juice@gmail.com
> > > >wrote:
> > > >
> > > > > Sorry to hijack the thread,
> > > > >
> > > > > this seems like first steps of mahout geeting it to work on spark
> > > > >
> > > > > there are similar efforts going on with R+Spark aka Spark R
> > > > >
> > > > > not sure if this helpos, played with spark ec2 scripts and it
> brings up
> > > > > multinode cluster using mesos and its configurable - willing to
> > > contribute
> > > > > donations for mahout-dev
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <
> > > jira@apache.org
> > > > >> wrote:
> > > > >
> > > > >>
> > > > >> [
> > > > >>
> > > > >
> > >
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> > > > > ]
> > > > >>
> > > > >> Saikat Kanjilal commented on MAHOUT-1464:
> > > > >> -----------------------------------------
> > > > >>
> > > > >> +1 on Andrew's suggestion on using AWS to do this. Andrew is it
> > > possible
> > > > >> to have a shared account so mahout contributors can use this, I 'd
> > > even
> > > > > be
> > > > >> willing to chip in donations :) to have a shared AWS account
> > > > >>
> > > > >>> RowSimilarityJob on Spark
> > > > >>> -------------------------
> > > > >>>
> > > > >>> Key: MAHOUT-1464
> > > > >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > > > >>> Project: Mahout
> > > > >>> Issue Type: Improvement
> > > > >>> Components: Collaborative Filtering
> > > > >>> Affects Versions: 0.9
> > > > >>> Environment: hadoop, spark
> > > > >>> Reporter: Pat Ferrel
> > > > >>> Labels: performance
> > > > >>> Fix For: 1.0
> > > > >>>
> > > > >>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> > > > >> MAHOUT-1464.patch
> > > > >>>
> > > > >>>
> > > > >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has
> a
> > > > >> prototype here: https://gist.github.com/sscdotopen/8314254. This
> > > should
> > > > >> be compatible with Mahout Spark DRM DSL so a DRM can be used as
> input.
> > > > >>> Ideally this would extend to cover MAHOUT-1422 which is a feature
> > > > >> request for RSJ on two inputs to calculate the similarity of rows
> of
> > > one
> > > > >> DRM with those of another. This cross-similarity has several
> > > applications
> > > > >> including cross-action recommendations.
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> This message was sent by Atlassian JIRA
> > > > >> (v6.2#6252)
> > > > >>
> > > > >
> > > >
> > >
>
>

RE: Mahout on Spark

Posted by Saikat Kanjilal <sx...@hotmail.com>.

@DmitryAre there JIRA items created for the wanted pieces?  I'd like to volunteer to take on the shell and the R bindings , should I create JIRA items for these?

> Date: Wed, 26 Mar 2014 10:12:01 -0700
> Subject: Re: Mahout on Spark
> From: dlieu.7@gmail.com
> To: sxk1969@hotmail.com
> CC: dev@mahout.apache.org
> 
> Sure.
> 
> @Saikat et al:
> 
> Check out the http://mahout.apache.org/users/sparkbindings/home.html "Wanted"
> section.
> 
> Of course, data frames and vectorization(feature prep) standardization is
> very high priority there.
> Another high priority is interactive shell /scripting (just like spark
> shell). Something very similar in R interactive/script runner mode in
> spirit. It is very important.
> 
> Re: data frames. Anyone familiar with R, knows what it is. Basically a set
> of named columnar vectors (with rows named or enumerated as well). A set of
> filtering/modifying DSL expressions similar to R (I haven't really thought
> about it at depth). The tricky part here is in-core data frame support of
> course, since data frames are based on vectors that go beyond just a real
> (double) values we have right now. in R, vector values could be integral,
> boolean and character(i.e.string) types as well. If we had an in-core
> support for that (or borrowed it from somewhere), the rest would have been
> easy -- it is just a matter of semantic elegance. Really, i suggest to look
> at R paradigms there, it is a pretty elegant way to work with closures
> there.
> 
> Of course we could use off-the-shelf stuff such as Map's  to support
> something  named, with string values. I don't know at this point. Scala
> itself comes a long way to help out here.
> 
> As for slides, they are of little interest themselves since they mostly
> re-interpret and summarize the working notes pdf in a bit more palatable
> way. It is just an opportunity to deliver some content to folks who shy
> away from reading docs for some reason *wink wink*. I will put them on the
> site after meetup if it is ok.
> 
> 
> 
> 
> On Wed, Mar 26, 2014 at 9:09 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
> 
> > +1, in fact I would be very much indebted if someone (namely Dmitry :) )
> > could do a google hangout focused on spark where folks can ask questions
> > and learn more, to this end I want to bring up something else, it'd be
> > great if mahout itself either through the apache project foundation or
> > through committer means have a hadoop cluster to test algorithms, it seems
> > like folks have their own cluster to test on but I think it'd be a benefit
> > to the community to have a cluster that everyone can leverage.
> >
> > > Subject: Mahout on Spark
> > > From: pat@occamsmachete.com
> > > Date: Wed, 26 Mar 2014 09:05:02 -0700
> > > To: dev@mahout.apache.org; dlieu.7@gmail.com
> >
> > >
> > > New name for a new thread.
> > >
> > > A lot of the discussion on MAHOUT-1464 has been around integrating that
> > feature with the Scala DSL. As Saikat says this is of general interest
> > since people seem to agree that this is a good place to integrate efforts.
> > >
> > > I'm interested in what I think Dmitriy called data frames. Being a
> > complete noob on Spark I may have gotten this wrong but let me take a shot
> > so he can correct me.
> > >
> > > There are a lot of problems that require a pipeline. The text input
> > pipeline is an example, but almost any input to Mahout requires at least an
> > id translation step. What I though Dmitriy was suggesting was that by
> > avoiding the disk write + read between steps we might get significant
> > speedups. This has many implications, I'm sure.
> > >
> > > For one I think it means the non-serialized objects are being used by
> > multiple parts of the pipeline and so are not subject to "translation".
> > >
> > > Dmitriy can you explain more? You mentioned a talk you have given, do
> > you have slides somewhere or a PDF?
> > >
> > >
> > > On Mar 26, 2014, at 7:15 AM, Ted Dunning <te...@gmail.com> wrote:
> > >
> > > It would be great to have you.
> > >
> > >
> > > (go ahead and start new threads when appropriate ... better than
> > hijacking)
> > >
> > >
> > > On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <smarty.juice@gmail.com
> > >wrote:
> > >
> > > > Sorry to hijack the thread,
> > > >
> > > > this seems like first steps of mahout geeting it to work on spark
> > > >
> > > > there are similar efforts going on with R+Spark aka Spark R
> > > >
> > > > not sure if this helpos, played with spark ec2 scripts and it brings up
> > > > multinode cluster using mesos and its configurable - willing to
> > contribute
> > > > donations for mahout-dev
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <
> > jira@apache.org
> > > >> wrote:
> > > >
> > > >>
> > > >> [
> > > >>
> > > >
> > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> > > > ]
> > > >>
> > > >> Saikat Kanjilal commented on MAHOUT-1464:
> > > >> -----------------------------------------
> > > >>
> > > >> +1 on Andrew's suggestion on using AWS to do this. Andrew is it
> > possible
> > > >> to have a shared account so mahout contributors can use this, I 'd
> > even
> > > > be
> > > >> willing to chip in donations :) to have a shared AWS account
> > > >>
> > > >>> RowSimilarityJob on Spark
> > > >>> -------------------------
> > > >>>
> > > >>> Key: MAHOUT-1464
> > > >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > > >>> Project: Mahout
> > > >>> Issue Type: Improvement
> > > >>> Components: Collaborative Filtering
> > > >>> Affects Versions: 0.9
> > > >>> Environment: hadoop, spark
> > > >>> Reporter: Pat Ferrel
> > > >>> Labels: performance
> > > >>> Fix For: 1.0
> > > >>>
> > > >>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> > > >> MAHOUT-1464.patch
> > > >>>
> > > >>>
> > > >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a
> > > >> prototype here: https://gist.github.com/sscdotopen/8314254. This
> > should
> > > >> be compatible with Mahout Spark DRM DSL so a DRM can be used as input.
> > > >>> Ideally this would extend to cover MAHOUT-1422 which is a feature
> > > >> request for RSJ on two inputs to calculate the similarity of rows of
> > one
> > > >> DRM with those of another. This cross-similarity has several
> > applications
> > > >> including cross-action recommendations.
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> This message was sent by Atlassian JIRA
> > > >> (v6.2#6252)
> > > >>
> > > >
> > >
> >

Re: Mahout on Spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Sure.

@Saikat et al:

Check out the http://mahout.apache.org/users/sparkbindings/home.html "Wanted"
section.

Of course, data frames and vectorization(feature prep) standardization is
very high priority there.
Another high priority is interactive shell /scripting (just like spark
shell). Something very similar in R interactive/script runner mode in
spirit. It is very important.

Re: data frames. Anyone familiar with R, knows what it is. Basically a set
of named columnar vectors (with rows named or enumerated as well). A set of
filtering/modifying DSL expressions similar to R (I haven't really thought
about it at depth). The tricky part here is in-core data frame support of
course, since data frames are based on vectors that go beyond just a real
(double) values we have right now. in R, vector values could be integral,
boolean and character(i.e.string) types as well. If we had an in-core
support for that (or borrowed it from somewhere), the rest would have been
easy -- it is just a matter of semantic elegance. Really, i suggest to look
at R paradigms there, it is a pretty elegant way to work with closures
there.

Of course we could use off-the-shelf stuff such as Map's  to support
something  named, with string values. I don't know at this point. Scala
itself comes a long way to help out here.

As for slides, they are of little interest themselves since they mostly
re-interpret and summarize the working notes pdf in a bit more palatable
way. It is just an opportunity to deliver some content to folks who shy
away from reading docs for some reason *wink wink*. I will put them on the
site after meetup if it is ok.




On Wed, Mar 26, 2014 at 9:09 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:

> +1, in fact I would be very much indebted if someone (namely Dmitry :) )
> could do a google hangout focused on spark where folks can ask questions
> and learn more, to this end I want to bring up something else, it'd be
> great if mahout itself either through the apache project foundation or
> through committer means have a hadoop cluster to test algorithms, it seems
> like folks have their own cluster to test on but I think it'd be a benefit
> to the community to have a cluster that everyone can leverage.
>
> > Subject: Mahout on Spark
> > From: pat@occamsmachete.com
> > Date: Wed, 26 Mar 2014 09:05:02 -0700
> > To: dev@mahout.apache.org; dlieu.7@gmail.com
>
> >
> > New name for a new thread.
> >
> > A lot of the discussion on MAHOUT-1464 has been around integrating that
> feature with the Scala DSL. As Saikat says this is of general interest
> since people seem to agree that this is a good place to integrate efforts.
> >
> > I'm interested in what I think Dmitriy called data frames. Being a
> complete noob on Spark I may have gotten this wrong but let me take a shot
> so he can correct me.
> >
> > There are a lot of problems that require a pipeline. The text input
> pipeline is an example, but almost any input to Mahout requires at least an
> id translation step. What I though Dmitriy was suggesting was that by
> avoiding the disk write + read between steps we might get significant
> speedups. This has many implications, I'm sure.
> >
> > For one I think it means the non-serialized objects are being used by
> multiple parts of the pipeline and so are not subject to "translation".
> >
> > Dmitriy can you explain more? You mentioned a talk you have given, do
> you have slides somewhere or a PDF?
> >
> >
> > On Mar 26, 2014, at 7:15 AM, Ted Dunning <te...@gmail.com> wrote:
> >
> > It would be great to have you.
> >
> >
> > (go ahead and start new threads when appropriate ... better than
> hijacking)
> >
> >
> > On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <smarty.juice@gmail.com
> >wrote:
> >
> > > Sorry to hijack the thread,
> > >
> > > this seems like first steps of mahout geeting it to work on spark
> > >
> > > there are similar efforts going on with R+Spark aka Spark R
> > >
> > > not sure if this helpos, played with spark ec2 scripts and it brings up
> > > multinode cluster using mesos and its configurable - willing to
> contribute
> > > donations for mahout-dev
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <
> jira@apache.org
> > >> wrote:
> > >
> > >>
> > >> [
> > >>
> > >
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> > > ]
> > >>
> > >> Saikat Kanjilal commented on MAHOUT-1464:
> > >> -----------------------------------------
> > >>
> > >> +1 on Andrew's suggestion on using AWS to do this. Andrew is it
> possible
> > >> to have a shared account so mahout contributors can use this, I 'd
> even
> > > be
> > >> willing to chip in donations :) to have a shared AWS account
> > >>
> > >>> RowSimilarityJob on Spark
> > >>> -------------------------
> > >>>
> > >>> Key: MAHOUT-1464
> > >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > >>> Project: Mahout
> > >>> Issue Type: Improvement
> > >>> Components: Collaborative Filtering
> > >>> Affects Versions: 0.9
> > >>> Environment: hadoop, spark
> > >>> Reporter: Pat Ferrel
> > >>> Labels: performance
> > >>> Fix For: 1.0
> > >>>
> > >>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> > >> MAHOUT-1464.patch
> > >>>
> > >>>
> > >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a
> > >> prototype here: https://gist.github.com/sscdotopen/8314254. This
> should
> > >> be compatible with Mahout Spark DRM DSL so a DRM can be used as input.
> > >>> Ideally this would extend to cover MAHOUT-1422 which is a feature
> > >> request for RSJ on two inputs to calculate the similarity of rows of
> one
> > >> DRM with those of another. This cross-similarity has several
> applications
> > >> including cross-action recommendations.
> > >>
> > >>
> > >>
> > >> --
> > >> This message was sent by Atlassian JIRA
> > >> (v6.2#6252)
> > >>
> > >
> >
>

RE: Mahout on Spark

Posted by Saikat Kanjilal <sx...@hotmail.com>.

+1, in fact I would be very much indebted if someone (namely Dmitry :) ) could do a google hangout focused on spark where folks can ask questions and learn more, to this end I want to bring up something else, it'd be great if mahout itself either through the apache project foundation or through committer means have a hadoop cluster to test algorithms, it seems like folks have their own cluster to test on but I think it'd be a benefit to the community to have a cluster that everyone can leverage.

> Subject: Mahout on Spark
> From: pat@occamsmachete.com
> Date: Wed, 26 Mar 2014 09:05:02 -0700
> To: dev@mahout.apache.org; dlieu.7@gmail.com
> 
> New name for a new thread.
> 
> A lot of the discussion on MAHOUT-1464 has been around integrating that feature with the Scala DSL. As Saikat says this is of general interest since people seem to agree that this is a good place to integrate efforts.
> 
> I’m interested in what I think Dmitriy called data frames. Being a complete noob on Spark I may have gotten this wrong but let me take a shot so he can correct me.
> 
> There are a lot of problems that require a pipeline. The text input pipeline is an example, but almost any input to Mahout requires at least an id translation step. What I though Dmitriy was suggesting was that by avoiding the disk write + read between steps we might get significant speedups. This has many implications, I’m sure.
> 
> For one I think it means the non-serialized objects are being used by multiple parts of the pipeline and so are not subject to “translation”.
> 
> Dmitriy can you explain more? You mentioned a talk you have given, do you have slides somewhere or a PDF?
> 
> 
> On Mar 26, 2014, at 7:15 AM, Ted Dunning <te...@gmail.com> wrote:
> 
> It would be great to have you.
> 
> 
> (go ahead and start new threads when appropriate ... better than hijacking)
> 
> 
> On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <sm...@gmail.com>wrote:
> 
> > Sorry to hijack the thread,
> > 
> > this seems like first steps of mahout geeting it to work on spark
> > 
> > there are similar efforts going on with R+Spark aka Spark R
> > 
> > not sure if this helpos, played with spark ec2 scripts and it brings up
> > multinode cluster using mesos and its configurable - willing to contribute
> > donations for mahout-dev
> > 
> > 
> > 
> > 
> > 
> > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <jira@apache.org
> >> wrote:
> > 
> >> 
> >>    [
> >> 
> > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> > ]
> >> 
> >> Saikat Kanjilal commented on MAHOUT-1464:
> >> -----------------------------------------
> >> 
> >> +1 on Andrew's suggestion on using AWS to do this.  Andrew is it possible
> >> to have a shared account so mahout contributors can use this, I 'd even
> > be
> >> willing to chip in donations :) to have a shared AWS account
> >> 
> >>> RowSimilarityJob on Spark
> >>> -------------------------
> >>> 
> >>>                Key: MAHOUT-1464
> >>>                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> >>>            Project: Mahout
> >>>         Issue Type: Improvement
> >>>         Components: Collaborative Filtering
> >>>   Affects Versions: 0.9
> >>>        Environment: hadoop, spark
> >>>           Reporter: Pat Ferrel
> >>>             Labels: performance
> >>>            Fix For: 1.0
> >>> 
> >>>        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> >> MAHOUT-1464.patch
> >>> 
> >>> 
> >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a
> >> prototype here: https://gist.github.com/sscdotopen/8314254. This should
> >> be compatible with Mahout Spark DRM DSL so a DRM can be used as input.
> >>> Ideally this would extend to cover MAHOUT-1422 which is a feature
> >> request for RSJ on two inputs to calculate the similarity of rows of one
> >> DRM with those of another. This cross-similarity has several applications
> >> including cross-action recommendations.
> >> 
> >> 
> >> 
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.2#6252)
> >> 
> > 
>