You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Saikat Kanjilal <sx...@hotmail.com> on 2014/04/30 16:44:32 UTC

Helping out on spark efforts

Sebastien/Dmitry,In looking through the current list of issues I didnt see other algorithms in mahout that are talked about being ported to spark, I was wondering if there's any interest/need in porting or writing things like LR/KMeans/SVM to use spark, I'd like to help out in this area while working on 1490.  Also are we planning to port the distributed versions of taste to use spark as well at some point.
Thanks in advance.

RE: Helping out on spark efforts

Posted by Saikat Kanjilal <sx...@hotmail.com>.

> The aim of this issue is to get the initial design right for a slim, but 
> powerful dataframe. I talked about narrowing the scope w.r.t. to these 
> features that you proposed:
> 
> * "transactional operations between the dataFrame and a remote database"
> * introduction of a "a generalized abstraction around a query==could 
> represent a sql/nosql query or an hdfs query"
> 
***Understood, wil try to narrow down the scope even further and remove these assuming no-one else is interested.  I certainly understand the sentiment of getting something rolling with as narrow a focus as possible to make this successful, I'll keep working on the proposal based on your feedback and send periodic updates.

> Date: Sun, 4 May 2014 17:26:39 +0200
> From: ssc@apache.org
> To: dev@mahout.apache.org
> Subject: Re: Helping out on spark efforts
> 
> Saikat,
> 
> The aim of this issue is to get the initial design right for a slim, but 
> powerful dataframe. I talked about narrowing the scope w.r.t. to these 
> features that you proposed:
> 
>   * "transactional operations between the dataFrame and a remote database"
>   * introduction of  a "a generalized abstraction around a query==could 
> represent a sql/nosql query or an hdfs query"
> 
> At this point, I would veto any patch that tries to address these things.
> 
> --sebastian
> 
> 
> On 05/04/2014 04:31 PM, Saikat Kanjilal wrote:
> > I'll add the example associated with Mahout-1518 in the integration API section, to be clear per the initial feedback I tried to "narrow the scope" of this effort by adding more examples around the dplyr and mltables functionality that I felt would be relevant towards the concept of a dataframe.   Are there other things missing from the APIs I am suggesting, would love to add them in 1 fell swoop :))).  I wouldn't necessarily say the introduction of connection to a remote datasource and manipulating its contents inside a dataframe is distracting, infact dplyr is doing that now and I think it might be useful to take an RDD in the context of spark and bring a subset by applying a set of functions on top of that into a dataframe.
> > Keep feedback coming as you guys look through the API.
> >
> >> Date: Sun, 4 May 2014 13:20:03 +0200
> >> From: ssc@apache.org
> >> To: dev@mahout.apache.org
> >> Subject: Re: Helping out on spark efforts
> >>
> >> I think we should concentrate on getting the core functionality right,
> >> and test that on a few examples. We should narrow the scope of this and
> >> avoid getting distracted by thinking about adding something generalizes
> >> NoSQL-queries or so...
> >>
> >> One thing that I would like to see is an example of how to handle the
> >> input for a cooccurrence-based recommender in MAHOUT-1518
> >>
> >> Say the raw data looks like this:
> >>
> >> timestamp1, userIdString1, itemIdString1, “view"
> >> timestamp2, userIdString2, itemIdString1, “like"
> >> ...
> >>
> >>
> >> What we want in the end is two DRMs with int keys having users as rows
> >> and items as columns. One DRM should contain all the views, the other
> >> all the likes (e.g. for every userIdString, itemIdString pair present,
> >> there is a 1 in the corresponding cell of the matrix).
> >>
> >> The result of the cooccurrence analysis is a set of int-keyed item-item
> >> matrices. We should be able to map the int keys back to the original
> >> itemIdStrings.
> >>
> >> Would love to see how that example looks like in your proposed DataFrame.
> >>
> >>
> >> --sebastian
> >>
> >>
> >>
> >> On 05/04/2014 07:17 AM, Saikat Kanjilal wrote:
> >>> Me again :), added a subset of the definitions from the dplyr functionality to the integration API section as promised , examples include compute/filter/chain etc.    My next steps will be adding concrete examples underneath each of the newly created Integration APIs, at a high level here are the domain objects I am thinking will need to exist and be referenced in the DataFrame world:
> >>> DataFrame (self explanatory)Query (a generalized abstraction around a query==could represent a sql/nosql query or an hdfs query)RDD (important domain object that could be returned by one or more of our APIs)Destination (a remote data source, could be a table/ a location in hdfs etc)Connection (a remote database connection to use to perform transactional operations between the dataFrame and a remote database)
> >>> Had an additional thought, might we at some point want to operate on matrices and mathematically perform operations with matrices and dataFrames, would love to hear from committers as to whether this may be useful and I can add in APIs around this as well.
> >>> One thing that I've also been pondering is whether or how to handle errors in any of these APIs, one thought I had was to introduce a generalized error object that can be reused on all of the APIs, maybe something that contains a message and an error code or something similar, an alternative idea is to leverage something already existing in the spark bindings if possible.
> >>> Would love for folks to take a look through the APIs as I expand them and add more examples and leave comments on JIRA ticket, also I'm thinking that since the stuff around performing slicing/CRUD functionality around dataFrames is pretty commonly understood I may take those examples out and put more examples in around the APIs for dplyr and mltables.
> >>> Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> >>> JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490
> >>>
> >>> Regards
> >>>
> >>>
> >>>
> >>>> From: sxk1969@hotmail.com
> >>>> To: dev@mahout.apache.org
> >>>> Subject: RE: Helping out on spark efforts
> >>>> Date: Sat, 3 May 2014 10:09:51 -0700
> >>>>
> >>>> I've taken a stab at adding a subset of the functionality used by MLTable operators into the blog on top of the R CRUD functionality I listed earlier into the integration API section of the blog, please review and let me know your thoughts, will be tackling the dplyr functionality next and adding that in , blog is shown below, again please see the integration API section for details:
> >>>>
> >>>> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> >>>>
> >>>> Look forward to hearing comments either on the list on the jira ticket itself:
> >>>> https://issues.apache.org/jira/browse/MAHOUT-1490
> >>>> Thanks in advance.
> >>>>
> >>>>> Date: Wed, 30 Apr 2014 17:13:52 +0200
> >>>>> From: ssc@apache.org
> >>>>> To: ted.dunning@gmail.com; dev@mahout.apache.org
> >>>>> Subject: Re: Helping out on spark efforts
> >>>>>
> >>>>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy
> >>>>> suggested to update the design example to Scala code and try to work in
> >>>>> things that fit from dply from R and MLTable. I'd love to see such a
> >>>>> design doc.
> >>>>>
> >>>>> --sebastian
> >>>>>
> >>>>> On 04/30/2014 05:02 PM, Ted Dunning wrote:
> >>>>>> +1 for foundations first.
> >>>>>>
> >>>>>> There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> >>>>>> regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:
> >>>>>>
> >>>>>>> I think you should concentrate on MAHOUT-1490, that is a highly important
> >>>>>>> task that will be the foundation for a lot of stuff to be built on top.
> >>>>>>> Let's focus on getting this thing right and then move on to other things.
> >>>>>>>
> >>>>>>> --sebastian
> >>>>>>>
> >>>>>>>
> >>>>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> >>>>>>>
> >>>>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> >>>>>>>> see other algorithms in mahout that are talked about being ported to spark,
> >>>>>>>> I was wondering if there's any interest/need in porting or writing things
> >>>>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
> >>>>>>>> working on 1490.  Also are we planning to port the distributed versions of
> >>>>>>>> taste to use spark as well at some point.
> >>>>>>>> Thanks in advance.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>    		 	   		
> >>>    		 	   		
> >>>
> >>
> >   		 	   		
> >
>

Re: Helping out on spark efforts

Posted by Sebastian Schelter <ss...@apache.org>.

Saikat,

The aim of this issue is to get the initial design right for a slim, but 
powerful dataframe. I talked about narrowing the scope w.r.t. to these 
features that you proposed:

  * "transactional operations between the dataFrame and a remote database"
  * introduction of  a "a generalized abstraction around a query==could 
represent a sql/nosql query or an hdfs query"

At this point, I would veto any patch that tries to address these things.

--sebastian


On 05/04/2014 04:31 PM, Saikat Kanjilal wrote:
> I'll add the example associated with Mahout-1518 in the integration API section, to be clear per the initial feedback I tried to "narrow the scope" of this effort by adding more examples around the dplyr and mltables functionality that I felt would be relevant towards the concept of a dataframe.   Are there other things missing from the APIs I am suggesting, would love to add them in 1 fell swoop :))).  I wouldn't necessarily say the introduction of connection to a remote datasource and manipulating its contents inside a dataframe is distracting, infact dplyr is doing that now and I think it might be useful to take an RDD in the context of spark and bring a subset by applying a set of functions on top of that into a dataframe.
> Keep feedback coming as you guys look through the API.
>
>> Date: Sun, 4 May 2014 13:20:03 +0200
>> From: ssc@apache.org
>> To: dev@mahout.apache.org
>> Subject: Re: Helping out on spark efforts
>>
>> I think we should concentrate on getting the core functionality right,
>> and test that on a few examples. We should narrow the scope of this and
>> avoid getting distracted by thinking about adding something generalizes
>> NoSQL-queries or so...
>>
>> One thing that I would like to see is an example of how to handle the
>> input for a cooccurrence-based recommender in MAHOUT-1518
>>
>> Say the raw data looks like this:
>>
>> timestamp1, userIdString1, itemIdString1, “view"
>> timestamp2, userIdString2, itemIdString1, “like"
>> ...
>>
>>
>> What we want in the end is two DRMs with int keys having users as rows
>> and items as columns. One DRM should contain all the views, the other
>> all the likes (e.g. for every userIdString, itemIdString pair present,
>> there is a 1 in the corresponding cell of the matrix).
>>
>> The result of the cooccurrence analysis is a set of int-keyed item-item
>> matrices. We should be able to map the int keys back to the original
>> itemIdStrings.
>>
>> Would love to see how that example looks like in your proposed DataFrame.
>>
>>
>> --sebastian
>>
>>
>>
>> On 05/04/2014 07:17 AM, Saikat Kanjilal wrote:
>>> Me again :), added a subset of the definitions from the dplyr functionality to the integration API section as promised , examples include compute/filter/chain etc.    My next steps will be adding concrete examples underneath each of the newly created Integration APIs, at a high level here are the domain objects I am thinking will need to exist and be referenced in the DataFrame world:
>>> DataFrame (self explanatory)Query (a generalized abstraction around a query==could represent a sql/nosql query or an hdfs query)RDD (important domain object that could be returned by one or more of our APIs)Destination (a remote data source, could be a table/ a location in hdfs etc)Connection (a remote database connection to use to perform transactional operations between the dataFrame and a remote database)
>>> Had an additional thought, might we at some point want to operate on matrices and mathematically perform operations with matrices and dataFrames, would love to hear from committers as to whether this may be useful and I can add in APIs around this as well.
>>> One thing that I've also been pondering is whether or how to handle errors in any of these APIs, one thought I had was to introduce a generalized error object that can be reused on all of the APIs, maybe something that contains a message and an error code or something similar, an alternative idea is to leverage something already existing in the spark bindings if possible.
>>> Would love for folks to take a look through the APIs as I expand them and add more examples and leave comments on JIRA ticket, also I'm thinking that since the stuff around performing slicing/CRUD functionality around dataFrames is pretty commonly understood I may take those examples out and put more examples in around the APIs for dplyr and mltables.
>>> Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
>>> JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490
>>>
>>> Regards
>>>
>>>
>>>
>>>> From: sxk1969@hotmail.com
>>>> To: dev@mahout.apache.org
>>>> Subject: RE: Helping out on spark efforts
>>>> Date: Sat, 3 May 2014 10:09:51 -0700
>>>>
>>>> I've taken a stab at adding a subset of the functionality used by MLTable operators into the blog on top of the R CRUD functionality I listed earlier into the integration API section of the blog, please review and let me know your thoughts, will be tackling the dplyr functionality next and adding that in , blog is shown below, again please see the integration API section for details:
>>>>
>>>> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
>>>>
>>>> Look forward to hearing comments either on the list on the jira ticket itself:
>>>> https://issues.apache.org/jira/browse/MAHOUT-1490
>>>> Thanks in advance.
>>>>
>>>>> Date: Wed, 30 Apr 2014 17:13:52 +0200
>>>>> From: ssc@apache.org
>>>>> To: ted.dunning@gmail.com; dev@mahout.apache.org
>>>>> Subject: Re: Helping out on spark efforts
>>>>>
>>>>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy
>>>>> suggested to update the design example to Scala code and try to work in
>>>>> things that fit from dply from R and MLTable. I'd love to see such a
>>>>> design doc.
>>>>>
>>>>> --sebastian
>>>>>
>>>>> On 04/30/2014 05:02 PM, Ted Dunning wrote:
>>>>>> +1 for foundations first.
>>>>>>
>>>>>> There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
>>>>>> regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>>>
>>>>>>> I think you should concentrate on MAHOUT-1490, that is a highly important
>>>>>>> task that will be the foundation for a lot of stuff to be built on top.
>>>>>>> Let's focus on getting this thing right and then move on to other things.
>>>>>>>
>>>>>>> --sebastian
>>>>>>>
>>>>>>>
>>>>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>>>>>>>
>>>>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>>>>>>>> see other algorithms in mahout that are talked about being ported to spark,
>>>>>>>> I was wondering if there's any interest/need in porting or writing things
>>>>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>>>>>>>> working on 1490.  Also are we planning to port the distributed versions of
>>>>>>>> taste to use spark as well at some point.
>>>>>>>> Thanks in advance.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>    		 	   		
>>>    		 	   		
>>>
>>
>   		 	   		
>

RE: Helping out on spark efforts

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I'll add the example associated with Mahout-1518 in the integration API section, to be clear per the initial feedback I tried to "narrow the scope" of this effort by adding more examples around the dplyr and mltables functionality that I felt would be relevant towards the concept of a dataframe.   Are there other things missing from the APIs I am suggesting, would love to add them in 1 fell swoop :))).  I wouldn't necessarily say the introduction of connection to a remote datasource and manipulating its contents inside a dataframe is distracting, infact dplyr is doing that now and I think it might be useful to take an RDD in the context of spark and bring a subset by applying a set of functions on top of that into a dataframe.
Keep feedback coming as you guys look through the API.

> Date: Sun, 4 May 2014 13:20:03 +0200
> From: ssc@apache.org
> To: dev@mahout.apache.org
> Subject: Re: Helping out on spark efforts
> 
> I think we should concentrate on getting the core functionality right, 
> and test that on a few examples. We should narrow the scope of this and 
> avoid getting distracted by thinking about adding something generalizes 
> NoSQL-queries or so...
> 
> One thing that I would like to see is an example of how to handle the 
> input for a cooccurrence-based recommender in MAHOUT-1518
> 
> Say the raw data looks like this:
> 
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> ...
> 
> 
> What we want in the end is two DRMs with int keys having users as rows 
> and items as columns. One DRM should contain all the views, the other 
> all the likes (e.g. for every userIdString, itemIdString pair present, 
> there is a 1 in the corresponding cell of the matrix).
> 
> The result of the cooccurrence analysis is a set of int-keyed item-item 
> matrices. We should be able to map the int keys back to the original 
> itemIdStrings.
> 
> Would love to see how that example looks like in your proposed DataFrame.
> 
> 
> --sebastian
> 
> 
> 
> On 05/04/2014 07:17 AM, Saikat Kanjilal wrote:
> > Me again :), added a subset of the definitions from the dplyr functionality to the integration API section as promised , examples include compute/filter/chain etc.    My next steps will be adding concrete examples underneath each of the newly created Integration APIs, at a high level here are the domain objects I am thinking will need to exist and be referenced in the DataFrame world:
> > DataFrame (self explanatory)Query (a generalized abstraction around a query==could represent a sql/nosql query or an hdfs query)RDD (important domain object that could be returned by one or more of our APIs)Destination (a remote data source, could be a table/ a location in hdfs etc)Connection (a remote database connection to use to perform transactional operations between the dataFrame and a remote database)
> > Had an additional thought, might we at some point want to operate on matrices and mathematically perform operations with matrices and dataFrames, would love to hear from committers as to whether this may be useful and I can add in APIs around this as well.
> > One thing that I've also been pondering is whether or how to handle errors in any of these APIs, one thought I had was to introduce a generalized error object that can be reused on all of the APIs, maybe something that contains a message and an error code or something similar, an alternative idea is to leverage something already existing in the spark bindings if possible.
> > Would love for folks to take a look through the APIs as I expand them and add more examples and leave comments on JIRA ticket, also I'm thinking that since the stuff around performing slicing/CRUD functionality around dataFrames is pretty commonly understood I may take those examples out and put more examples in around the APIs for dplyr and mltables.
> > Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> > JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490
> >
> > Regards
> >
> >
> >
> >> From: sxk1969@hotmail.com
> >> To: dev@mahout.apache.org
> >> Subject: RE: Helping out on spark efforts
> >> Date: Sat, 3 May 2014 10:09:51 -0700
> >>
> >> I've taken a stab at adding a subset of the functionality used by MLTable operators into the blog on top of the R CRUD functionality I listed earlier into the integration API section of the blog, please review and let me know your thoughts, will be tackling the dplyr functionality next and adding that in , blog is shown below, again please see the integration API section for details:
> >>
> >> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> >>
> >> Look forward to hearing comments either on the list on the jira ticket itself:
> >> https://issues.apache.org/jira/browse/MAHOUT-1490
> >> Thanks in advance.
> >>
> >>> Date: Wed, 30 Apr 2014 17:13:52 +0200
> >>> From: ssc@apache.org
> >>> To: ted.dunning@gmail.com; dev@mahout.apache.org
> >>> Subject: Re: Helping out on spark efforts
> >>>
> >>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy
> >>> suggested to update the design example to Scala code and try to work in
> >>> things that fit from dply from R and MLTable. I'd love to see such a
> >>> design doc.
> >>>
> >>> --sebastian
> >>>
> >>> On 04/30/2014 05:02 PM, Ted Dunning wrote:
> >>>> +1 for foundations first.
> >>>>
> >>>> There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> >>>> regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:
> >>>>
> >>>>> I think you should concentrate on MAHOUT-1490, that is a highly important
> >>>>> task that will be the foundation for a lot of stuff to be built on top.
> >>>>> Let's focus on getting this thing right and then move on to other things.
> >>>>>
> >>>>> --sebastian
> >>>>>
> >>>>>
> >>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> >>>>>
> >>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> >>>>>> see other algorithms in mahout that are talked about being ported to spark,
> >>>>>> I was wondering if there's any interest/need in porting or writing things
> >>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
> >>>>>> working on 1490.  Also are we planning to port the distributed versions of
> >>>>>> taste to use spark as well at some point.
> >>>>>> Thanks in advance.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>   		 	   		
> >   		 	   		
> >
>

Re: Helping out on spark efforts

Posted by Sebastian Schelter <ss...@apache.org>.

I think we should concentrate on getting the core functionality right, 
and test that on a few examples. We should narrow the scope of this and 
avoid getting distracted by thinking about adding something generalizes 
NoSQL-queries or so...

One thing that I would like to see is an example of how to handle the 
input for a cooccurrence-based recommender in MAHOUT-1518

Say the raw data looks like this:

timestamp1, userIdString1, itemIdString1, “view"
timestamp2, userIdString2, itemIdString1, “like"
...


What we want in the end is two DRMs with int keys having users as rows 
and items as columns. One DRM should contain all the views, the other 
all the likes (e.g. for every userIdString, itemIdString pair present, 
there is a 1 in the corresponding cell of the matrix).

The result of the cooccurrence analysis is a set of int-keyed item-item 
matrices. We should be able to map the int keys back to the original 
itemIdStrings.

Would love to see how that example looks like in your proposed DataFrame.


--sebastian



On 05/04/2014 07:17 AM, Saikat Kanjilal wrote:
> Me again :), added a subset of the definitions from the dplyr functionality to the integration API section as promised , examples include compute/filter/chain etc.    My next steps will be adding concrete examples underneath each of the newly created Integration APIs, at a high level here are the domain objects I am thinking will need to exist and be referenced in the DataFrame world:
> DataFrame (self explanatory)Query (a generalized abstraction around a query==could represent a sql/nosql query or an hdfs query)RDD (important domain object that could be returned by one or more of our APIs)Destination (a remote data source, could be a table/ a location in hdfs etc)Connection (a remote database connection to use to perform transactional operations between the dataFrame and a remote database)
> Had an additional thought, might we at some point want to operate on matrices and mathematically perform operations with matrices and dataFrames, would love to hear from committers as to whether this may be useful and I can add in APIs around this as well.
> One thing that I've also been pondering is whether or how to handle errors in any of these APIs, one thought I had was to introduce a generalized error object that can be reused on all of the APIs, maybe something that contains a message and an error code or something similar, an alternative idea is to leverage something already existing in the spark bindings if possible.
> Would love for folks to take a look through the APIs as I expand them and add more examples and leave comments on JIRA ticket, also I'm thinking that since the stuff around performing slicing/CRUD functionality around dataFrames is pretty commonly understood I may take those examples out and put more examples in around the APIs for dplyr and mltables.
> Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490
>
> Regards
>
>
>
>> From: sxk1969@hotmail.com
>> To: dev@mahout.apache.org
>> Subject: RE: Helping out on spark efforts
>> Date: Sat, 3 May 2014 10:09:51 -0700
>>
>> I've taken a stab at adding a subset of the functionality used by MLTable operators into the blog on top of the R CRUD functionality I listed earlier into the integration API section of the blog, please review and let me know your thoughts, will be tackling the dplyr functionality next and adding that in , blog is shown below, again please see the integration API section for details:
>>
>> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
>>
>> Look forward to hearing comments either on the list on the jira ticket itself:
>> https://issues.apache.org/jira/browse/MAHOUT-1490
>> Thanks in advance.
>>
>>> Date: Wed, 30 Apr 2014 17:13:52 +0200
>>> From: ssc@apache.org
>>> To: ted.dunning@gmail.com; dev@mahout.apache.org
>>> Subject: Re: Helping out on spark efforts
>>>
>>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy
>>> suggested to update the design example to Scala code and try to work in
>>> things that fit from dply from R and MLTable. I'd love to see such a
>>> design doc.
>>>
>>> --sebastian
>>>
>>> On 04/30/2014 05:02 PM, Ted Dunning wrote:
>>>> +1 for foundations first.
>>>>
>>>> There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
>>>> regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
>>>>
>>>>
>>>>
>>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>
>>>>> I think you should concentrate on MAHOUT-1490, that is a highly important
>>>>> task that will be the foundation for a lot of stuff to be built on top.
>>>>> Let's focus on getting this thing right and then move on to other things.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>>>>>
>>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>>>>>> see other algorithms in mahout that are talked about being ported to spark,
>>>>>> I was wondering if there's any interest/need in porting or writing things
>>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>>>>>> working on 1490.  Also are we planning to port the distributed versions of
>>>>>> taste to use spark as well at some point.
>>>>>> Thanks in advance.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>   		 	   		
>   		 	   		
>

RE: Helping out on spark efforts

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Me again :), added a subset of the definitions from the dplyr functionality to the integration API section as promised , examples include compute/filter/chain etc.    My next steps will be adding concrete examples underneath each of the newly created Integration APIs, at a high level here are the domain objects I am thinking will need to exist and be referenced in the DataFrame world:
DataFrame (self explanatory)Query (a generalized abstraction around a query==could represent a sql/nosql query or an hdfs query)RDD (important domain object that could be returned by one or more of our APIs)Destination (a remote data source, could be a table/ a location in hdfs etc)Connection (a remote database connection to use to perform transactional operations between the dataFrame and a remote database)
Had an additional thought, might we at some point want to operate on matrices and mathematically perform operations with matrices and dataFrames, would love to hear from committers as to whether this may be useful and I can add in APIs around this as well.
One thing that I've also been pondering is whether or how to handle errors in any of these APIs, one thought I had was to introduce a generalized error object that can be reused on all of the APIs, maybe something that contains a message and an error code or something similar, an alternative idea is to leverage something already existing in the spark bindings if possible.
Would love for folks to take a look through the APIs as I expand them and add more examples and leave comments on JIRA ticket, also I'm thinking that since the stuff around performing slicing/CRUD functionality around dataFrames is pretty commonly understood I may take those examples out and put more examples in around the APIs for dplyr and mltables.
Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490

Regards

> From: sxk1969@hotmail.com
> To: dev@mahout.apache.org
> Subject: RE: Helping out on spark efforts
> Date: Sat, 3 May 2014 10:09:51 -0700
> 
> I've taken a stab at adding a subset of the functionality used by MLTable operators into the blog on top of the R CRUD functionality I listed earlier into the integration API section of the blog, please review and let me know your thoughts, will be tackling the dplyr functionality next and adding that in , blog is shown below, again please see the integration API section for details:
> 
> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> 
> Look forward to hearing comments either on the list on the jira ticket itself:
> https://issues.apache.org/jira/browse/MAHOUT-1490
> Thanks in advance.
> 
> > Date: Wed, 30 Apr 2014 17:13:52 +0200
> > From: ssc@apache.org
> > To: ted.dunning@gmail.com; dev@mahout.apache.org
> > Subject: Re: Helping out on spark efforts
> > 
> > I think getting the design right for MAHOUT-1490 is tough. Dmitriy 
> > suggested to update the design example to Scala code and try to work in 
> > things that fit from dply from R and MLTable. I'd love to see such a 
> > design doc.
> > 
> > --sebastian
> > 
> > On 04/30/2014 05:02 PM, Ted Dunning wrote:
> > > +1 for foundations first.
> > >
> > > There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> > > regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> > >
> > >
> > >
> > > On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:
> > >
> > >> I think you should concentrate on MAHOUT-1490, that is a highly important
> > >> task that will be the foundation for a lot of stuff to be built on top.
> > >> Let's focus on getting this thing right and then move on to other things.
> > >>
> > >> --sebastian
> > >>
> > >>
> > >> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> > >>
> > >>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> > >>> see other algorithms in mahout that are talked about being ported to spark,
> > >>> I was wondering if there's any interest/need in porting or writing things
> > >>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
> > >>> working on 1490.  Also are we planning to port the distributed versions of
> > >>> taste to use spark as well at some point.
> > >>> Thanks in advance.
> > >>>
> > >>>
> > >>
> > >
> > 
>

RE: Helping out on spark efforts

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I've taken a stab at adding a subset of the functionality used by MLTable operators into the blog on top of the R CRUD functionality I listed earlier into the integration API section of the blog, please review and let me know your thoughts, will be tackling the dplyr functionality next and adding that in , blog is shown below, again please see the integration API section for details:

http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html

Look forward to hearing comments either on the list on the jira ticket itself:
https://issues.apache.org/jira/browse/MAHOUT-1490
Thanks in advance.

> Date: Wed, 30 Apr 2014 17:13:52 +0200
> From: ssc@apache.org
> To: ted.dunning@gmail.com; dev@mahout.apache.org
> Subject: Re: Helping out on spark efforts
> 
> I think getting the design right for MAHOUT-1490 is tough. Dmitriy 
> suggested to update the design example to Scala code and try to work in 
> things that fit from dply from R and MLTable. I'd love to see such a 
> design doc.
> 
> --sebastian
> 
> On 04/30/2014 05:02 PM, Ted Dunning wrote:
> > +1 for foundations first.
> >
> > There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> > regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> >
> >
> >
> > On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:
> >
> >> I think you should concentrate on MAHOUT-1490, that is a highly important
> >> task that will be the foundation for a lot of stuff to be built on top.
> >> Let's focus on getting this thing right and then move on to other things.
> >>
> >> --sebastian
> >>
> >>
> >> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> >>
> >>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> >>> see other algorithms in mahout that are talked about being ported to spark,
> >>> I was wondering if there's any interest/need in porting or writing things
> >>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
> >>> working on 1490.  Also are we planning to port the distributed versions of
> >>> taste to use spark as well at some point.
> >>> Thanks in advance.
> >>>
> >>>
> >>
> >
>

Re: Helping out on spark efforts

Posted by Sebastian Schelter <ss...@apache.org>.

I think getting the design right for MAHOUT-1490 is tough. Dmitriy 
suggested to update the design example to Scala code and try to work in 
things that fit from dply from R and MLTable. I'd love to see such a 
design doc.

--sebastian

On 04/30/2014 05:02 PM, Ted Dunning wrote:
> +1 for foundations first.
>
> There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
>
>
>
> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> I think you should concentrate on MAHOUT-1490, that is a highly important
>> task that will be the foundation for a lot of stuff to be built on top.
>> Let's focus on getting this thing right and then move on to other things.
>>
>> --sebastian
>>
>>
>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>>
>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>>> see other algorithms in mahout that are talked about being ported to spark,
>>> I was wondering if there's any interest/need in porting or writing things
>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>>> working on 1490.  Also are we planning to port the distributed versions of
>>> taste to use spark as well at some point.
>>> Thanks in advance.
>>>
>>>
>>
>

Re: Helping out on spark efforts

Posted by Ted Dunning <te...@gmail.com>.

+1 for foundations first.

There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.



On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <ss...@apache.org> wrote:

> I think you should concentrate on MAHOUT-1490, that is a highly important
> task that will be the foundation for a lot of stuff to be built on top.
> Let's focus on getting this thing right and then move on to other things.
>
> --sebastian
>
>
> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>
>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>> see other algorithms in mahout that are talked about being ported to spark,
>> I was wondering if there's any interest/need in porting or writing things
>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>> working on 1490.  Also are we planning to port the distributed versions of
>> taste to use spark as well at some point.
>> Thanks in advance.
>>
>>
>

Re: Helping out on spark efforts

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Apr 30, 2014 at 9:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Wed, Apr 30, 2014 at 11:42 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > I also would suggest to take some guinea pigs to validate stuff.
> >
> > E.g. if i may make a suggestion, let's see how we'd do a categorical
> > variable vectorization into predictor variables in our would-be language
> > here.
> >
>
> to be a bit further specific here, here's what roughly happens here.
> assuming we have a column named "C1"
>
>
> (1) assess  levels and their number (in R sense, aka R "factor" type)
> (2) assume there's n total levels (i.e. distinct categories). Assign each
> level, except one, to n-1 Bernoulli features named according to certain
> convention e.g. "C1_<level-name-prefix>".
> (3) repeat that for all categorical variables in the data frame.
> (4) generate final dataframe executing mapping categories established in
> (2) and (3) (set predictors to 1 if current categorical value matches
> predictor's).
> (5) compute resulting data frame summaries (mean, variance, quartiles).
>
> seems simple enough, but how would it look like?
>

Sounds good.  Minor nit in that 1 of n coding should be allowed as well.

I would also expect that we could do random hashing encoding as well.

A similar problem statement is possible for values that are textual, in
addition to categorical.  The process is essentially the same in that you
have 0 or 1 passes to optionally agree on a dictionary and then another
pass to encode into n columns.

Re: Helping out on spark efforts

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Apr 30, 2014 at 11:42 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I also would suggest to take some guinea pigs to validate stuff.
>
> E.g. if i may make a suggestion, let's see how we'd do a categorical
> variable vectorization into predictor variables in our would-be language
> here.
>

to be a bit further specific here, here's what roughly happens here.
assuming we have a column named "C1"


(1) assess  levels and their number (in R sense, aka R "factor" type)
(2) assume there's n total levels (i.e. distinct categories). Assign each
level, except one, to n-1 Bernoulli features named according to certain
convention e.g. "C1_<level-name-prefix>".
(3) repeat that for all categorical variables in the data frame.
(4) generate final dataframe executing mapping categories established in
(2) and (3) (set predictors to 1 if current categorical value matches
predictor's).
(5) compute resulting data frame summaries (mean, variance, quartiles).

seems simple enough, but how would it look like?

>
>
> On Wed, Apr 30, 2014 at 11:40 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>>
>>
>>
>> On Wed, Apr 30, 2014 at 10:53 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> +1.
>>>
>>> And the greatest benefit of data frames work is standardization of
>>> feature extraction in Mahout, not necessarily any particular algorithms.
>>> This has been the thorniest issue in the history and nobody does it well
>>> today as it stands.
>>>
>>
>> Correction: nobody does it well in open source and in distributed way,
>> that is.
>>
>>
>>>  If we tackle feature prep techniques in engine-agnostic way, this would
>>> be truly unique differentiation factor for Mahout.
>>>
>>>
>>>
>>> On Wed, Apr 30, 2014 at 7:52 AM, Sebastian Schelter <ss...@apache.org>wrote:
>>>
>>>> I think you should concentrate on MAHOUT-1490, that is a highly
>>>> important task that will be the foundation for a lot of stuff to be built
>>>> on top. Let's focus on getting this thing right and then move on to other
>>>> things.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>>>>
>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>>>>> see other algorithms in mahout that are talked about being ported to spark,
>>>>> I was wondering if there's any interest/need in porting or writing things
>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>>>>> working on 1490.  Also are we planning to port the distributed versions of
>>>>> taste to use spark as well at some point.
>>>>> Thanks in advance.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Helping out on spark efforts

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I also would suggest to take some guinea pigs to validate stuff.

E.g. if i may make a suggestion, let's see how we'd do a categorical
variable vectorization into predictor variables in our would-be language
here.


On Wed, Apr 30, 2014 at 11:40 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

>
>
>
> On Wed, Apr 30, 2014 at 10:53 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> +1.
>>
>> And the greatest benefit of data frames work is standardization of
>> feature extraction in Mahout, not necessarily any particular algorithms.
>> This has been the thorniest issue in the history and nobody does it well
>> today as it stands.
>>
>
> Correction: nobody does it well in open source and in distributed way,
> that is.
>
>
>>  If we tackle feature prep techniques in engine-agnostic way, this would
>> be truly unique differentiation factor for Mahout.
>>
>>
>>
>> On Wed, Apr 30, 2014 at 7:52 AM, Sebastian Schelter <ss...@apache.org>wrote:
>>
>>> I think you should concentrate on MAHOUT-1490, that is a highly
>>> important task that will be the foundation for a lot of stuff to be built
>>> on top. Let's focus on getting this thing right and then move on to other
>>> things.
>>>
>>> --sebastian
>>>
>>>
>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>>>
>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>>>> see other algorithms in mahout that are talked about being ported to spark,
>>>> I was wondering if there's any interest/need in porting or writing things
>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>>>> working on 1490.  Also are we planning to port the distributed versions of
>>>> taste to use spark as well at some point.
>>>> Thanks in advance.
>>>>
>>>>
>>>
>>
>

Re: Helping out on spark efforts

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Apr 30, 2014 at 10:53 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> +1.
>
> And the greatest benefit of data frames work is standardization of feature
> extraction in Mahout, not necessarily any particular algorithms. This has
> been the thorniest issue in the history and nobody does it well today as it
> stands.
>

Correction: nobody does it well in open source and in distributed way, that
is.


> If we tackle feature prep techniques in engine-agnostic way, this would be
> truly unique differentiation factor for Mahout.
>
>
>
> On Wed, Apr 30, 2014 at 7:52 AM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> I think you should concentrate on MAHOUT-1490, that is a highly important
>> task that will be the foundation for a lot of stuff to be built on top.
>> Let's focus on getting this thing right and then move on to other things.
>>
>> --sebastian
>>
>>
>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>>
>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>>> see other algorithms in mahout that are talked about being ported to spark,
>>> I was wondering if there's any interest/need in porting or writing things
>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>>> working on 1490.  Also are we planning to port the distributed versions of
>>> taste to use spark as well at some point.
>>> Thanks in advance.
>>>
>>>
>>
>

Re: Helping out on spark efforts

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

+1.

And the greatest benefit of data frames work is standardization of feature
extraction in Mahout, not necessarily any particular algorithms. This has
been the thorniest issue in the history and nobody does it well today as it
stands. If we tackle feature prep techniques in engine-agnostic way, this
would be truly unique differentiation factor for Mahout.

On Wed, Apr 30, 2014 at 7:52 AM, Sebastian Schelter <ss...@apache.org> wrote:

> I think you should concentrate on MAHOUT-1490, that is a highly important
> task that will be the foundation for a lot of stuff to be built on top.
> Let's focus on getting this thing right and then move on to other things.
>
> --sebastian
>
>
> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
>
>> Sebastien/Dmitry,In looking through the current list of issues I didnt
>> see other algorithms in mahout that are talked about being ported to spark,
>> I was wondering if there's any interest/need in porting or writing things
>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
>> working on 1490.  Also are we planning to port the distributed versions of
>> taste to use spark as well at some point.
>> Thanks in advance.
>>
>>
>

Re: Helping out on spark efforts

Posted by Sebastian Schelter <ss...@apache.org>.

I think you should concentrate on MAHOUT-1490, that is a highly 
important task that will be the foundation for a lot of stuff to be 
built on top. Let's focus on getting this thing right and then move on 
to other things.

--sebastian

On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> Sebastien/Dmitry,In looking through the current list of issues I didnt see other algorithms in mahout that are talked about being ported to spark, I was wondering if there's any interest/need in porting or writing things like LR/KMeans/SVM to use spark, I'd like to help out in this area while working on 1490.  Also are we planning to port the distributed versions of taste to use spark as well at some point.
> Thanks in advance. 		 	   		
>