You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Олег Зотов <ol...@gmail.com> on 2015/02/03 20:57:29 UTC

Extending spark-itemsimilarity for calculation multiple cross-indicators

Hello.
I develop recommendation system and use mahout on spark (1.0 snapshot). In
the process I have found, that spark-itemsimilarity driver do not allow to
process more than two action types.  After reading the documentation, I
found that, I should run it multiple times or use
SimilarityAnalysis.cooccurrence API. But multiple running is not
efficiently and write java/scala code is not always very convenient.

Furthermore, in sources of ItemSimilarityDriver.scala (at 217 line) I have
found this comment "// todo: allow more than one cross-similarity matrix?"

It is my first experience of working with opensource, also I hear writing
here before creating issue is preferred. So my question: what about
extending spark-itemsimilarity driver api with something like this:
mahout spark-itemsimilarity --main-filter purchase --secondary-filter
view,addToCart,like
(other parameters are omitted)
The result will be one indicator matrix and set of cross-indicator
matrices(one for each secondary action)

If it's helpful feature, I'll do it.

P.S. Sorry for my poor English, it is not my native language.

Regards, Oleg.

Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS to run mahout shell, one can use

MASTER=<master> mahout/bin spark-shell

Syntax to load scripts is retained from Scala shell.

ideally one also needs stuf like MAHOUT_OPTS=-Xmx=5G but as i mentioned it
is broken right now, you can do a quick hack

On Tue, Feb 3, 2015 at 12:06 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
> On Tue, Feb 3, 2015 at 11:57 AM, Олег Зотов <ol...@gmail.com> wrote:
>
>> Hello.
>> I develop recommendation system and use mahout on spark (1.0 snapshot). In
>> the process I have found, that spark-itemsimilarity driver do not allow to
>> process more than two action types.  After reading the documentation, I
>> found that, I should run it multiple times or use
>> SimilarityAnalysis.cooccurrence API. But multiple running is not
>> efficiently and write java/scala code is not always very convenient.
>>
>
> Don't you think writing script for spark shell is better for this type of
> stuff? IDEA would support full scala syntax support even for scala scripts.
>
> (one problem with shell is that there's a bug where MAHOUT_OPTS
> enviornment doesn't work for adjusting spark application specifics with
> -D...).
>
>
>> Furthermore, in sources of ItemSimilarityDriver.scala (at 217 line) I have
>> found this comment "// todo: allow more than one cross-similarity matrix?"
>>
>> It is my first experience of working with opensource, also I hear writing
>> here before creating issue is preferred. So my question: what about
>> extending spark-itemsimilarity driver api with something like this:
>> mahout spark-itemsimilarity --main-filter purchase --secondary-filter
>> view,addToCart,like
>> (other parameters are omitted)
>> The result will be one indicator matrix and set of cross-indicator
>> matrices(one for each secondary action)
>>
>> If it's helpful feature, I'll do it.
>>
>> P.S. Sorry for my poor English, it is not my native language.
>>
> нормальный такой инглиш вроде.  извиняться не за что имо.
>
>>
>> Regards, Oleg.
>>
>
>

Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

Posted by Pat Ferrel <pa...@occamsmachete.com>.
BTW if you want to try it out quickly the CLI can be run for each pair. This recalculates A’A multiple times but requires less node memory and no code changes.

Run it once for every A & B input where B is one of the secondary actions.


On Feb 3, 2015, at 12:33 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Yes, full support for multiple cross-cooccurrence is supported by the API.

Whether you write your own app/driver or use the shell you can pass in as many inputs as you need. The driver cli is already too complicated.

To pass a script to the shell doesn’t require you to go through creating a project but has limited debug capabilities. The script shouldn’t be too complicated though.

I’m doing this for a client now. If you want to input tuples <userID, itemID> from separate files or directories with part-xxxx files you can use the TDIndexedDatasetReader#elementReader which will do a parallel input of csv type text files and create an IndexedDataset from each.

Pass these in to SimilarityAnalysis.cooccurrencesIDSs. It takes the primary action and a list of secondary actions and returns a list of indicators as IndexedDatasets. You can then use the TDIndexedDatasetWriter to do parallel writes creating directories full of csv part-xxxxx files for each indicator matrix. 

If you are going straight into a search engine make sure to set omitScore in the schema. The LLR cooccurrence score is really only needed for downsampling, the search engine will re-weight the indicators using TF-IDF, which is good for recs.

Caveat emptor: doing more than one secondary input has not been thoroughly tested but since I’m doing that myself you will get fast support. Also remember that IndexedDatasets keep HashMaps in memory on each cluster machine. The will be one per userID and itemID collection. So you need enough memory on each node to hold them.

Let me know how it goes I’ll be doing the same thing.

On Feb 3, 2015, at 12:06 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Tue, Feb 3, 2015 at 11:57 AM, Олег Зотов <ol...@gmail.com> wrote:

> Hello.
> I develop recommendation system and use mahout on spark (1.0 snapshot). In
> the process I have found, that spark-itemsimilarity driver do not allow to
> process more than two action types.  After reading the documentation, I
> found that, I should run it multiple times or use
> SimilarityAnalysis.cooccurrence API. But multiple running is not
> efficiently and write java/scala code is not always very convenient.
> 

Don't you think writing script for spark shell is better for this type of
stuff? IDEA would support full scala syntax support even for scala scripts.

(one problem with shell is that there's a bug where MAHOUT_OPTS enviornment
doesn't work for adjusting spark application specifics with -D...).


> Furthermore, in sources of ItemSimilarityDriver.scala (at 217 line) I have
> found this comment "// todo: allow more than one cross-similarity matrix?"
> 
> It is my first experience of working with opensource, also I hear writing
> here before creating issue is preferred. So my question: what about
> extending spark-itemsimilarity driver api with something like this:
> mahout spark-itemsimilarity --main-filter purchase --secondary-filter
> view,addToCart,like
> (other parameters are omitted)
> The result will be one indicator matrix and set of cross-indicator
> matrices(one for each secondary action)
> 
> If it's helpful feature, I'll do it.
> 
> P.S. Sorry for my poor English, it is not my native language.
> 
нормальный такой инглиш вроде.  извиняться не за что имо.

> 
> Regards, Oleg.
> 



Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Yes, full support for multiple cross-cooccurrence is supported by the API.

Whether you write your own app/driver or use the shell you can pass in as many inputs as you need. The driver cli is already too complicated.

To pass a script to the shell doesn’t require you to go through creating a project but has limited debug capabilities. The script shouldn’t be too complicated though.

I’m doing this for a client now. If you want to input tuples <userID, itemID> from separate files or directories with part-xxxx files you can use the TDIndexedDatasetReader#elementReader which will do a parallel input of csv type text files and create an IndexedDataset from each.

Pass these in to SimilarityAnalysis.cooccurrencesIDSs. It takes the primary action and a list of secondary actions and returns a list of indicators as IndexedDatasets. You can then use the TDIndexedDatasetWriter to do parallel writes creating directories full of csv part-xxxxx files for each indicator matrix. 

If you are going straight into a search engine make sure to set omitScore in the schema. The LLR cooccurrence score is really only needed for downsampling, the search engine will re-weight the indicators using TF-IDF, which is good for recs.

Caveat emptor: doing more than one secondary input has not been thoroughly tested but since I’m doing that myself you will get fast support. Also remember that IndexedDatasets keep HashMaps in memory on each cluster machine. The will be one per userID and itemID collection. So you need enough memory on each node to hold them.

Let me know how it goes I’ll be doing the same thing.

On Feb 3, 2015, at 12:06 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Tue, Feb 3, 2015 at 11:57 AM, Олег Зотов <ol...@gmail.com> wrote:

> Hello.
> I develop recommendation system and use mahout on spark (1.0 snapshot). In
> the process I have found, that spark-itemsimilarity driver do not allow to
> process more than two action types.  After reading the documentation, I
> found that, I should run it multiple times or use
> SimilarityAnalysis.cooccurrence API. But multiple running is not
> efficiently and write java/scala code is not always very convenient.
> 

Don't you think writing script for spark shell is better for this type of
stuff? IDEA would support full scala syntax support even for scala scripts.

(one problem with shell is that there's a bug where MAHOUT_OPTS enviornment
doesn't work for adjusting spark application specifics with -D...).


> Furthermore, in sources of ItemSimilarityDriver.scala (at 217 line) I have
> found this comment "// todo: allow more than one cross-similarity matrix?"
> 
> It is my first experience of working with opensource, also I hear writing
> here before creating issue is preferred. So my question: what about
> extending spark-itemsimilarity driver api with something like this:
> mahout spark-itemsimilarity --main-filter purchase --secondary-filter
> view,addToCart,like
> (other parameters are omitted)
> The result will be one indicator matrix and set of cross-indicator
> matrices(one for each secondary action)
> 
> If it's helpful feature, I'll do it.
> 
> P.S. Sorry for my poor English, it is not my native language.
> 
нормальный такой инглиш вроде.  извиняться не за что имо.

> 
> Regards, Oleg.
> 


Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Tue, Feb 3, 2015 at 11:57 AM, Олег Зотов <ol...@gmail.com> wrote:

> Hello.
> I develop recommendation system and use mahout on spark (1.0 snapshot). In
> the process I have found, that spark-itemsimilarity driver do not allow to
> process more than two action types.  After reading the documentation, I
> found that, I should run it multiple times or use
> SimilarityAnalysis.cooccurrence API. But multiple running is not
> efficiently and write java/scala code is not always very convenient.
>

Don't you think writing script for spark shell is better for this type of
stuff? IDEA would support full scala syntax support even for scala scripts.

(one problem with shell is that there's a bug where MAHOUT_OPTS enviornment
doesn't work for adjusting spark application specifics with -D...).


> Furthermore, in sources of ItemSimilarityDriver.scala (at 217 line) I have
> found this comment "// todo: allow more than one cross-similarity matrix?"
>
> It is my first experience of working with opensource, also I hear writing
> here before creating issue is preferred. So my question: what about
> extending spark-itemsimilarity driver api with something like this:
> mahout spark-itemsimilarity --main-filter purchase --secondary-filter
> view,addToCart,like
> (other parameters are omitted)
> The result will be one indicator matrix and set of cross-indicator
> matrices(one for each secondary action)
>
> If it's helpful feature, I'll do it.
>
> P.S. Sorry for my poor English, it is not my native language.
>
нормальный такой инглиш вроде.  извиняться не за что имо.

>
> Regards, Oleg.
>