You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/12/30 20:50:46 UTC

Some test results

As many of you know Mahout-Samsara includes an interesting and important extension to cooccurrence similarity, which supports cross-coossurrence and log-likelihood downsampling. This, when combined with a search engine, gives us a multimodal recommender. Some of us integrated Mahout with a DB and search engine to create what we call (humbly) the Universal Recommender. 

We just completed a tool that measures the effects of what we call secondary events or indicators using the Universal Recommender. It calculates a ranking based precision metric called mean average precision—MAP@k. We took a dataset from the Rotten Tomatoes web site of “fresh”, and “rotten” reviews and combined that with data about the genres, casts, directors, and writers of the various video items. This gave us the indicators below:
like, video-id <== primary indicator
dislike, video-id
like-genre, genre-id
dislike-genre, genre-id
like-director, director-id
dislike-director, director-id
like-writer, writer-id
dislike-writer, writer-id
like-cast, cast-member-id
dislike-cast, cast-member-id
These aren’t necessarily what we would have chosen if we were designing something from scratch but are possible to gather from public data.

We have only ~5000 mostly professional reviewers with ~250k video items in this dataset but have a larger one we are integrating. We are also writing a white paper and blog post with some deeper analysis. There are several tidbits of insight when you look deeper.

The bottom line is that using most of the above indicators we were able to get a 26% increase in MAP@1 over using only “like”. This is important because the vast majority of recommenders can only really ingest one type of indicator.

http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html <http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>
https://github.com/actionml/template-scala-parallel-universal-recommendation <https://github.com/actionml/template-scala-parallel-universal-recommendation>

Re: Some test results

Posted by Suneel Marthi <sm...@apache.org>.
👍👏

On Wed, Dec 30, 2015 at 2:57 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Nice!
> On Dec 30, 2015 11:51 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>
> > As many of you know Mahout-Samsara includes an interesting and important
> > extension to cooccurrence similarity, which supports cross-coossurrence
> and
> > log-likelihood downsampling. This, when combined with a search engine,
> > gives us a multimodal recommender. Some of us integrated Mahout with a DB
> > and search engine to create what we call (humbly) the Universal
> Recommender.
> >
> > We just completed a tool that measures the effects of what we call
> > secondary events or indicators using the Universal Recommender. It
> > calculates a ranking based precision metric called mean average
> > precision—MAP@k. We took a dataset from the Rotten Tomatoes web site of
> > “fresh”, and “rotten” reviews and combined that with data about the
> genres,
> > casts, directors, and writers of the various video items. This gave us
> the
> > indicators below:
> > like, video-id <== primary indicator
> > dislike, video-id
> > like-genre, genre-id
> > dislike-genre, genre-id
> > like-director, director-id
> > dislike-director, director-id
> > like-writer, writer-id
> > dislike-writer, writer-id
> > like-cast, cast-member-id
> > dislike-cast, cast-member-id
> > These aren’t necessarily what we would have chosen if we were designing
> > something from scratch but are possible to gather from public data.
> >
> > We have only ~5000 mostly professional reviewers with ~250k video items
> in
> > this dataset but have a larger one we are integrating. We are also
> writing
> > a white paper and blog post with some deeper analysis. There are several
> > tidbits of insight when you look deeper.
> >
> > The bottom line is that using most of the above indicators we were able
> to
> > get a 26% increase in MAP@1 over using only “like”. This is important
> > because the vast majority of recommenders can only really ingest one type
> > of indicator.
> >
> > http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
> <
> > http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>
> >
> >
> https://github.com/actionml/template-scala-parallel-universal-recommendation
> > <
> >
> https://github.com/actionml/template-scala-parallel-universal-recommendation
> > >
>

Re: Some test results

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Nice!
On Dec 30, 2015 11:51 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> As many of you know Mahout-Samsara includes an interesting and important
> extension to cooccurrence similarity, which supports cross-coossurrence and
> log-likelihood downsampling. This, when combined with a search engine,
> gives us a multimodal recommender. Some of us integrated Mahout with a DB
> and search engine to create what we call (humbly) the Universal Recommender.
>
> We just completed a tool that measures the effects of what we call
> secondary events or indicators using the Universal Recommender. It
> calculates a ranking based precision metric called mean average
> precision—MAP@k. We took a dataset from the Rotten Tomatoes web site of
> “fresh”, and “rotten” reviews and combined that with data about the genres,
> casts, directors, and writers of the various video items. This gave us the
> indicators below:
> like, video-id <== primary indicator
> dislike, video-id
> like-genre, genre-id
> dislike-genre, genre-id
> like-director, director-id
> dislike-director, director-id
> like-writer, writer-id
> dislike-writer, writer-id
> like-cast, cast-member-id
> dislike-cast, cast-member-id
> These aren’t necessarily what we would have chosen if we were designing
> something from scratch but are possible to gather from public data.
>
> We have only ~5000 mostly professional reviewers with ~250k video items in
> this dataset but have a larger one we are integrating. We are also writing
> a white paper and blog post with some deeper analysis. There are several
> tidbits of insight when you look deeper.
>
> The bottom line is that using most of the above indicators we were able to
> get a 26% increase in MAP@1 over using only “like”. This is important
> because the vast majority of recommenders can only really ingest one type
> of indicator.
>
> http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html <
> http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>
>
> https://github.com/actionml/template-scala-parallel-universal-recommendation
> <
> https://github.com/actionml/template-scala-parallel-universal-recommendation
> >

using root LLR

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I understand the eyeball method but not sure users will so am working on a t-digest calculation of an LLR threshold. This is to maintain a certain sparsity at maximum “quality”. But I have a few questions.

You mention root LLR, ok but that will create negative numbers. I assume:

1) we should use the absolute value of root LLR for ranking in the max # of indicators sense. Seems like no value in creating the sqrt( | rootLLR | ) since the rank will not change but we can’t just use the value returned by the java root LL function directly
2) Likewise we use the absolute value of root LLR to compare with the threshold. Put another way without using absolute value the value passes the LLR threshold test if mean - threshold < value < mean + threshold
3) However the positive and negative root LLR values would be used in the t-digest quantile calc, which ideally would have mean = 0.

Seems simple but just checking my understanding, are these correct?


On Jan 2, 2016, at 3:17 PM, Ted Dunning <td...@maprtech.com> wrote:


I usually like to use a combination of a fixed threshold for llr plus a max number of indicators.

The fixed threshold I use is typically around 20-30 for raw LLR which corresponds to about 5 for root LLR. I often eyeball the lists of indicators for items that I understand to find a point where the list of indicators becomes about half noise, half useful indicators.





On Sat, Jan 2, 2016 at 2:15 PM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
One interesting thing we saw is that like-genre was better discarded and dislike-genre left in the mix.

This brings up a fundamental issue with how we use LLR to downsample in Mahout. In this case by downsampling I mean llr(A’B), where we keep some max number of indicators based on the best LLR score. For the primary action—something like “buy”—this works well since there are usually quite a lot of items, but for B there may be very few items, genre is an example. Using the same max # of indicators for A’A as well as all the rest (A’B, etc)  means that very little if any downsampling based on LLR score happens for A’B. So for A’B the result is really more like simple cross-cooccurrence.

This seems worth addressing, if only because in our analysis the effect made like-genre useless, when intuition would say that it should be useful. Our hypothesis is that since no downsampling happened and very many of the reviewers preferred most all of the genres it had no differentiating value. If we had changed the per item max indicators to some smaller number this might have left only strongly correlated like-genre indicators.

Assuming I’ve got the issue correctly identified the options I can think of are:
1) use a fixed number LLR threshold for A’B or other cross-cooccurrence indicator. This seems pretty impractical. 
2) add a max indicators threshold param for each of the secondary indicators. This would be fairly easy and could be based on the # of B items. Some method of choosing this might end up being ~100 for A’A (the default), and a function of the # of items in B, C, etc. The plus is that this would be easy and keep the calculation at O(n) but the function that return 100 for A, and some smaller number for B, C, and the rest is not clear to me.
3) create a threshold based on the distribution of llr(A’B). This could be based on a correlation confidence (actually confidence of non-correlation for LLR). The down side is that this means we need to calculate all of llr(A’B) which approaches O(n^2) then do the downsampling of the complete llr(A’B). This removes the rather significant practical benefit of the current downsampling algorithm. Practically speaking most indicators will be of dimensionality on the order of # of A items or will be very very much smaller, like # of genre’s. So maybe calculating the distribution of llr(A’B) wouldn’t bee to bad if only done when B has a small number of items. In the small B case it would be O(n*m) where m is the number of items in B and n is the number or items in A and m << n so this would nearly be O(n). Also this could be mixed with #2 and only calculated every so often since it probably won’t change very much in any one application.

I guess I’d be inclined to test by trying a range of max # of indicators on our test data since the number of genre’s are small. If there is any place that produces significantly better results we could proceed to try the confidence method and see if it allows us to calculate the optimal #. If so them we could implement this for very occasional calculation on live datasets.

Any advice?

> On Dec 30, 2015, at 2:26 PM, Ted Dunning <tdunning@maprtech.com <ma...@maprtech.com>> wrote:
> 
> 
> This is really nice work!
> 
> On Wed, Dec 30, 2015 at 11:50 AM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
> As many of you know Mahout-Samsara includes an interesting and important extension to cooccurrence similarity, which supports cross-coossurrence and log-likelihood downsampling. This, when combined with a search engine, gives us a multimodal recommender. Some of us integrated Mahout with a DB and search engine to create what we call (humbly) the Universal Recommender. 
> 
> We just completed a tool that measures the effects of what we call secondary events or indicators using the Universal Recommender. It calculates a ranking based precision metric called mean average precision—MAP@k. We took a dataset from the Rotten Tomatoes web site of “fresh”, and “rotten” reviews and combined that with data about the genres, casts, directors, and writers of the various video items. This gave us the indicators below:
> like, video-id <== primary indicator
> dislike, video-id
> like-genre, genre-id
> dislike-genre, genre-id
> like-director, director-id
> dislike-director, director-id
> like-writer, writer-id
> dislike-writer, writer-id
> like-cast, cast-member-id
> dislike-cast, cast-member-id
> These aren’t necessarily what we would have chosen if we were designing something from scratch but are possible to gather from public data.
> 
> We have only ~5000 mostly professional reviewers with ~250k video items in this dataset but have a larger one we are integrating. We are also writing a white paper and blog post with some deeper analysis. There are several tidbits of insight when you look deeper.
> 
> The bottom line is that using most of the above indicators we were able to get a 26% increase in MAP@1 over using only “like”. This is important because the vast majority of recommenders can only really ingest one type of indicator.
> 
> http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html <http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>
> https://github.com/actionml/template-scala-parallel-universal-recommendation <https://github.com/actionml/template-scala-parallel-universal-recommendation>



using root LLR

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I understand the eyeball method but not sure users will so am working on a t-digest calculation of an LLR threshold. This is to maintain a certain sparsity at maximum “quality”. But I have a few questions.

You mention root LLR, ok but that will create negative numbers. I assume:

1) we should use the absolute value of root LLR for ranking in the max # of indicators sense. Seems like no value in creating the sqrt( | rootLLR | ) since the rank will not change but we can’t just use the value returned by the java root LL function directly
2) Likewise we use the absolute value of root LLR to compare with the threshold. Put another way without using absolute value the value passes the LLR threshold test if mean - threshold < value < mean + threshold
3) However the positive and negative root LLR values would be used in the t-digest quantile calc, which ideally would have mean = 0.

Seems simple but just checking my understanding, are these correct?


On Jan 2, 2016, at 3:17 PM, Ted Dunning <td...@maprtech.com> wrote:


I usually like to use a combination of a fixed threshold for llr plus a max number of indicators.

The fixed threshold I use is typically around 20-30 for raw LLR which corresponds to about 5 for root LLR. I often eyeball the lists of indicators for items that I understand to find a point where the list of indicators becomes about half noise, half useful indicators.





On Sat, Jan 2, 2016 at 2:15 PM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
One interesting thing we saw is that like-genre was better discarded and dislike-genre left in the mix.

This brings up a fundamental issue with how we use LLR to downsample in Mahout. In this case by downsampling I mean llr(A’B), where we keep some max number of indicators based on the best LLR score. For the primary action—something like “buy”—this works well since there are usually quite a lot of items, but for B there may be very few items, genre is an example. Using the same max # of indicators for A’A as well as all the rest (A’B, etc)  means that very little if any downsampling based on LLR score happens for A’B. So for A’B the result is really more like simple cross-cooccurrence.

This seems worth addressing, if only because in our analysis the effect made like-genre useless, when intuition would say that it should be useful. Our hypothesis is that since no downsampling happened and very many of the reviewers preferred most all of the genres it had no differentiating value. If we had changed the per item max indicators to some smaller number this might have left only strongly correlated like-genre indicators.

Assuming I’ve got the issue correctly identified the options I can think of are:
1) use a fixed number LLR threshold for A’B or other cross-cooccurrence indicator. This seems pretty impractical. 
2) add a max indicators threshold param for each of the secondary indicators. This would be fairly easy and could be based on the # of B items. Some method of choosing this might end up being ~100 for A’A (the default), and a function of the # of items in B, C, etc. The plus is that this would be easy and keep the calculation at O(n) but the function that return 100 for A, and some smaller number for B, C, and the rest is not clear to me.
3) create a threshold based on the distribution of llr(A’B). This could be based on a correlation confidence (actually confidence of non-correlation for LLR). The down side is that this means we need to calculate all of llr(A’B) which approaches O(n^2) then do the downsampling of the complete llr(A’B). This removes the rather significant practical benefit of the current downsampling algorithm. Practically speaking most indicators will be of dimensionality on the order of # of A items or will be very very much smaller, like # of genre’s. So maybe calculating the distribution of llr(A’B) wouldn’t bee to bad if only done when B has a small number of items. In the small B case it would be O(n*m) where m is the number of items in B and n is the number or items in A and m << n so this would nearly be O(n). Also this could be mixed with #2 and only calculated every so often since it probably won’t change very much in any one application.

I guess I’d be inclined to test by trying a range of max # of indicators on our test data since the number of genre’s are small. If there is any place that produces significantly better results we could proceed to try the confidence method and see if it allows us to calculate the optimal #. If so them we could implement this for very occasional calculation on live datasets.

Any advice?

> On Dec 30, 2015, at 2:26 PM, Ted Dunning <tdunning@maprtech.com <ma...@maprtech.com>> wrote:
> 
> 
> This is really nice work!
> 
> On Wed, Dec 30, 2015 at 11:50 AM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
> As many of you know Mahout-Samsara includes an interesting and important extension to cooccurrence similarity, which supports cross-coossurrence and log-likelihood downsampling. This, when combined with a search engine, gives us a multimodal recommender. Some of us integrated Mahout with a DB and search engine to create what we call (humbly) the Universal Recommender. 
> 
> We just completed a tool that measures the effects of what we call secondary events or indicators using the Universal Recommender. It calculates a ranking based precision metric called mean average precision—MAP@k. We took a dataset from the Rotten Tomatoes web site of “fresh”, and “rotten” reviews and combined that with data about the genres, casts, directors, and writers of the various video items. This gave us the indicators below:
> like, video-id <== primary indicator
> dislike, video-id
> like-genre, genre-id
> dislike-genre, genre-id
> like-director, director-id
> dislike-director, director-id
> like-writer, writer-id
> dislike-writer, writer-id
> like-cast, cast-member-id
> dislike-cast, cast-member-id
> These aren’t necessarily what we would have chosen if we were designing something from scratch but are possible to gather from public data.
> 
> We have only ~5000 mostly professional reviewers with ~250k video items in this dataset but have a larger one we are integrating. We are also writing a white paper and blog post with some deeper analysis. There are several tidbits of insight when you look deeper.
> 
> The bottom line is that using most of the above indicators we were able to get a 26% increase in MAP@1 over using only “like”. This is important because the vast majority of recommenders can only really ingest one type of indicator.
> 
> http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html <http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>
> https://github.com/actionml/template-scala-parallel-universal-recommendation <https://github.com/actionml/template-scala-parallel-universal-recommendation>