You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@gmail.com> on 2014/05/27 17:08:24 UTC

Indicator Matrix and Mahout + Solr recommender

I was talking with Ken Krugler off list about the Mahout + Solr recommender and he had an interesting request. 

When calculating the indicator/item similarity matrix using ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to have an option that specified the fraction of values kept in the entire matrix based on their similarity strength? This is very difficult to do with --threshold. It would be like expressing the threshold as a fraction of total number of values rather than a strength value. Seems like this would have the effect of tossing the least interesting similarities where limiting per item (—maxSimilaritiesPerItem) could easily toss some of the most interesting.

At very least it seems like a better way of expressing the threshold, doesn’t it?

Re: Indicator Matrix and Mahout + Solr recommender

Posted by Pat Ferrel <pa...@gmail.com>.

The input user preferences are boolean, 1 or 0 (actually 0 is not used, just assumed if not present). User ratings introduce more questions than answers usually. Most commercial recommenders do not care about predicting a user’s rating—even Neflix doesn’t really care about that anymore. Boolean preferences work very well for _ranking_ recommendations, which is the important thing. 

This project uses intermediate non-boolean values for various things but the input is expected to be items a user preferred, we assume no preference for everything else. It is exactly the indicator matrix and input type described in "Practical Machine Learning”. The project has several additions to what is discussed in the book (like cross-action indicators) and does not include Solr integration, which is assumed to be handled by your preferred app architecture. 

The project creates CSV text files that you can use as you wish. They can be directly indexed by Solr or via some other method. For example the demo site integrates Solr by putting the indicator matrix into a DB (MongoDB) and having Solr index it from there.

On Jun 6, 2014, at 10:00 AM, Xavier Rampino <xr...@senscritique.com> wrote:

I have a related question about the Indicator Matrix. Is it possible to
compute it using either quantitative ratings; or maybe just good ratings
taken a single action (user 1 "liked" product 1). I am referring to
"Practical Machine Learning Innovations in Recommendation" where you say
that "The best choice of data may surprise you—it’s not user ratings [...]".

So basically, was this recommender designed specifically not for
quantitative ratings, or is this just an empiric observation that visits
works better than ratings in order to produce an indicator matrix leading
to the best recommendations?


On Thu, May 29, 2014 at 4:54 PM, Ted Dunning <te...@gmail.com> wrote:

> I really think that a hard limit on number of indicators is just fine.  The
> points that I have seen raised regarding this include:
> 
> a) this doesn't limit total size of indicator matrix.
> 
> I agree with this.  It doesn't.  And it shouldn't.  It does limit that size
> per item which is really better for operational use.
> 
> b) an average would be better
> 
> Why?  The hard limit winds up limiting almost all items to exactly the
> limit.  This means that this is very nearly the average.
> 
> 
> 
> 
> On Wed, May 28, 2014 at 8:31 AM, Pat Ferrel <pa...@gmail.com> wrote:
> 
>> That’s what I thought, also why the total number of indicators is not
>> limitable, right?
>> 
>> For the Spark version, should we allow something like an average number
> of
>> indicators per item? We will only be supporting LLR with that and as Ted
>> and Ken point out that is the interesting thing to limit. It will mean a
>> non-trivial bit of added processing if specified, obviously.
>> 
>> On May 27, 2014, at 12:00 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> 
>> I have added the threshold merely as a way to increase the performance of
>> RowSimilarityJob. If a threshold is given, some item pairs don't need to
> be
>> looked at. A simple example is if you use cooccurrence count as
> similarity
>> measure, and set a threshold of n cooccurrences, than any pair containing
>> an item with less than n interactions can be ignored. IIRC similar
>> techniques are implemented for cosine and jaccard.
>> 
>> Best,
>> Sebastian
>> 
>> 
>> 
>> On 05/27/2014 07:08 PM, Pat Ferrel wrote:
>>>> 
>>>> On May 27, 2014, at 8:15 AM, Ted Dunning <te...@gmail.com>
> wrote:
>>>> 
>>>> The threshold should not normally be used in the Mahout+Solr
> deployment
>>>> style.
>>> 
>>> Understood and that’s why an alternative way of specifying a cutoff may
>> be a good idea.
>>> 
>>>> 
>>>> This need is better supported by specifying the maximum number of
>>>> indicators.  This is mathematically equivalent to specifying a
> fraction
>> of
>>>> values, but is more meaningful to users since good values for this
>> number
>>>> are pretty consistent across different uses (50-100 are reasonable
>> values
>>>> for most needs larger values are quite plausible).
>>> 
>>> Assume you mean 50-100 as the average number per item.
>>> 
>>> The total for the entire indicator matrix is what Ken was asking for.
>> But I was thinking about the use with itemsimilarity where the user may
> not
>> know the dimensionality since itemsimilarity assembles the matrix from
>> individual prefs. The user probably knows the number of items in their
>> catalog but the indicator matrix dimensionality is arbitrarily smaller.
>>> 
>>> Currently the help reads:
>>> --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the
>> number of similar items per  item to this number  (default: 100)
>>> 
>>> If this were actually the average # per item it would do what you
>> describe but it looks like it’s a literal a cutoff per vector in the
> code.
>>> 
>>> A cutoff based on the highest scores in the entire matrix seems to
> imply
>> a sort when the total is larger than the average would allow and I don’t
>> see an obvious sort being done in the MR.
>>> 
>>> Anyway, it looks like we could do this by
>>> 1) total number of values in the matrix (what Ken was asking for) This
>> requires that the user know the dimensionality of the indicator matrix to
>> be very useful.
>>> 2) average number per item (what Ted describes) This seems the most
>> intuitive and does not require the dimensionality be known
>>> 3) fraction of the values. This might be useful if you are more
>> interested in downsampling by score, at least it seems more useful than
>> —threshold as it is today but maybe I’m missing some use cases? Is there
>> really a need for a hard score threshold?
>>> 
>>> 
>>>> 
>>>> 
>>>> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pa...@gmail.com>
>> wrote:
>>>> 
>>>>> I was talking with Ken Krugler off list about the Mahout + Solr
>>>>> recommender and he had an interesting request.
>>>>> 
>>>>> When calculating the indicator/item similarity matrix using
>>>>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be
>> better to
>>>>> have an option that specified the fraction of values kept in the
> entire
>>>>> matrix based on their similarity strength? This is very difficult to
> do
>>>>> with --threshold. It would be like expressing the threshold as a
>> fraction
>>>>> of total number of values rather than a strength value. Seems like
> this
>>>>> would have the effect of tossing the least interesting similarities
>> where
>>>>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of
>> the
>>>>> most interesting.
>>>>> 
>>>>> At very least it seems like a better way of expressing the threshold,
>>>>> doesn’t it?
>>>> 
>> 
>> 
>> 
>

Re: Indicator Matrix and Mahout + Solr recommender

Posted by Xavier Rampino <xr...@senscritique.com>.

I have a related question about the Indicator Matrix. Is it possible to
compute it using either quantitative ratings; or maybe just good ratings
taken a single action (user 1 "liked" product 1). I am referring to
"Practical Machine Learning Innovations in Recommendation" where you say
that "The best choice of data may surprise you—it’s not user ratings [...]".

So basically, was this recommender designed specifically not for
quantitative ratings, or is this just an empiric observation that visits
works better than ratings in order to produce an indicator matrix leading
to the best recommendations?


On Thu, May 29, 2014 at 4:54 PM, Ted Dunning <te...@gmail.com> wrote:

> I really think that a hard limit on number of indicators is just fine.  The
> points that I have seen raised regarding this include:
>
> a) this doesn't limit total size of indicator matrix.
>
> I agree with this.  It doesn't.  And it shouldn't.  It does limit that size
> per item which is really better for operational use.
>
> b) an average would be better
>
> Why?  The hard limit winds up limiting almost all items to exactly the
> limit.  This means that this is very nearly the average.
>
>
>
>
> On Wed, May 28, 2014 at 8:31 AM, Pat Ferrel <pa...@gmail.com> wrote:
>
> > That’s what I thought, also why the total number of indicators is not
> > limitable, right?
> >
> > For the Spark version, should we allow something like an average number
> of
> > indicators per item? We will only be supporting LLR with that and as Ted
> > and Ken point out that is the interesting thing to limit. It will mean a
> > non-trivial bit of added processing if specified, obviously.
> >
> > On May 27, 2014, at 12:00 PM, Sebastian Schelter <ss...@apache.org> wrote:
> >
> > I have added the threshold merely as a way to increase the performance of
> > RowSimilarityJob. If a threshold is given, some item pairs don't need to
> be
> > looked at. A simple example is if you use cooccurrence count as
> similarity
> > measure, and set a threshold of n cooccurrences, than any pair containing
> > an item with less than n interactions can be ignored. IIRC similar
> > techniques are implemented for cosine and jaccard.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > On 05/27/2014 07:08 PM, Pat Ferrel wrote:
> > >>
> > >> On May 27, 2014, at 8:15 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > >>
> > >> The threshold should not normally be used in the Mahout+Solr
> deployment
> > >> style.
> > >
> > > Understood and that’s why an alternative way of specifying a cutoff may
> > be a good idea.
> > >
> > >>
> > >> This need is better supported by specifying the maximum number of
> > >> indicators.  This is mathematically equivalent to specifying a
> fraction
> > of
> > >> values, but is more meaningful to users since good values for this
> > number
> > >> are pretty consistent across different uses (50-100 are reasonable
> > values
> > >> for most needs larger values are quite plausible).
> > >
> > > Assume you mean 50-100 as the average number per item.
> > >
> > > The total for the entire indicator matrix is what Ken was asking for.
> > But I was thinking about the use with itemsimilarity where the user may
> not
> > know the dimensionality since itemsimilarity assembles the matrix from
> > individual prefs. The user probably knows the number of items in their
> > catalog but the indicator matrix dimensionality is arbitrarily smaller.
> > >
> > > Currently the help reads:
> > > --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the
> > number of similar items per  item to this number  (default: 100)
> > >
> > > If this were actually the average # per item it would do what you
> > describe but it looks like it’s a literal a cutoff per vector in the
> code.
> > >
> > > A cutoff based on the highest scores in the entire matrix seems to
> imply
> > a sort when the total is larger than the average would allow and I don’t
> > see an obvious sort being done in the MR.
> > >
> > > Anyway, it looks like we could do this by
> > > 1) total number of values in the matrix (what Ken was asking for) This
> > requires that the user know the dimensionality of the indicator matrix to
> > be very useful.
> > > 2) average number per item (what Ted describes) This seems the most
> > intuitive and does not require the dimensionality be known
> > > 3) fraction of the values. This might be useful if you are more
> > interested in downsampling by score, at least it seems more useful than
> > —threshold as it is today but maybe I’m missing some use cases? Is there
> > really a need for a hard score threshold?
> > >
> > >
> > >>
> > >>
> > >> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pa...@gmail.com>
> > wrote:
> > >>
> > >>> I was talking with Ken Krugler off list about the Mahout + Solr
> > >>> recommender and he had an interesting request.
> > >>>
> > >>> When calculating the indicator/item similarity matrix using
> > >>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be
> > better to
> > >>> have an option that specified the fraction of values kept in the
> entire
> > >>> matrix based on their similarity strength? This is very difficult to
> do
> > >>> with --threshold. It would be like expressing the threshold as a
> > fraction
> > >>> of total number of values rather than a strength value. Seems like
> this
> > >>> would have the effect of tossing the least interesting similarities
> > where
> > >>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of
> > the
> > >>> most interesting.
> > >>>
> > >>> At very least it seems like a better way of expressing the threshold,
> > >>> doesn’t it?
> > >>
> >
> >
> >
>

Re: Indicator Matrix and Mahout + Solr recommender

Posted by Ted Dunning <te...@gmail.com>.

I really think that a hard limit on number of indicators is just fine.  The
points that I have seen raised regarding this include:

a) this doesn't limit total size of indicator matrix.

I agree with this.  It doesn't.  And it shouldn't.  It does limit that size
per item which is really better for operational use.

b) an average would be better

Why?  The hard limit winds up limiting almost all items to exactly the
limit.  This means that this is very nearly the average.




On Wed, May 28, 2014 at 8:31 AM, Pat Ferrel <pa...@gmail.com> wrote:

> That’s what I thought, also why the total number of indicators is not
> limitable, right?
>
> For the Spark version, should we allow something like an average number of
> indicators per item? We will only be supporting LLR with that and as Ted
> and Ken point out that is the interesting thing to limit. It will mean a
> non-trivial bit of added processing if specified, obviously.
>
> On May 27, 2014, at 12:00 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
> I have added the threshold merely as a way to increase the performance of
> RowSimilarityJob. If a threshold is given, some item pairs don't need to be
> looked at. A simple example is if you use cooccurrence count as similarity
> measure, and set a threshold of n cooccurrences, than any pair containing
> an item with less than n interactions can be ignored. IIRC similar
> techniques are implemented for cosine and jaccard.
>
> Best,
> Sebastian
>
>
>
> On 05/27/2014 07:08 PM, Pat Ferrel wrote:
> >>
> >> On May 27, 2014, at 8:15 AM, Ted Dunning <te...@gmail.com> wrote:
> >>
> >> The threshold should not normally be used in the Mahout+Solr deployment
> >> style.
> >
> > Understood and that’s why an alternative way of specifying a cutoff may
> be a good idea.
> >
> >>
> >> This need is better supported by specifying the maximum number of
> >> indicators.  This is mathematically equivalent to specifying a fraction
> of
> >> values, but is more meaningful to users since good values for this
> number
> >> are pretty consistent across different uses (50-100 are reasonable
> values
> >> for most needs larger values are quite plausible).
> >
> > Assume you mean 50-100 as the average number per item.
> >
> > The total for the entire indicator matrix is what Ken was asking for.
> But I was thinking about the use with itemsimilarity where the user may not
> know the dimensionality since itemsimilarity assembles the matrix from
> individual prefs. The user probably knows the number of items in their
> catalog but the indicator matrix dimensionality is arbitrarily smaller.
> >
> > Currently the help reads:
> > --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the
> number of similar items per  item to this number  (default: 100)
> >
> > If this were actually the average # per item it would do what you
> describe but it looks like it’s a literal a cutoff per vector in the code.
> >
> > A cutoff based on the highest scores in the entire matrix seems to imply
> a sort when the total is larger than the average would allow and I don’t
> see an obvious sort being done in the MR.
> >
> > Anyway, it looks like we could do this by
> > 1) total number of values in the matrix (what Ken was asking for) This
> requires that the user know the dimensionality of the indicator matrix to
> be very useful.
> > 2) average number per item (what Ted describes) This seems the most
> intuitive and does not require the dimensionality be known
> > 3) fraction of the values. This might be useful if you are more
> interested in downsampling by score, at least it seems more useful than
> —threshold as it is today but maybe I’m missing some use cases? Is there
> really a need for a hard score threshold?
> >
> >
> >>
> >>
> >> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pa...@gmail.com>
> wrote:
> >>
> >>> I was talking with Ken Krugler off list about the Mahout + Solr
> >>> recommender and he had an interesting request.
> >>>
> >>> When calculating the indicator/item similarity matrix using
> >>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be
> better to
> >>> have an option that specified the fraction of values kept in the entire
> >>> matrix based on their similarity strength? This is very difficult to do
> >>> with --threshold. It would be like expressing the threshold as a
> fraction
> >>> of total number of values rather than a strength value. Seems like this
> >>> would have the effect of tossing the least interesting similarities
> where
> >>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of
> the
> >>> most interesting.
> >>>
> >>> At very least it seems like a better way of expressing the threshold,
> >>> doesn’t it?
> >>
>
>
>

Re: Indicator Matrix and Mahout + Solr recommender

Posted by Pat Ferrel <pa...@gmail.com>.

That’s what I thought, also why the total number of indicators is not limitable, right?

For the Spark version, should we allow something like an average number of indicators per item? We will only be supporting LLR with that and as Ted and Ken point out that is the interesting thing to limit. It will mean a non-trivial bit of added processing if specified, obviously.

On May 27, 2014, at 12:00 PM, Sebastian Schelter <ss...@apache.org> wrote:

I have added the threshold merely as a way to increase the performance of RowSimilarityJob. If a threshold is given, some item pairs don't need to be looked at. A simple example is if you use cooccurrence count as similarity measure, and set a threshold of n cooccurrences, than any pair containing an item with less than n interactions can be ignored. IIRC similar techniques are implemented for cosine and jaccard.

Best,
Sebastian



On 05/27/2014 07:08 PM, Pat Ferrel wrote:
>> 
>> On May 27, 2014, at 8:15 AM, Ted Dunning <te...@gmail.com> wrote:
>> 
>> The threshold should not normally be used in the Mahout+Solr deployment
>> style.
> 
> Understood and that’s why an alternative way of specifying a cutoff may be a good idea.
> 
>> 
>> This need is better supported by specifying the maximum number of
>> indicators.  This is mathematically equivalent to specifying a fraction of
>> values, but is more meaningful to users since good values for this number
>> are pretty consistent across different uses (50-100 are reasonable values
>> for most needs larger values are quite plausible).
> 
> Assume you mean 50-100 as the average number per item.
> 
> The total for the entire indicator matrix is what Ken was asking for. But I was thinking about the use with itemsimilarity where the user may not know the dimensionality since itemsimilarity assembles the matrix from individual prefs. The user probably knows the number of items in their catalog but the indicator matrix dimensionality is arbitrarily smaller.
> 
> Currently the help reads:
> --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the number of similar items per  item to this number  (default: 100)
> 
> If this were actually the average # per item it would do what you describe but it looks like it’s a literal a cutoff per vector in the code.
> 
> A cutoff based on the highest scores in the entire matrix seems to imply a sort when the total is larger than the average would allow and I don’t see an obvious sort being done in the MR.
> 
> Anyway, it looks like we could do this by
> 1) total number of values in the matrix (what Ken was asking for) This requires that the user know the dimensionality of the indicator matrix to be very useful.
> 2) average number per item (what Ted describes) This seems the most intuitive and does not require the dimensionality be known
> 3) fraction of the values. This might be useful if you are more interested in downsampling by score, at least it seems more useful than —threshold as it is today but maybe I’m missing some use cases? Is there really a need for a hard score threshold?
> 
> 
>> 
>> 
>> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> 
>>> I was talking with Ken Krugler off list about the Mahout + Solr
>>> recommender and he had an interesting request.
>>> 
>>> When calculating the indicator/item similarity matrix using
>>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to
>>> have an option that specified the fraction of values kept in the entire
>>> matrix based on their similarity strength? This is very difficult to do
>>> with --threshold. It would be like expressing the threshold as a fraction
>>> of total number of values rather than a strength value. Seems like this
>>> would have the effect of tossing the least interesting similarities where
>>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
>>> most interesting.
>>> 
>>> At very least it seems like a better way of expressing the threshold,
>>> doesn’t it?
>>

Re: Indicator Matrix and Mahout + Solr recommender

Posted by Sebastian Schelter <ss...@apache.org>.

I have added the threshold merely as a way to increase the performance 
of RowSimilarityJob. If a threshold is given, some item pairs don't need 
to be looked at. A simple example is if you use cooccurrence count as 
similarity measure, and set a threshold of n cooccurrences, than any 
pair containing an item with less than n interactions can be ignored. 
IIRC similar techniques are implemented for cosine and jaccard.

Best,
Sebastian



On 05/27/2014 07:08 PM, Pat Ferrel wrote:
>>
>> On May 27, 2014, at 8:15 AM, Ted Dunning <te...@gmail.com> wrote:
>>
>> The threshold should not normally be used in the Mahout+Solr deployment
>> style.
>
> Understood and that’s why an alternative way of specifying a cutoff may be a good idea.
>
>>
>> This need is better supported by specifying the maximum number of
>> indicators.  This is mathematically equivalent to specifying a fraction of
>> values, but is more meaningful to users since good values for this number
>> are pretty consistent across different uses (50-100 are reasonable values
>> for most needs larger values are quite plausible).
>
> Assume you mean 50-100 as the average number per item.
>
> The total for the entire indicator matrix is what Ken was asking for. But I was thinking about the use with itemsimilarity where the user may not know the dimensionality since itemsimilarity assembles the matrix from individual prefs. The user probably knows the number of items in their catalog but the indicator matrix dimensionality is arbitrarily smaller.
>
> Currently the help reads:
> --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the number of similar items per  item to this number  (default: 100)
>
> If this were actually the average # per item it would do what you describe but it looks like it’s a literal a cutoff per vector in the code.
>
> A cutoff based on the highest scores in the entire matrix seems to imply a sort when the total is larger than the average would allow and I don’t see an obvious sort being done in the MR.
>
> Anyway, it looks like we could do this by
> 1) total number of values in the matrix (what Ken was asking for) This requires that the user know the dimensionality of the indicator matrix to be very useful.
> 2) average number per item (what Ted describes) This seems the most intuitive and does not require the dimensionality be known
> 3) fraction of the values. This might be useful if you are more interested in downsampling by score, at least it seems more useful than —threshold as it is today but maybe I’m missing some use cases? Is there really a need for a hard score threshold?
>
>
>>
>>
>> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>
>>> I was talking with Ken Krugler off list about the Mahout + Solr
>>> recommender and he had an interesting request.
>>>
>>> When calculating the indicator/item similarity matrix using
>>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to
>>> have an option that specified the fraction of values kept in the entire
>>> matrix based on their similarity strength? This is very difficult to do
>>> with --threshold. It would be like expressing the threshold as a fraction
>>> of total number of values rather than a strength value. Seems like this
>>> would have the effect of tossing the least interesting similarities where
>>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
>>> most interesting.
>>>
>>> At very least it seems like a better way of expressing the threshold,
>>> doesn’t it?
>>

Re: Indicator Matrix and Mahout + Solr recommender

Posted by Pat Ferrel <pa...@gmail.com>.

> 
> On May 27, 2014, at 8:15 AM, Ted Dunning <te...@gmail.com> wrote:
> 
> The threshold should not normally be used in the Mahout+Solr deployment
> style.

Understood and that’s why an alternative way of specifying a cutoff may be a good idea.

> 
> This need is better supported by specifying the maximum number of
> indicators.  This is mathematically equivalent to specifying a fraction of
> values, but is more meaningful to users since good values for this number
> are pretty consistent across different uses (50-100 are reasonable values
> for most needs larger values are quite plausible).

Assume you mean 50-100 as the average number per item. 

The total for the entire indicator matrix is what Ken was asking for. But I was thinking about the use with itemsimilarity where the user may not know the dimensionality since itemsimilarity assembles the matrix from individual prefs. The user probably knows the number of items in their catalog but the indicator matrix dimensionality is arbitrarily smaller. 

Currently the help reads:
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the number of similar items per  item to this number  (default: 100)  

If this were actually the average # per item it would do what you describe but it looks like it’s a literal a cutoff per vector in the code. 

A cutoff based on the highest scores in the entire matrix seems to imply a sort when the total is larger than the average would allow and I don’t see an obvious sort being done in the MR.

Anyway, it looks like we could do this by 
1) total number of values in the matrix (what Ken was asking for) This requires that the user know the dimensionality of the indicator matrix to be very useful.
2) average number per item (what Ted describes) This seems the most intuitive and does not require the dimensionality be known
3) fraction of the values. This might be useful if you are more interested in downsampling by score, at least it seems more useful than —threshold as it is today but maybe I’m missing some use cases? Is there really a need for a hard score threshold?

> 
> 
> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pa...@gmail.com> wrote:
> 
>> I was talking with Ken Krugler off list about the Mahout + Solr
>> recommender and he had an interesting request.
>> 
>> When calculating the indicator/item similarity matrix using
>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to
>> have an option that specified the fraction of values kept in the entire
>> matrix based on their similarity strength? This is very difficult to do
>> with --threshold. It would be like expressing the threshold as a fraction
>> of total number of values rather than a strength value. Seems like this
>> would have the effect of tossing the least interesting similarities where
>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
>> most interesting.
>> 
>> At very least it seems like a better way of expressing the threshold,
>> doesn’t it?
>

Re: Indicator Matrix and Mahout + Solr recommender

Posted by Ted Dunning <te...@gmail.com>.

The threshold should not normally be used in the Mahout+Solr deployment
style.

This need is better supported by specifying the maximum number of
indicators.  This is mathematically equivalent to specifying a fraction of
values, but is more meaningful to users since good values for this number
are pretty consistent across different uses (50-100 are reasonable values
for most needs larger values are quite plausible).

On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pa...@gmail.com> wrote:

> I was talking with Ken Krugler off list about the Mahout + Solr
> recommender and he had an interesting request.
>
> When calculating the indicator/item similarity matrix using
> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to
> have an option that specified the fraction of values kept in the entire
> matrix based on their similarity strength? This is very difficult to do
> with --threshold. It would be like expressing the threshold as a fraction
> of total number of values rather than a strength value. Seems like this
> would have the effect of tossing the least interesting similarities where
> limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
> most interesting.
>
> At very least it seems like a better way of expressing the threshold,
> doesn’t it?