You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/05/13 01:29:44 UTC

RowSimilarity

I tried an experiment running RowSimilarity with 16 docs of short 
quotations on a similar subject. It looks to me that using tanimoto the 
largest pair-wise distance allowed for the similar docs was 0.4. Though 
I asked for 10 similar docs I got 0 to 10. I see this same effect with 
larger data sets but haven't seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distance 
myself. In other words I was expecting to always get 20 similar docs 
when I asked for 20. It is useful to see what docs are at larger distances.

How is RowSimilarity deciding when to cut-off the returned docs?


Re: RowSimilarity

Posted by Suneel Marthi <su...@yahoo.com>.
The consider() method in the distance measure (Tanimoto in ur scenario) is the one that does the cut-off.
All of the similarity measures (almost all of them) have some implementation of consider() so as to cut-off the returned results.

Have a look at Sebastian's explanation in https://issues.apache.org/jira/browse/MAHOUT-803.




________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Saturday, May 12, 2012 7:29 PM
Subject: RowSimilarity
 
I tried an experiment running RowSimilarity with 16 docs of short quotations on a similar subject. It looks to me that using tanimoto the largest pair-wise distance allowed for the similar docs was 0.4. Though I asked for 10 similar docs I got 0 to 10. I see this same effect with larger data sets but haven't seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distance myself. In other words I was expecting to always get 20 similar docs when I asked for 20. It is useful to see what docs are at larger distances.

How is RowSimilarity deciding when to cut-off the returned docs?

Re: RowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Sorry but I'm still confused. So the similarity magnitude has nothing to 
do with one of mahout's distance measures, the similarity class is used 
only to specify the algorithm used to calculate this magnitude and does 
not imply a connection between distance and similarity? I'm now a bit 
unsure about how to read my results.

  * Using tanimoto for example is a value of 0.0001 more similar than a
    value of 0.9? This seems to fit my results even though below you say
    "A lack of (term) cooccurrences is equivalent to a similarity of 0"
  * Is there a description somewhere of what the similarity magnitude
    describes?

Thanks,
Pat

On 5/14/12 10:35 AM, Sebastian Schelter wrote:
> "The cutoff is made based on lack of term cooccurrences not the distance
> measure."
>
> I'd rather use the term similarity measure not distance measure as a lot
> of the measures implemented are not metric and the term 'distance' might
> be misleading
>
> A lack of (term) cooccurrences is equivalent to a similarity of 0 by
> definition, therefore the "default cutoff" is also based on the
> similarity measure.
>
> --sebastian
>
>
> On 14.05.2012 19:30, Pat Ferrel wrote:
>> Thanks, this is quite clear and reasonable.  The optional
>> 'threshold' is based on the distance measure.
>>
>> BTW I assume the 'distance' returned is expressed in the distance
>> measure's units? So using cosine as a distance measure a value near 0 is
>> actually quite similar because the measure is 1-(cosine of the angle
>> between the vectors)?
>>
>> On 5/13/12 9:10 AM, Sebastian Schelter wrote:
>>> Hi Pat,
>>>
>>> RowSimilarityJob allows the use of a lot of different similarity
>>> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
>>> of which compute a single number for a pair of vectors that denotes how
>>> similar those are. All these measures have the characteristic that two
>>> vectors that do not share at least one non-zero value in a single
>>> dimension are considered not similar (have similarity 0).
>>>
>>> In general, an all-pairs comparison, as it is conducted by
>>> RowSimilarityJob, has quadratic complexity and is therefore not scalable.
>>>
>>> If we have sparse data such as text or ratings however, we can exploit
>>> the fact that we only need to compare pairs which share at least one
>>> non-zero value in a dimension. This is the basic idea behind row
>>> similarity job to avoid an all-pairs comparison.
>>>
>>> In some real-world usecases you will furthermore encounter a lot of
>>> pairs with near-zero similarities that are of little value for you. To
>>> be able to avoid computing these, RowSimilarityJob provides the option
>>> to specify a minimum threshold so that it ignores pairs with a
>>> similarity value below this threshold. This threshold is data-dependent
>>> and you have to experimentally find it.
>>>
>>> --sebastian
>>>
>>>
>>> On 13.05.2012 17:33, Pat Ferrel wrote:
>>>> To paraphrase:
>>>>
>>>> There is some internal threshold to be considered 'similar'. This is the
>>>> one supplied with the 'threshold' option mentioned below and I need to
>>>> do a special build to get this option activated? I assume it is not
>>>> active because it has not been tested well?
>>>>
>>>> So currently how is the threshold calculated? How can I determine its
>>>> value? Can I vote that this be activated as an optional parameter in the
>>>> future?
>>>>
>>>> I ask this in part because I want to use RowSimilarity in an experiment
>>>> to do something like a non-partitioning hierarchical clustering where
>>>> I'll need to find close centroids in clusters calculated with different
>>>> levels of specificity.
>>>>
>>>> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>>>>> This could be simply due to the fact that there are less similar docs
>>>>> than the number specified in 'maxSimilaritiesPerRow'.
>>>>>
>>>>> consider() is only invoked if a threshold was specified.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>>>>     Pat's question was that he was seeing less documents than that
>>>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>>>>> the 'consider' functionality of the applied similarity measure.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>>     From: Sebastian Schelter<ss...@apache.org>
>>>>>> To: user@mahout.apache.org
>>>>>> Sent: Sunday, May 13, 2012 2:08 AM
>>>>>> Subject: Re: RowSimilarity
>>>>>>
>>>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>>>>> similar docs/items/rows per row. It depends on your data if there are
>>>>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>>>>
>>>>>> The option 'threshold' determines the minimum similarity value for a
>>>>>> pair of docs (otherwise it will be dropped). This option is not
>>>>>> activated by default however.
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>>>>> quotations on a similar subject. It looks to me that using
>>>>>>> tanimoto the
>>>>>>> largest pair-wise distance allowed for the similar docs was 0.4.
>>>>>>> Though
>>>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect
>>>>>>> with
>>>>>>> larger data sets but haven't seen an obvious cut-off point
>>>>>>>
>>>>>>> I was expecting to be able to make the decision about cut-off
>>>>>>> distance
>>>>>>> myself. In other words I was expecting to always get 20 similar docs
>>>>>>> when I asked for 20. It is useful to see what docs are at larger
>>>>>>> distances.
>>>>>>>
>>>>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>>>>
>>>
>
>

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@googlemail.com>.
"The cutoff is made based on lack of term cooccurrences not the distance
measure."

I'd rather use the term similarity measure not distance measure as a lot
of the measures implemented are not metric and the term 'distance' might
be misleading

A lack of (term) cooccurrences is equivalent to a similarity of 0 by
definition, therefore the "default cutoff" is also based on the
similarity measure.

--sebastian


On 14.05.2012 19:30, Pat Ferrel wrote:
> Thanks, this is quite clear and reasonable.  The optional
> 'threshold' is based on the distance measure.
> 
> BTW I assume the 'distance' returned is expressed in the distance
> measure's units? So using cosine as a distance measure a value near 0 is
> actually quite similar because the measure is 1-(cosine of the angle
> between the vectors)?
> 
> On 5/13/12 9:10 AM, Sebastian Schelter wrote:
>> Hi Pat,
>>
>> RowSimilarityJob allows the use of a lot of different similarity
>> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
>> of which compute a single number for a pair of vectors that denotes how
>> similar those are. All these measures have the characteristic that two
>> vectors that do not share at least one non-zero value in a single
>> dimension are considered not similar (have similarity 0).
>>
>> In general, an all-pairs comparison, as it is conducted by
>> RowSimilarityJob, has quadratic complexity and is therefore not scalable.
>>
>> If we have sparse data such as text or ratings however, we can exploit
>> the fact that we only need to compare pairs which share at least one
>> non-zero value in a dimension. This is the basic idea behind row
>> similarity job to avoid an all-pairs comparison.
>>
>> In some real-world usecases you will furthermore encounter a lot of
>> pairs with near-zero similarities that are of little value for you. To
>> be able to avoid computing these, RowSimilarityJob provides the option
>> to specify a minimum threshold so that it ignores pairs with a
>> similarity value below this threshold. This threshold is data-dependent
>> and you have to experimentally find it.
>>
>> --sebastian
>>
>>
>> On 13.05.2012 17:33, Pat Ferrel wrote:
>>> To paraphrase:
>>>
>>> There is some internal threshold to be considered 'similar'. This is the
>>> one supplied with the 'threshold' option mentioned below and I need to
>>> do a special build to get this option activated? I assume it is not
>>> active because it has not been tested well?
>>>
>>> So currently how is the threshold calculated? How can I determine its
>>> value? Can I vote that this be activated as an optional parameter in the
>>> future?
>>>
>>> I ask this in part because I want to use RowSimilarity in an experiment
>>> to do something like a non-partitioning hierarchical clustering where
>>> I'll need to find close centroids in clusters calculated with different
>>> levels of specificity.
>>>
>>> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>>>> This could be simply due to the fact that there are less similar docs
>>>> than the number specified in 'maxSimilaritiesPerRow'.
>>>>
>>>> consider() is only invoked if a threshold was specified.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>>>    Pat's question was that he was seeing less documents than that
>>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>>>> the 'consider' functionality of the applied similarity measure.
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>>    From: Sebastian Schelter<ss...@apache.org>
>>>>> To: user@mahout.apache.org
>>>>> Sent: Sunday, May 13, 2012 2:08 AM
>>>>> Subject: Re: RowSimilarity
>>>>>
>>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>>>> similar docs/items/rows per row. It depends on your data if there are
>>>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>>>
>>>>> The option 'threshold' determines the minimum similarity value for a
>>>>> pair of docs (otherwise it will be dropped). This option is not
>>>>> activated by default however.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>>>> quotations on a similar subject. It looks to me that using
>>>>>> tanimoto the
>>>>>> largest pair-wise distance allowed for the similar docs was 0.4.
>>>>>> Though
>>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect
>>>>>> with
>>>>>> larger data sets but haven't seen an obvious cut-off point
>>>>>>
>>>>>> I was expecting to be able to make the decision about cut-off
>>>>>> distance
>>>>>> myself. In other words I was expecting to always get 20 similar docs
>>>>>> when I asked for 20. It is useful to see what docs are at larger
>>>>>> distances.
>>>>>>
>>>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>>>
>>>>
>>
>>


Re: RowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Thanks, this is quite clear and reasonable. The cutoff is made based on 
lack of term cooccurrences not the distance measure. The optional 
'threshold' is based on the distance measure.

BTW I assume the 'distance' returned is expressed in the distance 
measure's units? So using cosine as a distance measure a value near 0 is 
actually quite similar because the measure is 1-(cosine of the angle 
between the vectors)?

On 5/13/12 9:10 AM, Sebastian Schelter wrote:
> Hi Pat,
>
> RowSimilarityJob allows the use of a lot of different similarity
> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
> of which compute a single number for a pair of vectors that denotes how
> similar those are. All these measures have the characteristic that two
> vectors that do not share at least one non-zero value in a single
> dimension are considered not similar (have similarity 0).
>
> In general, an all-pairs comparison, as it is conducted by
> RowSimilarityJob, has quadratic complexity and is therefore not scalable.
>
> If we have sparse data such as text or ratings however, we can exploit
> the fact that we only need to compare pairs which share at least one
> non-zero value in a dimension. This is the basic idea behind row
> similarity job to avoid an all-pairs comparison.
>
> In some real-world usecases you will furthermore encounter a lot of
> pairs with near-zero similarities that are of little value for you. To
> be able to avoid computing these, RowSimilarityJob provides the option
> to specify a minimum threshold so that it ignores pairs with a
> similarity value below this threshold. This threshold is data-dependent
> and you have to experimentally find it.
>
> --sebastian
>
>
> On 13.05.2012 17:33, Pat Ferrel wrote:
>> To paraphrase:
>>
>> There is some internal threshold to be considered 'similar'. This is the
>> one supplied with the 'threshold' option mentioned below and I need to
>> do a special build to get this option activated? I assume it is not
>> active because it has not been tested well?
>>
>> So currently how is the threshold calculated? How can I determine its
>> value? Can I vote that this be activated as an optional parameter in the
>> future?
>>
>> I ask this in part because I want to use RowSimilarity in an experiment
>> to do something like a non-partitioning hierarchical clustering where
>> I'll need to find close centroids in clusters calculated with different
>> levels of specificity.
>>
>> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>>> This could be simply due to the fact that there are less similar docs
>>> than the number specified in 'maxSimilaritiesPerRow'.
>>>
>>> consider() is only invoked if a threshold was specified.
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>>    Pat's question was that he was seeing less documents than that
>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>>> the 'consider' functionality of the applied similarity measure.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>    From: Sebastian Schelter<ss...@apache.org>
>>>> To: user@mahout.apache.org
>>>> Sent: Sunday, May 13, 2012 2:08 AM
>>>> Subject: Re: RowSimilarity
>>>>
>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>>> similar docs/items/rows per row. It depends on your data if there are
>>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>>
>>>> The option 'threshold' determines the minimum similarity value for a
>>>> pair of docs (otherwise it will be dropped). This option is not
>>>> activated by default however.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>>> quotations on a similar subject. It looks to me that using tanimoto the
>>>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>>>> larger data sets but haven't seen an obvious cut-off point
>>>>>
>>>>> I was expecting to be able to make the decision about cut-off distance
>>>>> myself. In other words I was expecting to always get 20 similar docs
>>>>> when I asked for 20. It is useful to see what docs are at larger
>>>>> distances.
>>>>>
>>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>>
>>>
>
>

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Pat,

RowSimilarityJob allows the use of a lot of different similarity
measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
of which compute a single number for a pair of vectors that denotes how
similar those are. All these measures have the characteristic that two
vectors that do not share at least one non-zero value in a single
dimension are considered not similar (have similarity 0).

In general, an all-pairs comparison, as it is conducted by
RowSimilarityJob, has quadratic complexity and is therefore not scalable.

If we have sparse data such as text or ratings however, we can exploit
the fact that we only need to compare pairs which share at least one
non-zero value in a dimension. This is the basic idea behind row
similarity job to avoid an all-pairs comparison.

In some real-world usecases you will furthermore encounter a lot of
pairs with near-zero similarities that are of little value for you. To
be able to avoid computing these, RowSimilarityJob provides the option
to specify a minimum threshold so that it ignores pairs with a
similarity value below this threshold. This threshold is data-dependent
and you have to experimentally find it.

--sebastian


On 13.05.2012 17:33, Pat Ferrel wrote:
> To paraphrase:
> 
> There is some internal threshold to be considered 'similar'. This is the
> one supplied with the 'threshold' option mentioned below and I need to
> do a special build to get this option activated? I assume it is not
> active because it has not been tested well?
> 
> So currently how is the threshold calculated? How can I determine its
> value? Can I vote that this be activated as an optional parameter in the
> future?
> 
> I ask this in part because I want to use RowSimilarity in an experiment
> to do something like a non-partitioning hierarchical clustering where
> I'll need to find close centroids in clusters calculated with different
> levels of specificity.
> 
> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>> This could be simply due to the fact that there are less similar docs
>> than the number specified in 'maxSimilaritiesPerRow'.
>>
>> consider() is only invoked if a threshold was specified.
>>
>> Best,
>> Sebastian
>>
>>
>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>   Pat's question was that he was seeing less documents than that
>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>> the 'consider' functionality of the applied similarity measure.
>>>
>>>
>>>
>>> ________________________________
>>>   From: Sebastian Schelter<ss...@apache.org>
>>> To: user@mahout.apache.org
>>> Sent: Sunday, May 13, 2012 2:08 AM
>>> Subject: Re: RowSimilarity
>>>
>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>> similar docs/items/rows per row. It depends on your data if there are
>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>
>>> The option 'threshold' determines the minimum similarity value for a
>>> pair of docs (otherwise it will be dropped). This option is not
>>> activated by default however.
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>> quotations on a similar subject. It looks to me that using tanimoto the
>>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>>> larger data sets but haven't seen an obvious cut-off point
>>>>
>>>> I was expecting to be able to make the decision about cut-off distance
>>>> myself. In other words I was expecting to always get 20 similar docs
>>>> when I asked for 20. It is useful to see what docs are at larger
>>>> distances.
>>>>
>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>
>>
>>


Re: RowSimilarity

Posted by Pat Ferrel <pa...@farfetchers.com>.
To paraphrase:

There is some internal threshold to be considered 'similar'. This is the 
one supplied with the 'threshold' option mentioned below and I need to 
do a special build to get this option activated? I assume it is not 
active because it has not been tested well?

So currently how is the threshold calculated? How can I determine its 
value? Can I vote that this be activated as an optional parameter in the 
future?

I ask this in part because I want to use RowSimilarity in an experiment 
to do something like a non-partitioning hierarchical clustering where 
I'll need to find close centroids in clusters calculated with different 
levels of specificity.

On 5/12/12 11:38 PM, Sebastian Schelter wrote:
> This could be simply due to the fact that there are less similar docs
> than the number specified in 'maxSimilaritiesPerRow'.
>
> consider() is only invoked if a threshold was specified.
>
> Best,
> Sebastian
>
>
> On 13.05.2012 08:25, Suneel Marthi wrote:
>>   Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure.
>>
>>
>>
>> ________________________________
>>   From: Sebastian Schelter<ss...@apache.org>
>> To: user@mahout.apache.org
>> Sent: Sunday, May 13, 2012 2:08 AM
>> Subject: Re: RowSimilarity
>>
>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>> similar docs/items/rows per row. It depends on your data if there are
>> enough similar rows per row, so you can't always get 20 similar docs.
>>
>> The option 'threshold' determines the minimum similarity value for a
>> pair of docs (otherwise it will be dropped). This option is not
>> activated by default however.
>>
>> Best,
>> Sebastian
>>
>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>> I tried an experiment running RowSimilarity with 16 docs of short
>>> quotations on a similar subject. It looks to me that using tanimoto the
>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>> larger data sets but haven't seen an obvious cut-off point
>>>
>>> I was expecting to be able to make the decision about cut-off distance
>>> myself. In other words I was expecting to always get 20 similar docs
>>> when I asked for 20. It is useful to see what docs are at larger distances.
>>>
>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>
>
>

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@apache.org>.
This could be simply due to the fact that there are less similar docs
than the number specified in 'maxSimilaritiesPerRow'.

consider() is only invoked if a threshold was specified.

Best,
Sebastian


On 13.05.2012 08:25, Suneel Marthi wrote:
>  Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure.
> 
> 
> 
> ________________________________
>  From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org 
> Sent: Sunday, May 13, 2012 2:08 AM
> Subject: Re: RowSimilarity
>  
> The option 'maxSimilaritiesPerRow' determines the maximum number of
> similar docs/items/rows per row. It depends on your data if there are
> enough similar rows per row, so you can't always get 20 similar docs.
> 
> The option 'threshold' determines the minimum similarity value for a
> pair of docs (otherwise it will be dropped). This option is not
> activated by default however.
> 
> Best,
> Sebastian
> 
> On 13.05.2012 01:29, Pat Ferrel wrote:
>> I tried an experiment running RowSimilarity with 16 docs of short
>> quotations on a similar subject. It looks to me that using tanimoto the
>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>> larger data sets but haven't seen an obvious cut-off point
>>
>> I was expecting to be able to make the decision about cut-off distance
>> myself. In other words I was expecting to always get 20 similar docs
>> when I asked for 20. It is useful to see what docs are at larger distances.
>>
>> How is RowSimilarity deciding when to cut-off the returned docs?
>>


Re: RowSimilarity

Posted by Suneel Marthi <su...@yahoo.com>.
 Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure.



________________________________
 From: Sebastian Schelter <ss...@apache.org>
To: user@mahout.apache.org 
Sent: Sunday, May 13, 2012 2:08 AM
Subject: Re: RowSimilarity
 
The option 'maxSimilaritiesPerRow' determines the maximum number of
similar docs/items/rows per row. It depends on your data if there are
enough similar rows per row, so you can't always get 20 similar docs.

The option 'threshold' determines the minimum similarity value for a
pair of docs (otherwise it will be dropped). This option is not
activated by default however.

Best,
Sebastian

On 13.05.2012 01:29, Pat Ferrel wrote:
> I tried an experiment running RowSimilarity with 16 docs of short
> quotations on a similar subject. It looks to me that using tanimoto the
> largest pair-wise distance allowed for the similar docs was 0.4. Though
> I asked for 10 similar docs I got 0 to 10. I see this same effect with
> larger data sets but haven't seen an obvious cut-off point
> 
> I was expecting to be able to make the decision about cut-off distance
> myself. In other words I was expecting to always get 20 similar docs
> when I asked for 20. It is useful to see what docs are at larger distances.
> 
> How is RowSimilarity deciding when to cut-off the returned docs?
> 

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@apache.org>.
The option 'maxSimilaritiesPerRow' determines the maximum number of
similar docs/items/rows per row. It depends on your data if there are
enough similar rows per row, so you can't always get 20 similar docs.

The option 'threshold' determines the minimum similarity value for a
pair of docs (otherwise it will be dropped). This option is not
activated by default however.

Best,
Sebastian

On 13.05.2012 01:29, Pat Ferrel wrote:
> I tried an experiment running RowSimilarity with 16 docs of short
> quotations on a similar subject. It looks to me that using tanimoto the
> largest pair-wise distance allowed for the similar docs was 0.4. Though
> I asked for 10 similar docs I got 0 to 10. I see this same effect with
> larger data sets but haven't seen an obvious cut-off point
> 
> I was expecting to be able to make the decision about cut-off distance
> myself. In other words I was expecting to always get 20 similar docs
> when I asked for 20. It is useful to see what docs are at larger distances.
> 
> How is RowSimilarity deciding when to cut-off the returned docs?
>