You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/05/08 19:06:57 UTC

rowsimilarity not creating requested number of similar docs

Using the below data set I ran rowsimilarity asking for 20 similar docs 
but got anywhere from 1 to 20. Is this the expected behavior? It would 
be nice to get all 20 so I can see where the similarity starts to drop off.

   mahout rowid     -i b2/bixo-vectors/tfidf-vectors/part-r-00000     -o 
b2/bixo-matrix

   mahout rowsimilarity \
       -i b2/bixo-matrix/matrix \
       -o b2/bixo-similarity \
       -r 5250 \
       --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
       -m 20 \
       -ess true

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 


Using the same config as below kmeans example.

I could file bugs but I'm not sure if this is a bug or not.

On 5/8/12 9:19 AM, Pat Ferrel wrote:
> BTW it seems odd that I get large numbers for distance from centroid 
> using clustering. Shouldn't I expect small numbers for the closest 
> docs? I have assumed the real distance is 1-reported distance but the 
> distances reported by rowsimilarity are very small as I'd expect. I 
> was using tanimoto in both cases as the distance measure but also 
> tried cosine with similar results.
>
> On 5/8/12 9:12 AM, Pat Ferrel wrote:
>> Here is a sample data set. In this case I asked for 30 and got 28 but 
>> in other cases the discrepancy has been greater like ask for 200 and 
>> get 38 but that was for a much larger data set.
>>
>> Running on my mac laptop in a single node pseudo cluster hadoop 
>> 0.20.205, mahout 0.6
>>
>> command line:
>>
>> mahout kmeans \
>>     -i b2/bixo-vectors/tfidf-vectors/ \
>>     -c b2/bixo-kmeans-centroids \
>>     -cl \
>>     -o b2/bixo-kmeans-clusters \
>>     -k 30 \
>>     -ow \
>>     -cd 0.01 \
>>     -x 20 \
>>     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>>
>>
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max 
>> of 20 but sometimes many less. Shouldn't this always return the 
>> requested number? I'll post this question again to the the attention 
>> of the right person.
>>
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along 
>>> with the execution parameters?
>>>
>>> You can also open a Jira for this and provide the info there.
>>>
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>>
>>>> I take it this is not expected behavior? I could be doing something 
>>>> stupid. I only look in the "final" directory. Looking in the others 
>>>> with clusterdump shows the same number of clusters and I assumed 
>>>> they were iterations.
>>>>
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak 
>>>>>> parameters I get different numbers of clusters but it's usually 
>>>>>> less than the k I pass in. Since I am not using canopies at 
>>>>>> present I would expect k to always be honored but the quality of 
>>>>>> the clusters would depend on the convergence amount and number of 
>>>>>> iterations allowed. No?
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>

Re: rowsimilarity not creating requested number of similar docs

Posted by Suneel Marthi <su...@yahoo.com>.
This is not a bug, the similarity measure does cut-off the results that are returned.



________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Tuesday, May 8, 2012 1:06 PM
Subject: rowsimilarity not creating requested number of similar docs
 
Using the below data set I ran rowsimilarity asking for 20 similar docs but got anywhere from 1 to 20. Is this the expected behavior? It would be nice to get all 20 so I can see where the similarity starts to drop off.

  mahout rowid     -i b2/bixo-vectors/tfidf-vectors/part-r-00000     -o b2/bixo-matrix

  mahout rowsimilarity \
      -i b2/bixo-matrix/matrix \
      -o b2/bixo-similarity \
      -r 5250 \
      --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
      -m 20 \
      -ess true

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 

Using the same config as below kmeans example.

I could file bugs but I'm not sure if this is a bug or not.

On 5/8/12 9:19 AM, Pat Ferrel wrote:
> BTW it seems odd that I get large numbers for distance from centroid using clustering. Shouldn't I expect small numbers for the closest docs? I have assumed the real distance is 1-reported distance but the distances reported by rowsimilarity are very small as I'd expect. I was using tanimoto in both cases as the distance measure but also tried cosine with similar results.
> 
> On 5/8/12 9:12 AM, Pat Ferrel wrote:
>> Here is a sample data set. In this case I asked for 30 and got 28 but in other cases the discrepancy has been greater like ask for 200 and get 38 but that was for a much larger data set.
>> 
>> Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, mahout 0.6
>> 
>> command line:
>> 
>> mahout kmeans \
>>     -i b2/bixo-vectors/tfidf-vectors/ \
>>     -c b2/bixo-kmeans-centroids \
>>     -cl \
>>     -o b2/bixo-kmeans-clusters \
>>     -k 30 \
>>     -ow \
>>     -cd 0.01 \
>>     -x 20 \
>>     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>> 
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>> 
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of 20 but sometimes many less. Shouldn't this always return the requested number? I'll post this question again to the the attention of the right person.
>> 
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along with the execution parameters?
>>> 
>>> You can also open a Jira for this and provide the info there.
>>> 
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>> 
>>>> I take it this is not expected behavior? I could be doing something stupid. I only look in the "final" directory. Looking in the others with clusterdump shows the same number of clusters and I assumed they were iterations.
>>>> 
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>> 
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak parameters I get different numbers of clusters but it's usually less than the k I pass in. Since I am not using canopies at present I would expect k to always be honored but the quality of the clusters would depend on the convergence amount and number of iterations allowed. No?
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>>