You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/05/08 19:06:57 UTC
rowsimilarity not creating requested number of similar docs
Using the below data set I ran rowsimilarity asking for 20 similar docs
but got anywhere from 1 to 20. Is this the expected behavior? It would
be nice to get all 20 so I can see where the similarity starts to drop off.
mahout rowid -i b2/bixo-vectors/tfidf-vectors/part-r-00000 -o
b2/bixo-matrix
mahout rowsimilarity \
-i b2/bixo-matrix/matrix \
-o b2/bixo-similarity \
-r 5250 \
--similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
-m 20 \
-ess true
Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
Using the same config as below kmeans example.
I could file bugs but I'm not sure if this is a bug or not.
On 5/8/12 9:19 AM, Pat Ferrel wrote:
> BTW it seems odd that I get large numbers for distance from centroid
> using clustering. Shouldn't I expect small numbers for the closest
> docs? I have assumed the real distance is 1-reported distance but the
> distances reported by rowsimilarity are very small as I'd expect. I
> was using tanimoto in both cases as the distance measure but also
> tried cosine with similar results.
>
> On 5/8/12 9:12 AM, Pat Ferrel wrote:
>> Here is a sample data set. In this case I asked for 30 and got 28 but
>> in other cases the discrepancy has been greater like ask for 200 and
>> get 38 but that was for a much larger data set.
>>
>> Running on my mac laptop in a single node pseudo cluster hadoop
>> 0.20.205, mahout 0.6
>>
>> command line:
>>
>> mahout kmeans \
>> -i b2/bixo-vectors/tfidf-vectors/ \
>> -c b2/bixo-kmeans-centroids \
>> -cl \
>> -o b2/bixo-kmeans-clusters \
>> -k 30 \
>> -ow \
>> -cd 0.01 \
>> -x 20 \
>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
>>
>>
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max
>> of 20 but sometimes many less. Shouldn't this always return the
>> requested number? I'll post this question again to the the attention
>> of the right person.
>>
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along
>>> with the execution parameters?
>>>
>>> You can also open a Jira for this and provide the info there.
>>>
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>>
>>>> I take it this is not expected behavior? I could be doing something
>>>> stupid. I only look in the "final" directory. Looking in the others
>>>> with clusterdump shows the same number of clusters and I assumed
>>>> they were iterations.
>>>>
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>>> parameters I get different numbers of clusters but it's usually
>>>>>> less than the k I pass in. Since I am not using canopies at
>>>>>> present I would expect k to always be honored but the quality of
>>>>>> the clusters would depend on the convergence amount and number of
>>>>>> iterations allowed. No?
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
Re: rowsimilarity not creating requested number of similar docs
Posted by Suneel Marthi <su...@yahoo.com>.
This is not a bug, the similarity measure does cut-off the results that are returned.
________________________________
From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org
Sent: Tuesday, May 8, 2012 1:06 PM
Subject: rowsimilarity not creating requested number of similar docs
Using the below data set I ran rowsimilarity asking for 20 similar docs but got anywhere from 1 to 20. Is this the expected behavior? It would be nice to get all 20 so I can see where the similarity starts to drop off.
mahout rowid -i b2/bixo-vectors/tfidf-vectors/part-r-00000 -o b2/bixo-matrix
mahout rowsimilarity \
-i b2/bixo-matrix/matrix \
-o b2/bixo-similarity \
-r 5250 \
--similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
-m 20 \
-ess true
Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
Using the same config as below kmeans example.
I could file bugs but I'm not sure if this is a bug or not.
On 5/8/12 9:19 AM, Pat Ferrel wrote:
> BTW it seems odd that I get large numbers for distance from centroid using clustering. Shouldn't I expect small numbers for the closest docs? I have assumed the real distance is 1-reported distance but the distances reported by rowsimilarity are very small as I'd expect. I was using tanimoto in both cases as the distance measure but also tried cosine with similar results.
>
> On 5/8/12 9:12 AM, Pat Ferrel wrote:
>> Here is a sample data set. In this case I asked for 30 and got 28 but in other cases the discrepancy has been greater like ask for 200 and get 38 but that was for a much larger data set.
>>
>> Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, mahout 0.6
>>
>> command line:
>>
>> mahout kmeans \
>> -i b2/bixo-vectors/tfidf-vectors/ \
>> -c b2/bixo-kmeans-centroids \
>> -cl \
>> -o b2/bixo-kmeans-clusters \
>> -k 30 \
>> -ow \
>> -cd 0.01 \
>> -x 20 \
>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
>>
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of 20 but sometimes many less. Shouldn't this always return the requested number? I'll post this question again to the the attention of the right person.
>>
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along with the execution parameters?
>>>
>>> You can also open a Jira for this and provide the info there.
>>>
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>>
>>>> I take it this is not expected behavior? I could be doing something stupid. I only look in the "final" directory. Looking in the others with clusterdump shows the same number of clusters and I assumed they were iterations.
>>>>
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak parameters I get different numbers of clusters but it's usually less than the k I pass in. Since I am not using canopies at present I would expect k to always be honored but the quality of the clusters would depend on the convergence amount and number of iterations allowed. No?
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>