You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/05/06 22:49:52 UTC

kmeans not returning k clusters

What would cause kmeans to not return k clusters? As I tweak parameters 
I get different numbers of clusters but it's usually less than the k I 
pass in. Since I am not using canopies at present I would expect k to 
always be honored but the quality of the clusters would depend on the 
convergence amount and number of iterations allowed. No?

Re: kmeans not returning k clusters

Posted by Ted Dunning <te...@gmail.com>.

On Mon, May 7, 2012 at 12:01 AM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> > - it doesn't have the final pass of in-memory clustering so it really
> just
> > gives you an indifferent quality clustering with a huge number of
> weighted
> > clusters.  With the final pass, it will give you a high quality
> clustering
> > with your specified number of clusters.
>
> I think the "huge number of weighted clusters" can be actually
> beneficial in certain applications. Are you going to leave this in as
> an option when integrating with Mahout, Ted? Still didn't have time to
> look at the code yet ;(
>

Yes.  This is the current primary use case as part of a k-nn modeling
framework.

Re: kmeans not returning k clusters

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> - it doesn't have the final pass of in-memory clustering so it really just
> gives you an indifferent quality clustering with a huge number of weighted
> clusters.  With the final pass, it will give you a high quality clustering
> with your specified number of clusters.

I think the "huge number of weighted clusters" can be actually
beneficial in certain applications. Are you going to leave this in as
an option when integrating with Mahout, Ted? Still didn't have time to
look at the code yet ;(

Dawid

Re: kmeans not returning k clusters

Posted by Ted Dunning <te...@gmail.com>.

Pat,

You may be interested in the code at https://github.com/tdunning/knn

This includes some high speed clustering code that could help you with your
issues.  To wit,

- there aren't as many knobs to tweak on the algorithm (you still have data
scaling tricks to do)

- the speed should be 10-100x current Mahout implementations

- it will go into Mahout before too long

The big downsides right now are

- no history yet

- not compatible with Mahout clustering API's yet

- it doesn't have the final pass of in-memory clustering so it really just
gives you an indifferent quality clustering with a huge number of weighted
clusters.  With the final pass, it will give you a high quality clustering
with your specified number of clusters.


On Sun, May 6, 2012 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> What would cause kmeans to not return k clusters? As I tweak parameters I
> get different numbers of clusters but it's usually less than the k I pass
> in. Since I am not using canopies at present I would expect k to always be
> honored but the quality of the clusters would depend on the convergence
> amount and number of iterations allowed. No?
>

Re: kmeans not returning k clusters

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I have not checked with canopy since you don't really tell kmeans how 
many to create, it's a little hidden. That's why I said I don't care 
about the number, just that I'm not loosing real/important clusters.

The size of the vectors is in the data set is something like 3000 I 
think. Very little pruning, just a bixo + boilerpipe crawl of a few 
sites at a minimum depth. Here is the seq2sparse command I ran:

mahout seq2sparse \
     -i b2/bixo-seqfiles/ \
     -o b2/bixo-vectors/ \
     -ow -chunk 2000 \
     -x 90 \
     -seq \
     -n 2 \
     -nv

I agree if no one else is seeing this, it may be weirdness of my own 
creation.

On 5/9/12 12:24 PM, Jeff Eastman wrote:
> Does this cluster reduction happen when you prime k-means with canopy? 
> Can you first adjust T1==T2 to get about 200 canopies and feed that to 
> k-means? How wide are your term vectors? Have you tried other distance 
> measures?
>
> If anybody else out there is experiencing similar problems, please 
> chime in.
>
> Jeff
>
> On 5/9/12 1:07 PM, Pat Ferrel wrote:
>> That's what I'm doing now. Random seeds is not really the best way to 
>> do kmeans. However my results are repeatable as far as I've gone. And 
>> canopy wants to generate a much larger set of clusters, with a wide 
>> range of T1 and T2 for this data set so the theory that it does not 
>> support 30 clusters seems unlikely although the may be a fair 
>> distance apart.
>>
>> Since I've tried several times with several random seed so the "seeds 
>> are too close" theory doesn't seem likely.
>> Given canopy wants to generate more clusters, the "doesn't support k 
>> = 30" theory doesn't seem likely.
>>
>> I'm not saying that there is a real problem here but when I noticed 
>> it I had 16,000 documents and was asking for 200 clusters and got 38. 
>> If there is some good reason for this it would be nice to find it and 
>> report it to the user. The "good reason" might be very helpful in the 
>> analysis. Or it could be a bug.
>>
>> At least it's out there in case others are seeing lost clusters.
>>
>> On 5/9/12 7:49 AM, Jeff Eastman wrote:
>>> Paratosh is correct in his analysis. K-means can work itself into a 
>>> situation where there are some empty clusters if the initial cluster 
>>> centers are too closely spaced or if the data really doesn't support 
>>> k clusters. This is because it assigns each vector to the most 
>>> likely (closest) cluster. If two prior clusters are very close 
>>> together this can cause one of them to become empty.
>>>
>>> Have you tried priming k-means with canopy instead of the random 
>>> sampler?
>>>
>>> On 5/9/12 10:35 AM, Pat Ferrel wrote:
>>>> I suspect you are right Paritosh. I ran the random seed with kmean 
>>>> several times on the supplied data set and always got 28 rather 
>>>> than 30 clusters. I don't care so much about the number but it 
>>>> might mean that some clusters are thrown out and without looking 
>>>> you couldn't tell if they were important ones or not. Just upping k 
>>>> to 32 doesn't really work if you still get some thrown out.
>>>>
>>>> At least i think the issue is repeatable with this data.
>>>>
>>>> On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
>>>>> Printouts of Mahout vectors prints only the non-zero elements.
>>>>> So, the centers are not empty, rather they are zero.
>>>>>
>>>>> Prima facie, I suspect that you are getting lot of empty clusters. 
>>>>> This might be occurring due to the combination of distance 
>>>>> measure, convergence threshold and distances between vectors.
>>>>> Can you try to analyze and change/play around with these parameters?
>>>>>
>>>>> I will try to look into how the Random Cluster Initialization is 
>>>>> working. I will log a jira if I find some issue. However, I think 
>>>>> that there will be no problem in cluster initialization part.
>>>>>
>>>>> On 09-05-2012 03:21, Danfeng Li wrote:
>>>>>> I got the same issue. What I found is that the initial centers 
>>>>>> have many empty ones, the final number of clusters are decided by 
>>>>>> the number of nonempty centers.
>>>>>>
>>>>>> Here are some example of my cases:
>>>>>>
>>>>>> ...
>>>>>> CL-34358205{n=0 c=[] r=[]}
>>>>>> CL-34358207{n=0 c=[] r=[]}
>>>>>> CL-34358209{n=0 c=[] r=[]}
>>>>>> CL-34358213{n=0 c=[0:1.000] r=[]}
>>>>>> CL-34358215{n=0 c=[] r=[]}
>>>>>> CL-34358216{n=0 c=[] r=[]}
>>>>>> CL-34358217{n=0 c=[] r=[]}
>>>>>> CL-34358220{n=0 c=[] r=[]}
>>>>>> CL-34358221{n=0 c=[] r=[]}
>>>>>> CL-34358222{n=0 c=[] r=[]}
>>>>>> CL-34358223{n=0 c=[] r=[]}
>>>>>> CL-34358224{n=0 c=[] r=[]}
>>>>>> CL-34358227{n=0 c=[0:1.000] r=[]}
>>>>>> CL-34358228{n=0 c=[] r=[]}
>>>>>> CL-34358229{n=0 c=[] r=[]}
>>>>>> ...
>>>>>>
>>>>>> Is it the case there is a bug in initialization?
>>>>>>
>>>>>> Thanks.
>>>>>> Dan
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Pat Ferrel [mailto:pat@occamsmachete.com]
>>>>>> Sent: Tuesday, May 08, 2012 9:13 AM
>>>>>> To: user@mahout.apache.org
>>>>>> Subject: Re: kmeans not returning k clusters
>>>>>>
>>>>>> Here is a sample data set. In this case I asked for 30 and got 28 
>>>>>> but in other cases the discrepancy has been greater like ask for 
>>>>>> 200 and get 38 but that was for a much larger data set.
>>>>>>
>>>>>> Running on my mac laptop in a single node pseudo cluster hadoop 
>>>>>> 0.20.205, mahout 0.6
>>>>>>
>>>>>> command line:
>>>>>>
>>>>>> mahout kmeans \
>>>>>>       -i b2/bixo-vectors/tfidf-vectors/ \
>>>>>>       -c b2/bixo-kmeans-centroids \
>>>>>>       -cl \
>>>>>>       -o b2/bixo-kmeans-clusters \
>>>>>>       -k 30 \
>>>>>>       -ow \
>>>>>>       -cd 0.01 \
>>>>>>       -x 20 \
>>>>>>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>>>>>
>>>>>> Find the data here:
>>>>>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>>>>>>
>>>>>>
>>>>>> BTW when I run rowsimilarity asking for 20 similar docs I get a 
>>>>>> max of
>>>>>> 20 but sometimes many less. Shouldn't this always return the 
>>>>>> requested number? I'll post this question again to the the 
>>>>>> attention of the right person.
>>>>>>
>>>>>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>>>>>> I looked at the 0.6 version's code but was not able to find any 
>>>>>>> reason.
>>>>>>> If possible, can you share the data you are trying to cluster along
>>>>>>> with the execution parameters?
>>>>>>>
>>>>>>> You can also open a Jira for this and provide the info there.
>>>>>>>
>>>>>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>>>>>> 0.6
>>>>>>>>
>>>>>>>> I take it this is not expected behavior? I could be doing 
>>>>>>>> something
>>>>>>>> stupid. I only look in the "final" directory. Looking in the 
>>>>>>>> others
>>>>>>>> with clusterdump shows the same number of clusters and I 
>>>>>>>> assumed they
>>>>>>>> were iterations.
>>>>>>>>
>>>>>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>>>>>
>>>>>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>>>>>>> parameters I get different numbers of clusters but it's usually
>>>>>>>>>> less than the k I pass in. Since I am not using canopies at 
>>>>>>>>>> present
>>>>>>>>>> I would expect k to always be honored but the quality of the
>>>>>>>>>> clusters would depend on the convergence amount and number of
>>>>>>>>>> iterations allowed. No?
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: kmeans not returning k clusters

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Does this cluster reduction happen when you prime k-means with canopy? 
Can you first adjust T1==T2 to get about 200 canopies and feed that to 
k-means? How wide are your term vectors? Have you tried other distance 
measures?

If anybody else out there is experiencing similar problems, please chime in.

Jeff

On 5/9/12 1:07 PM, Pat Ferrel wrote:
> That's what I'm doing now. Random seeds is not really the best way to 
> do kmeans. However my results are repeatable as far as I've gone. And 
> canopy wants to generate a much larger set of clusters, with a wide 
> range of T1 and T2 for this data set so the theory that it does not 
> support 30 clusters seems unlikely although the may be a fair distance 
> apart.
>
> Since I've tried several times with several random seed so the "seeds 
> are too close" theory doesn't seem likely.
> Given canopy wants to generate more clusters, the "doesn't support k = 
> 30" theory doesn't seem likely.
>
> I'm not saying that there is a real problem here but when I noticed it 
> I had 16,000 documents and was asking for 200 clusters and got 38. If 
> there is some good reason for this it would be nice to find it and 
> report it to the user. The "good reason" might be very helpful in the 
> analysis. Or it could be a bug.
>
> At least it's out there in case others are seeing lost clusters.
>
> On 5/9/12 7:49 AM, Jeff Eastman wrote:
>> Paratosh is correct in his analysis. K-means can work itself into a 
>> situation where there are some empty clusters if the initial cluster 
>> centers are too closely spaced or if the data really doesn't support 
>> k clusters. This is because it assigns each vector to the most likely 
>> (closest) cluster. If two prior clusters are very close together this 
>> can cause one of them to become empty.
>>
>> Have you tried priming k-means with canopy instead of the random 
>> sampler?
>>
>> On 5/9/12 10:35 AM, Pat Ferrel wrote:
>>> I suspect you are right Paritosh. I ran the random seed with kmean 
>>> several times on the supplied data set and always got 28 rather than 
>>> 30 clusters. I don't care so much about the number but it might mean 
>>> that some clusters are thrown out and without looking you couldn't 
>>> tell if they were important ones or not. Just upping k to 32 doesn't 
>>> really work if you still get some thrown out.
>>>
>>> At least i think the issue is repeatable with this data.
>>>
>>> On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
>>>> Printouts of Mahout vectors prints only the non-zero elements.
>>>> So, the centers are not empty, rather they are zero.
>>>>
>>>> Prima facie, I suspect that you are getting lot of empty clusters. 
>>>> This might be occurring due to the combination of distance measure, 
>>>> convergence threshold and distances between vectors.
>>>> Can you try to analyze and change/play around with these parameters?
>>>>
>>>> I will try to look into how the Random Cluster Initialization is 
>>>> working. I will log a jira if I find some issue. However, I think 
>>>> that there will be no problem in cluster initialization part.
>>>>
>>>> On 09-05-2012 03:21, Danfeng Li wrote:
>>>>> I got the same issue. What I found is that the initial centers 
>>>>> have many empty ones, the final number of clusters are decided by 
>>>>> the number of nonempty centers.
>>>>>
>>>>> Here are some example of my cases:
>>>>>
>>>>> ...
>>>>> CL-34358205{n=0 c=[] r=[]}
>>>>> CL-34358207{n=0 c=[] r=[]}
>>>>> CL-34358209{n=0 c=[] r=[]}
>>>>> CL-34358213{n=0 c=[0:1.000] r=[]}
>>>>> CL-34358215{n=0 c=[] r=[]}
>>>>> CL-34358216{n=0 c=[] r=[]}
>>>>> CL-34358217{n=0 c=[] r=[]}
>>>>> CL-34358220{n=0 c=[] r=[]}
>>>>> CL-34358221{n=0 c=[] r=[]}
>>>>> CL-34358222{n=0 c=[] r=[]}
>>>>> CL-34358223{n=0 c=[] r=[]}
>>>>> CL-34358224{n=0 c=[] r=[]}
>>>>> CL-34358227{n=0 c=[0:1.000] r=[]}
>>>>> CL-34358228{n=0 c=[] r=[]}
>>>>> CL-34358229{n=0 c=[] r=[]}
>>>>> ...
>>>>>
>>>>> Is it the case there is a bug in initialization?
>>>>>
>>>>> Thanks.
>>>>> Dan
>>>>>
>>>>> -----Original Message-----
>>>>> From: Pat Ferrel [mailto:pat@occamsmachete.com]
>>>>> Sent: Tuesday, May 08, 2012 9:13 AM
>>>>> To: user@mahout.apache.org
>>>>> Subject: Re: kmeans not returning k clusters
>>>>>
>>>>> Here is a sample data set. In this case I asked for 30 and got 28 
>>>>> but in other cases the discrepancy has been greater like ask for 
>>>>> 200 and get 38 but that was for a much larger data set.
>>>>>
>>>>> Running on my mac laptop in a single node pseudo cluster hadoop 
>>>>> 0.20.205, mahout 0.6
>>>>>
>>>>> command line:
>>>>>
>>>>> mahout kmeans \
>>>>>       -i b2/bixo-vectors/tfidf-vectors/ \
>>>>>       -c b2/bixo-kmeans-centroids \
>>>>>       -cl \
>>>>>       -o b2/bixo-kmeans-clusters \
>>>>>       -k 30 \
>>>>>       -ow \
>>>>>       -cd 0.01 \
>>>>>       -x 20 \
>>>>>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>>>>
>>>>> Find the data here:
>>>>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>>>>>
>>>>>
>>>>> BTW when I run rowsimilarity asking for 20 similar docs I get a 
>>>>> max of
>>>>> 20 but sometimes many less. Shouldn't this always return the 
>>>>> requested number? I'll post this question again to the the 
>>>>> attention of the right person.
>>>>>
>>>>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>>>>> I looked at the 0.6 version's code but was not able to find any 
>>>>>> reason.
>>>>>> If possible, can you share the data you are trying to cluster along
>>>>>> with the execution parameters?
>>>>>>
>>>>>> You can also open a Jira for this and provide the info there.
>>>>>>
>>>>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>>>>> 0.6
>>>>>>>
>>>>>>> I take it this is not expected behavior? I could be doing something
>>>>>>> stupid. I only look in the "final" directory. Looking in the others
>>>>>>> with clusterdump shows the same number of clusters and I assumed 
>>>>>>> they
>>>>>>> were iterations.
>>>>>>>
>>>>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>>>>
>>>>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>>>>>> parameters I get different numbers of clusters but it's usually
>>>>>>>>> less than the k I pass in. Since I am not using canopies at 
>>>>>>>>> present
>>>>>>>>> I would expect k to always be honored but the quality of the
>>>>>>>>> clusters would depend on the convergence amount and number of
>>>>>>>>> iterations allowed. No?
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: kmeans not returning k clusters

Posted by Pat Ferrel <pa...@occamsmachete.com>.

That's what I'm doing now. Random seeds is not really the best way to do 
kmeans. However my results are repeatable as far as I've gone. And 
canopy wants to generate a much larger set of clusters, with a wide 
range of T1 and T2 for this data set so the theory that it does not 
support 30 clusters seems unlikely although the may be a fair distance 
apart.

Since I've tried several times with several random seed so the "seeds 
are too close" theory doesn't seem likely.
Given canopy wants to generate more clusters, the "doesn't support k = 
30" theory doesn't seem likely.

I'm not saying that there is a real problem here but when I noticed it I 
had 16,000 documents and was asking for 200 clusters and got 38. If 
there is some good reason for this it would be nice to find it and 
report it to the user. The "good reason" might be very helpful in the 
analysis. Or it could be a bug.

At least it's out there in case others are seeing lost clusters.

On 5/9/12 7:49 AM, Jeff Eastman wrote:
> Paratosh is correct in his analysis. K-means can work itself into a 
> situation where there are some empty clusters if the initial cluster 
> centers are too closely spaced or if the data really doesn't support k 
> clusters. This is because it assigns each vector to the most likely 
> (closest) cluster. If two prior clusters are very close together this 
> can cause one of them to become empty.
>
> Have you tried priming k-means with canopy instead of the random sampler?
>
> On 5/9/12 10:35 AM, Pat Ferrel wrote:
>> I suspect you are right Paritosh. I ran the random seed with kmean 
>> several times on the supplied data set and always got 28 rather than 
>> 30 clusters. I don't care so much about the number but it might mean 
>> that some clusters are thrown out and without looking you couldn't 
>> tell if they were important ones or not. Just upping k to 32 doesn't 
>> really work if you still get some thrown out.
>>
>> At least i think the issue is repeatable with this data.
>>
>> On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
>>> Printouts of Mahout vectors prints only the non-zero elements.
>>> So, the centers are not empty, rather they are zero.
>>>
>>> Prima facie, I suspect that you are getting lot of empty clusters. 
>>> This might be occurring due to the combination of distance measure, 
>>> convergence threshold and distances between vectors.
>>> Can you try to analyze and change/play around with these parameters?
>>>
>>> I will try to look into how the Random Cluster Initialization is 
>>> working. I will log a jira if I find some issue. However, I think 
>>> that there will be no problem in cluster initialization part.
>>>
>>> On 09-05-2012 03:21, Danfeng Li wrote:
>>>> I got the same issue. What I found is that the initial centers have 
>>>> many empty ones, the final number of clusters are decided by the 
>>>> number of nonempty centers.
>>>>
>>>> Here are some example of my cases:
>>>>
>>>> ...
>>>> CL-34358205{n=0 c=[] r=[]}
>>>> CL-34358207{n=0 c=[] r=[]}
>>>> CL-34358209{n=0 c=[] r=[]}
>>>> CL-34358213{n=0 c=[0:1.000] r=[]}
>>>> CL-34358215{n=0 c=[] r=[]}
>>>> CL-34358216{n=0 c=[] r=[]}
>>>> CL-34358217{n=0 c=[] r=[]}
>>>> CL-34358220{n=0 c=[] r=[]}
>>>> CL-34358221{n=0 c=[] r=[]}
>>>> CL-34358222{n=0 c=[] r=[]}
>>>> CL-34358223{n=0 c=[] r=[]}
>>>> CL-34358224{n=0 c=[] r=[]}
>>>> CL-34358227{n=0 c=[0:1.000] r=[]}
>>>> CL-34358228{n=0 c=[] r=[]}
>>>> CL-34358229{n=0 c=[] r=[]}
>>>> ...
>>>>
>>>> Is it the case there is a bug in initialization?
>>>>
>>>> Thanks.
>>>> Dan
>>>>
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat@occamsmachete.com]
>>>> Sent: Tuesday, May 08, 2012 9:13 AM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: kmeans not returning k clusters
>>>>
>>>> Here is a sample data set. In this case I asked for 30 and got 28 
>>>> but in other cases the discrepancy has been greater like ask for 
>>>> 200 and get 38 but that was for a much larger data set.
>>>>
>>>> Running on my mac laptop in a single node pseudo cluster hadoop 
>>>> 0.20.205, mahout 0.6
>>>>
>>>> command line:
>>>>
>>>> mahout kmeans \
>>>>       -i b2/bixo-vectors/tfidf-vectors/ \
>>>>       -c b2/bixo-kmeans-centroids \
>>>>       -cl \
>>>>       -o b2/bixo-kmeans-clusters \
>>>>       -k 30 \
>>>>       -ow \
>>>>       -cd 0.01 \
>>>>       -x 20 \
>>>>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>>>
>>>> Find the data here:
>>>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>>>>
>>>>
>>>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of
>>>> 20 but sometimes many less. Shouldn't this always return the 
>>>> requested number? I'll post this question again to the the 
>>>> attention of the right person.
>>>>
>>>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>>>> I looked at the 0.6 version's code but was not able to find any 
>>>>> reason.
>>>>> If possible, can you share the data you are trying to cluster along
>>>>> with the execution parameters?
>>>>>
>>>>> You can also open a Jira for this and provide the info there.
>>>>>
>>>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>>>> 0.6
>>>>>>
>>>>>> I take it this is not expected behavior? I could be doing something
>>>>>> stupid. I only look in the "final" directory. Looking in the others
>>>>>> with clusterdump shows the same number of clusters and I assumed 
>>>>>> they
>>>>>> were iterations.
>>>>>>
>>>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>>>
>>>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>>>>> parameters I get different numbers of clusters but it's usually
>>>>>>>> less than the k I pass in. Since I am not using canopies at 
>>>>>>>> present
>>>>>>>> I would expect k to always be honored but the quality of the
>>>>>>>> clusters would depend on the convergence amount and number of
>>>>>>>> iterations allowed. No?
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>
>>
>

Re: kmeans not returning k clusters

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Paratosh is correct in his analysis. K-means can work itself into a 
situation where there are some empty clusters if the initial cluster 
centers are too closely spaced or if the data really doesn't support k 
clusters. This is because it assigns each vector to the most likely 
(closest) cluster. If two prior clusters are very close together this 
can cause one of them to become empty.

Have you tried priming k-means with canopy instead of the random sampler?

On 5/9/12 10:35 AM, Pat Ferrel wrote:
> I suspect you are right Paritosh. I ran the random seed with kmean 
> several times on the supplied data set and always got 28 rather than 
> 30 clusters. I don't care so much about the number but it might mean 
> that some clusters are thrown out and without looking you couldn't 
> tell if they were important ones or not. Just upping k to 32 doesn't 
> really work if you still get some thrown out.
>
> At least i think the issue is repeatable with this data.
>
> On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
>> Printouts of Mahout vectors prints only the non-zero elements.
>> So, the centers are not empty, rather they are zero.
>>
>> Prima facie, I suspect that you are getting lot of empty clusters. 
>> This might be occurring due to the combination of distance measure, 
>> convergence threshold and distances between vectors.
>> Can you try to analyze and change/play around with these parameters?
>>
>> I will try to look into how the Random Cluster Initialization is 
>> working. I will log a jira if I find some issue. However, I think 
>> that there will be no problem in cluster initialization part.
>>
>> On 09-05-2012 03:21, Danfeng Li wrote:
>>> I got the same issue. What I found is that the initial centers have 
>>> many empty ones, the final number of clusters are decided by the 
>>> number of nonempty centers.
>>>
>>> Here are some example of my cases:
>>>
>>> ...
>>> CL-34358205{n=0 c=[] r=[]}
>>> CL-34358207{n=0 c=[] r=[]}
>>> CL-34358209{n=0 c=[] r=[]}
>>> CL-34358213{n=0 c=[0:1.000] r=[]}
>>> CL-34358215{n=0 c=[] r=[]}
>>> CL-34358216{n=0 c=[] r=[]}
>>> CL-34358217{n=0 c=[] r=[]}
>>> CL-34358220{n=0 c=[] r=[]}
>>> CL-34358221{n=0 c=[] r=[]}
>>> CL-34358222{n=0 c=[] r=[]}
>>> CL-34358223{n=0 c=[] r=[]}
>>> CL-34358224{n=0 c=[] r=[]}
>>> CL-34358227{n=0 c=[0:1.000] r=[]}
>>> CL-34358228{n=0 c=[] r=[]}
>>> CL-34358229{n=0 c=[] r=[]}
>>> ...
>>>
>>> Is it the case there is a bug in initialization?
>>>
>>> Thanks.
>>> Dan
>>>
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat@occamsmachete.com]
>>> Sent: Tuesday, May 08, 2012 9:13 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: kmeans not returning k clusters
>>>
>>> Here is a sample data set. In this case I asked for 30 and got 28 
>>> but in other cases the discrepancy has been greater like ask for 200 
>>> and get 38 but that was for a much larger data set.
>>>
>>> Running on my mac laptop in a single node pseudo cluster hadoop 
>>> 0.20.205, mahout 0.6
>>>
>>> command line:
>>>
>>> mahout kmeans \
>>>       -i b2/bixo-vectors/tfidf-vectors/ \
>>>       -c b2/bixo-kmeans-centroids \
>>>       -cl \
>>>       -o b2/bixo-kmeans-clusters \
>>>       -k 30 \
>>>       -ow \
>>>       -cd 0.01 \
>>>       -x 20 \
>>>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>>
>>> Find the data here:
>>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>>>
>>>
>>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of
>>> 20 but sometimes many less. Shouldn't this always return the 
>>> requested number? I'll post this question again to the the attention 
>>> of the right person.
>>>
>>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>>> I looked at the 0.6 version's code but was not able to find any 
>>>> reason.
>>>> If possible, can you share the data you are trying to cluster along
>>>> with the execution parameters?
>>>>
>>>> You can also open a Jira for this and provide the info there.
>>>>
>>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>>> 0.6
>>>>>
>>>>> I take it this is not expected behavior? I could be doing something
>>>>> stupid. I only look in the "final" directory. Looking in the others
>>>>> with clusterdump shows the same number of clusters and I assumed they
>>>>> were iterations.
>>>>>
>>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>>
>>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>>>> parameters I get different numbers of clusters but it's usually
>>>>>>> less than the k I pass in. Since I am not using canopies at present
>>>>>>> I would expect k to always be honored but the quality of the
>>>>>>> clusters would depend on the convergence amount and number of
>>>>>>> iterations allowed. No?
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>>
>
>

Re: kmeans not returning k clusters

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I suspect you are right Paritosh. I ran the random seed with kmean 
several times on the supplied data set and always got 28 rather than 30 
clusters. I don't care so much about the number but it might mean that 
some clusters are thrown out and without looking you couldn't tell if 
they were important ones or not. Just upping k to 32 doesn't really work 
if you still get some thrown out.

At least i think the issue is repeatable with this data.

On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
> Printouts of Mahout vectors prints only the non-zero elements.
> So, the centers are not empty, rather they are zero.
>
> Prima facie, I suspect that you are getting lot of empty clusters. 
> This might be occurring due to the combination of distance measure, 
> convergence threshold and distances between vectors.
> Can you try to analyze and change/play around with these parameters?
>
> I will try to look into how the Random Cluster Initialization is 
> working. I will log a jira if I find some issue. However, I think that 
> there will be no problem in cluster initialization part.
>
> On 09-05-2012 03:21, Danfeng Li wrote:
>> I got the same issue. What I found is that the initial centers have 
>> many empty ones, the final number of clusters are decided by the 
>> number of nonempty centers.
>>
>> Here are some example of my cases:
>>
>> ...
>> CL-34358205{n=0 c=[] r=[]}
>> CL-34358207{n=0 c=[] r=[]}
>> CL-34358209{n=0 c=[] r=[]}
>> CL-34358213{n=0 c=[0:1.000] r=[]}
>> CL-34358215{n=0 c=[] r=[]}
>> CL-34358216{n=0 c=[] r=[]}
>> CL-34358217{n=0 c=[] r=[]}
>> CL-34358220{n=0 c=[] r=[]}
>> CL-34358221{n=0 c=[] r=[]}
>> CL-34358222{n=0 c=[] r=[]}
>> CL-34358223{n=0 c=[] r=[]}
>> CL-34358224{n=0 c=[] r=[]}
>> CL-34358227{n=0 c=[0:1.000] r=[]}
>> CL-34358228{n=0 c=[] r=[]}
>> CL-34358229{n=0 c=[] r=[]}
>> ...
>>
>> Is it the case there is a bug in initialization?
>>
>> Thanks.
>> Dan
>>
>> -----Original Message-----
>> From: Pat Ferrel [mailto:pat@occamsmachete.com]
>> Sent: Tuesday, May 08, 2012 9:13 AM
>> To: user@mahout.apache.org
>> Subject: Re: kmeans not returning k clusters
>>
>> Here is a sample data set. In this case I asked for 30 and got 28 but 
>> in other cases the discrepancy has been greater like ask for 200 and 
>> get 38 but that was for a much larger data set.
>>
>> Running on my mac laptop in a single node pseudo cluster hadoop 
>> 0.20.205, mahout 0.6
>>
>> command line:
>>
>> mahout kmeans \
>>       -i b2/bixo-vectors/tfidf-vectors/ \
>>       -c b2/bixo-kmeans-centroids \
>>       -cl \
>>       -o b2/bixo-kmeans-clusters \
>>       -k 30 \
>>       -ow \
>>       -cd 0.01 \
>>       -x 20 \
>>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>>
>>
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of
>> 20 but sometimes many less. Shouldn't this always return the 
>> requested number? I'll post this question again to the the attention 
>> of the right person.
>>
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along
>>> with the execution parameters?
>>>
>>> You can also open a Jira for this and provide the info there.
>>>
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>>
>>>> I take it this is not expected behavior? I could be doing something
>>>> stupid. I only look in the "final" directory. Looking in the others
>>>> with clusterdump shows the same number of clusters and I assumed they
>>>> were iterations.
>>>>
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>>> parameters I get different numbers of clusters but it's usually
>>>>>> less than the k I pass in. Since I am not using canopies at present
>>>>>> I would expect k to always be honored but the quality of the
>>>>>> clusters would depend on the convergence amount and number of
>>>>>> iterations allowed. No?
>>>>>
>>>>>
>>>
>>>
>
>
>

Re: kmeans not returning k clusters

Posted by Paritosh Ranjan <pr...@xebia.com>.

Printouts of Mahout vectors prints only the non-zero elements.
So, the centers are not empty, rather they are zero.

Prima facie, I suspect that you are getting lot of empty clusters. This 
might be occurring due to the combination of distance measure, 
convergence threshold and distances between vectors.
Can you try to analyze and change/play around with these parameters?

I will try to look into how the Random Cluster Initialization is 
working. I will log a jira if I find some issue. However, I think that 
there will be no problem in cluster initialization part.

On 09-05-2012 03:21, Danfeng Li wrote:
> I got the same issue. What I found is that the initial centers have many empty ones, the final number of clusters are decided by the number of nonempty centers.
>
> Here are some example of my cases:
>
> ...
> CL-34358205{n=0 c=[] r=[]}
> CL-34358207{n=0 c=[] r=[]}
> CL-34358209{n=0 c=[] r=[]}
> CL-34358213{n=0 c=[0:1.000] r=[]}
> CL-34358215{n=0 c=[] r=[]}
> CL-34358216{n=0 c=[] r=[]}
> CL-34358217{n=0 c=[] r=[]}
> CL-34358220{n=0 c=[] r=[]}
> CL-34358221{n=0 c=[] r=[]}
> CL-34358222{n=0 c=[] r=[]}
> CL-34358223{n=0 c=[] r=[]}
> CL-34358224{n=0 c=[] r=[]}
> CL-34358227{n=0 c=[0:1.000] r=[]}
> CL-34358228{n=0 c=[] r=[]}
> CL-34358229{n=0 c=[] r=[]}
> ...
>
> Is it the case there is a bug in initialization?
>
> Thanks.
> Dan
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, May 08, 2012 9:13 AM
> To: user@mahout.apache.org
> Subject: Re: kmeans not returning k clusters
>
> Here is a sample data set. In this case I asked for 30 and got 28 but in other cases the discrepancy has been greater like ask for 200 and get 38 but that was for a much larger data set.
>
> Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, mahout 0.6
>
> command line:
>
> mahout kmeans \
>       -i b2/bixo-vectors/tfidf-vectors/ \
>       -c b2/bixo-kmeans-centroids \
>       -cl \
>       -o b2/bixo-kmeans-clusters \
>       -k 30 \
>       -ow \
>       -cd 0.01 \
>       -x 20 \
>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>
> Find the data here:
> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
>
> BTW when I run rowsimilarity asking for 20 similar docs I get a max of
> 20 but sometimes many less. Shouldn't this always return the requested number? I'll post this question again to the the attention of the right person.
>
> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>> I looked at the 0.6 version's code but was not able to find any reason.
>> If possible, can you share the data you are trying to cluster along
>> with the execution parameters?
>>
>> You can also open a Jira for this and provide the info there.
>>
>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>> 0.6
>>>
>>> I take it this is not expected behavior? I could be doing something
>>> stupid. I only look in the "final" directory. Looking in the others
>>> with clusterdump shows the same number of clusters and I assumed they
>>> were iterations.
>>>
>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>
>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>> parameters I get different numbers of clusters but it's usually
>>>>> less than the k I pass in. Since I am not using canopies at present
>>>>> I would expect k to always be honored but the quality of the
>>>>> clusters would depend on the convergence amount and number of
>>>>> iterations allowed. No?
>>>>
>>>>
>>
>>

RE: kmeans not returning k clusters

Posted by Danfeng Li <dl...@operasolutions.com>.

I got the same issue. What I found is that the initial centers have many empty ones, the final number of clusters are decided by the number of nonempty centers.

Here are some example of my cases:

...
CL-34358205{n=0 c=[] r=[]}
CL-34358207{n=0 c=[] r=[]}
CL-34358209{n=0 c=[] r=[]}
CL-34358213{n=0 c=[0:1.000] r=[]}
CL-34358215{n=0 c=[] r=[]}
CL-34358216{n=0 c=[] r=[]}
CL-34358217{n=0 c=[] r=[]}
CL-34358220{n=0 c=[] r=[]}
CL-34358221{n=0 c=[] r=[]}
CL-34358222{n=0 c=[] r=[]}
CL-34358223{n=0 c=[] r=[]}
CL-34358224{n=0 c=[] r=[]}
CL-34358227{n=0 c=[0:1.000] r=[]}
CL-34358228{n=0 c=[] r=[]}
CL-34358229{n=0 c=[] r=[]}
...

Is it the case there is a bug in initialization?

Thanks.
Dan

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Tuesday, May 08, 2012 9:13 AM
To: user@mahout.apache.org
Subject: Re: kmeans not returning k clusters

Here is a sample data set. In this case I asked for 30 and got 28 but in other cases the discrepancy has been greater like ask for 200 and get 38 but that was for a much larger data set.

Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, mahout 0.6

command line:

mahout kmeans \
     -i b2/bixo-vectors/tfidf-vectors/ \
     -c b2/bixo-kmeans-centroids \
     -cl \
     -o b2/bixo-kmeans-clusters \
     -k 30 \
     -ow \
     -cd 0.01 \
     -x 20 \
     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740

BTW when I run rowsimilarity asking for 20 similar docs I get a max of
20 but sometimes many less. Shouldn't this always return the requested number? I'll post this question again to the the attention of the right person.

On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
> I looked at the 0.6 version's code but was not able to find any reason.
> If possible, can you share the data you are trying to cluster along 
> with the execution parameters?
>
> You can also open a Jira for this and provide the info there.
>
> On 07-05-2012 19:45, Pat Ferrel wrote:
>> 0.6
>>
>> I take it this is not expected behavior? I could be doing something 
>> stupid. I only look in the "final" directory. Looking in the others 
>> with clusterdump shows the same number of clusters and I assumed they 
>> were iterations.
>>
>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>
>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>> What would cause kmeans to not return k clusters? As I tweak 
>>>> parameters I get different numbers of clusters but it's usually 
>>>> less than the k I pass in. Since I am not using canopies at present 
>>>> I would expect k to always be honored but the quality of the 
>>>> clusters would depend on the convergence amount and number of 
>>>> iterations allowed. No?
>>>
>>>
>>>
>
>
>

Re: rowsimilarity not creating requested number of similar docs

Posted by Suneel Marthi <su...@yahoo.com>.

This is not a bug, the similarity measure does cut-off the results that are returned.



________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Tuesday, May 8, 2012 1:06 PM
Subject: rowsimilarity not creating requested number of similar docs
 
Using the below data set I ran rowsimilarity asking for 20 similar docs but got anywhere from 1 to 20. Is this the expected behavior? It would be nice to get all 20 so I can see where the similarity starts to drop off.

  mahout rowid     -i b2/bixo-vectors/tfidf-vectors/part-r-00000     -o b2/bixo-matrix

  mahout rowsimilarity \
      -i b2/bixo-matrix/matrix \
      -o b2/bixo-similarity \
      -r 5250 \
      --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
      -m 20 \
      -ess true

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 

Using the same config as below kmeans example.

I could file bugs but I'm not sure if this is a bug or not.

On 5/8/12 9:19 AM, Pat Ferrel wrote:
> BTW it seems odd that I get large numbers for distance from centroid using clustering. Shouldn't I expect small numbers for the closest docs? I have assumed the real distance is 1-reported distance but the distances reported by rowsimilarity are very small as I'd expect. I was using tanimoto in both cases as the distance measure but also tried cosine with similar results.
> 
> On 5/8/12 9:12 AM, Pat Ferrel wrote:
>> Here is a sample data set. In this case I asked for 30 and got 28 but in other cases the discrepancy has been greater like ask for 200 and get 38 but that was for a much larger data set.
>> 
>> Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, mahout 0.6
>> 
>> command line:
>> 
>> mahout kmeans \
>>     -i b2/bixo-vectors/tfidf-vectors/ \
>>     -c b2/bixo-kmeans-centroids \
>>     -cl \
>>     -o b2/bixo-kmeans-clusters \
>>     -k 30 \
>>     -ow \
>>     -cd 0.01 \
>>     -x 20 \
>>     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>> 
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>> 
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of 20 but sometimes many less. Shouldn't this always return the requested number? I'll post this question again to the the attention of the right person.
>> 
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along with the execution parameters?
>>> 
>>> You can also open a Jira for this and provide the info there.
>>> 
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>> 
>>>> I take it this is not expected behavior? I could be doing something stupid. I only look in the "final" directory. Looking in the others with clusterdump shows the same number of clusters and I assumed they were iterations.
>>>> 
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>> 
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak parameters I get different numbers of clusters but it's usually less than the k I pass in. Since I am not using canopies at present I would expect k to always be honored but the quality of the clusters would depend on the convergence amount and number of iterations allowed. No?
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>>

rowsimilarity not creating requested number of similar docs

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Using the below data set I ran rowsimilarity asking for 20 similar docs 
but got anywhere from 1 to 20. Is this the expected behavior? It would 
be nice to get all 20 so I can see where the similarity starts to drop off.

   mahout rowid     -i b2/bixo-vectors/tfidf-vectors/part-r-00000     -o 
b2/bixo-matrix

   mahout rowsimilarity \
       -i b2/bixo-matrix/matrix \
       -o b2/bixo-similarity \
       -r 5250 \
       --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
       -m 20 \
       -ess true

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 


Using the same config as below kmeans example.

I could file bugs but I'm not sure if this is a bug or not.

On 5/8/12 9:19 AM, Pat Ferrel wrote:
> BTW it seems odd that I get large numbers for distance from centroid 
> using clustering. Shouldn't I expect small numbers for the closest 
> docs? I have assumed the real distance is 1-reported distance but the 
> distances reported by rowsimilarity are very small as I'd expect. I 
> was using tanimoto in both cases as the distance measure but also 
> tried cosine with similar results.
>
> On 5/8/12 9:12 AM, Pat Ferrel wrote:
>> Here is a sample data set. In this case I asked for 30 and got 28 but 
>> in other cases the discrepancy has been greater like ask for 200 and 
>> get 38 but that was for a much larger data set.
>>
>> Running on my mac laptop in a single node pseudo cluster hadoop 
>> 0.20.205, mahout 0.6
>>
>> command line:
>>
>> mahout kmeans \
>>     -i b2/bixo-vectors/tfidf-vectors/ \
>>     -c b2/bixo-kmeans-centroids \
>>     -cl \
>>     -o b2/bixo-kmeans-clusters \
>>     -k 30 \
>>     -ow \
>>     -cd 0.01 \
>>     -x 20 \
>>     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>>
>>
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max 
>> of 20 but sometimes many less. Shouldn't this always return the 
>> requested number? I'll post this question again to the the attention 
>> of the right person.
>>
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along 
>>> with the execution parameters?
>>>
>>> You can also open a Jira for this and provide the info there.
>>>
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>>
>>>> I take it this is not expected behavior? I could be doing something 
>>>> stupid. I only look in the "final" directory. Looking in the others 
>>>> with clusterdump shows the same number of clusters and I assumed 
>>>> they were iterations.
>>>>
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak 
>>>>>> parameters I get different numbers of clusters but it's usually 
>>>>>> less than the k I pass in. Since I am not using canopies at 
>>>>>> present I would expect k to always be honored but the quality of 
>>>>>> the clusters would depend on the convergence amount and number of 
>>>>>> iterations allowed. No?
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>

Re: kmeans not returning k clusters

Posted by Pat Ferrel <pa...@occamsmachete.com>.

BTW it seems odd that I get large numbers for distance from centroid 
using clustering. Shouldn't I expect small numbers for the closest docs? 
I have assumed the real distance is 1-reported distance but the 
distances reported by rowsimilarity are very small as I'd expect. I was 
using tanimoto in both cases as the distance measure but also tried 
cosine with similar results.

On 5/8/12 9:12 AM, Pat Ferrel wrote:
> Here is a sample data set. In this case I asked for 30 and got 28 but 
> in other cases the discrepancy has been greater like ask for 200 and 
> get 38 but that was for a much larger data set.
>
> Running on my mac laptop in a single node pseudo cluster hadoop 
> 0.20.205, mahout 0.6
>
> command line:
>
> mahout kmeans \
>     -i b2/bixo-vectors/tfidf-vectors/ \
>     -c b2/bixo-kmeans-centroids \
>     -cl \
>     -o b2/bixo-kmeans-clusters \
>     -k 30 \
>     -ow \
>     -cd 0.01 \
>     -x 20 \
>     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>
> Find the data here:
> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 
>
>
> BTW when I run rowsimilarity asking for 20 similar docs I get a max of 
> 20 but sometimes many less. Shouldn't this always return the requested 
> number? I'll post this question again to the the attention of the 
> right person.
>
> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>> I looked at the 0.6 version's code but was not able to find any reason.
>> If possible, can you share the data you are trying to cluster along 
>> with the execution parameters?
>>
>> You can also open a Jira for this and provide the info there.
>>
>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>> 0.6
>>>
>>> I take it this is not expected behavior? I could be doing something 
>>> stupid. I only look in the "final" directory. Looking in the others 
>>> with clusterdump shows the same number of clusters and I assumed 
>>> they were iterations.
>>>
>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>
>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>> What would cause kmeans to not return k clusters? As I tweak 
>>>>> parameters I get different numbers of clusters but it's usually 
>>>>> less than the k I pass in. Since I am not using canopies at 
>>>>> present I would expect k to always be honored but the quality of 
>>>>> the clusters would depend on the convergence amount and number of 
>>>>> iterations allowed. No?
>>>>
>>>>
>>>>
>>
>>
>>

Re: kmeans not returning k clusters

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Here is a sample data set. In this case I asked for 30 and got 28 but in 
other cases the discrepancy has been greater like ask for 200 and get 38 
but that was for a much larger data set.

Running on my mac laptop in a single node pseudo cluster hadoop 
0.20.205, mahout 0.6

command line:

mahout kmeans \
     -i b2/bixo-vectors/tfidf-vectors/ \
     -c b2/bixo-kmeans-centroids \
     -cl \
     -o b2/bixo-kmeans-clusters \
     -k 30 \
     -ow \
     -cd 0.01 \
     -x 20 \
     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740

BTW when I run rowsimilarity asking for 20 similar docs I get a max of 
20 but sometimes many less. Shouldn't this always return the requested 
number? I'll post this question again to the the attention of the right 
person.

On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
> I looked at the 0.6 version's code but was not able to find any reason.
> If possible, can you share the data you are trying to cluster along 
> with the execution parameters?
>
> You can also open a Jira for this and provide the info there.
>
> On 07-05-2012 19:45, Pat Ferrel wrote:
>> 0.6
>>
>> I take it this is not expected behavior? I could be doing something 
>> stupid. I only look in the "final" directory. Looking in the others 
>> with clusterdump shows the same number of clusters and I assumed they 
>> were iterations.
>>
>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>
>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>> What would cause kmeans to not return k clusters? As I tweak 
>>>> parameters I get different numbers of clusters but it's usually 
>>>> less than the k I pass in. Since I am not using canopies at present 
>>>> I would expect k to always be honored but the quality of the 
>>>> clusters would depend on the convergence amount and number of 
>>>> iterations allowed. No?
>>>
>>>
>>>
>
>
>

Re: kmeans not returning k clusters

Posted by Paritosh Ranjan <pr...@xebia.com>.

I looked at the 0.6 version's code but was not able to find any reason.
If possible, can you share the data you are trying to cluster along with 
the execution parameters?

You can also open a Jira for this and provide the info there.

On 07-05-2012 19:45, Pat Ferrel wrote:
> 0.6
>
> I take it this is not expected behavior? I could be doing something 
> stupid. I only look in the "final" directory. Looking in the others 
> with clusterdump shows the same number of clusters and I assumed they 
> were iterations.
>
> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>
>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>> What would cause kmeans to not return k clusters? As I tweak 
>>> parameters I get different numbers of clusters but it's usually less 
>>> than the k I pass in. Since I am not using canopies at present I 
>>> would expect k to always be honored but the quality of the clusters 
>>> would depend on the convergence amount and number of iterations 
>>> allowed. No?
>>
>>
>>

Re: kmeans not returning k clusters

Posted by Pat Ferrel <pa...@farfetchers.com>.

0.6

I take it this is not expected behavior? I could be doing something 
stupid. I only look in the "final" directory. Looking in the others with 
clusterdump shows the same number of clusters and I assumed they were 
iterations.

On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
> Which version are you using ? 0.6 or the current 0.7-snapshot?
>
> On 07-05-2012 02:19, Pat Ferrel wrote:
>> What would cause kmeans to not return k clusters? As I tweak 
>> parameters I get different numbers of clusters but it's usually less 
>> than the k I pass in. Since I am not using canopies at present I 
>> would expect k to always be honored but the quality of the clusters 
>> would depend on the convergence amount and number of iterations 
>> allowed. No?
>
>
>

Re: kmeans not returning k clusters

Posted by Paritosh Ranjan <pr...@xebia.com>.

Which version are you using ? 0.6 or the current 0.7-snapshot?

On 07-05-2012 02:19, Pat Ferrel wrote:
> What would cause kmeans to not return k clusters? As I tweak 
> parameters I get different numbers of clusters but it's usually less 
> than the k I pass in. Since I am not using canopies at present I would 
> expect k to always be honored but the quality of the clusters would 
> depend on the convergence amount and number of iterations allowed. No?