You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Chris Harrington <ch...@heystaks.com> on 2013/02/20 18:07:22 UTC

kmeans clustering - how to leave some docs unclustered

Hi all,

I'm running kmeans to cluster some text docs and some docs that are seemingly unrelated to the cluster (i.e. noise) are getting clustered and I wish to leave them unclustered.

I thought the clusterClassificationThreshold variable would do this for me

from the java doc

clusterClassificationThreshold
   *          Is a clustering strictness / outlier removal parameter. Its value should be between 0 and 1. Vectors
   *          having pdf below this value will not be clustered.

but when ever I change this value no clustered points get written and there doesn't seem to be any change in the clusters, no matter what value I set (tried 0.00001 and 0.99999)

Did I misunderstand what this variable does or am I missing here?

Re: kmeans clustering - how to leave some docs unclustered

Posted by Matt Molek <mp...@gmail.com>.

Sorry for the confusion, I meant the same thing. I'm also looking at the
content of my clusteredPoints/part-m-00000 file.

I'm having trouble filtering outliers from my clusters too. Depending on
the clusterClassificationThreshold value, either all or none of my points
are classified. I think it's just that I haven't found the right value for
the threshold yet yet.


On Wed, Feb 27, 2013 at 11:01 AM, Chris Harrington <ch...@heystaks.com>wrote:

> Clustering for me worked, (sorry if I didn't make that part clear) it's
> the empty clusteredPoints/part-m-00000 file is the problem I'm having.
>
> Any value greater than 0.025 and the clusteredPoints/part-m-00000 is empty
> and I use that file to map the document to the cluster it ended up in.
> If I can't create a mapping between documents and a cluster then I'm not
> able to figure out if a particular document was clustered or not.
>
> On 27 Feb 2013, at 03:01, Matt Molek wrote:
>
> > I think you have the right idea about the clusterClassificationThreshold,
> > but something just isn't working right in your case.
> >
> > I know this answer won't be particularly helpful since I don't have any
> > suggestions to fix your problem, but I did a test recently where I tried
> > clusterClassificationThreshold values of 0.0, 0.1, and 0.5. With 0.0 and
> > 0.1, all my points were clustered. With 0.5, none of them were clustered.
> > So I assume there is some value for my test data between 0.1 and 0.5
> where
> > I would cluster some but not all of my data.
> >
> >
> > On Wed, Feb 20, 2013 at 12:07 PM, Chris Harrington <chris@heystaks.com
> >wrote:
> >
> >> Hi all,
> >>
> >> I'm running kmeans to cluster some text docs and some docs that are
> >> seemingly unrelated to the cluster (i.e. noise) are getting clustered
> and I
> >> wish to leave them unclustered.
> >>
> >> I thought the clusterClassificationThreshold variable would do this for
> me
> >>
> >> from the java doc
> >>
> >> clusterClassificationThreshold
> >>   *          Is a clustering strictness / outlier removal parameter. Its
> >> value should be between 0 and 1. Vectors
> >>   *          having pdf below this value will not be clustered.
> >>
> >> but when ever I change this value no clustered points get written and
> >> there doesn't seem to be any change in the clusters, no matter what
> value I
> >> set (tried 0.00001 and 0.99999)
> >>
> >> Did I misunderstand what this variable does or am I missing here?
>
>

Re: kmeans clustering - how to leave some docs unclustered

Posted by Chris Harrington <ch...@heystaks.com>.

Clustering for me worked, (sorry if I didn't make that part clear) it's the empty clusteredPoints/part-m-00000 file is the problem I'm having.

Any value greater than 0.025 and the clusteredPoints/part-m-00000 is empty and I use that file to map the document to the cluster it ended up in. 
If I can't create a mapping between documents and a cluster then I'm not able to figure out if a particular document was clustered or not.

On 27 Feb 2013, at 03:01, Matt Molek wrote:

> I think you have the right idea about the clusterClassificationThreshold,
> but something just isn't working right in your case.
> 
> I know this answer won't be particularly helpful since I don't have any
> suggestions to fix your problem, but I did a test recently where I tried
> clusterClassificationThreshold values of 0.0, 0.1, and 0.5. With 0.0 and
> 0.1, all my points were clustered. With 0.5, none of them were clustered.
> So I assume there is some value for my test data between 0.1 and 0.5 where
> I would cluster some but not all of my data.
> 
> 
> On Wed, Feb 20, 2013 at 12:07 PM, Chris Harrington <ch...@heystaks.com>wrote:
> 
>> Hi all,
>> 
>> I'm running kmeans to cluster some text docs and some docs that are
>> seemingly unrelated to the cluster (i.e. noise) are getting clustered and I
>> wish to leave them unclustered.
>> 
>> I thought the clusterClassificationThreshold variable would do this for me
>> 
>> from the java doc
>> 
>> clusterClassificationThreshold
>>   *          Is a clustering strictness / outlier removal parameter. Its
>> value should be between 0 and 1. Vectors
>>   *          having pdf below this value will not be clustered.
>> 
>> but when ever I change this value no clustered points get written and
>> there doesn't seem to be any change in the clusters, no matter what value I
>> set (tried 0.00001 and 0.99999)
>> 
>> Did I misunderstand what this variable does or am I missing here?

Re: kmeans clustering - how to leave some docs unclustered

Posted by Matt Molek <mp...@gmail.com>.

I think you have the right idea about the clusterClassificationThreshold,
but something just isn't working right in your case.

I know this answer won't be particularly helpful since I don't have any
suggestions to fix your problem, but I did a test recently where I tried
clusterClassificationThreshold values of 0.0, 0.1, and 0.5. With 0.0 and
0.1, all my points were clustered. With 0.5, none of them were clustered.
So I assume there is some value for my test data between 0.1 and 0.5 where
I would cluster some but not all of my data.

On Wed, Feb 20, 2013 at 12:07 PM, Chris Harrington <ch...@heystaks.com>wrote:

> Hi all,
>
> I'm running kmeans to cluster some text docs and some docs that are
> seemingly unrelated to the cluster (i.e. noise) are getting clustered and I
> wish to leave them unclustered.
>
> I thought the clusterClassificationThreshold variable would do this for me
>
> from the java doc
>
> clusterClassificationThreshold
>    *          Is a clustering strictness / outlier removal parameter. Its
> value should be between 0 and 1. Vectors
>    *          having pdf below this value will not be clustered.
>
> but when ever I change this value no clustered points get written and
> there doesn't seem to be any change in the clusters, no matter what value I
> set (tried 0.00001 and 0.99999)
>
> Did I misunderstand what this variable does or am I missing here?