You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Stuart <da...@progressivealliance.co.uk> on 2010/05/15 13:03:12 UTC

Clustering help

Hey all,

Just wanting some help with my clustering proof of concept. I have a  
solr index full of job descriptions I have created a vector and run  
kmeans on it with a cluster size of about 10. Dumping the resutls  
seems to yield a scatter graph with not very distinct clumps. I have  
read that generally you have to run the process a number of times to  
get better clumps how do I do this, as when I try to use the output  
from the kmeans I get an error about not being the wrong input format.  
For text comparison like this should I be using fuzzykmeans or another  
type algorithm or is it a try it and see situation

Regards

David Stuart

Re: Clustering help

Posted by Jake Mannix <ja...@gmail.com>.

Cool.  The mahout shell script has the pair of hadoop jobs to be run on your
vector set ("mahout svd" and "mahout cleaneigens" are the two command lines
you'll want to use).

  -jake

On Sat, May 15, 2010 at 6:42 AM, David Stuart <
david.stuart@progressivealliance.co.uk> wrote:

> At the moment only about 10000 but the plan was to get this working an then
> ramp it up. I'll try the SVD route and report back on success
>
> Thanks
>
> David Stuart
>
>
> On 15 May 2010, at 14:09, Jake Mannix <ja...@gmail.com> wrote:
>
>  Hi David,
>>
>>  Text is extraordinarily sparse (high dimensional), and clustering the raw
>> text will not get you great results.  If you reduce the dimensionality, by
>> doing SVD on the text first, *then* doing kmeans on the reduced vectors,
>> you'll get better clusters.  Alternately, running LDA on the text can do
>> similar things.  How many job descriptions do you have in your Solr index?
>>
>>  -jake
>>
>> On Sat, May 15, 2010 at 4:03 AM, David Stuart <
>> david.stuart@progressivealliance.co.uk> wrote:
>>
>>  Hey all,
>>>
>>> Just wanting some help with my clustering proof of concept. I have a solr
>>> index full of job descriptions I have created a vector and run kmeans on
>>> it
>>> with a cluster size of about 10. Dumping the resutls seems to yield a
>>> scatter graph with not very distinct clumps. I have read that generally
>>> you
>>> have to run the process a number of times to get better clumps how do I
>>> do
>>> this, as when I try to use the output from the kmeans I get an error
>>> about
>>> not being the wrong input format. For text comparison like this should I
>>> be
>>> using fuzzykmeans or another type algorithm or is it a try it and see
>>> situation
>>>
>>> Regards
>>>
>>> David Stuart
>>>
>>>

Re: Clustering help

Posted by David Stuart <da...@progressivealliance.co.uk>.

At the moment only about 10000 but the plan was to get this working an  
then ramp it up. I'll try the SVD route and report back on success

Thanks

David Stuart

On 15 May 2010, at 14:09, Jake Mannix <ja...@gmail.com> wrote:

> Hi David,
>
>  Text is extraordinarily sparse (high dimensional), and clustering  
> the raw
> text will not get you great results.  If you reduce the  
> dimensionality, by
> doing SVD on the text first, *then* doing kmeans on the reduced  
> vectors,
> you'll get better clusters.  Alternately, running LDA on the text  
> can do
> similar things.  How many job descriptions do you have in your Solr  
> index?
>
>  -jake
>
> On Sat, May 15, 2010 at 4:03 AM, David Stuart <
> david.stuart@progressivealliance.co.uk> wrote:
>
>> Hey all,
>>
>> Just wanting some help with my clustering proof of concept. I have  
>> a solr
>> index full of job descriptions I have created a vector and run  
>> kmeans on it
>> with a cluster size of about 10. Dumping the resutls seems to yield a
>> scatter graph with not very distinct clumps. I have read that  
>> generally you
>> have to run the process a number of times to get better clumps how  
>> do I do
>> this, as when I try to use the output from the kmeans I get an  
>> error about
>> not being the wrong input format. For text comparison like this  
>> should I be
>> using fuzzykmeans or another type algorithm or is it a try it and see
>> situation
>>
>> Regards
>>
>> David Stuart
>>

Re: Clustering help

Posted by Jake Mannix <ja...@gmail.com>.

Hi David,

  Text is extraordinarily sparse (high dimensional), and clustering the raw
text will not get you great results.  If you reduce the dimensionality, by
doing SVD on the text first, *then* doing kmeans on the reduced vectors,
you'll get better clusters.  Alternately, running LDA on the text can do
similar things.  How many job descriptions do you have in your Solr index?

  -jake

On Sat, May 15, 2010 at 4:03 AM, David Stuart <
david.stuart@progressivealliance.co.uk> wrote:

> Hey all,
>
> Just wanting some help with my clustering proof of concept. I have a solr
> index full of job descriptions I have created a vector and run kmeans on it
> with a cluster size of about 10. Dumping the resutls seems to yield a
> scatter graph with not very distinct clumps. I have read that generally you
> have to run the process a number of times to get better clumps how do I do
> this, as when I try to use the output from the kmeans I get an error about
> not being the wrong input format. For text comparison like this should I be
> using fuzzykmeans or another type algorithm or is it a try it and see
> situation
>
> Regards
>
> David Stuart
>

Re: Clustering help

Posted by David Stuart <da...@progressivealliance.co.uk>.

Thanks Ted I will take that into account. From my really simplistic  
point of view in was trying to achieve sine form of extended  idf  
functionality as I had success using similarity results from solr. I  
suppose it's simple in theroy much more complicated in large scales.  
Although I am quite happy with fact you used the word clumpiness made  
my day ;)

Regards

David Stuart

On 15 May 2010, at 17:40, Ted Dunning <te...@gmail.com> wrote:

> You won't necessarily see any distinct clumps, depending on your  
> data.  With
> some text. you might get such, but with resumes, especially if you  
> don't do
> IDF weighting you are likely to have a pretty nasty distribution that
> doesn't clump very well at all.  Even with IDF weighting on terms  
> and the
> inclusion of phrases and reduction using SVD, you will quite  
> plausibly not
> see clumps.
>
> Nonetheless, the k-means clusters may still be a good description of  
> you
> data.  Generally, what you want from clustering is some partition of  
> your
> data along sensible boundaries, or at least into some more or less
> homogenous groups.  Whether you have visually appealing clusters is  
> another
> question entirely.
>
> In fact, *any* distribution of data can be clustered into pretty  
> much any
> number of clusters (if you have enough data).  The result is a rough
> description of that distribution and if whatever characteristic that  
> you are
> interested in can be separated in the original feature space, then the
> distances to the cluster centroids will be good for subsequent  
> modeling.
> This fact does *not* depend on clumpiness in your data.
>
> On Sat, May 15, 2010 at 4:03 AM, David Stuart <
> david.stuart@progressivealliance.co.uk> wrote:
>
>> Dumping the resutls seems to yield a scatter graph with not very  
>> distinct
>> clumps.

Re: Clustering help

Posted by Ted Dunning <te...@gmail.com>.

You won't necessarily see any distinct clumps, depending on your data.  With
some text. you might get such, but with resumes, especially if you don't do
IDF weighting you are likely to have a pretty nasty distribution that
doesn't clump very well at all.  Even with IDF weighting on terms and the
inclusion of phrases and reduction using SVD, you will quite plausibly not
see clumps.

Nonetheless, the k-means clusters may still be a good description of you
data.  Generally, what you want from clustering is some partition of your
data along sensible boundaries, or at least into some more or less
homogenous groups.  Whether you have visually appealing clusters is another
question entirely.

In fact, *any* distribution of data can be clustered into pretty much any
number of clusters (if you have enough data).  The result is a rough
description of that distribution and if whatever characteristic that you are
interested in can be separated in the original feature space, then the
distances to the cluster centroids will be good for subsequent modeling.
 This fact does *not* depend on clumpiness in your data.

On Sat, May 15, 2010 at 4:03 AM, David Stuart <
david.stuart@progressivealliance.co.uk> wrote:

> Dumping the resutls seems to yield a scatter graph with not very distinct
> clumps.