You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by William Moran <ec...@gmail.com> on 2013/08/12 23:12:12 UTC

Question about clusterdump

Hi,

What exactly are the numbers next to these terms? (this is an example
clusterdump from the Mahout in Action book, but my clusterdumps look
similar).

Top Terms:

Shania Twain => 1.126984126984127
Garth Brooks => 0.746031746031746
Sara Evans => 0.6031746031746031
Lonestar => 0.5238095238095238

Sorry if this is an obvious question but I find it hard to find details on
these specifics.

Many thanks,

Will

Re: Question about clusterdump

Posted by Ritwik Kumar <li...@gmail.com>.
I am not 100% on how Mahout implementation of KMeans algorithm does this,
but in general, cluster center is the centroid  of all the points that
belong to that cluster. In the simplest case, it will just be the average
of all the points that belong to that cluster. Next, it could be an actual
point that is closest to the centroid.


On Thu, Aug 22, 2013 at 6:58 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Aug 12, 2013, at 5:12 PM, William Moran <ec...@gmail.com> wrote:
>
> > Hi,
> >
> > What exactly are the numbers next to these terms? (this is an example
> > clusterdump from the Mahout in Action book, but my clusterdumps look
> > similar).
>
> They are the weights assigned to each of the terms.  They are likely the
> TF/IDF values, but I believe they may be other things depending on how your
> dictionary/vectors were created.
>
> >
> > Top Terms:
> >
> > Shania Twain => 1.126984126984127
> > Garth Brooks => 0.746031746031746
> > Sara Evans => 0.6031746031746031
> > Lonestar => 0.5238095238095238
> >
> > Sorry if this is an obvious question but I find it hard to find details
> on
> > these specifics.
> >
> > Many thanks,
> >
> > Will
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>

Re: Question about clusterdump

Posted by Grant Ingersoll <gs...@apache.org>.
On Aug 12, 2013, at 5:12 PM, William Moran <ec...@gmail.com> wrote:

> Hi,
> 
> What exactly are the numbers next to these terms? (this is an example
> clusterdump from the Mahout in Action book, but my clusterdumps look
> similar).

They are the weights assigned to each of the terms.  They are likely the TF/IDF values, but I believe they may be other things depending on how your dictionary/vectors were created.

> 
> Top Terms:
> 
> Shania Twain => 1.126984126984127
> Garth Brooks => 0.746031746031746
> Sara Evans => 0.6031746031746031
> Lonestar => 0.5238095238095238
> 
> Sorry if this is an obvious question but I find it hard to find details on
> these specifics.
> 
> Many thanks,
> 
> Will

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com