You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ankit Goel <an...@gmail.com> on 2015/07/21 02:18:57 UTC

Kmeans clusterdump Interpretation

Hi,
I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1
index. The data is news articles. The --field option for kmeans is set to
"content". The idField is set to "title" (just so i can analyse it faster).
The clusterdump of the kmeans result gives me a proper output, but I cant
figure out the id of the vector chosen as the center. There are only 14-15
articles so I am not hung up about the cluster performance at this time.

I used random seeds for the kmeans commandline.
For reference, this is the commandline cluster dump I am executing

bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
-p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5

The output I get is off the form

:{"r":

top terms

xxxxx==>xxxxx

Weight : [props - optional]:  Point:

 1.0 : [distance=0.0]: [{"account":0.026}.......other features]

1.0 : [distance=0.3963903651622338]: [....]


So how exactly do I get the centroid id? I have even tried accessing it
with java

ClusterWritable value.getValue().getCenter() but this just gives me the
features and values of the centroid.

Also, please do explain the meaning of "account":0.026 (just making sure I
know it right). I used tfidf.

-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

Posted by Ankit Goel <an...@gmail.com>.

True that. Kmeans is just a first step anyways. Definetely needs tuning.
Thanks guys

On Tue, Jul 21, 2015 at 9:46 AM, Ted Dunning <te...@gmail.com> wrote:

> You can always just pick the article closest to the centroid.
>
> But I think that you may find that with simple k-means that clusters are
> going to be about more than one thing.
>
>
>
> On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel <an...@gmail.com>
> wrote:
>
> > Hmm, kmeans algorithmically is supposed to only annoint existing
> > vectors(documents) as the centroid for a cluster every step (or so I
> > believe). If mahout is generating non document vector as a centroid, it
> > changes a lot of things.
> >
> > That would also explain the -distanceMeasure option in clusterdump. As
> > Andrew mentions, running clusterdump with the default euclidean measure
> > should give me the closest document vector to the calculated centroid.
> > Please correct me if I'm wrong anywhere.
> > Thanks
> >
> > On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman <
> > andrew.musselman@gmail.com> wrote:
> >
> > > It's possible you could write a post-processing step to find the
> closest
> > > point to the centroid based on the "distance" property if I'm recalling
> > it
> > > correctly.
> > >
> > > On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel <an...@gmail.com>
> > > wrote:
> > >
> > > > That kind of puts me in a tough position. I was planning to use
> kmeans
> > > as a
> > > > method for aggregating similar articles from multiple news sources,
> and
> > > > then getting a representative article from those. Here I mean similar
> > as
> > > in
> > > > the articles are from different news sources but are about the exact
> > same
> > > > thing. Intuitively it seems that these articles would get grouped
> > > > together. Any suggestions how I should go about that? So far I'm
> using
> > > > nutch to crawl, solr to index and now I'm here on mahout.
> > > >
> > > > On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > The most central point in a cluster is often referred to as a
> medoid
> > > > > (similar to median, but multi-dimensional).
> > > > >
> > > > > The Mahout code does not compute medoids.  In general, they are
> > > difficult
> > > > > to compute and implementing a full k-medoid clustering algorithm
> even
> > > > more
> > > > > so.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <
> ankitgoel2004@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Oh, I thought kmeans gave me a point vector as a centroid, not a
> > > > > calculated
> > > > > > point central to a cluster. I guess in this case I would be
> looking
> > > for
> > > > > the
> > > > > > most central point vector (from the index ) that I can use as a
> > > > > > representative of the cluster.
> > > > > >
> > > > > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > > > > > andrew.musselman@gmail.com> wrote:
> > > > > >
> > > > > > > I'm not sure centroid id is even a defined thing, especially
> > since
> > > > the
> > > > > > > centroid, in my understanding, is just a point in space, not
> > > > > necessarily
> > > > > > a
> > > > > > > point in your data.
> > > > > > >
> > > > > > > Are you trying to find the most-central point in a given
> cluster?
> > > > > > >
> > > > > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <
> > > ankitgoel2004@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > I've been messing with mahout 0.10 and kmeans clustering
> with a
> > > > solr
> > > > > > > 4.6.1
> > > > > > > > index. The data is news articles. The --field option for
> kmeans
> > > is
> > > > > set
> > > > > > to
> > > > > > > > "content". The idField is set to "title" (just so i can
> analyse
> > > it
> > > > > > > faster).
> > > > > > > > The clusterdump of the kmeans result gives me a proper
> output,
> > > but
> > > > I
> > > > > > cant
> > > > > > > > figure out the id of the vector chosen as the center. There
> are
> > > > only
> > > > > > > 14-15
> > > > > > > > articles so I am not hung up about the cluster performance at
> > > this
> > > > > > time.
> > > > > > > >
> > > > > > > > I used random seeds for the kmeans commandline.
> > > > > > > > For reference, this is the commandline cluster dump I am
> > > executing
> > > > > > > >
> > > > > > > > bin/mahout clusterdump -i
> > > $MAHOUT_HOME/testCluster/clusters-3-final
> > > > > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d
> > > > $MAHOUT_HOME/dict.txt
> > > > > > -b 5
> > > > > > > >
> > > > > > > > The output I get is off the form
> > > > > > > >
> > > > > > > > :{"r":
> > > > > > > >
> > > > > > > > top terms
> > > > > > > >
> > > > > > > > xxxxx==>xxxxx
> > > > > > > >
> > > > > > > > Weight : [props - optional]:  Point:
> > > > > > > >
> > > > > > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other
> features]
> > > > > > > >
> > > > > > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > > > > > >
> > > > > > > >
> > > > > > > > So how exactly do I get the centroid id? I have even tried
> > > > accessing
> > > > > it
> > > > > > > > with java
> > > > > > > >
> > > > > > > > ClusterWritable value.getValue().getCenter() but this just
> > gives
> > > me
> > > > > the
> > > > > > > > features and values of the centroid.
> > > > > > > >
> > > > > > > > Also, please do explain the meaning of "account":0.026 (just
> > > making
> > > > > > sure
> > > > > > > I
> > > > > > > > know it right). I used tfidf.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > > Ankit Goel
> > > > > > > > http://about.me/ankitgoel
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Ankit Goel
> > > > > > http://about.me/ankitgoel
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Ankit Goel
> > > > http://about.me/ankitgoel
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Ankit Goel
> > http://about.me/ankitgoel
> >
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

Posted by Ted Dunning <te...@gmail.com>.

You can always just pick the article closest to the centroid.

But I think that you may find that with simple k-means that clusters are
going to be about more than one thing.



On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel <an...@gmail.com> wrote:

> Hmm, kmeans algorithmically is supposed to only annoint existing
> vectors(documents) as the centroid for a cluster every step (or so I
> believe). If mahout is generating non document vector as a centroid, it
> changes a lot of things.
>
> That would also explain the -distanceMeasure option in clusterdump. As
> Andrew mentions, running clusterdump with the default euclidean measure
> should give me the closest document vector to the calculated centroid.
> Please correct me if I'm wrong anywhere.
> Thanks
>
> On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
>
> > It's possible you could write a post-processing step to find the closest
> > point to the centroid based on the "distance" property if I'm recalling
> it
> > correctly.
> >
> > On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel <an...@gmail.com>
> > wrote:
> >
> > > That kind of puts me in a tough position. I was planning to use kmeans
> > as a
> > > method for aggregating similar articles from multiple news sources, and
> > > then getting a representative article from those. Here I mean similar
> as
> > in
> > > the articles are from different news sources but are about the exact
> same
> > > thing. Intuitively it seems that these articles would get grouped
> > > together. Any suggestions how I should go about that? So far I'm using
> > > nutch to crawl, solr to index and now I'm here on mahout.
> > >
> > > On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > The most central point in a cluster is often referred to as a medoid
> > > > (similar to median, but multi-dimensional).
> > > >
> > > > The Mahout code does not compute medoids.  In general, they are
> > difficult
> > > > to compute and implementing a full k-medoid clustering algorithm even
> > > more
> > > > so.
> > > >
> > > >
> > > >
> > > > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <ankitgoel2004@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Oh, I thought kmeans gave me a point vector as a centroid, not a
> > > > calculated
> > > > > point central to a cluster. I guess in this case I would be looking
> > for
> > > > the
> > > > > most central point vector (from the index ) that I can use as a
> > > > > representative of the cluster.
> > > > >
> > > > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > > > > andrew.musselman@gmail.com> wrote:
> > > > >
> > > > > > I'm not sure centroid id is even a defined thing, especially
> since
> > > the
> > > > > > centroid, in my understanding, is just a point in space, not
> > > > necessarily
> > > > > a
> > > > > > point in your data.
> > > > > >
> > > > > > Are you trying to find the most-central point in a given cluster?
> > > > > >
> > > > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <
> > ankitgoel2004@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > I've been messing with mahout 0.10 and kmeans clustering with a
> > > solr
> > > > > > 4.6.1
> > > > > > > index. The data is news articles. The --field option for kmeans
> > is
> > > > set
> > > > > to
> > > > > > > "content". The idField is set to "title" (just so i can analyse
> > it
> > > > > > faster).
> > > > > > > The clusterdump of the kmeans result gives me a proper output,
> > but
> > > I
> > > > > cant
> > > > > > > figure out the id of the vector chosen as the center. There are
> > > only
> > > > > > 14-15
> > > > > > > articles so I am not hung up about the cluster performance at
> > this
> > > > > time.
> > > > > > >
> > > > > > > I used random seeds for the kmeans commandline.
> > > > > > > For reference, this is the commandline cluster dump I am
> > executing
> > > > > > >
> > > > > > > bin/mahout clusterdump -i
> > $MAHOUT_HOME/testCluster/clusters-3-final
> > > > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d
> > > $MAHOUT_HOME/dict.txt
> > > > > -b 5
> > > > > > >
> > > > > > > The output I get is off the form
> > > > > > >
> > > > > > > :{"r":
> > > > > > >
> > > > > > > top terms
> > > > > > >
> > > > > > > xxxxx==>xxxxx
> > > > > > >
> > > > > > > Weight : [props - optional]:  Point:
> > > > > > >
> > > > > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > > > > > >
> > > > > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > > > > >
> > > > > > >
> > > > > > > So how exactly do I get the centroid id? I have even tried
> > > accessing
> > > > it
> > > > > > > with java
> > > > > > >
> > > > > > > ClusterWritable value.getValue().getCenter() but this just
> gives
> > me
> > > > the
> > > > > > > features and values of the centroid.
> > > > > > >
> > > > > > > Also, please do explain the meaning of "account":0.026 (just
> > making
> > > > > sure
> > > > > > I
> > > > > > > know it right). I used tfidf.
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > > Ankit Goel
> > > > > > > http://about.me/ankitgoel
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Ankit Goel
> > > > > http://about.me/ankitgoel
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Ankit Goel
> > > http://about.me/ankitgoel
> > >
> >
>
>
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>

Re: Kmeans clusterdump Interpretation

Posted by Ankit Goel <an...@gmail.com>.

Hmm, kmeans algorithmically is supposed to only annoint existing
vectors(documents) as the centroid for a cluster every step (or so I
believe). If mahout is generating non document vector as a centroid, it
changes a lot of things.

That would also explain the -distanceMeasure option in clusterdump. As
Andrew mentions, running clusterdump with the default euclidean measure
should give me the closest document vector to the calculated centroid.
Please correct me if I'm wrong anywhere.
Thanks

On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> It's possible you could write a post-processing step to find the closest
> point to the centroid based on the "distance" property if I'm recalling it
> correctly.
>
> On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel <an...@gmail.com>
> wrote:
>
> > That kind of puts me in a tough position. I was planning to use kmeans
> as a
> > method for aggregating similar articles from multiple news sources, and
> > then getting a representative article from those. Here I mean similar as
> in
> > the articles are from different news sources but are about the exact same
> > thing. Intuitively it seems that these articles would get grouped
> > together. Any suggestions how I should go about that? So far I'm using
> > nutch to crawl, solr to index and now I'm here on mahout.
> >
> > On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > The most central point in a cluster is often referred to as a medoid
> > > (similar to median, but multi-dimensional).
> > >
> > > The Mahout code does not compute medoids.  In general, they are
> difficult
> > > to compute and implementing a full k-medoid clustering algorithm even
> > more
> > > so.
> > >
> > >
> > >
> > > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <an...@gmail.com>
> > > wrote:
> > >
> > > > Oh, I thought kmeans gave me a point vector as a centroid, not a
> > > calculated
> > > > point central to a cluster. I guess in this case I would be looking
> for
> > > the
> > > > most central point vector (from the index ) that I can use as a
> > > > representative of the cluster.
> > > >
> > > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > > > andrew.musselman@gmail.com> wrote:
> > > >
> > > > > I'm not sure centroid id is even a defined thing, especially since
> > the
> > > > > centroid, in my understanding, is just a point in space, not
> > > necessarily
> > > > a
> > > > > point in your data.
> > > > >
> > > > > Are you trying to find the most-central point in a given cluster?
> > > > >
> > > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <
> ankitgoel2004@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I've been messing with mahout 0.10 and kmeans clustering with a
> > solr
> > > > > 4.6.1
> > > > > > index. The data is news articles. The --field option for kmeans
> is
> > > set
> > > > to
> > > > > > "content". The idField is set to "title" (just so i can analyse
> it
> > > > > faster).
> > > > > > The clusterdump of the kmeans result gives me a proper output,
> but
> > I
> > > > cant
> > > > > > figure out the id of the vector chosen as the center. There are
> > only
> > > > > 14-15
> > > > > > articles so I am not hung up about the cluster performance at
> this
> > > > time.
> > > > > >
> > > > > > I used random seeds for the kmeans commandline.
> > > > > > For reference, this is the commandline cluster dump I am
> executing
> > > > > >
> > > > > > bin/mahout clusterdump -i
> $MAHOUT_HOME/testCluster/clusters-3-final
> > > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d
> > $MAHOUT_HOME/dict.txt
> > > > -b 5
> > > > > >
> > > > > > The output I get is off the form
> > > > > >
> > > > > > :{"r":
> > > > > >
> > > > > > top terms
> > > > > >
> > > > > > xxxxx==>xxxxx
> > > > > >
> > > > > > Weight : [props - optional]:  Point:
> > > > > >
> > > > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > > > > >
> > > > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > > > >
> > > > > >
> > > > > > So how exactly do I get the centroid id? I have even tried
> > accessing
> > > it
> > > > > > with java
> > > > > >
> > > > > > ClusterWritable value.getValue().getCenter() but this just gives
> me
> > > the
> > > > > > features and values of the centroid.
> > > > > >
> > > > > > Also, please do explain the meaning of "account":0.026 (just
> making
> > > > sure
> > > > > I
> > > > > > know it right). I used tfidf.
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Ankit Goel
> > > > > > http://about.me/ankitgoel
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Ankit Goel
> > > > http://about.me/ankitgoel
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Ankit Goel
> > http://about.me/ankitgoel
> >
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

Posted by Andrew Musselman <an...@gmail.com>.

It's possible you could write a post-processing step to find the closest
point to the centroid based on the "distance" property if I'm recalling it
correctly.

On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel <an...@gmail.com> wrote:

> That kind of puts me in a tough position. I was planning to use kmeans as a
> method for aggregating similar articles from multiple news sources, and
> then getting a representative article from those. Here I mean similar as in
> the articles are from different news sources but are about the exact same
> thing. Intuitively it seems that these articles would get grouped
> together. Any suggestions how I should go about that? So far I'm using
> nutch to crawl, solr to index and now I'm here on mahout.
>
> On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The most central point in a cluster is often referred to as a medoid
> > (similar to median, but multi-dimensional).
> >
> > The Mahout code does not compute medoids.  In general, they are difficult
> > to compute and implementing a full k-medoid clustering algorithm even
> more
> > so.
> >
> >
> >
> > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <an...@gmail.com>
> > wrote:
> >
> > > Oh, I thought kmeans gave me a point vector as a centroid, not a
> > calculated
> > > point central to a cluster. I guess in this case I would be looking for
> > the
> > > most central point vector (from the index ) that I can use as a
> > > representative of the cluster.
> > >
> > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > > andrew.musselman@gmail.com> wrote:
> > >
> > > > I'm not sure centroid id is even a defined thing, especially since
> the
> > > > centroid, in my understanding, is just a point in space, not
> > necessarily
> > > a
> > > > point in your data.
> > > >
> > > > Are you trying to find the most-central point in a given cluster?
> > > >
> > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <ankitgoel2004@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > I've been messing with mahout 0.10 and kmeans clustering with a
> solr
> > > > 4.6.1
> > > > > index. The data is news articles. The --field option for kmeans is
> > set
> > > to
> > > > > "content". The idField is set to "title" (just so i can analyse it
> > > > faster).
> > > > > The clusterdump of the kmeans result gives me a proper output, but
> I
> > > cant
> > > > > figure out the id of the vector chosen as the center. There are
> only
> > > > 14-15
> > > > > articles so I am not hung up about the cluster performance at this
> > > time.
> > > > >
> > > > > I used random seeds for the kmeans commandline.
> > > > > For reference, this is the commandline cluster dump I am executing
> > > > >
> > > > > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d
> $MAHOUT_HOME/dict.txt
> > > -b 5
> > > > >
> > > > > The output I get is off the form
> > > > >
> > > > > :{"r":
> > > > >
> > > > > top terms
> > > > >
> > > > > xxxxx==>xxxxx
> > > > >
> > > > > Weight : [props - optional]:  Point:
> > > > >
> > > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > > > >
> > > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > > >
> > > > >
> > > > > So how exactly do I get the centroid id? I have even tried
> accessing
> > it
> > > > > with java
> > > > >
> > > > > ClusterWritable value.getValue().getCenter() but this just gives me
> > the
> > > > > features and values of the centroid.
> > > > >
> > > > > Also, please do explain the meaning of "account":0.026 (just making
> > > sure
> > > > I
> > > > > know it right). I used tfidf.
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Ankit Goel
> > > > > http://about.me/ankitgoel
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Ankit Goel
> > > http://about.me/ankitgoel
> > >
> >
>
>
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>

Re: Kmeans clusterdump Interpretation

Posted by Ankit Goel <an...@gmail.com>.

That kind of puts me in a tough position. I was planning to use kmeans as a
method for aggregating similar articles from multiple news sources, and
then getting a representative article from those. Here I mean similar as in
the articles are from different news sources but are about the exact same
thing. Intuitively it seems that these articles would get grouped
together. Any suggestions how I should go about that? So far I'm using
nutch to crawl, solr to index and now I'm here on mahout.

On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <te...@gmail.com> wrote:

> The most central point in a cluster is often referred to as a medoid
> (similar to median, but multi-dimensional).
>
> The Mahout code does not compute medoids.  In general, they are difficult
> to compute and implementing a full k-medoid clustering algorithm even more
> so.
>
>
>
> On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <an...@gmail.com>
> wrote:
>
> > Oh, I thought kmeans gave me a point vector as a centroid, not a
> calculated
> > point central to a cluster. I guess in this case I would be looking for
> the
> > most central point vector (from the index ) that I can use as a
> > representative of the cluster.
> >
> > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > andrew.musselman@gmail.com> wrote:
> >
> > > I'm not sure centroid id is even a defined thing, especially since the
> > > centroid, in my understanding, is just a point in space, not
> necessarily
> > a
> > > point in your data.
> > >
> > > Are you trying to find the most-central point in a given cluster?
> > >
> > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <an...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > > I've been messing with mahout 0.10 and kmeans clustering with a solr
> > > 4.6.1
> > > > index. The data is news articles. The --field option for kmeans is
> set
> > to
> > > > "content". The idField is set to "title" (just so i can analyse it
> > > faster).
> > > > The clusterdump of the kmeans result gives me a proper output, but I
> > cant
> > > > figure out the id of the vector chosen as the center. There are only
> > > 14-15
> > > > articles so I am not hung up about the cluster performance at this
> > time.
> > > >
> > > > I used random seeds for the kmeans commandline.
> > > > For reference, this is the commandline cluster dump I am executing
> > > >
> > > > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt
> > -b 5
> > > >
> > > > The output I get is off the form
> > > >
> > > > :{"r":
> > > >
> > > > top terms
> > > >
> > > > xxxxx==>xxxxx
> > > >
> > > > Weight : [props - optional]:  Point:
> > > >
> > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > > >
> > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > >
> > > >
> > > > So how exactly do I get the centroid id? I have even tried accessing
> it
> > > > with java
> > > >
> > > > ClusterWritable value.getValue().getCenter() but this just gives me
> the
> > > > features and values of the centroid.
> > > >
> > > > Also, please do explain the meaning of "account":0.026 (just making
> > sure
> > > I
> > > > know it right). I used tfidf.
> > > >
> > > > --
> > > > Regards,
> > > > Ankit Goel
> > > > http://about.me/ankitgoel
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Ankit Goel
> > http://about.me/ankitgoel
> >
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

Posted by Ted Dunning <te...@gmail.com>.

The most central point in a cluster is often referred to as a medoid
(similar to median, but multi-dimensional).

The Mahout code does not compute medoids.  In general, they are difficult
to compute and implementing a full k-medoid clustering algorithm even more
so.



On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <an...@gmail.com> wrote:

> Oh, I thought kmeans gave me a point vector as a centroid, not a calculated
> point central to a cluster. I guess in this case I would be looking for the
> most central point vector (from the index ) that I can use as a
> representative of the cluster.
>
> On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
>
> > I'm not sure centroid id is even a defined thing, especially since the
> > centroid, in my understanding, is just a point in space, not necessarily
> a
> > point in your data.
> >
> > Are you trying to find the most-central point in a given cluster?
> >
> > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <an...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > I've been messing with mahout 0.10 and kmeans clustering with a solr
> > 4.6.1
> > > index. The data is news articles. The --field option for kmeans is set
> to
> > > "content". The idField is set to "title" (just so i can analyse it
> > faster).
> > > The clusterdump of the kmeans result gives me a proper output, but I
> cant
> > > figure out the id of the vector chosen as the center. There are only
> > 14-15
> > > articles so I am not hung up about the cluster performance at this
> time.
> > >
> > > I used random seeds for the kmeans commandline.
> > > For reference, this is the commandline cluster dump I am executing
> > >
> > > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt
> -b 5
> > >
> > > The output I get is off the form
> > >
> > > :{"r":
> > >
> > > top terms
> > >
> > > xxxxx==>xxxxx
> > >
> > > Weight : [props - optional]:  Point:
> > >
> > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > >
> > > 1.0 : [distance=0.3963903651622338]: [....]
> > >
> > >
> > > So how exactly do I get the centroid id? I have even tried accessing it
> > > with java
> > >
> > > ClusterWritable value.getValue().getCenter() but this just gives me the
> > > features and values of the centroid.
> > >
> > > Also, please do explain the meaning of "account":0.026 (just making
> sure
> > I
> > > know it right). I used tfidf.
> > >
> > > --
> > > Regards,
> > > Ankit Goel
> > > http://about.me/ankitgoel
> > >
> >
>
>
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>

Re: Kmeans clusterdump Interpretation

Posted by Ankit Goel <an...@gmail.com>.

Oh, I thought kmeans gave me a point vector as a centroid, not a calculated
point central to a cluster. I guess in this case I would be looking for the
most central point vector (from the index ) that I can use as a
representative of the cluster.

On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> I'm not sure centroid id is even a defined thing, especially since the
> centroid, in my understanding, is just a point in space, not necessarily a
> point in your data.
>
> Are you trying to find the most-central point in a given cluster?
>
> On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <an...@gmail.com>
> wrote:
>
> > Hi,
> > I've been messing with mahout 0.10 and kmeans clustering with a solr
> 4.6.1
> > index. The data is news articles. The --field option for kmeans is set to
> > "content". The idField is set to "title" (just so i can analyse it
> faster).
> > The clusterdump of the kmeans result gives me a proper output, but I cant
> > figure out the id of the vector chosen as the center. There are only
> 14-15
> > articles so I am not hung up about the cluster performance at this time.
> >
> > I used random seeds for the kmeans commandline.
> > For reference, this is the commandline cluster dump I am executing
> >
> > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> > -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5
> >
> > The output I get is off the form
> >
> > :{"r":
> >
> > top terms
> >
> > xxxxx==>xxxxx
> >
> > Weight : [props - optional]:  Point:
> >
> >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> >
> > 1.0 : [distance=0.3963903651622338]: [....]
> >
> >
> > So how exactly do I get the centroid id? I have even tried accessing it
> > with java
> >
> > ClusterWritable value.getValue().getCenter() but this just gives me the
> > features and values of the centroid.
> >
> > Also, please do explain the meaning of "account":0.026 (just making sure
> I
> > know it right). I used tfidf.
> >
> > --
> > Regards,
> > Ankit Goel
> > http://about.me/ankitgoel
> >
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

Posted by Andrew Musselman <an...@gmail.com>.

I'm not sure centroid id is even a defined thing, especially since the
centroid, in my understanding, is just a point in space, not necessarily a
point in your data.

Are you trying to find the most-central point in a given cluster?

On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <an...@gmail.com> wrote:

> Hi,
> I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1
> index. The data is news articles. The --field option for kmeans is set to
> "content". The idField is set to "title" (just so i can analyse it faster).
> The clusterdump of the kmeans result gives me a proper output, but I cant
> figure out the id of the vector chosen as the center. There are only 14-15
> articles so I am not hung up about the cluster performance at this time.
>
> I used random seeds for the kmeans commandline.
> For reference, this is the commandline cluster dump I am executing
>
> bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5
>
> The output I get is off the form
>
> :{"r":
>
> top terms
>
> xxxxx==>xxxxx
>
> Weight : [props - optional]:  Point:
>
>  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
>
> 1.0 : [distance=0.3963903651622338]: [....]
>
>
> So how exactly do I get the centroid id? I have even tried accessing it
> with java
>
> ClusterWritable value.getValue().getCenter() but this just gives me the
> features and values of the centroid.
>
> Also, please do explain the meaning of "account":0.026 (just making sure I
> know it right). I used tfidf.
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>