You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Walter Chang <we...@gmail.com> on 2011/10/03 05:52:58 UTC

question about clustering

Hi ,

i have used mahout to produce kmeans  clustering for my tf-idf result. I use
the mahout command line to produce the clusters and it seems it successfully
completes.

$MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
./kmeans-clusters  -cd 1.0 -k 3 -x 1000

It seems there are two clusters directory generated.(cluster-1 and
cluster-2)  , when i use clusterdump on each of them, it seems to me that
the clustered top terms are the same. Any idea why ?

Also, how can i see which documents have been assigned to each cluster.
Right now, i can see the number of documents assigned but not the complete
list.

Most importantly, for production purposes, i assume it makes sense for
kmeans always runs on hadoop to generate the clustering file. But how do i
consume these during serving ? Ideally, serving should have the doc id or
query passed as a query, and the server should return the top document
ranked by the score within the same cluster back. How do I do it in code ?
Any good examples ?

Thanks a lot,

Weide

Re: question about clustering

Posted by Walter Chang <we...@gmail.com>.

Thanks a lot Kate. A followup question I have is how to use the clusters ?
Do you know what API i should follow to load the generated cluster file and
send query to them (send some document id and gets the cluster id and its
the other documents in the cluster)?

Thanks a lot,

Weide

On Mon, Oct 3, 2011 at 11:10 AM, Kate Ericson <er...@cs.colostate.edu>wrote:

> Hi Weide,
>
> Does this mean you have only 60 data points you are trying to cluster?
> This may be part of why it seems to be running so quickly.
> the k flag tells the program how many points to cluster around, so
> having k=3 means you are trying to group your data into 3 clusters.
> As for the folder names, after every iteration of clustering kmeans
> writes out the final cluster positions.  If you hit the max number of
> iterations, or the cluster centers don't move more than a
> predetermined distance the clustering function is stopped.
> Since you have clusters-1 and clusters-2 folders, this means it ran
> for only 2 iterations.
> It looks like you set the max iterations to 1000 (-x 1000), so it's
> definitely hitting the point where your cluster centers are no longer
> moving more than the minimum amount (-cd 1.0).
> You may want to try with a higher k - maybe 10 and see how many
> iterations it goes though.  Another thing to look at is how the
> initial clusters are chosen.  By default, the starting clusters are
> randomly chosen.  Working with the Canopy Clustering program may let
> you find better initial clusters.
>
> Hope this helps,
>
> -Kate
>
> On Mon, Oct 3, 2011 at 11:38 AM, Walter Chang <we...@gmail.com>
> wrote:
> > Hi Kate,
> >
> > I have 60 rows data that has text description. I just generated tf-idf
> using
> > my analyzer. and tf-idf vector is passed into the clustering algorithms
> to
> > do the clustering. I use k=3, it generates clusters-1, clusters-2 folder.
> > What does each folder mean ?  How does the clustering process generates
> > those ?
> >
> > Weide
> >
> > On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson <ericson@cs.colostate.edu
> >wrote:
> >
> >> Hi Welde,
> >>
> >> As a disclaimer, I only know enough to try to help you figure out your
> >> first problem.
> >> First of all, can you tell us about the dataset you are using?
> >> How many points are you clustering?
> >>
> >> As a guess without knowing either of these things, part of the reason
> >> why your clusters look the same is that you're only clustering around
> >> 3 points.  You're only running for 2 iterations, so it looks like its
> >> just not moving your cluster centers around at all.  Can you try again
> >> with a larger k?
> >> This may let it run for more iterations so you should be able to see
> >> more changes in results.
> >>
> >> Good luck!
> >>
> >> -Kate
> >>
> >> On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <we...@gmail.com>
> >> wrote:
> >> > Hi ,
> >> >
> >> > i have used mahout to produce kmeans  clustering for my tf-idf result.
> I
> >> use
> >> > the mahout command line to produce the clusters and it seems it
> >> successfully
> >> > completes.
> >> >
> >> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c
> ./initialclusters
> >> -o
> >> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> >> >
> >> > It seems there are two clusters directory generated.(cluster-1 and
> >> > cluster-2)  , when i use clusterdump on each of them, it seems to me
> that
> >> > the clustered top terms are the same. Any idea why ?
> >> >
> >> > Also, how can i see which documents have been assigned to each
> cluster.
> >> > Right now, i can see the number of documents assigned but not the
> >> complete
> >> > list.
> >> >
> >> > Most importantly, for production purposes, i assume it makes sense for
> >> > kmeans always runs on hadoop to generate the clustering file. But how
> do
> >> i
> >> > consume these during serving ? Ideally, serving should have the doc id
> or
> >> > query passed as a query, and the server should return the top document
> >> > ranked by the score within the same cluster back. How do I do it in
> code
> >> ?
> >> > Any good examples ?
> >> >
> >> > Thanks a lot,
> >> >
> >> > Weide
> >> >
> >>
> >
>

Re: question about clustering

Posted by Kate Ericson <er...@cs.colostate.edu>.

Hi Weide,

Does this mean you have only 60 data points you are trying to cluster?
This may be part of why it seems to be running so quickly.
the k flag tells the program how many points to cluster around, so
having k=3 means you are trying to group your data into 3 clusters.
As for the folder names, after every iteration of clustering kmeans
writes out the final cluster positions.  If you hit the max number of
iterations, or the cluster centers don't move more than a
predetermined distance the clustering function is stopped.
Since you have clusters-1 and clusters-2 folders, this means it ran
for only 2 iterations.
It looks like you set the max iterations to 1000 (-x 1000), so it's
definitely hitting the point where your cluster centers are no longer
moving more than the minimum amount (-cd 1.0).
You may want to try with a higher k - maybe 10 and see how many
iterations it goes though.  Another thing to look at is how the
initial clusters are chosen.  By default, the starting clusters are
randomly chosen.  Working with the Canopy Clustering program may let
you find better initial clusters.

Hope this helps,

-Kate

On Mon, Oct 3, 2011 at 11:38 AM, Walter Chang <we...@gmail.com> wrote:
> Hi Kate,
>
> I have 60 rows data that has text description. I just generated tf-idf using
> my analyzer. and tf-idf vector is passed into the clustering algorithms to
> do the clustering. I use k=3, it generates clusters-1, clusters-2 folder.
> What does each folder mean ?  How does the clustering process generates
> those ?
>
> Weide
>
> On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson <er...@cs.colostate.edu>wrote:
>
>> Hi Welde,
>>
>> As a disclaimer, I only know enough to try to help you figure out your
>> first problem.
>> First of all, can you tell us about the dataset you are using?
>> How many points are you clustering?
>>
>> As a guess without knowing either of these things, part of the reason
>> why your clusters look the same is that you're only clustering around
>> 3 points.  You're only running for 2 iterations, so it looks like its
>> just not moving your cluster centers around at all.  Can you try again
>> with a larger k?
>> This may let it run for more iterations so you should be able to see
>> more changes in results.
>>
>> Good luck!
>>
>> -Kate
>>
>> On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <we...@gmail.com>
>> wrote:
>> > Hi ,
>> >
>> > i have used mahout to produce kmeans  clustering for my tf-idf result. I
>> use
>> > the mahout command line to produce the clusters and it seems it
>> successfully
>> > completes.
>> >
>> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters
>> -o
>> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
>> >
>> > It seems there are two clusters directory generated.(cluster-1 and
>> > cluster-2)  , when i use clusterdump on each of them, it seems to me that
>> > the clustered top terms are the same. Any idea why ?
>> >
>> > Also, how can i see which documents have been assigned to each cluster.
>> > Right now, i can see the number of documents assigned but not the
>> complete
>> > list.
>> >
>> > Most importantly, for production purposes, i assume it makes sense for
>> > kmeans always runs on hadoop to generate the clustering file. But how do
>> i
>> > consume these during serving ? Ideally, serving should have the doc id or
>> > query passed as a query, and the server should return the top document
>> > ranked by the score within the same cluster back. How do I do it in code
>> ?
>> > Any good examples ?
>> >
>> > Thanks a lot,
>> >
>> > Weide
>> >
>>
>

Re: question about clustering

Posted by Walter Chang <we...@gmail.com>.

Hi Kate,

I have 60 rows data that has text description. I just generated tf-idf using
my analyzer. and tf-idf vector is passed into the clustering algorithms to
do the clustering. I use k=3, it generates clusters-1, clusters-2 folder.
What does each folder mean ?  How does the clustering process generates
those ?

Weide

On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson <er...@cs.colostate.edu>wrote:

> Hi Welde,
>
> As a disclaimer, I only know enough to try to help you figure out your
> first problem.
> First of all, can you tell us about the dataset you are using?
> How many points are you clustering?
>
> As a guess without knowing either of these things, part of the reason
> why your clusters look the same is that you're only clustering around
> 3 points.  You're only running for 2 iterations, so it looks like its
> just not moving your cluster centers around at all.  Can you try again
> with a larger k?
> This may let it run for more iterations so you should be able to see
> more changes in results.
>
> Good luck!
>
> -Kate
>
> On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <we...@gmail.com>
> wrote:
> > Hi ,
> >
> > i have used mahout to produce kmeans  clustering for my tf-idf result. I
> use
> > the mahout command line to produce the clusters and it seems it
> successfully
> > completes.
> >
> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters
> -o
> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> >
> > It seems there are two clusters directory generated.(cluster-1 and
> > cluster-2)  , when i use clusterdump on each of them, it seems to me that
> > the clustered top terms are the same. Any idea why ?
> >
> > Also, how can i see which documents have been assigned to each cluster.
> > Right now, i can see the number of documents assigned but not the
> complete
> > list.
> >
> > Most importantly, for production purposes, i assume it makes sense for
> > kmeans always runs on hadoop to generate the clustering file. But how do
> i
> > consume these during serving ? Ideally, serving should have the doc id or
> > query passed as a query, and the server should return the top document
> > ranked by the score within the same cluster back. How do I do it in code
> ?
> > Any good examples ?
> >
> > Thanks a lot,
> >
> > Weide
> >
>

Re: question about clustering

Posted by Kate Ericson <er...@cs.colostate.edu>.

Hi Welde,

As a disclaimer, I only know enough to try to help you figure out your
first problem.
First of all, can you tell us about the dataset you are using?
How many points are you clustering?

As a guess without knowing either of these things, part of the reason
why your clusters look the same is that you're only clustering around
3 points.  You're only running for 2 iterations, so it looks like its
just not moving your cluster centers around at all.  Can you try again
with a larger k?
This may let it run for more iterations so you should be able to see
more changes in results.

Good luck!

-Kate

On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <we...@gmail.com> wrote:
> Hi ,
>
> i have used mahout to produce kmeans  clustering for my tf-idf result. I use
> the mahout command line to produce the clusters and it seems it successfully
> completes.
>
> $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
> ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
>
> It seems there are two clusters directory generated.(cluster-1 and
> cluster-2)  , when i use clusterdump on each of them, it seems to me that
> the clustered top terms are the same. Any idea why ?
>
> Also, how can i see which documents have been assigned to each cluster.
> Right now, i can see the number of documents assigned but not the complete
> list.
>
> Most importantly, for production purposes, i assume it makes sense for
> kmeans always runs on hadoop to generate the clustering file. But how do i
> consume these during serving ? Ideally, serving should have the doc id or
> query passed as a query, and the server should return the top document
> ranked by the score within the same cluster back. How do I do it in code ?
> Any good examples ?
>
> Thanks a lot,
>
> Weide
>

Re: question about clustering

Posted by Grant Ingersoll <gs...@apache.org>.

Have a look at the ClusterDumper (bin/mahout clusterdump --help    -- that should give you an idea of how to run it)

The main output contains the centroids.  The clustered points dir contains all of the original points and what cluster they belong to along with distance.  The ClusterDumper can marry these two.  

On Oct 8, 2011, at 12:28 PM, Walter Chang wrote:

> Hi Grant,
> 
> I added clustering flag for kmeans. Now i see the an output dir called
> ClusteredPoints. However, when i use sequential file dump, what does each
> column mean ? How do i associate with the original doc  ? The output is not
> straightforward to me.
> 
> Input Path: part-m-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.WeightedPropertyVectorWritable
> Key: 0: Value: wt: 1.0distance: 7.811960616283556  vec: [0:1.000, 11:1.847,
> 12:2.253, 14:1.847]
> Key: 0: Value: wt: 1.0distance: 10.856925385759745  vec: [0:1.000, 5:2.253,
> 8:2.253, 11:1.847, 14:1.847]
> Key: 0: Value: wt: 1.0distance: 10.174423474410343  vec: [0:1.000, 6:1.847,
> 15:2.253, 16:2.253]
> Key: 0: Value: wt: 1.0distance: 4.766995846807366  vec: [0:1.000, 6:1.847,
> 17:1.847]
> Key: 0: Value: wt: 1.0distance: 7.129458704934154  vec: [0:1.000, 10:2.253,
> 17:1.847]
> Key: 5: Value: wt: 1.0distance: 0.0  vec: [0:1.000, 3:1.847, 7:1.847,
> 9:1.847, 13:1.847]
> Key: 6: Value: wt: 1.0distance: 0.0  vec: [1:2.253, 3:1.847, 7:1.847,
> 9:1.847, 13:1.847]
> 
> Thanks,
> 
> Weide
> 
> 
> On Thu, Oct 6, 2011 at 11:54 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> 
>> On Oct 2, 2011, at 11:52 PM, Walter Chang wrote:
>> 
>>> Hi ,
>>> 
>>> i have used mahout to produce kmeans  clustering for my tf-idf result. I
>> use
>>> the mahout command line to produce the clusters and it seems it
>> successfully
>>> completes.
>>> 
>>> $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters
>> -o
>>> ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
>>> 
>>> It seems there are two clusters directory generated.(cluster-1 and
>>> cluster-2)  , when i use clusterdump on each of them, it seems to me that
>>> the clustered top terms are the same. Any idea why ?
>> 
>> The top terms are exactly that, the top terms.  It is not all of the terms.
>> My guess is that things don't change much between the two iterations.
>> 
>>> 
>>> Also, how can i see which documents have been assigned to each cluster.
>>> Right now, i can see the number of documents assigned but not the
>> complete
>>> list.
>> 
>> Add the --clustering flag.  By default, K-Means just calculates the
>> centroids.  If you want to know membership, the --clustering flag does that.
>> 
>>> 
>>> Most importantly, for production purposes, i assume it makes sense for
>>> kmeans always runs on hadoop to generate the clustering file. But how do
>> i
>>> consume these during serving ? Ideally, serving should have the doc id or
>>> query passed as a query, and the server should return the top document
>>> ranked by the score within the same cluster back. How do I do it in code
>> ?
>>> Any good examples ?
>> 
>> Presumably, you have to load up the centroids and/or the results and see
>> which cluster the new item belongs to.
>> 
>> 
>> 
>>> 
>>> Thanks a lot,
>>> 
>>> Weide
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: question about clustering

Posted by Walter Chang <we...@gmail.com>.

Hi Grant,

I added clustering flag for kmeans. Now i see the an output dir called
ClusteredPoints. However, when i use sequential file dump, what does each
column mean ? How do i associate with the original doc  ? The output is not
straightforward to me.

Input Path: part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.WeightedPropertyVectorWritable
Key: 0: Value: wt: 1.0distance: 7.811960616283556  vec: [0:1.000, 11:1.847,
12:2.253, 14:1.847]
Key: 0: Value: wt: 1.0distance: 10.856925385759745  vec: [0:1.000, 5:2.253,
8:2.253, 11:1.847, 14:1.847]
Key: 0: Value: wt: 1.0distance: 10.174423474410343  vec: [0:1.000, 6:1.847,
15:2.253, 16:2.253]
Key: 0: Value: wt: 1.0distance: 4.766995846807366  vec: [0:1.000, 6:1.847,
17:1.847]
Key: 0: Value: wt: 1.0distance: 7.129458704934154  vec: [0:1.000, 10:2.253,
17:1.847]
Key: 5: Value: wt: 1.0distance: 0.0  vec: [0:1.000, 3:1.847, 7:1.847,
9:1.847, 13:1.847]
Key: 6: Value: wt: 1.0distance: 0.0  vec: [1:2.253, 3:1.847, 7:1.847,
9:1.847, 13:1.847]

Thanks,

Weide


On Thu, Oct 6, 2011 at 11:54 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Oct 2, 2011, at 11:52 PM, Walter Chang wrote:
>
> > Hi ,
> >
> > i have used mahout to produce kmeans  clustering for my tf-idf result. I
> use
> > the mahout command line to produce the clusters and it seems it
> successfully
> > completes.
> >
> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters
> -o
> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> >
> > It seems there are two clusters directory generated.(cluster-1 and
> > cluster-2)  , when i use clusterdump on each of them, it seems to me that
> > the clustered top terms are the same. Any idea why ?
>
> The top terms are exactly that, the top terms.  It is not all of the terms.
>  My guess is that things don't change much between the two iterations.
>
> >
> > Also, how can i see which documents have been assigned to each cluster.
> > Right now, i can see the number of documents assigned but not the
> complete
> > list.
>
> Add the --clustering flag.  By default, K-Means just calculates the
> centroids.  If you want to know membership, the --clustering flag does that.
>
> >
> > Most importantly, for production purposes, i assume it makes sense for
> > kmeans always runs on hadoop to generate the clustering file. But how do
> i
> > consume these during serving ? Ideally, serving should have the doc id or
> > query passed as a query, and the server should return the top document
> > ranked by the score within the same cluster back. How do I do it in code
> ?
> > Any good examples ?
>
> Presumably, you have to load up the centroids and/or the results and see
> which cluster the new item belongs to.
>
>
>
> >
> > Thanks a lot,
> >
> > Weide
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>

Re: question about clustering

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 2, 2011, at 11:52 PM, Walter Chang wrote:

> Hi ,
> 
> i have used mahout to produce kmeans  clustering for my tf-idf result. I use
> the mahout command line to produce the clusters and it seems it successfully
> completes.
> 
> $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
> ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> 
> It seems there are two clusters directory generated.(cluster-1 and
> cluster-2)  , when i use clusterdump on each of them, it seems to me that
> the clustered top terms are the same. Any idea why ?

The top terms are exactly that, the top terms.  It is not all of the terms.  My guess is that things don't change much between the two iterations.

> 
> Also, how can i see which documents have been assigned to each cluster.
> Right now, i can see the number of documents assigned but not the complete
> list.

Add the --clustering flag.  By default, K-Means just calculates the centroids.  If you want to know membership, the --clustering flag does that.

> 
> Most importantly, for production purposes, i assume it makes sense for
> kmeans always runs on hadoop to generate the clustering file. But how do i
> consume these during serving ? Ideally, serving should have the doc id or
> query passed as a query, and the server should return the top document
> ranked by the score within the same cluster back. How do I do it in code ?
> Any good examples ?

Presumably, you have to load up the centroids and/or the results and see which cluster the new item belongs to.



> 
> Thanks a lot,
> 
> Weide

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com