You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/04/07 00:10:39 UTC

MAHOUT-236 Cluster Evaluation Tools?

Is anybody working on MAHOUT-236? To me it looks like the next logical 
step beyond generalizing the cluster dumper: improving on its summaries

Jeff Eastman wrote:
> Completing the ClusterDumper jira will allow for visual inspection of 
> the Dirichlet models and extracting some useful information thereof; 
> arguably not too useful with 1793-element vectors but this is also 
> true of kmeans clusters with 1793-element center vectors. With no 
> terminating conditions, selecting the particular iteration to inspect 
> is also an issue unique to Dirichlet. MAHOUT-236 has been around for a 
> while and, as Jake notes below, is really needed.
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

That's not what I get from the paper. Certainly, the cluster center is 
the first representative point. But the paper talks about subsequently 
iterating through the clustered points to find the farthest point from 
the previously-selected representative points (RPs) and then adding that 
as another representative point. After a few such iterations, a set of 
RPs is developed for each cluster that defines the extreme points 
observed within the cluster. This is especially useful for non-spherical 
clusters, such as those returned by mean shift and Dirichlet asymmetric 
models. Then, in the final stage, the RPs in each cluster are compared 
and the closest RPs are used to compute CDbw. The final calculation can 
be done in memory since the number of clusters and RPs is well-bounded 
by then.

I get that each RP iteration takes place over all of the clustered 
points and would require a new MR job for each iteration. I imagine 
initializing the mappers and reducers with the set of clusters and their 
RPs. Then each mapper processes a subset of all clustered points, 
finally outputting the farthest it has seen for each cluster. The 
reducer gets this information and selects the RP that is absolutely the 
most distant, outputting it with the clusters+RPs for the next 
iteration. This is a lot like the way Dirichlet works now, outputting 
state to be used for the next iteration over the entire point set. We 
would need to allow a DistanceMeasure to be specified for this phase.

Currently, only canopy and kMeans actually produce their clustered 
points. Dirichlet points could be clustered by assigning each point to 
the model with the largest pdf (or even to more than one based upon a 
user-settable pdf threshold). Fuzzy kMeans would need to make similar 
assignments. MeanShift point ids are currently retained in its cluster 
state but there is no step to build clustered points like canopy and 
kMeans do. Some work would be needed here too, as we need a uniform 
representation for clustered points.

Finally, I'd like to review the output file naming conventions across 
all the clustering algorithms and converge on a single nomenclature that 
is common across all jobs.

Robin Anil wrote:
> Cluster center itself is a representative point. One pass over the data will
> get us that close enough points. Or exhaustively, we can just add it in the
> Kmeans Mapper and update a counter maybe?
>
> Robin
>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.

Cluster center itself is a representative point. One pass over the data will
get us that close enough points. Or exhaustively, we can just add it in the
Kmeans Mapper and update a counter maybe?

Robin

On Fri, Apr 9, 2010 at 4:13 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Looking at the paper it doesn't seem to require MR for the final CDbw
> calculation, right? For each cluster we only need to compare one of its
> points with one point in each other cluster. With small numbers of
> representative points per cluster that can be done easily in memory. I'd
> love to see the code you have for computing representative points.
>
> Jeff
>
>
>
> Robin Anil wrote:
>
>> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> Hi Robin,
>>>
>>> Interesting paper. I'm beginning to see how to MR the representative
>>> point
>>> selection already. The rest will hopefully become clearer with more
>>> study.
>>> Lots of MR jobs are needed to:
>>>
>>>
>>
>>
>>
>>
>>
>>> a) get the data into Vectors, We have something for text, missing for
>>> other
>>> formats
>>>
>>>
>>
>>
>>
>>
>>
>>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> c) cluster the data, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> d) iterate over the clustered data to derive representative points for
>>> each
>>> cluster, and finally Done ;)
>>>
>>>
>>
>>
>>
>>
>>
>>> e) produce the CDbw.- TODO
>>>
>>>
>>
>>
>>
>>
>>
>>
>>> And, of course all of this is again iterated with different values for
>>> the
>>> clustering algorithm's parameters. Should keep the lights on at PG&E
>>> producing power for the server farms.
>>>
>>>
>>>
>>> Robin Anil wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>>           This is an good paper with a simple measure of cluster quality
>>>> measurement based on intra cluster density and inter cluster separation.
>>>> Its
>>>> pretty easy to compute. Need to make it a map/reduce job
>>>>
>>>>
>>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>>>> Robin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Looking at the paper it doesn't seem to require MR for the final CDbw 
calculation, right? For each cluster we only need to compare one of its 
points with one point in each other cluster. With small numbers of 
representative points per cluster that can be done easily in memory. I'd 
love to see the code you have for computing representative points.

Jeff


Robin Anil wrote:
> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> Hi Robin,
>>
>> Interesting paper. I'm beginning to see how to MR the representative point
>> selection already. The rest will hopefully become clearer with more study.
>> Lots of MR jobs are needed to:
>>     
>
>
>
>   
>> a) get the data into Vectors, We have something for text, missing for other
>> formats
>>     
>
>
>
>   
>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
>>     
>
>
>
>   
>> c) cluster the data, Done
>>     
>
>
>
>   
>> d) iterate over the clustered data to derive representative points for each
>> cluster, and finally Done ;)
>>     
>
>
>
>   
>> e) produce the CDbw.- TODO
>>     
>
>
>
>
>   
>> And, of course all of this is again iterated with different values for the
>> clustering algorithm's parameters. Should keep the lights on at PG&E
>> producing power for the server farms.
>>
>>
>>
>> Robin Anil wrote:
>>
>>     
>>> Hi Jeff,
>>>            This is an good paper with a simple measure of cluster quality
>>> measurement based on intra cluster density and inter cluster separation.
>>> Its
>>> pretty easy to compute. Need to make it a map/reduce job
>>>
>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>>> Robin
>>>
>>>
>>>
>>>
>>>       
>>     
>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.

On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Hi Robin,
>
> Interesting paper. I'm beginning to see how to MR the representative point
> selection already. The rest will hopefully become clearer with more study.
> Lots of MR jobs are needed to:



> a) get the data into Vectors, We have something for text, missing for other
> formats



> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done



> c) cluster the data, Done



> d) iterate over the clustered data to derive representative points for each
> cluster, and finally Done ;)



> e) produce the CDbw.- TODO




> And, of course all of this is again iterated with different values for the
> clustering algorithm's parameters. Should keep the lights on at PG&E
> producing power for the server farms.
>
>
>
> Robin Anil wrote:
>
>> Hi Jeff,
>>            This is an good paper with a simple measure of cluster quality
>> measurement based on intra cluster density and inter cluster separation.
>> Its
>> pretty easy to compute. Need to make it a map/reduce job
>>
>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>> Robin
>>
>>
>>
>>
>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Robin,

Interesting paper. I'm beginning to see how to MR the representative 
point selection already. The rest will hopefully become clearer with 
more study. Lots of MR jobs are needed to: a) get the data into Vectors, 
b) iterate (e.g. kmeans) over the data to produce a set of clusters, c) 
cluster the data, d) iterate over the clustered data to derive 
representative points for each cluster, and finally e) produce the CDbw. 
And, of course all of this is again iterated with different values for 
the clustering algorithm's parameters. Should keep the lights on at PG&E 
producing power for the server farms.

Robin Anil wrote:
> Hi Jeff,
>             This is an good paper with a simple measure of cluster quality
> measurement based on intra cluster density and inter cluster separation. Its
> pretty easy to compute. Need to make it a map/reduce job
> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
> Robin
>
>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.

Hi Jeff,
            This is an good paper with a simple measure of cluster quality
measurement based on intra cluster density and inter cluster separation. Its
pretty easy to compute. Need to make it a map/reduce job
http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
Robin


On Wed, Apr 7, 2010 at 7:03 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Hi Robin,
>
> Great! I've got the refactoring changes for consolidating all the various
> cluster types under a Cluster interface (formerly Printable but now with id,
> numPoints and a center added). Dirichlet models still don't yet have
> meaningful ids implemented but they all do (so far anyway) have a notion of
> "numPoints" and a "center". I'm working on tests tomorrow to make sure the
> ClusterDumper actually works with Dirichlet clusters then I will commit
> that. Wednesday or Thursday most likely.
>
> BTW, I changed my mind about foisting off the old Printable interface on
> Vectors (but am still open to the idea if somebody actually working in math
> thinks it is worth doing). All the new Clusters use the vector formatting
> done in ClusterBase.
>
> What I'd really like is feedback from ClusterDumper users on what is
> working and what is needed to address MAHOUT-236. That includes you, right?
>
> Jeff
>
> PS: Ted, you expressed some doubts about the value of consolidating
> Dirichlet clusters with the others. So far it seems to be a reasonable fit
> but I'm doing the engineering on a tiny subset of simple models without
> enough theoretical insight to see any pitfalls ahead. Is there a
> "DistanceMeasure-like" discussion that might provide a firmer underpinning
> for this work?
>
>
>
>
> Robin Anil wrote:
>
>> No one yet. I am willing to help In case you need an extra pair of hands
>> on
>> this one.
>>
>> Robin
>>
>>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Ted Dunning <te...@gmail.com>.

If it fits, then it is great to do.

On Tue, Apr 6, 2010 at 6:33 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> PS: Ted, you expressed some doubts about the value of consolidating
> Dirichlet clusters with the others. So far it seems to be a reasonable fit
> but I'm doing the engineering on a tiny subset of simple models without
> enough theoretical insight to see any pitfalls ahead. Is there a
> "DistanceMeasure-like" discussion that might provide a firmer underpinning
> for this work?
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Robin,

Great! I've got the refactoring changes for consolidating all the 
various cluster types under a Cluster interface (formerly Printable but 
now with id, numPoints and a center added). Dirichlet models still don't 
yet have meaningful ids implemented but they all do (so far anyway) have 
a notion of "numPoints" and a "center". I'm working on tests tomorrow to 
make sure the ClusterDumper actually works with Dirichlet clusters then 
I will commit that. Wednesday or Thursday most likely.

BTW, I changed my mind about foisting off the old Printable interface on 
Vectors (but am still open to the idea if somebody actually working in 
math thinks it is worth doing). All the new Clusters use the vector 
formatting done in ClusterBase.

What I'd really like is feedback from ClusterDumper users on what is 
working and what is needed to address MAHOUT-236. That includes you, right?

Jeff

PS: Ted, you expressed some doubts about the value of consolidating 
Dirichlet clusters with the others. So far it seems to be a reasonable 
fit but I'm doing the engineering on a tiny subset of simple models 
without enough theoretical insight to see any pitfalls ahead. Is there a 
"DistanceMeasure-like" discussion that might provide a firmer 
underpinning for this work?

Robin Anil wrote:
> No one yet. I am willing to help In case you need an extra pair of hands on
> this one.
>
> Robin
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.

No one yet. I am willing to help In case you need an extra pair of hands on
this one.

Robin


On Wed, Apr 7, 2010 at 3:40 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Is anybody working on MAHOUT-236? To me it looks like the next logical step
> beyond generalizing the cluster dumper: improving on its summaries
>
> Jeff Eastman wrote:
>
>> Completing the ClusterDumper jira will allow for visual inspection of the
>> Dirichlet models and extracting some useful information thereof; arguably
>> not too useful with 1793-element vectors but this is also true of kmeans
>> clusters with 1793-element center vectors. With no terminating conditions,
>> selecting the particular iteration to inspect is also an issue unique to
>> Dirichlet. MAHOUT-236 has been around for a while and, as Jake notes below,
>> is really needed.
>>
>>
>