You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/04/07 00:10:39 UTC
MAHOUT-236 Cluster Evaluation Tools?
Is anybody working on MAHOUT-236? To me it looks like the next logical
step beyond generalizing the cluster dumper: improving on its summaries
Jeff Eastman wrote:
> Completing the ClusterDumper jira will allow for visual inspection of
> the Dirichlet models and extracting some useful information thereof;
> arguably not too useful with 1793-element vectors but this is also
> true of kmeans clusters with 1793-element center vectors. With no
> terminating conditions, selecting the particular iteration to inspect
> is also an issue unique to Dirichlet. MAHOUT-236 has been around for a
> while and, as Jake notes below, is really needed.
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That's not what I get from the paper. Certainly, the cluster center is
the first representative point. But the paper talks about subsequently
iterating through the clustered points to find the farthest point from
the previously-selected representative points (RPs) and then adding that
as another representative point. After a few such iterations, a set of
RPs is developed for each cluster that defines the extreme points
observed within the cluster. This is especially useful for non-spherical
clusters, such as those returned by mean shift and Dirichlet asymmetric
models. Then, in the final stage, the RPs in each cluster are compared
and the closest RPs are used to compute CDbw. The final calculation can
be done in memory since the number of clusters and RPs is well-bounded
by then.
I get that each RP iteration takes place over all of the clustered
points and would require a new MR job for each iteration. I imagine
initializing the mappers and reducers with the set of clusters and their
RPs. Then each mapper processes a subset of all clustered points,
finally outputting the farthest it has seen for each cluster. The
reducer gets this information and selects the RP that is absolutely the
most distant, outputting it with the clusters+RPs for the next
iteration. This is a lot like the way Dirichlet works now, outputting
state to be used for the next iteration over the entire point set. We
would need to allow a DistanceMeasure to be specified for this phase.
Currently, only canopy and kMeans actually produce their clustered
points. Dirichlet points could be clustered by assigning each point to
the model with the largest pdf (or even to more than one based upon a
user-settable pdf threshold). Fuzzy kMeans would need to make similar
assignments. MeanShift point ids are currently retained in its cluster
state but there is no step to build clustered points like canopy and
kMeans do. Some work would be needed here too, as we need a uniform
representation for clustered points.
Finally, I'd like to review the output file naming conventions across
all the clustering algorithms and converge on a single nomenclature that
is common across all jobs.
Robin Anil wrote:
> Cluster center itself is a representative point. One pass over the data will
> get us that close enough points. Or exhaustively, we can just add it in the
> Kmeans Mapper and update a counter maybe?
>
> Robin
>
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Robin Anil <ro...@gmail.com>.
Cluster center itself is a representative point. One pass over the data will
get us that close enough points. Or exhaustively, we can just add it in the
Kmeans Mapper and update a counter maybe?
Robin
On Fri, Apr 9, 2010 at 4:13 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
> Looking at the paper it doesn't seem to require MR for the final CDbw
> calculation, right? For each cluster we only need to compare one of its
> points with one point in each other cluster. With small numbers of
> representative points per cluster that can be done easily in memory. I'd
> love to see the code you have for computing representative points.
>
> Jeff
>
>
>
> Robin Anil wrote:
>
>> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> Hi Robin,
>>>
>>> Interesting paper. I'm beginning to see how to MR the representative
>>> point
>>> selection already. The rest will hopefully become clearer with more
>>> study.
>>> Lots of MR jobs are needed to:
>>>
>>>
>>
>>
>>
>>
>>
>>> a) get the data into Vectors, We have something for text, missing for
>>> other
>>> formats
>>>
>>>
>>
>>
>>
>>
>>
>>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> c) cluster the data, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> d) iterate over the clustered data to derive representative points for
>>> each
>>> cluster, and finally Done ;)
>>>
>>>
>>
>>
>>
>>
>>
>>> e) produce the CDbw.- TODO
>>>
>>>
>>
>>
>>
>>
>>
>>
>>> And, of course all of this is again iterated with different values for
>>> the
>>> clustering algorithm's parameters. Should keep the lights on at PG&E
>>> producing power for the server farms.
>>>
>>>
>>>
>>> Robin Anil wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>> This is an good paper with a simple measure of cluster quality
>>>> measurement based on intra cluster density and inter cluster separation.
>>>> Its
>>>> pretty easy to compute. Need to make it a map/reduce job
>>>>
>>>>
>>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>>>> Robin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Looking at the paper it doesn't seem to require MR for the final CDbw
calculation, right? For each cluster we only need to compare one of its
points with one point in each other cluster. With small numbers of
representative points per cluster that can be done easily in memory. I'd
love to see the code you have for computing representative points.
Jeff
Robin Anil wrote:
> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>
>> Hi Robin,
>>
>> Interesting paper. I'm beginning to see how to MR the representative point
>> selection already. The rest will hopefully become clearer with more study.
>> Lots of MR jobs are needed to:
>>
>
>
>
>
>> a) get the data into Vectors, We have something for text, missing for other
>> formats
>>
>
>
>
>
>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
>>
>
>
>
>
>> c) cluster the data, Done
>>
>
>
>
>
>> d) iterate over the clustered data to derive representative points for each
>> cluster, and finally Done ;)
>>
>
>
>
>
>> e) produce the CDbw.- TODO
>>
>
>
>
>
>
>> And, of course all of this is again iterated with different values for the
>> clustering algorithm's parameters. Should keep the lights on at PG&E
>> producing power for the server farms.
>>
>>
>>
>> Robin Anil wrote:
>>
>>
>>> Hi Jeff,
>>> This is an good paper with a simple measure of cluster quality
>>> measurement based on intra cluster density and inter cluster separation.
>>> Its
>>> pretty easy to compute. Need to make it a map/reduce job
>>>
>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>>> Robin
>>>
>>>
>>>
>>>
>>>
>>
>
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Robin Anil <ro...@gmail.com>.
On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
> Hi Robin,
>
> Interesting paper. I'm beginning to see how to MR the representative point
> selection already. The rest will hopefully become clearer with more study.
> Lots of MR jobs are needed to:
> a) get the data into Vectors, We have something for text, missing for other
> formats
> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
> c) cluster the data, Done
> d) iterate over the clustered data to derive representative points for each
> cluster, and finally Done ;)
> e) produce the CDbw.- TODO
> And, of course all of this is again iterated with different values for the
> clustering algorithm's parameters. Should keep the lights on at PG&E
> producing power for the server farms.
>
>
>
> Robin Anil wrote:
>
>> Hi Jeff,
>> This is an good paper with a simple measure of cluster quality
>> measurement based on intra cluster density and inter cluster separation.
>> Its
>> pretty easy to compute. Need to make it a map/reduce job
>>
>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>> Robin
>>
>>
>>
>>
>
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Robin,
Interesting paper. I'm beginning to see how to MR the representative
point selection already. The rest will hopefully become clearer with
more study. Lots of MR jobs are needed to: a) get the data into Vectors,
b) iterate (e.g. kmeans) over the data to produce a set of clusters, c)
cluster the data, d) iterate over the clustered data to derive
representative points for each cluster, and finally e) produce the CDbw.
And, of course all of this is again iterated with different values for
the clustering algorithm's parameters. Should keep the lights on at PG&E
producing power for the server farms.
Robin Anil wrote:
> Hi Jeff,
> This is an good paper with a simple measure of cluster quality
> measurement based on intra cluster density and inter cluster separation. Its
> pretty easy to compute. Need to make it a map/reduce job
> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
> Robin
>
>
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Robin Anil <ro...@gmail.com>.
Hi Jeff,
This is an good paper with a simple measure of cluster quality
measurement based on intra cluster density and inter cluster separation. Its
pretty easy to compute. Need to make it a map/reduce job
http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
Robin
On Wed, Apr 7, 2010 at 7:03 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
> Hi Robin,
>
> Great! I've got the refactoring changes for consolidating all the various
> cluster types under a Cluster interface (formerly Printable but now with id,
> numPoints and a center added). Dirichlet models still don't yet have
> meaningful ids implemented but they all do (so far anyway) have a notion of
> "numPoints" and a "center". I'm working on tests tomorrow to make sure the
> ClusterDumper actually works with Dirichlet clusters then I will commit
> that. Wednesday or Thursday most likely.
>
> BTW, I changed my mind about foisting off the old Printable interface on
> Vectors (but am still open to the idea if somebody actually working in math
> thinks it is worth doing). All the new Clusters use the vector formatting
> done in ClusterBase.
>
> What I'd really like is feedback from ClusterDumper users on what is
> working and what is needed to address MAHOUT-236. That includes you, right?
>
> Jeff
>
> PS: Ted, you expressed some doubts about the value of consolidating
> Dirichlet clusters with the others. So far it seems to be a reasonable fit
> but I'm doing the engineering on a tiny subset of simple models without
> enough theoretical insight to see any pitfalls ahead. Is there a
> "DistanceMeasure-like" discussion that might provide a firmer underpinning
> for this work?
>
>
>
>
> Robin Anil wrote:
>
>> No one yet. I am willing to help In case you need an extra pair of hands
>> on
>> this one.
>>
>> Robin
>>
>>
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Ted Dunning <te...@gmail.com>.
If it fits, then it is great to do.
On Tue, Apr 6, 2010 at 6:33 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
> PS: Ted, you expressed some doubts about the value of consolidating
> Dirichlet clusters with the others. So far it seems to be a reasonable fit
> but I'm doing the engineering on a tiny subset of simple models without
> enough theoretical insight to see any pitfalls ahead. Is there a
> "DistanceMeasure-like" discussion that might provide a firmer underpinning
> for this work?
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Robin,
Great! I've got the refactoring changes for consolidating all the
various cluster types under a Cluster interface (formerly Printable but
now with id, numPoints and a center added). Dirichlet models still don't
yet have meaningful ids implemented but they all do (so far anyway) have
a notion of "numPoints" and a "center". I'm working on tests tomorrow to
make sure the ClusterDumper actually works with Dirichlet clusters then
I will commit that. Wednesday or Thursday most likely.
BTW, I changed my mind about foisting off the old Printable interface on
Vectors (but am still open to the idea if somebody actually working in
math thinks it is worth doing). All the new Clusters use the vector
formatting done in ClusterBase.
What I'd really like is feedback from ClusterDumper users on what is
working and what is needed to address MAHOUT-236. That includes you, right?
Jeff
PS: Ted, you expressed some doubts about the value of consolidating
Dirichlet clusters with the others. So far it seems to be a reasonable
fit but I'm doing the engineering on a tiny subset of simple models
without enough theoretical insight to see any pitfalls ahead. Is there a
"DistanceMeasure-like" discussion that might provide a firmer
underpinning for this work?
Robin Anil wrote:
> No one yet. I am willing to help In case you need an extra pair of hands on
> this one.
>
> Robin
>
Re: MAHOUT-236 Cluster Evaluation Tools?
Posted by Robin Anil <ro...@gmail.com>.
No one yet. I am willing to help In case you need an extra pair of hands on
this one.
Robin
On Wed, Apr 7, 2010 at 3:40 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
> Is anybody working on MAHOUT-236? To me it looks like the next logical step
> beyond generalizing the cluster dumper: improving on its summaries
>
> Jeff Eastman wrote:
>
>> Completing the ClusterDumper jira will allow for visual inspection of the
>> Dirichlet models and extracting some useful information thereof; arguably
>> not too useful with 1793-element vectors but this is also true of kmeans
>> clusters with 1793-element center vectors. With no terminating conditions,
>> selecting the particular iteration to inspect is also an issue unique to
>> Dirichlet. MAHOUT-236 has been around for a while and, as Jake notes below,
>> is really needed.
>>
>>
>