You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Chris Harrington <ch...@heystaks.com> on 2013/02/25 13:27:16 UTC

Vector distance within a cluster

Hi all,

I want to find all the vectors within a cluster and then find the distance between them and every other vector within a cluster, in hopes this will give me a good idea of how similar each vector within a cluster is as well as identify outlier vectors.

So there are 2 things I want to ask.

1. Is this a sensible approach to evaluating the cluster quality?

2. Is the correct file to get this info from the clusteredPoints/parts-m-00000 file?

Thanks,
Chris

Re: Vector distance within a cluster

Posted by Ted Dunning <te...@gmail.com>.

If you get near a good solution (which is what the streaming k-means tries
to make very likely) then the local convexity should not be spoiled by the
L_1 regularization.  This is behind the guaranteed convergence of SGD and
the reason that step-wise regression so often fails ... stepwise is
equivalent to L_0 regularization which breaks convexity.  L_1 does nearly
the same amount of good while leaving convexity intact.

On Mon, Mar 4, 2013 at 2:44 PM, Sean Owen <sr...@gmail.com> wrote:

> So this is constraining the centroids themselves to be 0 in many
> dimensions... interesting. I assume this comes into play when you move
> centroids -- you're not minimizing squared distance anymore but that plus
> an additional penalty. I'd be interested to hear your results Chris if you
> try it -- versus simply using the L1 distance alone. I agree this does seem
> to make it all the more important to run it many times and pick the best
> clustering due to local minima.
>
>
> On Mon, Mar 4, 2013 at 7:18 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > Always good to test, but I think that euclidean distance with L_1
> > regularization is probably more interesting.
> >
> > On Mon, Mar 4, 2013 at 12:00 PM, Chris Harrington <chris@heystaks.com
> > >wrote:
> >
> > > So if I'm understanding what you are saying is, simply put, that I
> should
> > > investigate the use L_1 as my distance measure during my measuring of
> > > vector distance within a cluster?
> > >
> > >
> > > On 1 Mar 2013, at 16:24, Ted Dunning wrote:
> > >
> > > > What Sean says is just right, except that I was (telegraphically)
> > getting
> > > > at a slightly different point with L_1:
> > > >
> > > > On Wed, Feb 27, 2013 at 7:23 AM, Chris Harrington <
> chris@heystaks.com
> > > >wrote:
> > > >
> > > >> Is L_1 regularization the same as manhattan distance?
> > > >>
> > > >
> > > > L_1 metric is manhattan distance, yes.
> > > >
> > > > L_1 regularization of k-means refers to something a little bit
> > different.
> > > >
> > > > The idea with regularization is that you add some sort of penalty to
> > the
> > > > function you are optimizing.  This penalty pushes the optimization
> > > toward a
> > > > solution that you would prefer on some other grounds than just the
> > > > optimization alone.  Regularization often helps in solving
> > > underdetermined
> > > > systems where there are an infinite number of solutions and we have
> to
> > > pick
> > > > a preferred solution.
> > > >
> > > > There isn't anything that says that you have to be optimizing the
> same
> > > kind
> > > > of function as the regularization.  Thus k-means, which is inherently
> > > > optimizing squared error can quite reasonably be regularized with L_1
> > > (sum
> > > > of the absolute value of the centroids' coefficients).
> > > >
> > > > I haven't tried this at all seriously yet.  L_1 regularization tends
> to
> > > > help drive toward sparsity, but it is normally used in convex
> problems
> > > > where we can guarantee a findable global optimum.  The k-means
> problem,
> > > > however, is not convex so adding the regularization may screw things
> up
> > > in
> > > > practice.  For text-like data, I have a strong intuition that the
> > > idealized
> > > > effect of L_1 should be very good, but the pragmatic effect may not
> be
> > so
> > > > useful.
> > >
> > >
> >
>

Re: Vector distance within a cluster

Posted by Sean Owen <sr...@gmail.com>.

So this is constraining the centroids themselves to be 0 in many
dimensions... interesting. I assume this comes into play when you move
centroids -- you're not minimizing squared distance anymore but that plus
an additional penalty. I'd be interested to hear your results Chris if you
try it -- versus simply using the L1 distance alone. I agree this does seem
to make it all the more important to run it many times and pick the best
clustering due to local minima.


On Mon, Mar 4, 2013 at 7:18 PM, Ted Dunning <te...@gmail.com> wrote:

> Always good to test, but I think that euclidean distance with L_1
> regularization is probably more interesting.
>
> On Mon, Mar 4, 2013 at 12:00 PM, Chris Harrington <chris@heystaks.com
> >wrote:
>
> > So if I'm understanding what you are saying is, simply put, that I should
> > investigate the use L_1 as my distance measure during my measuring of
> > vector distance within a cluster?
> >
> >
> > On 1 Mar 2013, at 16:24, Ted Dunning wrote:
> >
> > > What Sean says is just right, except that I was (telegraphically)
> getting
> > > at a slightly different point with L_1:
> > >
> > > On Wed, Feb 27, 2013 at 7:23 AM, Chris Harrington <chris@heystaks.com
> > >wrote:
> > >
> > >> Is L_1 regularization the same as manhattan distance?
> > >>
> > >
> > > L_1 metric is manhattan distance, yes.
> > >
> > > L_1 regularization of k-means refers to something a little bit
> different.
> > >
> > > The idea with regularization is that you add some sort of penalty to
> the
> > > function you are optimizing.  This penalty pushes the optimization
> > toward a
> > > solution that you would prefer on some other grounds than just the
> > > optimization alone.  Regularization often helps in solving
> > underdetermined
> > > systems where there are an infinite number of solutions and we have to
> > pick
> > > a preferred solution.
> > >
> > > There isn't anything that says that you have to be optimizing the same
> > kind
> > > of function as the regularization.  Thus k-means, which is inherently
> > > optimizing squared error can quite reasonably be regularized with L_1
> > (sum
> > > of the absolute value of the centroids' coefficients).
> > >
> > > I haven't tried this at all seriously yet.  L_1 regularization tends to
> > > help drive toward sparsity, but it is normally used in convex problems
> > > where we can guarantee a findable global optimum.  The k-means problem,
> > > however, is not convex so adding the regularization may screw things up
> > in
> > > practice.  For text-like data, I have a strong intuition that the
> > idealized
> > > effect of L_1 should be very good, but the pragmatic effect may not be
> so
> > > useful.
> >
> >
>

Re: Vector distance within a cluster

Posted by Ted Dunning <te...@gmail.com>.

Always good to test, but I think that euclidean distance with L_1
regularization is probably more interesting.

On Mon, Mar 4, 2013 at 12:00 PM, Chris Harrington <ch...@heystaks.com>wrote:

> So if I'm understanding what you are saying is, simply put, that I should
> investigate the use L_1 as my distance measure during my measuring of
> vector distance within a cluster?
>
>
> On 1 Mar 2013, at 16:24, Ted Dunning wrote:
>
> > What Sean says is just right, except that I was (telegraphically) getting
> > at a slightly different point with L_1:
> >
> > On Wed, Feb 27, 2013 at 7:23 AM, Chris Harrington <chris@heystaks.com
> >wrote:
> >
> >> Is L_1 regularization the same as manhattan distance?
> >>
> >
> > L_1 metric is manhattan distance, yes.
> >
> > L_1 regularization of k-means refers to something a little bit different.
> >
> > The idea with regularization is that you add some sort of penalty to the
> > function you are optimizing.  This penalty pushes the optimization
> toward a
> > solution that you would prefer on some other grounds than just the
> > optimization alone.  Regularization often helps in solving
> underdetermined
> > systems where there are an infinite number of solutions and we have to
> pick
> > a preferred solution.
> >
> > There isn't anything that says that you have to be optimizing the same
> kind
> > of function as the regularization.  Thus k-means, which is inherently
> > optimizing squared error can quite reasonably be regularized with L_1
> (sum
> > of the absolute value of the centroids' coefficients).
> >
> > I haven't tried this at all seriously yet.  L_1 regularization tends to
> > help drive toward sparsity, but it is normally used in convex problems
> > where we can guarantee a findable global optimum.  The k-means problem,
> > however, is not convex so adding the regularization may screw things up
> in
> > practice.  For text-like data, I have a strong intuition that the
> idealized
> > effect of L_1 should be very good, but the pragmatic effect may not be so
> > useful.
>
>

Re: Vector distance within a cluster

Posted by Chris Harrington <ch...@heystaks.com>.

So if I'm understanding what you are saying is, simply put, that I should investigate the use L_1 as my distance measure during my measuring of vector distance within a cluster?


On 1 Mar 2013, at 16:24, Ted Dunning wrote:

> What Sean says is just right, except that I was (telegraphically) getting
> at a slightly different point with L_1:
> 
> On Wed, Feb 27, 2013 at 7:23 AM, Chris Harrington <ch...@heystaks.com>wrote:
> 
>> Is L_1 regularization the same as manhattan distance?
>> 
> 
> L_1 metric is manhattan distance, yes.
> 
> L_1 regularization of k-means refers to something a little bit different.
> 
> The idea with regularization is that you add some sort of penalty to the
> function you are optimizing.  This penalty pushes the optimization toward a
> solution that you would prefer on some other grounds than just the
> optimization alone.  Regularization often helps in solving underdetermined
> systems where there are an infinite number of solutions and we have to pick
> a preferred solution.
> 
> There isn't anything that says that you have to be optimizing the same kind
> of function as the regularization.  Thus k-means, which is inherently
> optimizing squared error can quite reasonably be regularized with L_1 (sum
> of the absolute value of the centroids' coefficients).
> 
> I haven't tried this at all seriously yet.  L_1 regularization tends to
> help drive toward sparsity, but it is normally used in convex problems
> where we can guarantee a findable global optimum.  The k-means problem,
> however, is not convex so adding the regularization may screw things up in
> practice.  For text-like data, I have a strong intuition that the idealized
> effect of L_1 should be very good, but the pragmatic effect may not be so
> useful.

Re: Vector distance within a cluster

Posted by Ted Dunning <te...@gmail.com>.

What Sean says is just right, except that I was (telegraphically) getting
at a slightly different point with L_1:

On Wed, Feb 27, 2013 at 7:23 AM, Chris Harrington <ch...@heystaks.com>wrote:

> Is L_1 regularization the same as manhattan distance?
>

L_1 metric is manhattan distance, yes.

L_1 regularization of k-means refers to something a little bit different.

The idea with regularization is that you add some sort of penalty to the
function you are optimizing.  This penalty pushes the optimization toward a
solution that you would prefer on some other grounds than just the
optimization alone.  Regularization often helps in solving underdetermined
systems where there are an infinite number of solutions and we have to pick
a preferred solution.

There isn't anything that says that you have to be optimizing the same kind
of function as the regularization.  Thus k-means, which is inherently
optimizing squared error can quite reasonably be regularized with L_1 (sum
of the absolute value of the centroids' coefficients).

I haven't tried this at all seriously yet.  L_1 regularization tends to
help drive toward sparsity, but it is normally used in convex problems
where we can guarantee a findable global optimum.  The k-means problem,
however, is not convex so adding the regularization may screw things up in
practice.  For text-like data, I have a strong intuition that the idealized
effect of L_1 should be very good, but the pragmatic effect may not be so
useful.

Re: Vector distance within a cluster

Posted by Sean Owen <sr...@gmail.com>.

A common measure of cluster coherence is the mean distance or mean squared
difference between the members and the cluster centroid. It sounds like
this is the kind of thing you're measuring with this all-pairs distances.
That could be a measure too; I've usually seen that done by taking the
maximum such intracluster distance, the 'diameter'.

To answer Ted's question -- you're measuring internal consistency. You're
not trying to find clusters that match some external standard that says
these 100 docs should cluster together, etc.

I'm speaking off the cuff, but I think the idea was that L1/Manhattan
distance may give you clusters that tend to spread out over few rather than
more dimensions, and so that may make them more interpretable -- because
they will tend to be nearly identical in the other several dimensions and
those homogenous dimensions tell you what they're "about".

The reason is that L1 is "indifferent" across dimensions -- moving a unit
in any dimension makes you a unit further/closer from another point --
while in L2 moving along a dimension where you are already close does
little.

On Wed, Feb 27, 2013 at 3:23 PM, Chris Harrington <ch...@heystaks.com>wrote:

> Hmmm, you may have to dumb things down for me here. I have don't have much
> of a background in the area of ML and I'm just piecing things together and
> learning as I go.
> So I don't really understand what you mean by "Coherence against an
> external standard?  Or internal consistency/homogeneity?" or "One thought
> along these lines is to add L_1 regularization to the k-means algorithm."
> Is L_1 regularization the same as manhattan distance?
>
> That aside I'm outputting a file with the top terms and the text of 20
> random documents that ended up in that cluster and eyeballing that, not
> very high-tech or efficient but it was the only way I knew to make a
> relevance judgment on a cluster topic. For example If the majority of the
> samples are sport related and 82.6% of the vector distances in my cluster
> are quite similar I'm happy to call that cluster sport.
>

Re: Vector distance within a cluster

Posted by Chris Harrington <ch...@heystaks.com>.

Hmmm, you may have to dumb things down for me here. I have don't have much of a background in the area of ML and I'm just piecing things together and learning as I go.
So I don't really understand what you mean by "Coherence against an external standard?  Or internal consistency/homogeneity?" or "One thought along these lines is to add L_1 regularization to the k-means algorithm."
Is L_1 regularization the same as manhattan distance? 

That aside I'm outputting a file with the top terms and the text of 20 random documents that ended up in that cluster and eyeballing that, not very high-tech or efficient but it was the only way I knew to make a relevance judgment on a cluster topic. For example If the majority of the samples are sport related and 82.6% of the vector distances in my cluster are quite similar I'm happy to call that cluster sport.

On 26 Feb 2013, at 22:00, Ted Dunning wrote:

> Chris,
> 
> How are you doing your manual judgement step?  Coherence against an
> external standard?  Or internal consistency/homogeneity?
> 
> Except for unusual situations it is to be expected that most clusterings
> are not particularly stable (i.e. will no reproduce the same clusters from
> run to run).  As such, it is also unlikely that they will reproduce
> externally defined clusters any more than they will reproduce their own
> results.
> 
> Likewise, there is no guarantee that the results will be easily
> interpretable.  One thought along these lines is to add L_1 regularization
> to the k-means algorithm.  Another is to look into what the carrot project
> has done where, according to the developers, they have put some effort into
> making clusters that are easily summarizable.  This might be similar in
> effect to the regularization step I just mentioned.
> 
> On Tue, Feb 26, 2013 at 7:02 AM, Chris Harrington <ch...@heystaks.com>wrote:
> 
>> Well, what I'm trying to do is create clusters of topically similar
>> content via kmeans.
>> 
>> Since I'm basing validity on topics there's a manual judgement step.
>> And that manual step is taking a prohibitive amount of time to heck many
>> clustering runs hence the desire for some stats to indicate roughly how
>> good the clusters are.
>> 
>> So I' want some stats that, at a glance, I'll be able to tell which
>> clusters "should" be good and manually check them instead of having to
>> check each and every one.
>> 
>> I was thinking that a file with
>> 
>> 1. the number of clusters,
>> 2. the avg of all points to every other point
>> 3. the avg distance of the points furthest from the center to all other
>> points, (furthest 25% of all points within a cluster)
>> 4. the avg distance of the points closest to the center to all other point
>> (closest 25% of all points within a cluster)
>> 
>> would allow me to quickly see if I should even bother manually checking
>> the clustering output, the logic being that if 4,3 and 2 are similar in
>> value then it's probably a decent cluster and I can manually check it. Also
>> a comparison of 3 vs 2 would indicate if the cluster contains a number of
>> distant outliers and 4 vs 2 would should show roughly how dense a cluster
>> is.
>> 
>> This makes sense right? or am I barking up the wrong tree?
>> 
>> On 25 Feb 2013, at 20:15, Ted Dunning wrote:
>> 
>>> The best way to evaluate a cluster really depends on what your purpose
>> is.
>>> 
>>> My own purpose is typically to use the clustering as a description of the
>>> probability distribution of data.
>>> 
>>> For that purpose, the best evaluation is distance to centroids for
>> held-out
>>> data.  The use of held-out data is critical here since otherwise you
>> could
>>> just put a single cluster at every data point and get zero distance for
>> the
>>> original data.  For held-out data, of course, the story would be
>> different.
>>> 
>>> This view of things is very good from the standpoint of machine learning
>>> and data compression, but might be less useful for certain purposes that
>>> have to do with explanation of data in human readable form.  My
>> experience
>>> is that it is common for a clustering algorithm to be very good as a
>>> probability distribution description but quite bad for human inspection.
>>> 
>>> My own tendency would be to adapt the outline you gave to work on
>> held-out
>>> data instead of the original training data.
>>> 
>>> On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <chris@heystaks.com
>>> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I want to find all the vectors within a cluster and then find the
>> distance
>>>> between them and every other vector within a cluster, in hopes this will
>>>> give me a good idea of how similar each vector within a cluster is as
>> well
>>>> as identify outlier vectors.
>>>> 
>>>> So there are 2 things I want to ask.
>>>> 
>>>> 1. Is this a sensible approach to evaluating the cluster quality?
>>>> 
>>>> 2. Is the correct file to get this info from the
>>>> clusteredPoints/parts-m-00000 file?
>>>> 
>>>> Thanks,
>>>> Chris
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Vector distance within a cluster

Posted by Ted Dunning <te...@gmail.com>.

Chris,

How are you doing your manual judgement step?  Coherence against an
external standard?  Or internal consistency/homogeneity?

Except for unusual situations it is to be expected that most clusterings
are not particularly stable (i.e. will no reproduce the same clusters from
run to run).  As such, it is also unlikely that they will reproduce
externally defined clusters any more than they will reproduce their own
results.

Likewise, there is no guarantee that the results will be easily
interpretable.  One thought along these lines is to add L_1 regularization
to the k-means algorithm.  Another is to look into what the carrot project
has done where, according to the developers, they have put some effort into
making clusters that are easily summarizable.  This might be similar in
effect to the regularization step I just mentioned.

On Tue, Feb 26, 2013 at 7:02 AM, Chris Harrington <ch...@heystaks.com>wrote:

> Well, what I'm trying to do is create clusters of topically similar
> content via kmeans.
>
> Since I'm basing validity on topics there's a manual judgement step.
> And that manual step is taking a prohibitive amount of time to heck many
> clustering runs hence the desire for some stats to indicate roughly how
> good the clusters are.
>
> So I' want some stats that, at a glance, I'll be able to tell which
> clusters "should" be good and manually check them instead of having to
> check each and every one.
>
> I was thinking that a file with
>
> 1. the number of clusters,
> 2. the avg of all points to every other point
> 3. the avg distance of the points furthest from the center to all other
> points, (furthest 25% of all points within a cluster)
> 4. the avg distance of the points closest to the center to all other point
> (closest 25% of all points within a cluster)
>
> would allow me to quickly see if I should even bother manually checking
> the clustering output, the logic being that if 4,3 and 2 are similar in
> value then it's probably a decent cluster and I can manually check it. Also
> a comparison of 3 vs 2 would indicate if the cluster contains a number of
> distant outliers and 4 vs 2 would should show roughly how dense a cluster
> is.
>
> This makes sense right? or am I barking up the wrong tree?
>
> On 25 Feb 2013, at 20:15, Ted Dunning wrote:
>
> > The best way to evaluate a cluster really depends on what your purpose
> is.
> >
> > My own purpose is typically to use the clustering as a description of the
> > probability distribution of data.
> >
> > For that purpose, the best evaluation is distance to centroids for
> held-out
> > data.  The use of held-out data is critical here since otherwise you
> could
> > just put a single cluster at every data point and get zero distance for
> the
> > original data.  For held-out data, of course, the story would be
> different.
> >
> > This view of things is very good from the standpoint of machine learning
> > and data compression, but might be less useful for certain purposes that
> > have to do with explanation of data in human readable form.  My
> experience
> > is that it is common for a clustering algorithm to be very good as a
> > probability distribution description but quite bad for human inspection.
> >
> > My own tendency would be to adapt the outline you gave to work on
> held-out
> > data instead of the original training data.
> >
> > On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <chris@heystaks.com
> >wrote:
> >
> >> Hi all,
> >>
> >> I want to find all the vectors within a cluster and then find the
> distance
> >> between them and every other vector within a cluster, in hopes this will
> >> give me a good idea of how similar each vector within a cluster is as
> well
> >> as identify outlier vectors.
> >>
> >> So there are 2 things I want to ask.
> >>
> >> 1. Is this a sensible approach to evaluating the cluster quality?
> >>
> >> 2. Is the correct file to get this info from the
> >> clusteredPoints/parts-m-00000 file?
> >>
> >> Thanks,
> >> Chris
> >>
> >>
> >>
>
>

Re: Vector distance within a cluster

Posted by Dan Filimon <da...@gmail.com>.

Hi Chris,

I'm running similar cluster quality tests and wrote code that gets
some of the statistics you want.

Have a look at [1], at the summarizeClusterDistances() method. You
give it the centroids and data points you have and use it to return a
list of OnlineSummarizers with the relevant distance statistics (your
classes are probably not the same but this is a good starting point).
I computed the distances from each point to its cluster's center and
summarized the distances in an OnlineSummarizer.
You then get (similar to what you ask for):
1. the number of points/cluster
2. average distances to the center (getMean())
3. first quartile (getQuartile(1))
4. third quartile (getQuartile(3))

I realize that you're talking about other kinds of distances (between
any two points), but information about the distances to the center
also give you information about the quality of the clustering (an
upper bound for the distance between any 2 points is 2 * max distance
to the center, in the worst case of 2 points on opposite sides of a
diameter).

And here [2] is code that I use to make a CSV of these cluster statistics.

[1] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/java/org/apache/mahout/clustering/streaming/utils/ExperimentUtils.java
[2] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/java/org/apache/mahout/clustering/streaming/tools/ClusterQuality20NewsGroups.java#L97

On Tue, Feb 26, 2013 at 5:02 PM, Chris Harrington <ch...@heystaks.com> wrote:
> Well, what I'm trying to do is create clusters of topically similar content via kmeans.
>
> Since I'm basing validity on topics there's a manual judgement step.
> And that manual step is taking a prohibitive amount of time to heck many clustering runs hence the desire for some stats to indicate roughly how good the clusters are.
>
> So I' want some stats that, at a glance, I'll be able to tell which clusters "should" be good and manually check them instead of having to check each and every one.
>
> I was thinking that a file with
>
> 1. the number of clusters,
> 2. the avg of all points to every other point
> 3. the avg distance of the points furthest from the center to all other points, (furthest 25% of all points within a cluster)
> 4. the avg distance of the points closest to the center to all other point (closest 25% of all points within a cluster)
>
> would allow me to quickly see if I should even bother manually checking the clustering output, the logic being that if 4,3 and 2 are similar in value then it's probably a decent cluster and I can manually check it. Also a comparison of 3 vs 2 would indicate if the cluster contains a number of distant outliers and 4 vs 2 would should show roughly how dense a cluster is.
>
> This makes sense right? or am I barking up the wrong tree?
>
> On 25 Feb 2013, at 20:15, Ted Dunning wrote:
>
>> The best way to evaluate a cluster really depends on what your purpose is.
>>
>> My own purpose is typically to use the clustering as a description of the
>> probability distribution of data.
>>
>> For that purpose, the best evaluation is distance to centroids for held-out
>> data.  The use of held-out data is critical here since otherwise you could
>> just put a single cluster at every data point and get zero distance for the
>> original data.  For held-out data, of course, the story would be different.
>>
>> This view of things is very good from the standpoint of machine learning
>> and data compression, but might be less useful for certain purposes that
>> have to do with explanation of data in human readable form.  My experience
>> is that it is common for a clustering algorithm to be very good as a
>> probability distribution description but quite bad for human inspection.
>>
>> My own tendency would be to adapt the outline you gave to work on held-out
>> data instead of the original training data.
>>
>> On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <ch...@heystaks.com>wrote:
>>
>>> Hi all,
>>>
>>> I want to find all the vectors within a cluster and then find the distance
>>> between them and every other vector within a cluster, in hopes this will
>>> give me a good idea of how similar each vector within a cluster is as well
>>> as identify outlier vectors.
>>>
>>> So there are 2 things I want to ask.
>>>
>>> 1. Is this a sensible approach to evaluating the cluster quality?
>>>
>>> 2. Is the correct file to get this info from the
>>> clusteredPoints/parts-m-00000 file?
>>>
>>> Thanks,
>>> Chris
>>>
>>>
>>>
>

Re: Vector distance within a cluster

Posted by Chris Harrington <ch...@heystaks.com>.

Well, what I'm trying to do is create clusters of topically similar content via kmeans. 

Since I'm basing validity on topics there's a manual judgement step.
And that manual step is taking a prohibitive amount of time to heck many clustering runs hence the desire for some stats to indicate roughly how good the clusters are.

So I' want some stats that, at a glance, I'll be able to tell which clusters "should" be good and manually check them instead of having to check each and every one.

I was thinking that a file with

1. the number of clusters, 
2. the avg of all points to every other point
3. the avg distance of the points furthest from the center to all other points, (furthest 25% of all points within a cluster)
4. the avg distance of the points closest to the center to all other point (closest 25% of all points within a cluster)

would allow me to quickly see if I should even bother manually checking the clustering output, the logic being that if 4,3 and 2 are similar in value then it's probably a decent cluster and I can manually check it. Also a comparison of 3 vs 2 would indicate if the cluster contains a number of distant outliers and 4 vs 2 would should show roughly how dense a cluster is.

This makes sense right? or am I barking up the wrong tree?

On 25 Feb 2013, at 20:15, Ted Dunning wrote:

> The best way to evaluate a cluster really depends on what your purpose is.
> 
> My own purpose is typically to use the clustering as a description of the
> probability distribution of data.
> 
> For that purpose, the best evaluation is distance to centroids for held-out
> data.  The use of held-out data is critical here since otherwise you could
> just put a single cluster at every data point and get zero distance for the
> original data.  For held-out data, of course, the story would be different.
> 
> This view of things is very good from the standpoint of machine learning
> and data compression, but might be less useful for certain purposes that
> have to do with explanation of data in human readable form.  My experience
> is that it is common for a clustering algorithm to be very good as a
> probability distribution description but quite bad for human inspection.
> 
> My own tendency would be to adapt the outline you gave to work on held-out
> data instead of the original training data.
> 
> On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <ch...@heystaks.com>wrote:
> 
>> Hi all,
>> 
>> I want to find all the vectors within a cluster and then find the distance
>> between them and every other vector within a cluster, in hopes this will
>> give me a good idea of how similar each vector within a cluster is as well
>> as identify outlier vectors.
>> 
>> So there are 2 things I want to ask.
>> 
>> 1. Is this a sensible approach to evaluating the cluster quality?
>> 
>> 2. Is the correct file to get this info from the
>> clusteredPoints/parts-m-00000 file?
>> 
>> Thanks,
>> Chris
>> 
>> 
>>

Re: Vector distance within a cluster

Posted by Ted Dunning <te...@gmail.com>.

The best way to evaluate a cluster really depends on what your purpose is.

My own purpose is typically to use the clustering as a description of the
probability distribution of data.

For that purpose, the best evaluation is distance to centroids for held-out
data.  The use of held-out data is critical here since otherwise you could
just put a single cluster at every data point and get zero distance for the
original data.  For held-out data, of course, the story would be different.

This view of things is very good from the standpoint of machine learning
and data compression, but might be less useful for certain purposes that
have to do with explanation of data in human readable form.  My experience
is that it is common for a clustering algorithm to be very good as a
probability distribution description but quite bad for human inspection.

My own tendency would be to adapt the outline you gave to work on held-out
data instead of the original training data.

On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <ch...@heystaks.com>wrote:

> Hi all,
>
> I want to find all the vectors within a cluster and then find the distance
> between them and every other vector within a cluster, in hopes this will
> give me a good idea of how similar each vector within a cluster is as well
> as identify outlier vectors.
>
> So there are 2 things I want to ask.
>
> 1. Is this a sensible approach to evaluating the cluster quality?
>
> 2. Is the correct file to get this info from the
> clusteredPoints/parts-m-00000 file?
>
> Thanks,
> Chris
>
>
>