You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2013/02/22 16:33:35 UTC

Plotting cluster quality

As most of the regulars know, I'm working with Ted Dunning on a new
clustering framework for Mahout that should land in 0.8.

Part of my work is comparing the clustering quality of the new code
with the existing Mahout implementation.

I compiled a CSV of the quality data [1]. I ran 5 runs of the
clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
followed by Ball KMeans (bskm).

I'm looking at now making some appealing plots for the data. For
instance, I think want to make box plots of individual clustering
runs. Here's an example [2] of what a clustering looks like for one
run of Mahout's standard k-means.

There's a box for each cluster, the mean distance is the thick line,
the limits are the 1st and 3rd quartiles and the whiskers are the min
and max distances.
The blue horizontal line is the mean of all average cluster distances.
The green horizontal line is the median of all average cluster distances.

I intend on making similar plots for the other runs and then
aggregating the means of the runs into box plots for the different
classes of k-means.
The main result being that streaming k-means + ball k-means (as done
in the MR) gives a high quality clustering.

How do you feel about this plot? Is it too dense? Too colorful? Should
I not draw the median any more?
What are some other good ways of plotting the quality given the data set?

Thanks!

[1] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv
[2] http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf

Re: Plotting cluster quality

Posted by Ted Dunning <te...@gmail.com>.

I spoke off-line to Dan and he confirmed your inference.  Color was just
there for visual esthetics.

On Sun, Feb 24, 2013 at 6:18 AM, David Murgatroyd <dm...@gmail.com> wrote:

> >What does color mean here? What about width of the box?
> FWIW, I infer color is solely for visual distinction -- rotating through
> orange, red, yellow, pink from left to right. I infer width is proportional
> to count of items in each cluster, though apparently not linearly.
>
> I agree that a single plot comparing the algorithms is important since the
> purpose of the plot is to compare the algorithms rather than better
> understand the data on which they've been run. I haven't thought of a good
> way to do that while still having a cluster-by-cluster visual element.
>
> On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > What does color mean here?
> >
> > What about width of the box?
> >
> > When you say median or mean of all cluster distances, do you mean across
> > that single run?
> >
> > I think that this plot is fine as it is except that it needs a legend
> that
> > explains all of these issues.  My general rule of thumb is that most
> > figures should have what I call a "Kipling caption".  See the caption of
> > the first image here: http://www.boop.org/jan/justso/butter.htm to see
> > what
> > I mean by this.  Imagine that there is a very mathematically inclined 4
> > year old who is looking at your diagram and quizzing you about every
> part.
> >  Answer all their questions in the caption and you have a Kipling
> caption.
> >
> > For comparing different runs of the clustering or different algorithms, I
> > think that a cumulative distribution plot (using plot.ecdf) with all of
> the
> > different algorithms on one plot would be the best comparison tool.
> >
> > On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon <
> dangeorge.filimon@gmail.com
> > >wrote:
> >
> > > As most of the regulars know, I'm working with Ted Dunning on a new
> > > clustering framework for Mahout that should land in 0.8.
> > >
> > > Part of my work is comparing the clustering quality of the new code
> > > with the existing Mahout implementation.
> > >
> > > I compiled a CSV of the quality data [1]. I ran 5 runs of the
> > > clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
> > > Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
> > > followed by Ball KMeans (bskm).
> > >
> > > I'm looking at now making some appealing plots for the data. For
> > > instance, I think want to make box plots of individual clustering
> > > runs. Here's an example [2] of what a clustering looks like for one
> > > run of Mahout's standard k-means.
> > >
> > > There's a box for each cluster, the mean distance is the thick line,
> > > the limits are the 1st and 3rd quartiles and the whiskers are the min
> > > and max distances.
> > > The blue horizontal line is the mean of all average cluster distances.
> > > The green horizontal line is the median of all average cluster
> distances.
> > >
> > > I intend on making similar plots for the other runs and then
> > > aggregating the means of the runs into box plots for the different
> > > classes of k-means.
> > > The main result being that streaming k-means + ball k-means (as done
> > > in the MR) gives a high quality clustering.
> > >
> > > How do you feel about this plot? Is it too dense? Too colorful? Should
> > > I not draw the median any more?
> > > What are some other good ways of plotting the quality given the data
> set?
> > >
> > > Thanks!
> > >
> > > [1]
> > >
> >
> https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv
> > > [2]
> > >
> >
> http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
> > >
> >
>

Re: Plotting cluster quality

Posted by David Murgatroyd <dm...@gmail.com>.

>What does color mean here? What about width of the box?
FWIW, I infer color is solely for visual distinction -- rotating through
orange, red, yellow, pink from left to right. I infer width is proportional
to count of items in each cluster, though apparently not linearly.

I agree that a single plot comparing the algorithms is important since the
purpose of the plot is to compare the algorithms rather than better
understand the data on which they've been run. I haven't thought of a good
way to do that while still having a cluster-by-cluster visual element.

On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning <te...@gmail.com> wrote:

> What does color mean here?
>
> What about width of the box?
>
> When you say median or mean of all cluster distances, do you mean across
> that single run?
>
> I think that this plot is fine as it is except that it needs a legend that
> explains all of these issues.  My general rule of thumb is that most
> figures should have what I call a "Kipling caption".  See the caption of
> the first image here: http://www.boop.org/jan/justso/butter.htm to see
> what
> I mean by this.  Imagine that there is a very mathematically inclined 4
> year old who is looking at your diagram and quizzing you about every part.
>  Answer all their questions in the caption and you have a Kipling caption.
>
> For comparing different runs of the clustering or different algorithms, I
> think that a cumulative distribution plot (using plot.ecdf) with all of the
> different algorithms on one plot would be the best comparison tool.
>
> On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon <dangeorge.filimon@gmail.com
> >wrote:
>
> > As most of the regulars know, I'm working with Ted Dunning on a new
> > clustering framework for Mahout that should land in 0.8.
> >
> > Part of my work is comparing the clustering quality of the new code
> > with the existing Mahout implementation.
> >
> > I compiled a CSV of the quality data [1]. I ran 5 runs of the
> > clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
> > Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
> > followed by Ball KMeans (bskm).
> >
> > I'm looking at now making some appealing plots for the data. For
> > instance, I think want to make box plots of individual clustering
> > runs. Here's an example [2] of what a clustering looks like for one
> > run of Mahout's standard k-means.
> >
> > There's a box for each cluster, the mean distance is the thick line,
> > the limits are the 1st and 3rd quartiles and the whiskers are the min
> > and max distances.
> > The blue horizontal line is the mean of all average cluster distances.
> > The green horizontal line is the median of all average cluster distances.
> >
> > I intend on making similar plots for the other runs and then
> > aggregating the means of the runs into box plots for the different
> > classes of k-means.
> > The main result being that streaming k-means + ball k-means (as done
> > in the MR) gives a high quality clustering.
> >
> > How do you feel about this plot? Is it too dense? Too colorful? Should
> > I not draw the median any more?
> > What are some other good ways of plotting the quality given the data set?
> >
> > Thanks!
> >
> > [1]
> >
> https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv
> > [2]
> >
> http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
> >
>

Re: Plotting cluster quality

Posted by Ted Dunning <te...@gmail.com>.

What does color mean here?

What about width of the box?

When you say median or mean of all cluster distances, do you mean across
that single run?

I think that this plot is fine as it is except that it needs a legend that
explains all of these issues.  My general rule of thumb is that most
figures should have what I call a "Kipling caption".  See the caption of
the first image here: http://www.boop.org/jan/justso/butter.htm to see what
I mean by this.  Imagine that there is a very mathematically inclined 4
year old who is looking at your diagram and quizzing you about every part.
 Answer all their questions in the caption and you have a Kipling caption.

For comparing different runs of the clustering or different algorithms, I
think that a cumulative distribution plot (using plot.ecdf) with all of the
different algorithms on one plot would be the best comparison tool.

On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon <da...@gmail.com>wrote:

> As most of the regulars know, I'm working with Ted Dunning on a new
> clustering framework for Mahout that should land in 0.8.
>
> Part of my work is comparing the clustering quality of the new code
> with the existing Mahout implementation.
>
> I compiled a CSV of the quality data [1]. I ran 5 runs of the
> clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
> Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
> followed by Ball KMeans (bskm).
>
> I'm looking at now making some appealing plots for the data. For
> instance, I think want to make box plots of individual clustering
> runs. Here's an example [2] of what a clustering looks like for one
> run of Mahout's standard k-means.
>
> There's a box for each cluster, the mean distance is the thick line,
> the limits are the 1st and 3rd quartiles and the whiskers are the min
> and max distances.
> The blue horizontal line is the mean of all average cluster distances.
> The green horizontal line is the median of all average cluster distances.
>
> I intend on making similar plots for the other runs and then
> aggregating the means of the runs into box plots for the different
> classes of k-means.
> The main result being that streaming k-means + ball k-means (as done
> in the MR) gives a high quality clustering.
>
> How do you feel about this plot? Is it too dense? Too colorful? Should
> I not draw the median any more?
> What are some other good ways of plotting the quality given the data set?
>
> Thanks!
>
> [1]
> https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv
> [2]
> http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
>