You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2013/02/27 22:38:09 UTC

What do "normal" pdf values look like for points clustered with kmeans?

I made a small modification to the KMeansDriver to call the
ClusterClassificationDriver with an emitMostLikely value of false so that I
could see what the pdf values of my points were for all k of my clusters.

I was expecting the most likely cluster to have a much higher pdf than the
other clusters in most cases, but in my results, all the values are pretty
close to 1/(number of clusters)

For example, when I ran with 50 clusters, most of my points had a pdf value
of 0.02xx for nearly every cluster.

I understand that to mean that for most of my points, none of my clusters
are a good fit. Is that right? Or is it common for for the most likely
cluster to only deviate tiny bit from all the others? (I wouldn't think so)

Thanks for the advice,
Matt

Re: What do "normal" pdf values look like for points clustered with kmeans?

Posted by Ted Dunning <te...@gmail.com>.
These numbers are hard to interpret without context.

It is relatively easier to interpret average squared distance within
clusters and between clusters.  Do you have those values?

On Fri, Mar 1, 2013 at 9:38 AM, Matt Molek <mp...@gmail.com> wrote:

> My data is from ~4500 Wikipedia articles. I stripped out the wiki markup,
> ran them through seq2sparse, and then reduced to 100 dimensions with ssvd
> before running kmeans.
>
> I re-ran my test with some slightly tweaked parameters to see if I could
> improve the clustering. My pdf values for the most likely clusters improved
> a little bit, but not dramatically.
>
> Taking the most likely cluster's pdf value for each point, I got a minimum
> pdf of 0.0215, a maximum pdf of 0.0377, and a mean pdf value of 0.0282
>
> Looking at all 50 pdf values for each point, I got a minimum pdf of
> 0.0.0174, and a mean pdf value of 0.0200.
>
> Do these pdf values say anything about the fit or quality of my cluster
> results?
>
>
> On Fri, Mar 1, 2013 at 2:56 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > How high is the dimension?
> >
> > How is your data generated?
> >
> >
> >
> > On Wed, Feb 27, 2013 at 1:38 PM, Matt Molek <mp...@gmail.com> wrote:
> >
> > > I made a small modification to the KMeansDriver to call the
> > > ClusterClassificationDriver with an emitMostLikely value of false so
> > that I
> > > could see what the pdf values of my points were for all k of my
> clusters.
> > >
> > > I was expecting the most likely cluster to have a much higher pdf than
> > the
> > > other clusters in most cases, but in my results, all the values are
> > pretty
> > > close to 1/(number of clusters)
> > >
> > > For example, when I ran with 50 clusters, most of my points had a pdf
> > value
> > > of 0.02xx for nearly every cluster.
> > >
> > > I understand that to mean that for most of my points, none of my
> clusters
> > > are a good fit. Is that right? Or is it common for for the most likely
> > > cluster to only deviate tiny bit from all the others? (I wouldn't think
> > so)
> > >
> > > Thanks for the advice,
> > > Matt
> > >
> >
>

Re: What do "normal" pdf values look like for points clustered with kmeans?

Posted by Matt Molek <mp...@gmail.com>.
My data is from ~4500 Wikipedia articles. I stripped out the wiki markup,
ran them through seq2sparse, and then reduced to 100 dimensions with ssvd
before running kmeans.

I re-ran my test with some slightly tweaked parameters to see if I could
improve the clustering. My pdf values for the most likely clusters improved
a little bit, but not dramatically.

Taking the most likely cluster's pdf value for each point, I got a minimum
pdf of 0.0215, a maximum pdf of 0.0377, and a mean pdf value of 0.0282

Looking at all 50 pdf values for each point, I got a minimum pdf of
0.0.0174, and a mean pdf value of 0.0200.

Do these pdf values say anything about the fit or quality of my cluster
results?


On Fri, Mar 1, 2013 at 2:56 AM, Ted Dunning <te...@gmail.com> wrote:

> How high is the dimension?
>
> How is your data generated?
>
>
>
> On Wed, Feb 27, 2013 at 1:38 PM, Matt Molek <mp...@gmail.com> wrote:
>
> > I made a small modification to the KMeansDriver to call the
> > ClusterClassificationDriver with an emitMostLikely value of false so
> that I
> > could see what the pdf values of my points were for all k of my clusters.
> >
> > I was expecting the most likely cluster to have a much higher pdf than
> the
> > other clusters in most cases, but in my results, all the values are
> pretty
> > close to 1/(number of clusters)
> >
> > For example, when I ran with 50 clusters, most of my points had a pdf
> value
> > of 0.02xx for nearly every cluster.
> >
> > I understand that to mean that for most of my points, none of my clusters
> > are a good fit. Is that right? Or is it common for for the most likely
> > cluster to only deviate tiny bit from all the others? (I wouldn't think
> so)
> >
> > Thanks for the advice,
> > Matt
> >
>

Re: What do "normal" pdf values look like for points clustered with kmeans?

Posted by Ted Dunning <te...@gmail.com>.
How high is the dimension?

How is your data generated?



On Wed, Feb 27, 2013 at 1:38 PM, Matt Molek <mp...@gmail.com> wrote:

> I made a small modification to the KMeansDriver to call the
> ClusterClassificationDriver with an emitMostLikely value of false so that I
> could see what the pdf values of my points were for all k of my clusters.
>
> I was expecting the most likely cluster to have a much higher pdf than the
> other clusters in most cases, but in my results, all the values are pretty
> close to 1/(number of clusters)
>
> For example, when I ran with 50 clusters, most of my points had a pdf value
> of 0.02xx for nearly every cluster.
>
> I understand that to mean that for most of my points, none of my clusters
> are a good fit. Is that right? Or is it common for for the most likely
> cluster to only deviate tiny bit from all the others? (I wouldn't think so)
>
> Thanks for the advice,
> Matt
>