You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2012/12/05 23:06:38 UTC

Clustering points in a unit hypercube

Hi,

One of the most basic tests for streaming k-means (and k-means in
general) is whether it works well for points that are multi-normally
distributed around the vertices of a unit cube.

So, for a cube, there'd be 8 vertices in 3d space. Generating
thousands of points should cluster them in those 8 clusters and they
should be relatively close to the means of these multinormal
distributions.

I decided to generalize it to more than 3 dimensions, and see how it
works for hypercubes with n dimensions and 2^n vertices.

Not well it turns out.

The clusters become less balanced as the number of dimensions increases.
I'm not sure if this is to be expected. I understand that in high
dimensional spaces, it becomes more likely for distances to be equal
and vectors to be orthogonal, but I'm seeing issues starting at 5
dimensions and this doesn't seem like a particularly high number of
dimension to me.

Is this normal?
Should the hypercube no longer have all sides equal to 1? The variance
of the multinormals is also 1.

Thanks!

Re: Clustering points in a unit hypercube

Posted by Ted Dunning <te...@gmail.com>.

Ahh... this may also be a problem.

You should get better results with a Brute searcher here, but a
ProjectionSearcher with lots of projections may work well.

On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon <da...@gmail.com>wrote:

> So, yes, it's probably a bug of some kind since I end up with anywhere
> between 400 and 1000 clusters (based on the searcher used) but the
> distances are still wrong.
>

Re: Clustering points in a unit hypercube

Posted by Ted Dunning <te...@gmail.com>.

Yeah... very useful.  Clearly the adaptive limit on the number of surrogate
points is much too restrictive.

On Fri, Dec 7, 2012 at 1:21 AM, Dan Filimon <da...@gmail.com>wrote:

> I took the plunge and rendered a few plots in R with how the
> parameters of streaming-k-means evolve.
> Here's the link [1].
>
> [1] https://github.com/dfilimon/knn/wiki/skm-visualization
>
> On Thu, Dec 6, 2012 at 2:01 AM, Ted Dunning <te...@gmail.com> wrote:
> > Still not that odd if several clusters are getting squashed.  This can
> > happen if the threshold increases too high or if the searcher is unable
> to
> > resolve the cube properly.  By its nature, the cube is hard to reduce to
> a
> > smaller dimension.
> >
> >
> > On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon <
> dangeorge.filimon@gmail.com>
> > wrote:
> >>
> >> But the weight referred to is the distance between a centroid and the
> >> mean of a distribution (a cube vertice).
> >> This should still be very small (also BallKMeans gets it right).
> >>
> >> On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >> > IN order to succeed here, SKM will need to have maxClusters set to
> >> > 20,000 or
> >> > so.
> >> >
> >> > The maximum distance between clusters on a 10d hypercube is sqrt(10) =
> >> > 3.1
> >> > or so.  If three clusters get smashed together, then you have a
> >> > threshold of
> >> > 1.4 or so.
> >> >
> >> >
> >> > On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon
> >> > <da...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I wanted there to be 2^d clusters. I was wrong and didn't check: the
> >> >> radius is in fact 0.01.
> >> >>
> >> >> What's happening is that for 10 dimension, I was expecting ~1024
> >> >> clusters (or at least have small distances) but StreamingKMeans fails
> >> >> on both accounts.
> >> >> BallKMeans does in fact get the clusters.
> >> >>
> >> >> So, yes, it's probably a bug of some kind since I end up with
> anywhere
> >> >> between 400 and 1000 clusters (based on the searcher used) but the
> >> >> distances are still wrong.
> >> >>
> >> >> Here's how many clusters I get and the searchers I get them with [1].
> >> >> As you can see, the number of clusters is all over the place.
> >> >>
> >> >> The distance too is also super huge. The assert said that all
> >> >> distances should be less than 0.05.
> >> >> Here is where it fails [2].
> >> >> And here is the corresponding GitHub issue (no info yet) [3].
> >> >>
> >> >> [1] https://gist.github.com/4220406
> >> >> [2]
> >> >>
> >> >>
> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
> >> >> [3] https://github.com/dfilimon/knn/issues/1
> >> >>
> >> >> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <te...@gmail.com>
> >> >> wrote:
> >> >> > How many clusters are you talking about?
> >> >> >
> >> >> > If you pick a modest number then streaming k-means should work well
> >> >> > if
> >> >> > it
> >> >> > has several times more surrogate points than there are clusters.
> >> >> >
> >> >> > Also, typically a hyper-cube test works with very small cluster
> >> >> > radius.
> >> >> > Try
> >> >> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
> >> >> > guarantees go out the window.  Without the guarantees, it is hard
> to
> >> >> > interpret test results.  With small radii, and a modest number of
> >> >> > clusters,
> >> >> > what should happen is that the threshold in streaming k-means
> quickly
> >> >> > adapts
> >> >> > but stays << 1 which is the minimum distance between clusters.
>  That
> >> >> > guarantees that we will have at least 1 surrogate in each real
> >> >> > cluster.
> >> >> >
> >> >> > Failure modes I can imagine could include:
> >> >> >
> >> >> > a) threshold gets very big and the number of surrogates drops to 1
> >> >> > due
> >> >> > to a
> >> >> > bug.
> >> >> >
> >> >> > b) unit test has exponentially many clusters (all corners = 2^d).
> >> >> > This
> >> >> > will
> >> >> > cause the threshold to be increased to 1 or larger and will cause
> us
> >> >> > to
> >> >> > try
> >> >> > to cover many clusters with a single surrogate.
> >> >> >
> >> >> > c) something else (always possible)
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon
> >> >> > <da...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Okay, please disregard the previous e-mail.
> >> >> >> That hypothesis is toast; clustering works just fine with ball
> >> >> >> k-means.
> >> >> >>
> >> >> >> So, the problem lies in streaming k-means somewhere.
> >> >> >>
> >> >> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
> >> >> >> <da...@gmail.com> wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > One of the most basic tests for streaming k-means (and k-means
> in
> >> >> >> > general) is whether it works well for points that are
> >> >> >> > multi-normally
> >> >> >> > distributed around the vertices of a unit cube.
> >> >> >> >
> >> >> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
> >> >> >> > thousands of points should cluster them in those 8 clusters and
> >> >> >> > they
> >> >> >> > should be relatively close to the means of these multinormal
> >> >> >> > distributions.
> >> >> >> >
> >> >> >> > I decided to generalize it to more than 3 dimensions, and see
> how
> >> >> >> > it
> >> >> >> > works for hypercubes with n dimensions and 2^n vertices.
> >> >> >> >
> >> >> >> > Not well it turns out.
> >> >> >> >
> >> >> >> > The clusters become less balanced as the number of dimensions
> >> >> >> > increases.
> >> >> >> > I'm not sure if this is to be expected. I understand that in
> high
> >> >> >> > dimensional spaces, it becomes more likely for distances to be
> >> >> >> > equal
> >> >> >> > and vectors to be orthogonal, but I'm seeing issues starting at
> 5
> >> >> >> > dimensions and this doesn't seem like a particularly high number
> >> >> >> > of
> >> >> >> > dimension to me.
> >> >> >> >
> >> >> >> > Is this normal?
> >> >> >> > Should the hypercube no longer have all sides equal to 1? The
> >> >> >> > variance
> >> >> >> > of the multinormals is also 1.
> >> >> >> >
> >> >> >> > Thanks!
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Clustering points in a unit hypercube

Posted by Dan Filimon <da...@gmail.com>.

I took the plunge and rendered a few plots in R with how the
parameters of streaming-k-means evolve.
Here's the link [1].

[1] https://github.com/dfilimon/knn/wiki/skm-visualization

On Thu, Dec 6, 2012 at 2:01 AM, Ted Dunning <te...@gmail.com> wrote:
> Still not that odd if several clusters are getting squashed.  This can
> happen if the threshold increases too high or if the searcher is unable to
> resolve the cube properly.  By its nature, the cube is hard to reduce to a
> smaller dimension.
>
>
> On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon <da...@gmail.com>
> wrote:
>>
>> But the weight referred to is the distance between a centroid and the
>> mean of a distribution (a cube vertice).
>> This should still be very small (also BallKMeans gets it right).
>>
>> On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <te...@gmail.com> wrote:
>> > IN order to succeed here, SKM will need to have maxClusters set to
>> > 20,000 or
>> > so.
>> >
>> > The maximum distance between clusters on a 10d hypercube is sqrt(10) =
>> > 3.1
>> > or so.  If three clusters get smashed together, then you have a
>> > threshold of
>> > 1.4 or so.
>> >
>> >
>> > On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon
>> > <da...@gmail.com>
>> > wrote:
>> >>
>> >> I wanted there to be 2^d clusters. I was wrong and didn't check: the
>> >> radius is in fact 0.01.
>> >>
>> >> What's happening is that for 10 dimension, I was expecting ~1024
>> >> clusters (or at least have small distances) but StreamingKMeans fails
>> >> on both accounts.
>> >> BallKMeans does in fact get the clusters.
>> >>
>> >> So, yes, it's probably a bug of some kind since I end up with anywhere
>> >> between 400 and 1000 clusters (based on the searcher used) but the
>> >> distances are still wrong.
>> >>
>> >> Here's how many clusters I get and the searchers I get them with [1].
>> >> As you can see, the number of clusters is all over the place.
>> >>
>> >> The distance too is also super huge. The assert said that all
>> >> distances should be less than 0.05.
>> >> Here is where it fails [2].
>> >> And here is the corresponding GitHub issue (no info yet) [3].
>> >>
>> >> [1] https://gist.github.com/4220406
>> >> [2]
>> >>
>> >> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
>> >> [3] https://github.com/dfilimon/knn/issues/1
>> >>
>> >> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <te...@gmail.com>
>> >> wrote:
>> >> > How many clusters are you talking about?
>> >> >
>> >> > If you pick a modest number then streaming k-means should work well
>> >> > if
>> >> > it
>> >> > has several times more surrogate points than there are clusters.
>> >> >
>> >> > Also, typically a hyper-cube test works with very small cluster
>> >> > radius.
>> >> > Try
>> >> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
>> >> > guarantees go out the window.  Without the guarantees, it is hard to
>> >> > interpret test results.  With small radii, and a modest number of
>> >> > clusters,
>> >> > what should happen is that the threshold in streaming k-means quickly
>> >> > adapts
>> >> > but stays << 1 which is the minimum distance between clusters.  That
>> >> > guarantees that we will have at least 1 surrogate in each real
>> >> > cluster.
>> >> >
>> >> > Failure modes I can imagine could include:
>> >> >
>> >> > a) threshold gets very big and the number of surrogates drops to 1
>> >> > due
>> >> > to a
>> >> > bug.
>> >> >
>> >> > b) unit test has exponentially many clusters (all corners = 2^d).
>> >> > This
>> >> > will
>> >> > cause the threshold to be increased to 1 or larger and will cause us
>> >> > to
>> >> > try
>> >> > to cover many clusters with a single surrogate.
>> >> >
>> >> > c) something else (always possible)
>> >> >
>> >> >
>> >> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon
>> >> > <da...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Okay, please disregard the previous e-mail.
>> >> >> That hypothesis is toast; clustering works just fine with ball
>> >> >> k-means.
>> >> >>
>> >> >> So, the problem lies in streaming k-means somewhere.
>> >> >>
>> >> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
>> >> >> <da...@gmail.com> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > One of the most basic tests for streaming k-means (and k-means in
>> >> >> > general) is whether it works well for points that are
>> >> >> > multi-normally
>> >> >> > distributed around the vertices of a unit cube.
>> >> >> >
>> >> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
>> >> >> > thousands of points should cluster them in those 8 clusters and
>> >> >> > they
>> >> >> > should be relatively close to the means of these multinormal
>> >> >> > distributions.
>> >> >> >
>> >> >> > I decided to generalize it to more than 3 dimensions, and see how
>> >> >> > it
>> >> >> > works for hypercubes with n dimensions and 2^n vertices.
>> >> >> >
>> >> >> > Not well it turns out.
>> >> >> >
>> >> >> > The clusters become less balanced as the number of dimensions
>> >> >> > increases.
>> >> >> > I'm not sure if this is to be expected. I understand that in high
>> >> >> > dimensional spaces, it becomes more likely for distances to be
>> >> >> > equal
>> >> >> > and vectors to be orthogonal, but I'm seeing issues starting at 5
>> >> >> > dimensions and this doesn't seem like a particularly high number
>> >> >> > of
>> >> >> > dimension to me.
>> >> >> >
>> >> >> > Is this normal?
>> >> >> > Should the hypercube no longer have all sides equal to 1? The
>> >> >> > variance
>> >> >> > of the multinormals is also 1.
>> >> >> >
>> >> >> > Thanks!
>> >> >
>> >> >
>> >
>> >
>
>

Re: Clustering points in a unit hypercube

Posted by Ted Dunning <te...@gmail.com>.

Still not that odd if several clusters are getting squashed.  This can
happen if the threshold increases too high or if the searcher is unable to
resolve the cube properly.  By its nature, the cube is hard to reduce to a
smaller dimension.

On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon <da...@gmail.com>wrote:

> But the weight referred to is the distance between a centroid and the
> mean of a distribution (a cube vertice).
> This should still be very small (also BallKMeans gets it right).
>
> On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <te...@gmail.com> wrote:
> > IN order to succeed here, SKM will need to have maxClusters set to
> 20,000 or
> > so.
> >
> > The maximum distance between clusters on a 10d hypercube is sqrt(10) =
> 3.1
> > or so.  If three clusters get smashed together, then you have a
> threshold of
> > 1.4 or so.
> >
> >
> > On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon <
> dangeorge.filimon@gmail.com>
> > wrote:
> >>
> >> I wanted there to be 2^d clusters. I was wrong and didn't check: the
> >> radius is in fact 0.01.
> >>
> >> What's happening is that for 10 dimension, I was expecting ~1024
> >> clusters (or at least have small distances) but StreamingKMeans fails
> >> on both accounts.
> >> BallKMeans does in fact get the clusters.
> >>
> >> So, yes, it's probably a bug of some kind since I end up with anywhere
> >> between 400 and 1000 clusters (based on the searcher used) but the
> >> distances are still wrong.
> >>
> >> Here's how many clusters I get and the searchers I get them with [1].
> >> As you can see, the number of clusters is all over the place.
> >>
> >> The distance too is also super huge. The assert said that all
> >> distances should be less than 0.05.
> >> Here is where it fails [2].
> >> And here is the corresponding GitHub issue (no info yet) [3].
> >>
> >> [1] https://gist.github.com/4220406
> >> [2]
> >>
> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
> >> [3] https://github.com/dfilimon/knn/issues/1
> >>
> >> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >> > How many clusters are you talking about?
> >> >
> >> > If you pick a modest number then streaming k-means should work well if
> >> > it
> >> > has several times more surrogate points than there are clusters.
> >> >
> >> > Also, typically a hyper-cube test works with very small cluster
> radius.
> >> > Try
> >> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
> >> > guarantees go out the window.  Without the guarantees, it is hard to
> >> > interpret test results.  With small radii, and a modest number of
> >> > clusters,
> >> > what should happen is that the threshold in streaming k-means quickly
> >> > adapts
> >> > but stays << 1 which is the minimum distance between clusters.  That
> >> > guarantees that we will have at least 1 surrogate in each real
> cluster.
> >> >
> >> > Failure modes I can imagine could include:
> >> >
> >> > a) threshold gets very big and the number of surrogates drops to 1 due
> >> > to a
> >> > bug.
> >> >
> >> > b) unit test has exponentially many clusters (all corners = 2^d).
>  This
> >> > will
> >> > cause the threshold to be increased to 1 or larger and will cause us
> to
> >> > try
> >> > to cover many clusters with a single surrogate.
> >> >
> >> > c) something else (always possible)
> >> >
> >> >
> >> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon
> >> > <da...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Okay, please disregard the previous e-mail.
> >> >> That hypothesis is toast; clustering works just fine with ball
> k-means.
> >> >>
> >> >> So, the problem lies in streaming k-means somewhere.
> >> >>
> >> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
> >> >> <da...@gmail.com> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > One of the most basic tests for streaming k-means (and k-means in
> >> >> > general) is whether it works well for points that are
> multi-normally
> >> >> > distributed around the vertices of a unit cube.
> >> >> >
> >> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
> >> >> > thousands of points should cluster them in those 8 clusters and
> they
> >> >> > should be relatively close to the means of these multinormal
> >> >> > distributions.
> >> >> >
> >> >> > I decided to generalize it to more than 3 dimensions, and see how
> it
> >> >> > works for hypercubes with n dimensions and 2^n vertices.
> >> >> >
> >> >> > Not well it turns out.
> >> >> >
> >> >> > The clusters become less balanced as the number of dimensions
> >> >> > increases.
> >> >> > I'm not sure if this is to be expected. I understand that in high
> >> >> > dimensional spaces, it becomes more likely for distances to be
> equal
> >> >> > and vectors to be orthogonal, but I'm seeing issues starting at 5
> >> >> > dimensions and this doesn't seem like a particularly high number of
> >> >> > dimension to me.
> >> >> >
> >> >> > Is this normal?
> >> >> > Should the hypercube no longer have all sides equal to 1? The
> >> >> > variance
> >> >> > of the multinormals is also 1.
> >> >> >
> >> >> > Thanks!
> >> >
> >> >
> >
> >
>

Re: Clustering points in a unit hypercube

Posted by Dan Filimon <da...@gmail.com>.

But the weight referred to is the distance between a centroid and the
mean of a distribution (a cube vertice).
This should still be very small (also BallKMeans gets it right).

On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <te...@gmail.com> wrote:
> IN order to succeed here, SKM will need to have maxClusters set to 20,000 or
> so.
>
> The maximum distance between clusters on a 10d hypercube is sqrt(10) = 3.1
> or so.  If three clusters get smashed together, then you have a threshold of
> 1.4 or so.
>
>
> On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon <da...@gmail.com>
> wrote:
>>
>> I wanted there to be 2^d clusters. I was wrong and didn't check: the
>> radius is in fact 0.01.
>>
>> What's happening is that for 10 dimension, I was expecting ~1024
>> clusters (or at least have small distances) but StreamingKMeans fails
>> on both accounts.
>> BallKMeans does in fact get the clusters.
>>
>> So, yes, it's probably a bug of some kind since I end up with anywhere
>> between 400 and 1000 clusters (based on the searcher used) but the
>> distances are still wrong.
>>
>> Here's how many clusters I get and the searchers I get them with [1].
>> As you can see, the number of clusters is all over the place.
>>
>> The distance too is also super huge. The assert said that all
>> distances should be less than 0.05.
>> Here is where it fails [2].
>> And here is the corresponding GitHub issue (no info yet) [3].
>>
>> [1] https://gist.github.com/4220406
>> [2]
>> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
>> [3] https://github.com/dfilimon/knn/issues/1
>>
>> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <te...@gmail.com> wrote:
>> > How many clusters are you talking about?
>> >
>> > If you pick a modest number then streaming k-means should work well if
>> > it
>> > has several times more surrogate points than there are clusters.
>> >
>> > Also, typically a hyper-cube test works with very small cluster radius.
>> > Try
>> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
>> > guarantees go out the window.  Without the guarantees, it is hard to
>> > interpret test results.  With small radii, and a modest number of
>> > clusters,
>> > what should happen is that the threshold in streaming k-means quickly
>> > adapts
>> > but stays << 1 which is the minimum distance between clusters.  That
>> > guarantees that we will have at least 1 surrogate in each real cluster.
>> >
>> > Failure modes I can imagine could include:
>> >
>> > a) threshold gets very big and the number of surrogates drops to 1 due
>> > to a
>> > bug.
>> >
>> > b) unit test has exponentially many clusters (all corners = 2^d).  This
>> > will
>> > cause the threshold to be increased to 1 or larger and will cause us to
>> > try
>> > to cover many clusters with a single surrogate.
>> >
>> > c) something else (always possible)
>> >
>> >
>> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon
>> > <da...@gmail.com>
>> > wrote:
>> >>
>> >> Okay, please disregard the previous e-mail.
>> >> That hypothesis is toast; clustering works just fine with ball k-means.
>> >>
>> >> So, the problem lies in streaming k-means somewhere.
>> >>
>> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
>> >> <da...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > One of the most basic tests for streaming k-means (and k-means in
>> >> > general) is whether it works well for points that are multi-normally
>> >> > distributed around the vertices of a unit cube.
>> >> >
>> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
>> >> > thousands of points should cluster them in those 8 clusters and they
>> >> > should be relatively close to the means of these multinormal
>> >> > distributions.
>> >> >
>> >> > I decided to generalize it to more than 3 dimensions, and see how it
>> >> > works for hypercubes with n dimensions and 2^n vertices.
>> >> >
>> >> > Not well it turns out.
>> >> >
>> >> > The clusters become less balanced as the number of dimensions
>> >> > increases.
>> >> > I'm not sure if this is to be expected. I understand that in high
>> >> > dimensional spaces, it becomes more likely for distances to be equal
>> >> > and vectors to be orthogonal, but I'm seeing issues starting at 5
>> >> > dimensions and this doesn't seem like a particularly high number of
>> >> > dimension to me.
>> >> >
>> >> > Is this normal?
>> >> > Should the hypercube no longer have all sides equal to 1? The
>> >> > variance
>> >> > of the multinormals is also 1.
>> >> >
>> >> > Thanks!
>> >
>> >
>
>

Re: Clustering points in a unit hypercube

Posted by Ted Dunning <te...@gmail.com>.

IN order to succeed here, SKM will need to have maxClusters set to 20,000
or so.

The maximum distance between clusters on a 10d hypercube is sqrt(10) = 3.1
or so.  If three clusters get smashed together, then you have a threshold
of 1.4 or so.

On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon <da...@gmail.com>wrote:

> I wanted there to be 2^d clusters. I was wrong and didn't check: the
> radius is in fact 0.01.
>
> What's happening is that for 10 dimension, I was expecting ~1024
> clusters (or at least have small distances) but StreamingKMeans fails
> on both accounts.
> BallKMeans does in fact get the clusters.
>
> So, yes, it's probably a bug of some kind since I end up with anywhere
> between 400 and 1000 clusters (based on the searcher used) but the
> distances are still wrong.
>
> Here's how many clusters I get and the searchers I get them with [1].
> As you can see, the number of clusters is all over the place.
>
> The distance too is also super huge. The assert said that all
> distances should be less than 0.05.
> Here is where it fails [2].
> And here is the corresponding GitHub issue (no info yet) [3].
>
> [1] https://gist.github.com/4220406
> [2]
> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
> [3] https://github.com/dfilimon/knn/issues/1
>
> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <te...@gmail.com> wrote:
> > How many clusters are you talking about?
> >
> > If you pick a modest number then streaming k-means should work well if it
> > has several times more surrogate points than there are clusters.
> >
> > Also, typically a hyper-cube test works with very small cluster radius.
>  Try
> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
> > guarantees go out the window.  Without the guarantees, it is hard to
> > interpret test results.  With small radii, and a modest number of
> clusters,
> > what should happen is that the threshold in streaming k-means quickly
> adapts
> > but stays << 1 which is the minimum distance between clusters.  That
> > guarantees that we will have at least 1 surrogate in each real cluster.
> >
> > Failure modes I can imagine could include:
> >
> > a) threshold gets very big and the number of surrogates drops to 1 due
> to a
> > bug.
> >
> > b) unit test has exponentially many clusters (all corners = 2^d).  This
> will
> > cause the threshold to be increased to 1 or larger and will cause us to
> try
> > to cover many clusters with a single surrogate.
> >
> > c) something else (always possible)
> >
> >
> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon <
> dangeorge.filimon@gmail.com>
> > wrote:
> >>
> >> Okay, please disregard the previous e-mail.
> >> That hypothesis is toast; clustering works just fine with ball k-means.
> >>
> >> So, the problem lies in streaming k-means somewhere.
> >>
> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
> >> <da...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > One of the most basic tests for streaming k-means (and k-means in
> >> > general) is whether it works well for points that are multi-normally
> >> > distributed around the vertices of a unit cube.
> >> >
> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
> >> > thousands of points should cluster them in those 8 clusters and they
> >> > should be relatively close to the means of these multinormal
> >> > distributions.
> >> >
> >> > I decided to generalize it to more than 3 dimensions, and see how it
> >> > works for hypercubes with n dimensions and 2^n vertices.
> >> >
> >> > Not well it turns out.
> >> >
> >> > The clusters become less balanced as the number of dimensions
> increases.
> >> > I'm not sure if this is to be expected. I understand that in high
> >> > dimensional spaces, it becomes more likely for distances to be equal
> >> > and vectors to be orthogonal, but I'm seeing issues starting at 5
> >> > dimensions and this doesn't seem like a particularly high number of
> >> > dimension to me.
> >> >
> >> > Is this normal?
> >> > Should the hypercube no longer have all sides equal to 1? The variance
> >> > of the multinormals is also 1.
> >> >
> >> > Thanks!
> >
> >
>

Re: Clustering points in a unit hypercube

Posted by Dan Filimon <da...@gmail.com>.

I wanted there to be 2^d clusters. I was wrong and didn't check: the
radius is in fact 0.01.

What's happening is that for 10 dimension, I was expecting ~1024
clusters (or at least have small distances) but StreamingKMeans fails
on both accounts.
BallKMeans does in fact get the clusters.

So, yes, it's probably a bug of some kind since I end up with anywhere
between 400 and 1000 clusters (based on the searcher used) but the
distances are still wrong.

Here's how many clusters I get and the searchers I get them with [1].
As you can see, the number of clusters is all over the place.

The distance too is also super huge. The assert said that all
distances should be less than 0.05.
Here is where it fails [2].
And here is the corresponding GitHub issue (no info yet) [3].

[1] https://gist.github.com/4220406
[2] https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
[3] https://github.com/dfilimon/knn/issues/1

On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <te...@gmail.com> wrote:
> How many clusters are you talking about?
>
> If you pick a modest number then streaming k-means should work well if it
> has several times more surrogate points than there are clusters.
>
> Also, typically a hyper-cube test works with very small cluster radius.  Try
> 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
> guarantees go out the window.  Without the guarantees, it is hard to
> interpret test results.  With small radii, and a modest number of clusters,
> what should happen is that the threshold in streaming k-means quickly adapts
> but stays << 1 which is the minimum distance between clusters.  That
> guarantees that we will have at least 1 surrogate in each real cluster.
>
> Failure modes I can imagine could include:
>
> a) threshold gets very big and the number of surrogates drops to 1 due to a
> bug.
>
> b) unit test has exponentially many clusters (all corners = 2^d).  This will
> cause the threshold to be increased to 1 or larger and will cause us to try
> to cover many clusters with a single surrogate.
>
> c) something else (always possible)
>
>
> On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon <da...@gmail.com>
> wrote:
>>
>> Okay, please disregard the previous e-mail.
>> That hypothesis is toast; clustering works just fine with ball k-means.
>>
>> So, the problem lies in streaming k-means somewhere.
>>
>> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
>> <da...@gmail.com> wrote:
>> > Hi,
>> >
>> > One of the most basic tests for streaming k-means (and k-means in
>> > general) is whether it works well for points that are multi-normally
>> > distributed around the vertices of a unit cube.
>> >
>> > So, for a cube, there'd be 8 vertices in 3d space. Generating
>> > thousands of points should cluster them in those 8 clusters and they
>> > should be relatively close to the means of these multinormal
>> > distributions.
>> >
>> > I decided to generalize it to more than 3 dimensions, and see how it
>> > works for hypercubes with n dimensions and 2^n vertices.
>> >
>> > Not well it turns out.
>> >
>> > The clusters become less balanced as the number of dimensions increases.
>> > I'm not sure if this is to be expected. I understand that in high
>> > dimensional spaces, it becomes more likely for distances to be equal
>> > and vectors to be orthogonal, but I'm seeing issues starting at 5
>> > dimensions and this doesn't seem like a particularly high number of
>> > dimension to me.
>> >
>> > Is this normal?
>> > Should the hypercube no longer have all sides equal to 1? The variance
>> > of the multinormals is also 1.
>> >
>> > Thanks!
>
>

Re: Clustering points in a unit hypercube

Posted by Ted Dunning <te...@gmail.com>.

How many clusters are you talking about?

If you pick a modest number then streaming k-means should work well if it
has several times more surrogate points than there are clusters.

Also, typically a hyper-cube test works with very small cluster radius.
 Try 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
guarantees go out the window.  Without the guarantees, it is hard to
interpret test results.  With small radii, and a modest number of clusters,
what should happen is that the threshold in streaming k-means quickly
adapts but stays << 1 which is the minimum distance between clusters.  That
guarantees that we will have at least 1 surrogate in each real cluster.

Failure modes I can imagine could include:

a) threshold gets very big and the number of surrogates drops to 1 due to a
bug.

b) unit test has exponentially many clusters (all corners = 2^d).  This
will cause the threshold to be increased to 1 or larger and will cause us
to try to cover many clusters with a single surrogate.

c) something else (always possible)

On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon <da...@gmail.com>wrote:

> Okay, please disregard the previous e-mail.
> That hypothesis is toast; clustering works just fine with ball k-means.
>
> So, the problem lies in streaming k-means somewhere.
>
> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
> <da...@gmail.com> wrote:
> > Hi,
> >
> > One of the most basic tests for streaming k-means (and k-means in
> > general) is whether it works well for points that are multi-normally
> > distributed around the vertices of a unit cube.
> >
> > So, for a cube, there'd be 8 vertices in 3d space. Generating
> > thousands of points should cluster them in those 8 clusters and they
> > should be relatively close to the means of these multinormal
> > distributions.
> >
> > I decided to generalize it to more than 3 dimensions, and see how it
> > works for hypercubes with n dimensions and 2^n vertices.
> >
> > Not well it turns out.
> >
> > The clusters become less balanced as the number of dimensions increases.
> > I'm not sure if this is to be expected. I understand that in high
> > dimensional spaces, it becomes more likely for distances to be equal
> > and vectors to be orthogonal, but I'm seeing issues starting at 5
> > dimensions and this doesn't seem like a particularly high number of
> > dimension to me.
> >
> > Is this normal?
> > Should the hypercube no longer have all sides equal to 1? The variance
> > of the multinormals is also 1.
> >
> > Thanks!
>

Re: Clustering points in a unit hypercube

Posted by Dan Filimon <da...@gmail.com>.

Okay, please disregard the previous e-mail.
That hypothesis is toast; clustering works just fine with ball k-means.

So, the problem lies in streaming k-means somewhere.

On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
<da...@gmail.com> wrote:
> Hi,
>
> One of the most basic tests for streaming k-means (and k-means in
> general) is whether it works well for points that are multi-normally
> distributed around the vertices of a unit cube.
>
> So, for a cube, there'd be 8 vertices in 3d space. Generating
> thousands of points should cluster them in those 8 clusters and they
> should be relatively close to the means of these multinormal
> distributions.
>
> I decided to generalize it to more than 3 dimensions, and see how it
> works for hypercubes with n dimensions and 2^n vertices.
>
> Not well it turns out.
>
> The clusters become less balanced as the number of dimensions increases.
> I'm not sure if this is to be expected. I understand that in high
> dimensional spaces, it becomes more likely for distances to be equal
> and vectors to be orthogonal, but I'm seeing issues starting at 5
> dimensions and this doesn't seem like a particularly high number of
> dimension to me.
>
> Is this normal?
> Should the hypercube no longer have all sides equal to 1? The variance
> of the multinormals is also 1.
>
> Thanks!