You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2013/01/03 00:38:29 UTC

clusterLogFactor in StreamingKMeans

Ted, I've been meaning to ask you about this.
Currently, we have a parameter called clusterLogFactor [1] that we
multiply by the number of points seen so far.

This is (I guess) meant to behave like the k*log(n) recommended value
for the number of clusters in the paper. So, clusterLogFactor should
actually be k (the number of clusters).

What I'm saying here is...

We get a numClusters parameter anyway. Currently I set this to
k*log(N) (where N is the total number of points at the beginning).

I propose that instead of having two confusing parameters:
estimatedNumClusters and clusterLogFactor, to just have one,
numClusters that has the same semantics as in BallKMeans.
It's about time these were properly documented.

Additionally, I'd remove the max at line 232.

How about it?

[1] https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47

Re: clusterLogFactor in StreamingKMeans

Posted by Dan Filimon <da...@gmail.com>.

The thing is that somehow I ended up not using clusterLogFactor in
that capacity at all.
Currently it's set to 10 by default (way more that I assume the theta
would imply).

I agree that we don't always know N, but I'd argue that we can
estimate a decent log N, especially since using StreamingKMeans in a
MapReduce implies you're doing processing the points in a batch and
know how many there are.

Currently we're doing clusterLogFactor *
Math.log(numProcessedDataPoints) whereas we should in fact be doing
clusterFudgeFactor * numClusters * Math.log(numProcessedDataPoints) so
we can estimate log N from c * log n.
While it might take a while to get n to approach N, we're using n^c to
estimate N, and after n grows larger than N^(1/c), we're
overestimating the number of clusters.

I guess what I'm asking is why do we even bother to have a dynamic
number of clusters? We're using the sketch to do ball k-means anyway,
aren't we?

On Thu, Jan 3, 2013 at 2:14 AM, Ted Dunning <te...@gmail.com> wrote:
> Well, the point of clusterLogFactor was original to be c in the expression
> c * k * log N.  The idea was that this would normally be in the range from
> 1 to 3 and would be a fudge factor to increase k log N.  This is useful
> where we know k, but not N.
>
> On the other hand, in some cases we can just pick the maximum number of
> sketch clusters and we just want to set it so.  This is very useful in
> testing.  This number is almost always larger than the value of c * k * log
> N so the max is pretty decent as a way of choosing between the two.
>
> I agree that this should be normalized and documented, but I think we
> probably will want two behaviors until we really understand how to pick an
> adaptive limit.  I am happy if you have some way to expressive this well.
>
> On Wed, Jan 2, 2013 at 3:38 PM, Dan Filimon <da...@gmail.com>wrote:
>
>> Ted, I've been meaning to ask you about this.
>> Currently, we have a parameter called clusterLogFactor [1] that we
>> multiply by the number of points seen so far.
>>
>> This is (I guess) meant to behave like the k*log(n) recommended value
>> for the number of clusters in the paper. So, clusterLogFactor should
>> actually be k (the number of clusters).
>>
>> What I'm saying here is...
>>
>> We get a numClusters parameter anyway. Currently I set this to
>> k*log(N) (where N is the total number of points at the beginning).
>>
>> I propose that instead of having two confusing parameters:
>> estimatedNumClusters and clusterLogFactor, to just have one,
>> numClusters that has the same semantics as in BallKMeans.
>> It's about time these were properly documented.
>>
>> Additionally, I'd remove the max at line 232.
>>
>> How about it?
>>
>> [1]
>> https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47
>>

Re: clusterLogFactor in StreamingKMeans

Posted by Ted Dunning <te...@gmail.com>.

Well, the point of clusterLogFactor was original to be c in the expression
c * k * log N.  The idea was that this would normally be in the range from
1 to 3 and would be a fudge factor to increase k log N.  This is useful
where we know k, but not N.

On the other hand, in some cases we can just pick the maximum number of
sketch clusters and we just want to set it so.  This is very useful in
testing.  This number is almost always larger than the value of c * k * log
N so the max is pretty decent as a way of choosing between the two.

I agree that this should be normalized and documented, but I think we
probably will want two behaviors until we really understand how to pick an
adaptive limit.  I am happy if you have some way to expressive this well.

On Wed, Jan 2, 2013 at 3:38 PM, Dan Filimon <da...@gmail.com>wrote:

> Ted, I've been meaning to ask you about this.
> Currently, we have a parameter called clusterLogFactor [1] that we
> multiply by the number of points seen so far.
>
> This is (I guess) meant to behave like the k*log(n) recommended value
> for the number of clusters in the paper. So, clusterLogFactor should
> actually be k (the number of clusters).
>
> What I'm saying here is...
>
> We get a numClusters parameter anyway. Currently I set this to
> k*log(N) (where N is the total number of points at the beginning).
>
> I propose that instead of having two confusing parameters:
> estimatedNumClusters and clusterLogFactor, to just have one,
> numClusters that has the same semantics as in BallKMeans.
> It's about time these were properly documented.
>
> Additionally, I'd remove the max at line 232.
>
> How about it?
>
> [1]
> https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47
>