You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Derek O'Callaghan <de...@ucd.ie> on 2010/09/21 16:39:12 UTC

Possible CDbw clustering evaluation problem

Hi Jeff,

I've been trying out the CDbwDriver today, and I'm having a problem 
running it on the clusters I've generated from my data, whereby I get 
all 0s printed for the following lines in CDbwDriver.job():

System.out.println("CDbw = " + evaluator.getCDbw());
System.out.println("Intra-cluster density = " + 
evaluator.intraClusterDensity());
System.out.println("Inter-cluster density = " + 
evaluator.interClusterDensity());
System.out.println("Separation = " + evaluator.separation());

Stepping through this, I found a problem at these lines in 
CDbwEvaluator.setStDev():

Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
SquareRootFunction()).divide(s0);
double d = std.zSum() / std.size();

'd' was being set to NaN for one of my clusters, caused by 
"s2.times(s0).minus(s1.times(s1))" returning a negative number, and so 
the subsequent sqrt failed. Looking at the cluster which had the 
problem, I saw that it only contained one point. However, 'repPts' in 
setStDev() contained 3 points, in fact 3 copies of the same sole cluster 
inhabitant point. This appeared to be causing the std calculation to 
fail, I guess from floating point inaccuracies.

I then started digging back further to see why there were 3 copies of 
the same point in 'repPts'. FYI I had specified numIterations = 2 to 
CDbwMapper.runJob(). Stepping through the code, I see the following 
happening:

- CDbwDriver.writeInitialState() writes out the cluster centroids to 
"representatives-0", with this particular point in question being 
written out as the representative for its cluster.
- CDbwMapper loads these into 'representativePoints' via 
setup()/getRepresentativePoints()
- When CDbwMapper.map() is called with this point, it will be added to 
'mostDistantPoints'
- CDbwReducer loads the mapper 'representativePoints' into 
'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
- CDbwReducer writes out the same point twice, once by writing it out as 
a most distant point in reduce(), and then again while writing it out as 
a reference/representative point in cleanup()
- The process repeats, and an additional copy of the point is written 
out by the reducer during each iteration, on top of those from the 
previous iteration.
- Later on, the evaluator fails in the std calculation as described above.

I'm wondering if the quickest solution would be to change the following 
statement in CDbwDriver.writeInitialState():

if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) 
cluster).getTotalCount() > 0) {

to ignore clusters which only contain one point? The mapper would then 
need to check if there was an entry for the cluster id key in 
representative points before doing anything with the point.

Does the issue also point to a separate problem with the std 
calculation, in that it's possible that negative numbers are passed to 
sqrt()?

Thanks,

Derek

Re: Possible CDbw clustering evaluation problem

Posted by Ted Dunning <te...@gmail.com>.

One general technique that can help with these kinds of problems (std <= 0)
is to do the calculation for std assuming a prior distribution on standard
deviation.  In practice, this comes down to assuming that you have some
number of prior observations with non-zero deviation.  You can implement
this by starting the sum at epsilon > 0 and then adding that epsilon to the
number of observations that you divide by at the end.  If using an on-line
computation, you just start the initial estimate at something slightly
positive and start the count of the number of items at a small positive
number that is <<1.  This will cause negligible bias when real data is
observed, but will prevent the variance from ever being negative.

On Tue, Sep 21, 2010 at 10:55 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

>  I'm coming to the same conclusion. In situations where the number of
> clusteredPoints is smaller than the number of representative points being
> requested there will be duplication of some of the points in the
> representative points output. Since the cluster center is always the first
> representative point, it will be the likely one. I think the representative
> point job is doing things correctly. What I see inside the evaluator;
> however, is that it has some brittleness in some of these situations.
>
> I'm writing some tests to try to duplicate these errors, building off of
> the TestCDbwEvaluator.testCDbw1() test. I can duplicate your exception but
> don't yet have a solution.
>
>
>
> On 9/21/10 12:34 PM, Derek O'Callaghan wrote:
>
>> Hi Jeff,
>>
>> I made a quick change in CDbwDriver.writeInitialState(), changing:
>>
>> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)
>> cluster).getTotalCount() > 0) {
>>
>> to:
>>
>> if ((cluster instanceof DirichletCluster && ((DirichletCluster)
>> cluster).getTotalCount() > 1) || cluster.getNumPoints() > 1) {
>>
>> while also adding a null test in the mapper, and I get 4 non-zero values
>> printed at the end of the evaluator as expected. However, I'm not sure the
>> if statement change is the correct solution, given that getTotalCount() and
>> getNumPoints() return the number of points observed while building the
>> clusters, but not the actual number of clustered points from the set that's
>> passed to the mapper? In this particular case, it so happens that number of
>> observed = number clustered = 1, but I guess it's possible that this may not
>> be the case with other data/clusters.
>>
>> Regarding the std calculation issue, I had a problem running Dirichlet at
>> the weekend, in that pdf was being calculated as NaN after a number of
>> iterations. It might be a similar problem, I'll take a look at it again and
>> let you know if I find anything.
>>
>> Thanks,
>>
>> Derek
>>
>> On 21/09/10 16:50, Jeff Eastman wrote:
>>
>>>  Hi Derek,
>>>
>>> Thanks for taking the time to look into CDbw. This is the first time
>>> anybody besides me has looked at it afaict and it is still quite
>>> experimental. I agree with your analysis and have seen this occur myself.
>>> It's a pathological case which is not handled well and your proposed fix may
>>> in fact be the best solution.
>>>
>>> On the std calculation itself, it is correct for scalar values of s0, s1
>>> and s2 but I'm not as confident that it extrapolates to vectors. It also has
>>> potential overflow, underflow and rounding issues but the running sums
>>> method is convenient and is used throughout the clustering code via
>>> AbstractCluster.computeParameters(). Most clustering doesn't really rely on
>>> the std (Dirichlet does to compute pdf for Gaussian models) and this is the
>>> only situation where I've seen this error.
>>>
>>> Finally, checking the computeParameters() implementation, it does not
>>> perform the std computation unless s0>1 so ignoring clusters with zero or
>>> one point is probably the right thing to do. Does it fix the problems you
>>> are seeing? I will write up a test today and commit a change if it does.
>>>
>>> Jeff
>>>
>>>
>>> On 9/21/10 10:39 AM, Derek O'Callaghan wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> I've been trying out the CDbwDriver today, and I'm having a problem
>>>> running it on the clusters I've generated from my data, whereby I get all 0s
>>>> printed for the following lines in CDbwDriver.job():
>>>>
>>>> System.out.println("CDbw = " + evaluator.getCDbw());
>>>> System.out.println("Intra-cluster density = " +
>>>> evaluator.intraClusterDensity());
>>>> System.out.println("Inter-cluster density = " +
>>>> evaluator.interClusterDensity());
>>>> System.out.println("Separation = " + evaluator.separation());
>>>>
>>>> Stepping through this, I found a problem at these lines in
>>>> CDbwEvaluator.setStDev():
>>>>
>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
>>>> SquareRootFunction()).divide(s0);
>>>> double d = std.zSum() / std.size();
>>>>
>>>> 'd' was being set to NaN for one of my clusters, caused by
>>>> "s2.times(s0).minus(s1.times(s1))" returning a negative number, and so the
>>>> subsequent sqrt failed. Looking at the cluster which had the problem, I saw
>>>> that it only contained one point. However, 'repPts' in setStDev() contained
>>>> 3 points, in fact 3 copies of the same sole cluster inhabitant point. This
>>>> appeared to be causing the std calculation to fail, I guess from floating
>>>> point inaccuracies.
>>>>
>>>> I then started digging back further to see why there were 3 copies of
>>>> the same point in 'repPts'. FYI I had specified numIterations = 2 to
>>>> CDbwMapper.runJob(). Stepping through the code, I see the following
>>>> happening:
>>>>
>>>> - CDbwDriver.writeInitialState() writes out the cluster centroids to
>>>> "representatives-0", with this particular point in question being written
>>>> out as the representative for its cluster.
>>>> - CDbwMapper loads these into 'representativePoints' via
>>>> setup()/getRepresentativePoints()
>>>> - When CDbwMapper.map() is called with this point, it will be added to
>>>> 'mostDistantPoints'
>>>> - CDbwReducer loads the mapper 'representativePoints' into
>>>> 'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
>>>> - CDbwReducer writes out the same point twice, once by writing it out as
>>>> a most distant point in reduce(), and then again while writing it out as a
>>>> reference/representative point in cleanup()
>>>> - The process repeats, and an additional copy of the point is written
>>>> out by the reducer during each iteration, on top of those from the previous
>>>> iteration.
>>>> - Later on, the evaluator fails in the std calculation as described
>>>> above.
>>>>
>>>> I'm wondering if the quickest solution would be to change the following
>>>> statement in CDbwDriver.writeInitialState():
>>>>
>>>> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)
>>>> cluster).getTotalCount() > 0) {
>>>>
>>>> to ignore clusters which only contain one point? The mapper would then
>>>> need to check if there was an entry for the cluster id key in representative
>>>> points before doing anything with the point.
>>>>
>>>> Does the issue also point to a separate problem with the std
>>>> calculation, in that it's possible that negative numbers are passed to
>>>> sqrt()?
>>>>
>>>> Thanks,
>>>>
>>>> Derek
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Possible CDbw clustering evaluation problem

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  I'm coming to the same conclusion. In situations where the number of 
clusteredPoints is smaller than the number of representative points 
being requested there will be duplication of some of the points in the 
representative points output. Since the cluster center is always the 
first representative point, it will be the likely one. I think the 
representative point job is doing things correctly. What I see inside 
the evaluator; however, is that it has some brittleness in some of these 
situations.

I'm writing some tests to try to duplicate these errors, building off of 
the TestCDbwEvaluator.testCDbw1() test. I can duplicate your exception 
but don't yet have a solution.


On 9/21/10 12:34 PM, Derek O'Callaghan wrote:
> Hi Jeff,
>
> I made a quick change in CDbwDriver.writeInitialState(), changing:
>
> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) 
> cluster).getTotalCount() > 0) {
>
> to:
>
> if ((cluster instanceof DirichletCluster && ((DirichletCluster) 
> cluster).getTotalCount() > 1) || cluster.getNumPoints() > 1) {
>
> while also adding a null test in the mapper, and I get 4 non-zero 
> values printed at the end of the evaluator as expected. However, I'm 
> not sure the if statement change is the correct solution, given that 
> getTotalCount() and getNumPoints() return the number of points 
> observed while building the clusters, but not the actual number of 
> clustered points from the set that's passed to the mapper? In this 
> particular case, it so happens that number of observed = number 
> clustered = 1, but I guess it's possible that this may not be the case 
> with other data/clusters.
>
> Regarding the std calculation issue, I had a problem running Dirichlet 
> at the weekend, in that pdf was being calculated as NaN after a number 
> of iterations. It might be a similar problem, I'll take a look at it 
> again and let you know if I find anything.
>
> Thanks,
>
> Derek
>
> On 21/09/10 16:50, Jeff Eastman wrote:
>>  Hi Derek,
>>
>> Thanks for taking the time to look into CDbw. This is the first time 
>> anybody besides me has looked at it afaict and it is still quite 
>> experimental. I agree with your analysis and have seen this occur 
>> myself. It's a pathological case which is not handled well and your 
>> proposed fix may in fact be the best solution.
>>
>> On the std calculation itself, it is correct for scalar values of s0, 
>> s1 and s2 but I'm not as confident that it extrapolates to vectors. 
>> It also has potential overflow, underflow and rounding issues but the 
>> running sums method is convenient and is used throughout the 
>> clustering code via AbstractCluster.computeParameters(). Most 
>> clustering doesn't really rely on the std (Dirichlet does to compute 
>> pdf for Gaussian models) and this is the only situation where I've 
>> seen this error.
>>
>> Finally, checking the computeParameters() implementation, it does not 
>> perform the std computation unless s0>1 so ignoring clusters with 
>> zero or one point is probably the right thing to do. Does it fix the 
>> problems you are seeing? I will write up a test today and commit a 
>> change if it does.
>>
>> Jeff
>>
>>
>> On 9/21/10 10:39 AM, Derek O'Callaghan wrote:
>>> Hi Jeff,
>>>
>>> I've been trying out the CDbwDriver today, and I'm having a problem 
>>> running it on the clusters I've generated from my data, whereby I 
>>> get all 0s printed for the following lines in CDbwDriver.job():
>>>
>>> System.out.println("CDbw = " + evaluator.getCDbw());
>>> System.out.println("Intra-cluster density = " + 
>>> evaluator.intraClusterDensity());
>>> System.out.println("Inter-cluster density = " + 
>>> evaluator.interClusterDensity());
>>> System.out.println("Separation = " + evaluator.separation());
>>>
>>> Stepping through this, I found a problem at these lines in 
>>> CDbwEvaluator.setStDev():
>>>
>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>>> SquareRootFunction()).divide(s0);
>>> double d = std.zSum() / std.size();
>>>
>>> 'd' was being set to NaN for one of my clusters, caused by 
>>> "s2.times(s0).minus(s1.times(s1))" returning a negative number, and 
>>> so the subsequent sqrt failed. Looking at the cluster which had the 
>>> problem, I saw that it only contained one point. However, 'repPts' 
>>> in setStDev() contained 3 points, in fact 3 copies of the same sole 
>>> cluster inhabitant point. This appeared to be causing the std 
>>> calculation to fail, I guess from floating point inaccuracies.
>>>
>>> I then started digging back further to see why there were 3 copies 
>>> of the same point in 'repPts'. FYI I had specified numIterations = 2 
>>> to CDbwMapper.runJob(). Stepping through the code, I see the 
>>> following happening:
>>>
>>> - CDbwDriver.writeInitialState() writes out the cluster centroids to 
>>> "representatives-0", with this particular point in question being 
>>> written out as the representative for its cluster.
>>> - CDbwMapper loads these into 'representativePoints' via 
>>> setup()/getRepresentativePoints()
>>> - When CDbwMapper.map() is called with this point, it will be added 
>>> to 'mostDistantPoints'
>>> - CDbwReducer loads the mapper 'representativePoints' into 
>>> 'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
>>> - CDbwReducer writes out the same point twice, once by writing it 
>>> out as a most distant point in reduce(), and then again while 
>>> writing it out as a reference/representative point in cleanup()
>>> - The process repeats, and an additional copy of the point is 
>>> written out by the reducer during each iteration, on top of those 
>>> from the previous iteration.
>>> - Later on, the evaluator fails in the std calculation as described 
>>> above.
>>>
>>> I'm wondering if the quickest solution would be to change the 
>>> following statement in CDbwDriver.writeInitialState():
>>>
>>> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) 
>>> cluster).getTotalCount() > 0) {
>>>
>>> to ignore clusters which only contain one point? The mapper would 
>>> then need to check if there was an entry for the cluster id key in 
>>> representative points before doing anything with the point.
>>>
>>> Does the issue also point to a separate problem with the std 
>>> calculation, in that it's possible that negative numbers are passed 
>>> to sqrt()?
>>>
>>> Thanks,
>>>
>>> Derek
>>>
>>>
>>>
>>
>

Re: Possible CDbw clustering evaluation problem

Posted by Derek O'Callaghan <de...@ucd.ie>.

Hi Jeff,

I made a quick change in CDbwDriver.writeInitialState(), changing:

if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) 
cluster).getTotalCount() > 0) {

to:

if ((cluster instanceof DirichletCluster && ((DirichletCluster) 
cluster).getTotalCount() > 1) || cluster.getNumPoints() > 1) {

while also adding a null test in the mapper, and I get 4 non-zero values 
printed at the end of the evaluator as expected. However, I'm not sure 
the if statement change is the correct solution, given that 
getTotalCount() and getNumPoints() return the number of points observed 
while building the clusters, but not the actual number of clustered 
points from the set that's passed to the mapper? In this particular 
case, it so happens that number of observed = number clustered = 1, but 
I guess it's possible that this may not be the case with other 
data/clusters.

Regarding the std calculation issue, I had a problem running Dirichlet 
at the weekend, in that pdf was being calculated as NaN after a number 
of iterations. It might be a similar problem, I'll take a look at it 
again and let you know if I find anything.

Thanks,

Derek

On 21/09/10 16:50, Jeff Eastman wrote:
>  Hi Derek,
>
> Thanks for taking the time to look into CDbw. This is the first time 
> anybody besides me has looked at it afaict and it is still quite 
> experimental. I agree with your analysis and have seen this occur 
> myself. It's a pathological case which is not handled well and your 
> proposed fix may in fact be the best solution.
>
> On the std calculation itself, it is correct for scalar values of s0, 
> s1 and s2 but I'm not as confident that it extrapolates to vectors. It 
> also has potential overflow, underflow and rounding issues but the 
> running sums method is convenient and is used throughout the 
> clustering code via AbstractCluster.computeParameters(). Most 
> clustering doesn't really rely on the std (Dirichlet does to compute 
> pdf for Gaussian models) and this is the only situation where I've 
> seen this error.
>
> Finally, checking the computeParameters() implementation, it does not 
> perform the std computation unless s0>1 so ignoring clusters with zero 
> or one point is probably the right thing to do. Does it fix the 
> problems you are seeing? I will write up a test today and commit a 
> change if it does.
>
> Jeff
>
>
> On 9/21/10 10:39 AM, Derek O'Callaghan wrote:
>> Hi Jeff,
>>
>> I've been trying out the CDbwDriver today, and I'm having a problem 
>> running it on the clusters I've generated from my data, whereby I get 
>> all 0s printed for the following lines in CDbwDriver.job():
>>
>> System.out.println("CDbw = " + evaluator.getCDbw());
>> System.out.println("Intra-cluster density = " + 
>> evaluator.intraClusterDensity());
>> System.out.println("Inter-cluster density = " + 
>> evaluator.interClusterDensity());
>> System.out.println("Separation = " + evaluator.separation());
>>
>> Stepping through this, I found a problem at these lines in 
>> CDbwEvaluator.setStDev():
>>
>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>> SquareRootFunction()).divide(s0);
>> double d = std.zSum() / std.size();
>>
>> 'd' was being set to NaN for one of my clusters, caused by 
>> "s2.times(s0).minus(s1.times(s1))" returning a negative number, and 
>> so the subsequent sqrt failed. Looking at the cluster which had the 
>> problem, I saw that it only contained one point. However, 'repPts' in 
>> setStDev() contained 3 points, in fact 3 copies of the same sole 
>> cluster inhabitant point. This appeared to be causing the std 
>> calculation to fail, I guess from floating point inaccuracies.
>>
>> I then started digging back further to see why there were 3 copies of 
>> the same point in 'repPts'. FYI I had specified numIterations = 2 to 
>> CDbwMapper.runJob(). Stepping through the code, I see the following 
>> happening:
>>
>> - CDbwDriver.writeInitialState() writes out the cluster centroids to 
>> "representatives-0", with this particular point in question being 
>> written out as the representative for its cluster.
>> - CDbwMapper loads these into 'representativePoints' via 
>> setup()/getRepresentativePoints()
>> - When CDbwMapper.map() is called with this point, it will be added 
>> to 'mostDistantPoints'
>> - CDbwReducer loads the mapper 'representativePoints' into 
>> 'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
>> - CDbwReducer writes out the same point twice, once by writing it out 
>> as a most distant point in reduce(), and then again while writing it 
>> out as a reference/representative point in cleanup()
>> - The process repeats, and an additional copy of the point is written 
>> out by the reducer during each iteration, on top of those from the 
>> previous iteration.
>> - Later on, the evaluator fails in the std calculation as described 
>> above.
>>
>> I'm wondering if the quickest solution would be to change the 
>> following statement in CDbwDriver.writeInitialState():
>>
>> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) 
>> cluster).getTotalCount() > 0) {
>>
>> to ignore clusters which only contain one point? The mapper would 
>> then need to check if there was an entry for the cluster id key in 
>> representative points before doing anything with the point.
>>
>> Does the issue also point to a separate problem with the std 
>> calculation, in that it's possible that negative numbers are passed 
>> to sqrt()?
>>
>> Thanks,
>>
>> Derek
>>
>>
>>
>

Re: Possible CDbw clustering evaluation problem

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hi Derek,

Thanks for taking the time to look into CDbw. This is the first time 
anybody besides me has looked at it afaict and it is still quite 
experimental. I agree with your analysis and have seen this occur 
myself. It's a pathological case which is not handled well and your 
proposed fix may in fact be the best solution.

On the std calculation itself, it is correct for scalar values of s0, s1 
and s2 but I'm not as confident that it extrapolates to vectors. It also 
has potential overflow, underflow and rounding issues but the running 
sums method is convenient and is used throughout the clustering code via 
AbstractCluster.computeParameters(). Most clustering doesn't really rely 
on the std (Dirichlet does to compute pdf for Gaussian models) and this 
is the only situation where I've seen this error.

Finally, checking the computeParameters() implementation, it does not 
perform the std computation unless s0>1 so ignoring clusters with zero 
or one point is probably the right thing to do. Does it fix the problems 
you are seeing? I will write up a test today and commit a change if it does.

Jeff


On 9/21/10 10:39 AM, Derek O'Callaghan wrote:
> Hi Jeff,
>
> I've been trying out the CDbwDriver today, and I'm having a problem 
> running it on the clusters I've generated from my data, whereby I get 
> all 0s printed for the following lines in CDbwDriver.job():
>
> System.out.println("CDbw = " + evaluator.getCDbw());
> System.out.println("Intra-cluster density = " + 
> evaluator.intraClusterDensity());
> System.out.println("Inter-cluster density = " + 
> evaluator.interClusterDensity());
> System.out.println("Separation = " + evaluator.separation());
>
> Stepping through this, I found a problem at these lines in 
> CDbwEvaluator.setStDev():
>
> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
> SquareRootFunction()).divide(s0);
> double d = std.zSum() / std.size();
>
> 'd' was being set to NaN for one of my clusters, caused by 
> "s2.times(s0).minus(s1.times(s1))" returning a negative number, and so 
> the subsequent sqrt failed. Looking at the cluster which had the 
> problem, I saw that it only contained one point. However, 'repPts' in 
> setStDev() contained 3 points, in fact 3 copies of the same sole 
> cluster inhabitant point. This appeared to be causing the std 
> calculation to fail, I guess from floating point inaccuracies.
>
> I then started digging back further to see why there were 3 copies of 
> the same point in 'repPts'. FYI I had specified numIterations = 2 to 
> CDbwMapper.runJob(). Stepping through the code, I see the following 
> happening:
>
> - CDbwDriver.writeInitialState() writes out the cluster centroids to 
> "representatives-0", with this particular point in question being 
> written out as the representative for its cluster.
> - CDbwMapper loads these into 'representativePoints' via 
> setup()/getRepresentativePoints()
> - When CDbwMapper.map() is called with this point, it will be added to 
> 'mostDistantPoints'
> - CDbwReducer loads the mapper 'representativePoints' into 
> 'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
> - CDbwReducer writes out the same point twice, once by writing it out 
> as a most distant point in reduce(), and then again while writing it 
> out as a reference/representative point in cleanup()
> - The process repeats, and an additional copy of the point is written 
> out by the reducer during each iteration, on top of those from the 
> previous iteration.
> - Later on, the evaluator fails in the std calculation as described 
> above.
>
> I'm wondering if the quickest solution would be to change the 
> following statement in CDbwDriver.writeInitialState():
>
> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) 
> cluster).getTotalCount() > 0) {
>
> to ignore clusters which only contain one point? The mapper would then 
> need to check if there was an entry for the cluster id key in 
> representative points before doing anything with the point.
>
> Does the issue also point to a separate problem with the std 
> calculation, in that it's possible that negative numbers are passed to 
> sqrt()?
>
> Thanks,
>
> Derek
>
>
>