You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/09/28 21:28:13 UTC

Standard Deviation of a Set of Vectors

  Hi Ted,

The clustering code computes this value for cluster radius. Currently, 
it is done with a running sums approach (s^0, s^1, s^2) that computes 
the std of each vector term using:

Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
SquareRootFunction()).divide(s0);

For CDbw, they need a scalar, average std value, and this is currently 
computed by averaging the vector terms:

double d = std.zSum() / std.size();

The more I read about it; however, the less confident I am about this 
approach. The paper itself seems to indicate a covariance approach, but 
I am lost in their notation. See page 5, just above Definition 1.

www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

It's late in the day and I'm looking to make a quick fix ;)

On 29/09/10 18:49, Ted Dunning wrote:
> Why ignore them?  They seem to be pretty awesomely clustered!
>
> One reasonable approach is to remove duplicate items before clustering.
>
> On Wed, Sep 29, 2010 at 10:45 AM, Derek O'Callaghan<derek.ocallaghan@ucd.ie
>    
>> wrote:
>>      
>    
>> I'm wondering if the quickest and easiest solution is to simply ignore such
>> clusters [with duplicate members], i.e. those that currently generate a NaN
>> std? I'm not sure if it's the "correct" approach though...
>>
>>      
>

Re: Standard Deviation of a Set of Vectors

Posted by Ted Dunning <te...@gmail.com>.

Why ignore them?  They seem to be pretty awesomely clustered!

One reasonable approach is to remove duplicate items before clustering.

On Wed, Sep 29, 2010 at 10:45 AM, Derek O'Callaghan <derek.ocallaghan@ucd.ie
> wrote:

> I'm wondering if the quickest and easiest solution is to simply ignore such
> clusters [with duplicate members], i.e. those that currently generate a NaN
> std? I'm not sure if it's the "correct" approach though...
>

Re: Standard Deviation of a Set of Vectors

Posted by Ted Dunning <te...@gmail.com>.

If there isn't much demand here, I would just document the difference rather
than converge them.

On Fri, Oct 1, 2010 at 1:46 AM, Derek O'Callaghan
<de...@ucd.ie>wrote:

> I'd also be inclined to have the B option for consistency, although I get
> the feeling that not too many people are using the sequential version, so
> perhaps just documenting it is enough for now if there are higher priorities
> for 0.4.
>
> Derek
>
> On 30/09/10 18:31, Jeff Eastman wrote:
>
>>  Derek,
>>
>> The Canopy implementation was probably one of the first Mahout commits.
>> Its reference implementation performs a single pass over the data and, in
>> your case, produces 128 canopies. It is the correct, published Canopy
>> algorithm. In order to become scalable, the MR version does this in each
>> mapper, and then again in the reducer to combine the results of the mapper
>> canopies. This approach was taken from a Google presentation, iirc, and it
>> seems to produce good results. At least it has withstood the test of time.
>>
>> When I added the sequential execution mode to canopy, I just used the
>> existing reference implementation. Now you have noticed that the results are
>> quite different when running the MR version beside the sequential version.
>>
>> I'm not sure which knob to turn here: A) try to modify the MR version to
>> perform a single pass; B) add another pass to the sequential version; or C)
>> just document the difference. A is a hard problem (maybe 0.5) and B an easy
>> change (ok for 0.4). Going for the "low hanging fruit", I'm inclined to do B
>> for consistency.
>
>

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

Hi Jeff,

Thanks for the info on Canopy. In my case, given that I'm seeing better 
results with the MR version, I'll stick with that for now. I'd also be 
inclined to have the B option for consistency, although I get the 
feeling that not too many people are using the sequential version, so 
perhaps just documenting it is enough for now if there are higher 
priorities for 0.4.

Derek

On 30/09/10 18:31, Jeff Eastman wrote:
>  Derek,
>
> The Canopy implementation was probably one of the first Mahout 
> commits. Its reference implementation performs a single pass over the 
> data and, in your case, produces 128 canopies. It is the correct, 
> published Canopy algorithm. In order to become scalable, the MR 
> version does this in each mapper, and then again in the reducer to 
> combine the results of the mapper canopies. This approach was taken 
> from a Google presentation, iirc, and it seems to produce good 
> results. At least it has withstood the test of time.
>
> When I added the sequential execution mode to canopy, I just used the 
> existing reference implementation. Now you have noticed that the 
> results are quite different when running the MR version beside the 
> sequential version.
>
> I'm not sure which knob to turn here: A) try to modify the MR version 
> to perform a single pass; B) add another pass to the sequential 
> version; or C) just document the difference. A is a hard problem 
> (maybe 0.5) and B an easy change (ok for 0.4). Going for the "low 
> hanging fruit", I'm inclined to do B for consistency.
>
> Can we get some opinions on this from the other Mahouts?
>
> Jeff
>
> PS: On the usability of ClusterEvaluator.intraClusterDensity() (vs. 
> CDbwEvaluator.intraClusterDensity() I presume), I don't have an 
> opinion. Both are pretty experimental IMHO and I'd rather not use 
> "should" for either. It would be interesting to develop some standard 
> data sets against which to compare them both under all of the 
> clustering algorithms. Perhaps a nice wiki page or technical paper for 
> someone to write. I think both evaluators can give useful insight. 
> Again, pick your poison.
>
> On 9/30/10 12:36 PM, Derek O'Callaghan wrote:
>>
>>>> Thanks for the tip, I had been generating the representative points 
>>>> sequentially but was still using the MR versions of the clustering 
>>>> algorithms, I'll change that now.
>>> :)
>>
>> I just tried this, and there seems to be a difference in behaviour 
>> between the sequential and MR versions of Canopy. With MR:
>>
>>    * Mapper called for each point, which calls
>>      canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
>>      case 128 canopies are created
>>    * Reducer called with the canopy centroid points, which then calls
>>      canopyClusterer.addPointToCanopies(point, canopies); for each of
>>      these centroids - and I end up with 11 canopies.
>>
>> And we end up with canopies of canopy centroids.
>>
>> However, the sequential version doesn't appear to have the equivalent 
>> of the Reducer steps, which means that it contains the original 
>> number of canopies. Should it also compute the "canopies of 
>> canopies"? At the moment, the MR version is working much better for 
>> me with the second canopy generation step, so I'll stick with this 
>> for now. I guess it should be consistent between sequential and MR? I 
>> should probably start a separate thread for this...
>>
>>
>>
>>>
>>> I guess I don't quite understand your question. Can you please 
>>> elaborate?
>>>
>>
>> Sorry, what I wanted to ask was: is it okay to use 
>> ClusterEvaluator.intraClusterDensity()? Or should only 
>> ClusterEvaluator.interClusterDensity() be used?
>>
>> I have to leave for the evening, but if you need me to check anything 
>> further here re: canopy I can take a look tomorrow.
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Derek,

The Canopy implementation was probably one of the first Mahout commits. 
Its reference implementation performs a single pass over the data and, 
in your case, produces 128 canopies. It is the correct, published Canopy 
algorithm. In order to become scalable, the MR version does this in each 
mapper, and then again in the reducer to combine the results of the 
mapper canopies. This approach was taken from a Google presentation, 
iirc, and it seems to produce good results. At least it has withstood 
the test of time.

When I added the sequential execution mode to canopy, I just used the 
existing reference implementation. Now you have noticed that the results 
are quite different when running the MR version beside the sequential 
version.

I'm not sure which knob to turn here: A) try to modify the MR version to 
perform a single pass; B) add another pass to the sequential version; or 
C) just document the difference. A is a hard problem (maybe 0.5) and B 
an easy change (ok for 0.4). Going for the "low hanging fruit", I'm 
inclined to do B for consistency.

Can we get some opinions on this from the other Mahouts?

Jeff

PS: On the usability of ClusterEvaluator.intraClusterDensity() (vs. 
CDbwEvaluator.intraClusterDensity() I presume), I don't have an opinion. 
Both are pretty experimental IMHO and I'd rather not use "should" for 
either. It would be interesting to develop some standard data sets 
against which to compare them both under all of the clustering 
algorithms. Perhaps a nice wiki page or technical paper for someone to 
write. I think both evaluators can give useful insight. Again, pick your 
poison.

On 9/30/10 12:36 PM, Derek O'Callaghan wrote:
>
>>> Thanks for the tip, I had been generating the representative points 
>>> sequentially but was still using the MR versions of the clustering 
>>> algorithms, I'll change that now.
>> :)
>
> I just tried this, and there seems to be a difference in behaviour 
> between the sequential and MR versions of Canopy. With MR:
>
>    * Mapper called for each point, which calls
>      canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
>      case 128 canopies are created
>    * Reducer called with the canopy centroid points, which then calls
>      canopyClusterer.addPointToCanopies(point, canopies); for each of
>      these centroids - and I end up with 11 canopies.
>
> And we end up with canopies of canopy centroids.
>
> However, the sequential version doesn't appear to have the equivalent 
> of the Reducer steps, which means that it contains the original number 
> of canopies. Should it also compute the "canopies of canopies"? At the 
> moment, the MR version is working much better for me with the second 
> canopy generation step, so I'll stick with this for now. I guess it 
> should be consistent between sequential and MR? I should probably 
> start a separate thread for this...
>
>
>
>>
>> I guess I don't quite understand your question. Can you please 
>> elaborate?
>>
>
> Sorry, what I wanted to ask was: is it okay to use 
> ClusterEvaluator.intraClusterDensity()? Or should only 
> ClusterEvaluator.interClusterDensity() be used?
>
> I have to leave for the evening, but if you need me to check anything 
> further here re: canopy I can take a look tomorrow.
>

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

>> Thanks for the tip, I had been generating the representative points 
>> sequentially but was still using the MR versions of the clustering 
>> algorithms, I'll change that now.
> :)

I just tried this, and there seems to be a difference in behaviour 
between the sequential and MR versions of Canopy. With MR:

    * Mapper called for each point, which calls
      canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
      case 128 canopies are created
    * Reducer called with the canopy centroid points, which then calls
      canopyClusterer.addPointToCanopies(point, canopies); for each of
      these centroids - and I end up with 11 canopies.

And we end up with canopies of canopy centroids.

However, the sequential version doesn't appear to have the equivalent of 
the Reducer steps, which means that it contains the original number of 
canopies. Should it also compute the "canopies of canopies"? At the 
moment, the MR version is working much better for me with the second 
canopy generation step, so I'll stick with this for now. I guess it 
should be consistent between sequential and MR? I should probably start 
a separate thread for this...



>
> I guess I don't quite understand your question. Can you please elaborate?
>

Sorry, what I wanted to ask was: is it okay to use 
ClusterEvaluator.intraClusterDensity()? Or should only 
ClusterEvaluator.interClusterDensity() be used?

I have to leave for the evening, but if you need me to check anything 
further here re: canopy I can take a look tomorrow.

Re: Standard Deviation of a Set of Vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  On 9/30/10 11:38 AM, Derek O'Callaghan wrote:
> Thanks for the tip, I had been generating the representative points 
> sequentially but was still using the MR versions of the clustering 
> algorithms, I'll change that now.
:)
>
> Regarding ClusterEvaluator, it seems to rely on 
> RepresentativePointsDriver having been run already, as it loads these 
> in the ClusterEvaluator(Configuration conf, Path clustersIn) 
> constructor? I see another constructor ClusterEvaluator(Map<Integer, 
> List<VectorWritable>> representativePoints, List<Cluster> clusters, 
> DistanceMeasure measure) where you can specify these points, but this 
> is marked as "test only". Is it okay to use this, passing in the 
> cluster centres, or will it ultimately be removed?
Not planning to remove this as the unit tests require it and I won't 
remove them. If it is useful for you, go ahead. I will change the 
comment to "useful for testing"
>
> I guess the question is, can ClusterEvaluator.intraClusterDensity() be 
> used, given that it relies on a set of points, and not just the centre 
> which is all that's required in interClusterDensity()? FYI I had to 
> modify my local copy to ignore my "identical points" cluster as it was 
> generating a NaN density.
>
I guess I don't quite understand your question. Can you please elaborate?

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

Thanks for the tip, I had been generating the representative points 
sequentially but was still using the MR versions of the clustering 
algorithms, I'll change that now.

Regarding ClusterEvaluator, it seems to rely on 
RepresentativePointsDriver having been run already, as it loads these in 
the ClusterEvaluator(Configuration conf, Path clustersIn) constructor? I 
see another constructor ClusterEvaluator(Map<Integer, 
List<VectorWritable>> representativePoints, List<Cluster> clusters, 
DistanceMeasure measure) where you can specify these points, but this is 
marked as "test only". Is it okay to use this, passing in the cluster 
centres, or will it ultimately be removed?

I guess the question is, can ClusterEvaluator.intraClusterDensity() be 
used, given that it relies on a set of points, and not just the centre 
which is all that's required in interClusterDensity()? FYI I had to 
modify my local copy to ignore my "identical points" cluster as it was 
generating a NaN density.


On 30/09/10 14:38, Jeff Eastman wrote:
>  The ClusterEvaluator compares the cluster centers directly to compute 
> the inter-cluster density whereas the CDbwEvaluator computes it by 
> counting the representative points which are within 2 stds (really 
> stdI + stdJ) from the center of a line segment between the closest 
> representative points of their respective clusters. These are very 
> different approaches and should be expected to yield different values.
>
> The ClusterEvaluator uses the algorithm from "Mahout in Action". I 
> don't know which is "more accurate", the CDbw approach is certainly 
> more sophisticated as it is measuring the density of representative 
> points in the inter-cluster region. Pick your poison.
>
> You know, you can run all the clustering algorithms in sequential mode 
> (-xm sequential) to improve performance if the data you are munching 
> is not too huge. The same is true for the representative points 
> calculations. See TestClusterEvaluator.testRepresentativePoints() for 
> an example of this.
>
> On 9/30/10 8:08 AM, Derek O'Callaghan wrote:
>> Sorry, I forgot to mention that I was also running ClusterEvaluator 
>> along with CDbwEvaluator yesterday. I was getting inter < intra with 
>> the former, and the reverse with the latter, and even allowing for 
>> the fact that they're different algorithms I was a bit suspicious of 
>> this.
>>
>> However, I'd also turned off KMeans to speed up the process while 
>> debugging, and was running CDbw directly on the generated Canopies. 
>> I've run it today with Canopy + KMeans and 10 representative points 
>> per cluster, and the results look better that lastnight, e.g.
>>
>> CDbw = 178.90082114134046
>> Intra-cluster density = 0.26970259013839315
>> Inter-cluster density = 0.2517752232288567
>> Separation = 663.326299719779
>>
>> I reran with Canopy only (to double-check) and this generated the 
>> following (I assume overlapping canopies are causing inter to be 
>> greater than intra):
>>
>> CDbw = 97.38825072133858
>> Intra-cluster density = 0.23594531216994152
>> Inter-cluster density = 0.9859430197400473
>> Separation = 412.7577268888219
>>
>> To confirm: it's running much better now than before, as the 
>> successful KMeans evaluation above includes the cluster with the 
>> identical points which are marginally different from the centre. 
>> Thanks for all the help on this.
>>
>> @Jeff: is ClusterEvaluator useful too, or should CDbwEvaluator be 
>> considered more accurate?
>>
>> On 29/09/10 20:51, Jeff Eastman wrote:
>>>  The CDbw intra-cluster density calculation uses the average std of 
>>> all the clusters in its normalization and produces an average of the 
>>> per-cluster intra-cluster densities so it's likely the other, looser 
>>> clusters are degrading that result.
>>>
>>> CDbw is not a panacea either. The Dirichlet unit test, for example, 
>>> produces a very low metric despite clustering over the same data set 
>>> that the DisplayDirichlet uses. That clustering finds almost exactly 
>>> the parameters of the input data set (try it), but the clusters 
>>> overlap and that messes with CDbw density calculations which reward 
>>> mostly non-overlapping clusters.
>>>
>>> I'm happy that the Online accumulator works better. I will plug it 
>>> in in my next commit and tweak the unit tests to accept its values.
>>>
>>> On 9/29/10 2:59 PM, Ted Dunning wrote:
>>>> That slight difference case is exactly where the running sums approach
>>>> fails.  I think that you found the problem.
>>>>
>>>> Your suspicion about inter being larger than intra confuses me.  
>>>> Isn't that
>>>> just another way of saying that clusters are very tight?
>>>>
>>>> On Wed, Sep 29, 2010 at 11:19 AM, Derek 
>>>> O'Callaghan<derek.ocallaghan@ucd.ie
>>>>> wrote:
>>>>> I just stepped through invalidCluster(), and it seems that there's 
>>>>> a slight
>>>>> difference between the centre and the other points, so it returns 
>>>>> false. I
>>>>> was positive that there was no difference when I stepped through 
>>>>> it last, I
>>>>> must have overlooked something, sorry about that.
>>>>>
>>>>> I just tried the OnlineGaussianAccumulator and it does run better, 
>>>>> in that
>>>>> I get values for the 4 metrics. One thing I need to check is why the
>>>>> inter-density is so much bigger than the intra-, I'm getting the 
>>>>> following
>>>>> values:
>>>>>
>>>>> CDbw = 68.51761788802385
>>>>> Intra-cluster density = 0.3734803797950363
>>>>> Inter-cluster density = 3.474415557178534
>>>>> Separation = 183.4570745741071
>>>>>
>>>>> When using RunningSums and ignoring the identical points cluster, 
>>>>> I get a
>>>>> similar issue in that inter = ~1.5, with intra = ~0.15. I have to 
>>>>> leave for
>>>>> the evening, I'll look into it tomorrow to see if I can determine 
>>>>> if it's
>>>>> correct.
>>>>>
>>>>> Thanks again.
>>>>>
>>>>> On 29/09/10 18:55, Jeff Eastman wrote:
>>>>>
>>>>>>   If all of the representative points for that cluster are 
>>>>>> identical then
>>>>>> they are also identical to the cluster center (the first 
>>>>>> representative
>>>>>> point) and should be pruned. I'm wondering why this was not 
>>>>>> detected in
>>>>>> invalidCluster, can you investigate that? You may also want to 
>>>>>> plug in an
>>>>>> instance of the new OnlineGaussianAccumulator to see if it does 
>>>>>> any better.
>>>>>> It is likely to me much more stable than the RunningSums...
>>>>>>
>>>>>> On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
>>>>>>
>>>>>>> Thanks for that Jeff. I tried the changes and get the same 
>>>>>>> result as
>>>>>>> expected. FYI I've investigated further and it seems that all of 
>>>>>>> the points
>>>>>>> in the affected cluster are identical, so it ends up as more or 
>>>>>>> less the
>>>>>>> same problem we had last week with clusters with total points<  #
>>>>>>> representative points, in that there are duplicate 
>>>>>>> representative points. In
>>>>>>> this case total>  # representative, but the end result is the same.
>>>>>>>
>>>>>>> I'm wondering if the quickest and easiest solution is to simply 
>>>>>>> ignore
>>>>>>> such clusters, i.e. those that currently generate a NaN std? I'm 
>>>>>>> not sure if
>>>>>>> it's the "correct" approach though...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 29/09/10 17:37, Jeff Eastman wrote:
>>>>>>>
>>>>>>>>   Hi Derek,
>>>>>>>>
>>>>>>>> I've committed some changes which will hopefully help in fixing 
>>>>>>>> this
>>>>>>>> problem but which do not yet accomplish that. As you can see 
>>>>>>>> from the new
>>>>>>>> CDbw test (testAlmostSameValueCluster) I tried creating a test 
>>>>>>>> cluster with
>>>>>>>> points identical to the cluster center but with one which 
>>>>>>>> differed from it
>>>>>>>> by Double.MIN_NORMAL in one element. That test failed to 
>>>>>>>> duplicate your
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> The patch also factors out the std calculation into an 
>>>>>>>> implementor of
>>>>>>>> GaussianAccumulator. I factored the current std calculations 
>>>>>>>> out of
>>>>>>>> CDbwEvaluator into RunningSumsGaussianAccumulator and all the 
>>>>>>>> tests produced
>>>>>>>> the same results as before. With the new 
>>>>>>>> OnlineGaussianAccumulator plugged
>>>>>>>> in, the tests all return slightly different results but still 
>>>>>>>> no NaNs.
>>>>>>>>
>>>>>>>> I still have not added priors and I'm not entirely sure where 
>>>>>>>> to do
>>>>>>>> that. I've committed the changes so you can see my quandary.
>>>>>>>> OnlineGaussianAccumulator is still a work in progress but, 
>>>>>>>> since it is never
>>>>>>>> used it is in the commit for your viewing.
>>>>>>>>
>>>>>>>> Jeff
>>>>>>>>
>>>>>>>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>>>>>>>>
>>>>>>>>> Thanks Jeff, I'll try out the changes when they're committed. 
>>>>>>>>> I tried a
>>>>>>>>> couple of things locally (removing the clusters/setting a 
>>>>>>>>> small prior), but
>>>>>>>>> I ended up with inter-density>  intra-density, so I suspect 
>>>>>>>>> I've slipped up
>>>>>>>>> somewhere. I'll hold off on it for now.
>>>>>>>>>
>>>>>>>>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>>>>>>>
>>>>>>>>>>   Hi Derek,
>>>>>>>>>>
>>>>>>>>>> That makes sense. With the very, very tight cluster that your
>>>>>>>>>> clustering produced you've uncovered an instability in that 
>>>>>>>>>> std calculation.
>>>>>>>>>> I'm going to rework that method today to use a better 
>>>>>>>>>> algorithm and will add
>>>>>>>>>> a small prior in the process. I'm also going to add a unit 
>>>>>>>>>> test to reproduce
>>>>>>>>>> this problem first. Look for a commit in a couple of hours.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Jeff,
>>>>>>>>>>>
>>>>>>>>>>> FYI I checked the problem I was having in CDbwEvaluator with 
>>>>>>>>>>> the same
>>>>>>>>>>> dataset from the ClusterEvaluator thread, the problem is 
>>>>>>>>>>> occurring in the
>>>>>>>>>>> std calculation in CDbwEvaluator.computeStd(), in that
>>>>>>>>>>> s2.times(s0).minus(s1.times(s1)) generates negative values 
>>>>>>>>>>> which then
>>>>>>>>>>> produce NaN with the subsequent SquareRootFunction(). This 
>>>>>>>>>>> then sets the
>>>>>>>>>>> average std to NaN later on in intraClusterDensity(). It's 
>>>>>>>>>>> happening for the
>>>>>>>>>>> cluster I have with the almost-identical points.
>>>>>>>>>>>
>>>>>>>>>>> It's the same symptom as the problem last week, where this was
>>>>>>>>>>> happening when s0 was 1. Is the solution to ignore these 
>>>>>>>>>>> clusters, like the
>>>>>>>>>>> s0 = 1 clusters? Or to add a small prior std as was done for 
>>>>>>>>>>> the similar
>>>>>>>>>>> issue in NormalModel.pdf()?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Derek
>>>>>>>>>>>
>>>>>>>>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>>>>>>>
>>>>>>>>>>>>   Hi Ted,
>>>>>>>>>>>>
>>>>>>>>>>>> The clustering code computes this value for cluster radius.
>>>>>>>>>>>> Currently, it is done with a running sums approach (s^0, 
>>>>>>>>>>>> s^1, s^2) that
>>>>>>>>>>>> computes the std of each vector term using:
>>>>>>>>>>>>
>>>>>>>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
>>>>>>>>>>>> SquareRootFunction()).divide(s0);
>>>>>>>>>>>>
>>>>>>>>>>>> For CDbw, they need a scalar, average std value, and this is
>>>>>>>>>>>> currently computed by averaging the vector terms:
>>>>>>>>>>>>
>>>>>>>>>>>> double d = std.zSum() / std.size();
>>>>>>>>>>>>
>>>>>>>>>>>> The more I read about it; however, the less confident I am 
>>>>>>>>>>>> about
>>>>>>>>>>>> this approach. The paper itself seems to indicate a 
>>>>>>>>>>>> covariance approach, but
>>>>>>>>>>>> I am lost in their notation. See page 5, just above 
>>>>>>>>>>>> Definition 1.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  The ClusterEvaluator compares the cluster centers directly to compute 
the inter-cluster density whereas the CDbwEvaluator computes it by 
counting the representative points which are within 2 stds (really stdI 
+ stdJ) from the center of a line segment between the closest 
representative points of their respective clusters. These are very 
different approaches and should be expected to yield different values.

The ClusterEvaluator uses the algorithm from "Mahout in Action". I don't 
know which is "more accurate", the CDbw approach is certainly more 
sophisticated as it is measuring the density of representative points in 
the inter-cluster region. Pick your poison.

You know, you can run all the clustering algorithms in sequential mode 
(-xm sequential) to improve performance if the data you are munching is 
not too huge. The same is true for the representative points 
calculations. See TestClusterEvaluator.testRepresentativePoints() for an 
example of this.

On 9/30/10 8:08 AM, Derek O'Callaghan wrote:
> Sorry, I forgot to mention that I was also running ClusterEvaluator 
> along with CDbwEvaluator yesterday. I was getting inter < intra with 
> the former, and the reverse with the latter, and even allowing for the 
> fact that they're different algorithms I was a bit suspicious of this.
>
> However, I'd also turned off KMeans to speed up the process while 
> debugging, and was running CDbw directly on the generated Canopies. 
> I've run it today with Canopy + KMeans and 10 representative points 
> per cluster, and the results look better that lastnight, e.g.
>
> CDbw = 178.90082114134046
> Intra-cluster density = 0.26970259013839315
> Inter-cluster density = 0.2517752232288567
> Separation = 663.326299719779
>
> I reran with Canopy only (to double-check) and this generated the 
> following (I assume overlapping canopies are causing inter to be 
> greater than intra):
>
> CDbw = 97.38825072133858
> Intra-cluster density = 0.23594531216994152
> Inter-cluster density = 0.9859430197400473
> Separation = 412.7577268888219
>
> To confirm: it's running much better now than before, as the 
> successful KMeans evaluation above includes the cluster with the 
> identical points which are marginally different from the centre. 
> Thanks for all the help on this.
>
> @Jeff: is ClusterEvaluator useful too, or should CDbwEvaluator be 
> considered more accurate?
>
> On 29/09/10 20:51, Jeff Eastman wrote:
>>  The CDbw intra-cluster density calculation uses the average std of 
>> all the clusters in its normalization and produces an average of the 
>> per-cluster intra-cluster densities so it's likely the other, looser 
>> clusters are degrading that result.
>>
>> CDbw is not a panacea either. The Dirichlet unit test, for example, 
>> produces a very low metric despite clustering over the same data set 
>> that the DisplayDirichlet uses. That clustering finds almost exactly 
>> the parameters of the input data set (try it), but the clusters 
>> overlap and that messes with CDbw density calculations which reward 
>> mostly non-overlapping clusters.
>>
>> I'm happy that the Online accumulator works better. I will plug it in 
>> in my next commit and tweak the unit tests to accept its values.
>>
>> On 9/29/10 2:59 PM, Ted Dunning wrote:
>>> That slight difference case is exactly where the running sums approach
>>> fails.  I think that you found the problem.
>>>
>>> Your suspicion about inter being larger than intra confuses me.  
>>> Isn't that
>>> just another way of saying that clusters are very tight?
>>>
>>> On Wed, Sep 29, 2010 at 11:19 AM, Derek 
>>> O'Callaghan<derek.ocallaghan@ucd.ie
>>>> wrote:
>>>> I just stepped through invalidCluster(), and it seems that there's 
>>>> a slight
>>>> difference between the centre and the other points, so it returns 
>>>> false. I
>>>> was positive that there was no difference when I stepped through it 
>>>> last, I
>>>> must have overlooked something, sorry about that.
>>>>
>>>> I just tried the OnlineGaussianAccumulator and it does run better, 
>>>> in that
>>>> I get values for the 4 metrics. One thing I need to check is why the
>>>> inter-density is so much bigger than the intra-, I'm getting the 
>>>> following
>>>> values:
>>>>
>>>> CDbw = 68.51761788802385
>>>> Intra-cluster density = 0.3734803797950363
>>>> Inter-cluster density = 3.474415557178534
>>>> Separation = 183.4570745741071
>>>>
>>>> When using RunningSums and ignoring the identical points cluster, I 
>>>> get a
>>>> similar issue in that inter = ~1.5, with intra = ~0.15. I have to 
>>>> leave for
>>>> the evening, I'll look into it tomorrow to see if I can determine 
>>>> if it's
>>>> correct.
>>>>
>>>> Thanks again.
>>>>
>>>> On 29/09/10 18:55, Jeff Eastman wrote:
>>>>
>>>>>   If all of the representative points for that cluster are 
>>>>> identical then
>>>>> they are also identical to the cluster center (the first 
>>>>> representative
>>>>> point) and should be pruned. I'm wondering why this was not 
>>>>> detected in
>>>>> invalidCluster, can you investigate that? You may also want to 
>>>>> plug in an
>>>>> instance of the new OnlineGaussianAccumulator to see if it does 
>>>>> any better.
>>>>> It is likely to me much more stable than the RunningSums...
>>>>>
>>>>> On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
>>>>>
>>>>>> Thanks for that Jeff. I tried the changes and get the same result as
>>>>>> expected. FYI I've investigated further and it seems that all of 
>>>>>> the points
>>>>>> in the affected cluster are identical, so it ends up as more or 
>>>>>> less the
>>>>>> same problem we had last week with clusters with total points<  #
>>>>>> representative points, in that there are duplicate representative 
>>>>>> points. In
>>>>>> this case total>  # representative, but the end result is the same.
>>>>>>
>>>>>> I'm wondering if the quickest and easiest solution is to simply 
>>>>>> ignore
>>>>>> such clusters, i.e. those that currently generate a NaN std? I'm 
>>>>>> not sure if
>>>>>> it's the "correct" approach though...
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 29/09/10 17:37, Jeff Eastman wrote:
>>>>>>
>>>>>>>   Hi Derek,
>>>>>>>
>>>>>>> I've committed some changes which will hopefully help in fixing 
>>>>>>> this
>>>>>>> problem but which do not yet accomplish that. As you can see 
>>>>>>> from the new
>>>>>>> CDbw test (testAlmostSameValueCluster) I tried creating a test 
>>>>>>> cluster with
>>>>>>> points identical to the cluster center but with one which 
>>>>>>> differed from it
>>>>>>> by Double.MIN_NORMAL in one element. That test failed to 
>>>>>>> duplicate your
>>>>>>> issue.
>>>>>>>
>>>>>>> The patch also factors out the std calculation into an 
>>>>>>> implementor of
>>>>>>> GaussianAccumulator. I factored the current std calculations out of
>>>>>>> CDbwEvaluator into RunningSumsGaussianAccumulator and all the 
>>>>>>> tests produced
>>>>>>> the same results as before. With the new 
>>>>>>> OnlineGaussianAccumulator plugged
>>>>>>> in, the tests all return slightly different results but still no 
>>>>>>> NaNs.
>>>>>>>
>>>>>>> I still have not added priors and I'm not entirely sure where to do
>>>>>>> that. I've committed the changes so you can see my quandary.
>>>>>>> OnlineGaussianAccumulator is still a work in progress but, since 
>>>>>>> it is never
>>>>>>> used it is in the commit for your viewing.
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>>>>>>>
>>>>>>>> Thanks Jeff, I'll try out the changes when they're committed. I 
>>>>>>>> tried a
>>>>>>>> couple of things locally (removing the clusters/setting a small 
>>>>>>>> prior), but
>>>>>>>> I ended up with inter-density>  intra-density, so I suspect 
>>>>>>>> I've slipped up
>>>>>>>> somewhere. I'll hold off on it for now.
>>>>>>>>
>>>>>>>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>>>>>>
>>>>>>>>>   Hi Derek,
>>>>>>>>>
>>>>>>>>> That makes sense. With the very, very tight cluster that your
>>>>>>>>> clustering produced you've uncovered an instability in that 
>>>>>>>>> std calculation.
>>>>>>>>> I'm going to rework that method today to use a better 
>>>>>>>>> algorithm and will add
>>>>>>>>> a small prior in the process. I'm also going to add a unit 
>>>>>>>>> test to reproduce
>>>>>>>>> this problem first. Look for a commit in a couple of hours.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jeff,
>>>>>>>>>>
>>>>>>>>>> FYI I checked the problem I was having in CDbwEvaluator with 
>>>>>>>>>> the same
>>>>>>>>>> dataset from the ClusterEvaluator thread, the problem is 
>>>>>>>>>> occurring in the
>>>>>>>>>> std calculation in CDbwEvaluator.computeStd(), in that
>>>>>>>>>> s2.times(s0).minus(s1.times(s1)) generates negative values 
>>>>>>>>>> which then
>>>>>>>>>> produce NaN with the subsequent SquareRootFunction(). This 
>>>>>>>>>> then sets the
>>>>>>>>>> average std to NaN later on in intraClusterDensity(). It's 
>>>>>>>>>> happening for the
>>>>>>>>>> cluster I have with the almost-identical points.
>>>>>>>>>>
>>>>>>>>>> It's the same symptom as the problem last week, where this was
>>>>>>>>>> happening when s0 was 1. Is the solution to ignore these 
>>>>>>>>>> clusters, like the
>>>>>>>>>> s0 = 1 clusters? Or to add a small prior std as was done for 
>>>>>>>>>> the similar
>>>>>>>>>> issue in NormalModel.pdf()?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Derek
>>>>>>>>>>
>>>>>>>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>>>>>>
>>>>>>>>>>>   Hi Ted,
>>>>>>>>>>>
>>>>>>>>>>> The clustering code computes this value for cluster radius.
>>>>>>>>>>> Currently, it is done with a running sums approach (s^0, 
>>>>>>>>>>> s^1, s^2) that
>>>>>>>>>>> computes the std of each vector term using:
>>>>>>>>>>>
>>>>>>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
>>>>>>>>>>> SquareRootFunction()).divide(s0);
>>>>>>>>>>>
>>>>>>>>>>> For CDbw, they need a scalar, average std value, and this is
>>>>>>>>>>> currently computed by averaging the vector terms:
>>>>>>>>>>>
>>>>>>>>>>> double d = std.zSum() / std.size();
>>>>>>>>>>>
>>>>>>>>>>> The more I read about it; however, the less confident I am 
>>>>>>>>>>> about
>>>>>>>>>>> this approach. The paper itself seems to indicate a 
>>>>>>>>>>> covariance approach, but
>>>>>>>>>>> I am lost in their notation. See page 5, just above 
>>>>>>>>>>> Definition 1.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

Sorry, I forgot to mention that I was also running ClusterEvaluator 
along with CDbwEvaluator yesterday. I was getting inter < intra with the 
former, and the reverse with the latter, and even allowing for the fact 
that they're different algorithms I was a bit suspicious of this.

However, I'd also turned off KMeans to speed up the process while 
debugging, and was running CDbw directly on the generated Canopies. I've 
run it today with Canopy + KMeans and 10 representative points per 
cluster, and the results look better that lastnight, e.g.

CDbw = 178.90082114134046
Intra-cluster density = 0.26970259013839315
Inter-cluster density = 0.2517752232288567
Separation = 663.326299719779

I reran with Canopy only (to double-check) and this generated the 
following (I assume overlapping canopies are causing inter to be greater 
than intra):

CDbw = 97.38825072133858
Intra-cluster density = 0.23594531216994152
Inter-cluster density = 0.9859430197400473
Separation = 412.7577268888219

To confirm: it's running much better now than before, as the successful 
KMeans evaluation above includes the cluster with the identical points 
which are marginally different from the centre. Thanks for all the help 
on this.

@Jeff: is ClusterEvaluator useful too, or should CDbwEvaluator be 
considered more accurate?

On 29/09/10 20:51, Jeff Eastman wrote:
>  The CDbw intra-cluster density calculation uses the average std of 
> all the clusters in its normalization and produces an average of the 
> per-cluster intra-cluster densities so it's likely the other, looser 
> clusters are degrading that result.
>
> CDbw is not a panacea either. The Dirichlet unit test, for example, 
> produces a very low metric despite clustering over the same data set 
> that the DisplayDirichlet uses. That clustering finds almost exactly 
> the parameters of the input data set (try it), but the clusters 
> overlap and that messes with CDbw density calculations which reward 
> mostly non-overlapping clusters.
>
> I'm happy that the Online accumulator works better. I will plug it in 
> in my next commit and tweak the unit tests to accept its values.
>
> On 9/29/10 2:59 PM, Ted Dunning wrote:
>> That slight difference case is exactly where the running sums approach
>> fails.  I think that you found the problem.
>>
>> Your suspicion about inter being larger than intra confuses me.  
>> Isn't that
>> just another way of saying that clusters are very tight?
>>
>> On Wed, Sep 29, 2010 at 11:19 AM, Derek 
>> O'Callaghan<derek.ocallaghan@ucd.ie
>>> wrote:
>>> I just stepped through invalidCluster(), and it seems that there's a 
>>> slight
>>> difference between the centre and the other points, so it returns 
>>> false. I
>>> was positive that there was no difference when I stepped through it 
>>> last, I
>>> must have overlooked something, sorry about that.
>>>
>>> I just tried the OnlineGaussianAccumulator and it does run better, 
>>> in that
>>> I get values for the 4 metrics. One thing I need to check is why the
>>> inter-density is so much bigger than the intra-, I'm getting the 
>>> following
>>> values:
>>>
>>> CDbw = 68.51761788802385
>>> Intra-cluster density = 0.3734803797950363
>>> Inter-cluster density = 3.474415557178534
>>> Separation = 183.4570745741071
>>>
>>> When using RunningSums and ignoring the identical points cluster, I 
>>> get a
>>> similar issue in that inter = ~1.5, with intra = ~0.15. I have to 
>>> leave for
>>> the evening, I'll look into it tomorrow to see if I can determine if 
>>> it's
>>> correct.
>>>
>>> Thanks again.
>>>
>>> On 29/09/10 18:55, Jeff Eastman wrote:
>>>
>>>>   If all of the representative points for that cluster are 
>>>> identical then
>>>> they are also identical to the cluster center (the first 
>>>> representative
>>>> point) and should be pruned. I'm wondering why this was not 
>>>> detected in
>>>> invalidCluster, can you investigate that? You may also want to plug 
>>>> in an
>>>> instance of the new OnlineGaussianAccumulator to see if it does any 
>>>> better.
>>>> It is likely to me much more stable than the RunningSums...
>>>>
>>>> On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
>>>>
>>>>> Thanks for that Jeff. I tried the changes and get the same result as
>>>>> expected. FYI I've investigated further and it seems that all of 
>>>>> the points
>>>>> in the affected cluster are identical, so it ends up as more or 
>>>>> less the
>>>>> same problem we had last week with clusters with total points<  #
>>>>> representative points, in that there are duplicate representative 
>>>>> points. In
>>>>> this case total>  # representative, but the end result is the same.
>>>>>
>>>>> I'm wondering if the quickest and easiest solution is to simply 
>>>>> ignore
>>>>> such clusters, i.e. those that currently generate a NaN std? I'm 
>>>>> not sure if
>>>>> it's the "correct" approach though...
>>>>>
>>>>>
>>>>>
>>>>> On 29/09/10 17:37, Jeff Eastman wrote:
>>>>>
>>>>>>   Hi Derek,
>>>>>>
>>>>>> I've committed some changes which will hopefully help in fixing this
>>>>>> problem but which do not yet accomplish that. As you can see from 
>>>>>> the new
>>>>>> CDbw test (testAlmostSameValueCluster) I tried creating a test 
>>>>>> cluster with
>>>>>> points identical to the cluster center but with one which 
>>>>>> differed from it
>>>>>> by Double.MIN_NORMAL in one element. That test failed to 
>>>>>> duplicate your
>>>>>> issue.
>>>>>>
>>>>>> The patch also factors out the std calculation into an 
>>>>>> implementor of
>>>>>> GaussianAccumulator. I factored the current std calculations out of
>>>>>> CDbwEvaluator into RunningSumsGaussianAccumulator and all the 
>>>>>> tests produced
>>>>>> the same results as before. With the new 
>>>>>> OnlineGaussianAccumulator plugged
>>>>>> in, the tests all return slightly different results but still no 
>>>>>> NaNs.
>>>>>>
>>>>>> I still have not added priors and I'm not entirely sure where to do
>>>>>> that. I've committed the changes so you can see my quandary.
>>>>>> OnlineGaussianAccumulator is still a work in progress but, since 
>>>>>> it is never
>>>>>> used it is in the commit for your viewing.
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>>>>>>
>>>>>>> Thanks Jeff, I'll try out the changes when they're committed. I 
>>>>>>> tried a
>>>>>>> couple of things locally (removing the clusters/setting a small 
>>>>>>> prior), but
>>>>>>> I ended up with inter-density>  intra-density, so I suspect I've 
>>>>>>> slipped up
>>>>>>> somewhere. I'll hold off on it for now.
>>>>>>>
>>>>>>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>>>>>
>>>>>>>>   Hi Derek,
>>>>>>>>
>>>>>>>> That makes sense. With the very, very tight cluster that your
>>>>>>>> clustering produced you've uncovered an instability in that std 
>>>>>>>> calculation.
>>>>>>>> I'm going to rework that method today to use a better algorithm 
>>>>>>>> and will add
>>>>>>>> a small prior in the process. I'm also going to add a unit test 
>>>>>>>> to reproduce
>>>>>>>> this problem first. Look for a commit in a couple of hours.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>>>>>>
>>>>>>>>> Hi Jeff,
>>>>>>>>>
>>>>>>>>> FYI I checked the problem I was having in CDbwEvaluator with 
>>>>>>>>> the same
>>>>>>>>> dataset from the ClusterEvaluator thread, the problem is 
>>>>>>>>> occurring in the
>>>>>>>>> std calculation in CDbwEvaluator.computeStd(), in that
>>>>>>>>> s2.times(s0).minus(s1.times(s1)) generates negative values 
>>>>>>>>> which then
>>>>>>>>> produce NaN with the subsequent SquareRootFunction(). This 
>>>>>>>>> then sets the
>>>>>>>>> average std to NaN later on in intraClusterDensity(). It's 
>>>>>>>>> happening for the
>>>>>>>>> cluster I have with the almost-identical points.
>>>>>>>>>
>>>>>>>>> It's the same symptom as the problem last week, where this was
>>>>>>>>> happening when s0 was 1. Is the solution to ignore these 
>>>>>>>>> clusters, like the
>>>>>>>>> s0 = 1 clusters? Or to add a small prior std as was done for 
>>>>>>>>> the similar
>>>>>>>>> issue in NormalModel.pdf()?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Derek
>>>>>>>>>
>>>>>>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>>>>>
>>>>>>>>>>   Hi Ted,
>>>>>>>>>>
>>>>>>>>>> The clustering code computes this value for cluster radius.
>>>>>>>>>> Currently, it is done with a running sums approach (s^0, s^1, 
>>>>>>>>>> s^2) that
>>>>>>>>>> computes the std of each vector term using:
>>>>>>>>>>
>>>>>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
>>>>>>>>>> SquareRootFunction()).divide(s0);
>>>>>>>>>>
>>>>>>>>>> For CDbw, they need a scalar, average std value, and this is
>>>>>>>>>> currently computed by averaging the vector terms:
>>>>>>>>>>
>>>>>>>>>> double d = std.zSum() / std.size();
>>>>>>>>>>
>>>>>>>>>> The more I read about it; however, the less confident I am about
>>>>>>>>>> this approach. The paper itself seems to indicate a 
>>>>>>>>>> covariance approach, but
>>>>>>>>>> I am lost in their notation. See page 5, just above 
>>>>>>>>>> Definition 1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  The CDbw intra-cluster density calculation uses the average std of all 
the clusters in its normalization and produces an average of the 
per-cluster intra-cluster densities so it's likely the other, looser 
clusters are degrading that result.

CDbw is not a panacea either. The Dirichlet unit test, for example, 
produces a very low metric despite clustering over the same data set 
that the DisplayDirichlet uses. That clustering finds almost exactly the 
parameters of the input data set (try it), but the clusters overlap and 
that messes with CDbw density calculations which reward mostly 
non-overlapping clusters.

I'm happy that the Online accumulator works better. I will plug it in in 
my next commit and tweak the unit tests to accept its values.

On 9/29/10 2:59 PM, Ted Dunning wrote:
> That slight difference case is exactly where the running sums approach
> fails.  I think that you found the problem.
>
> Your suspicion about inter being larger than intra confuses me.  Isn't that
> just another way of saying that clusters are very tight?
>
> On Wed, Sep 29, 2010 at 11:19 AM, Derek O'Callaghan<derek.ocallaghan@ucd.ie
>> wrote:
>> I just stepped through invalidCluster(), and it seems that there's a slight
>> difference between the centre and the other points, so it returns false. I
>> was positive that there was no difference when I stepped through it last, I
>> must have overlooked something, sorry about that.
>>
>> I just tried the OnlineGaussianAccumulator and it does run better, in that
>> I get values for the 4 metrics. One thing I need to check is why the
>> inter-density is so much bigger than the intra-, I'm getting the following
>> values:
>>
>> CDbw = 68.51761788802385
>> Intra-cluster density = 0.3734803797950363
>> Inter-cluster density = 3.474415557178534
>> Separation = 183.4570745741071
>>
>> When using RunningSums and ignoring the identical points cluster, I get a
>> similar issue in that inter = ~1.5, with intra = ~0.15. I have to leave for
>> the evening, I'll look into it tomorrow to see if I can determine if it's
>> correct.
>>
>> Thanks again.
>>
>> On 29/09/10 18:55, Jeff Eastman wrote:
>>
>>>   If all of the representative points for that cluster are identical then
>>> they are also identical to the cluster center (the first representative
>>> point) and should be pruned. I'm wondering why this was not detected in
>>> invalidCluster, can you investigate that? You may also want to plug in an
>>> instance of the new OnlineGaussianAccumulator to see if it does any better.
>>> It is likely to me much more stable than the RunningSums...
>>>
>>> On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
>>>
>>>> Thanks for that Jeff. I tried the changes and get the same result as
>>>> expected. FYI I've investigated further and it seems that all of the points
>>>> in the affected cluster are identical, so it ends up as more or less the
>>>> same problem we had last week with clusters with total points<  #
>>>> representative points, in that there are duplicate representative points. In
>>>> this case total>  # representative, but the end result is the same.
>>>>
>>>> I'm wondering if the quickest and easiest solution is to simply ignore
>>>> such clusters, i.e. those that currently generate a NaN std? I'm not sure if
>>>> it's the "correct" approach though...
>>>>
>>>>
>>>>
>>>> On 29/09/10 17:37, Jeff Eastman wrote:
>>>>
>>>>>   Hi Derek,
>>>>>
>>>>> I've committed some changes which will hopefully help in fixing this
>>>>> problem but which do not yet accomplish that. As you can see from the new
>>>>> CDbw test (testAlmostSameValueCluster) I tried creating a test cluster with
>>>>> points identical to the cluster center but with one which differed from it
>>>>> by Double.MIN_NORMAL in one element. That test failed to duplicate your
>>>>> issue.
>>>>>
>>>>> The patch also factors out the std calculation into an implementor of
>>>>> GaussianAccumulator. I factored the current std calculations out of
>>>>> CDbwEvaluator into RunningSumsGaussianAccumulator and all the tests produced
>>>>> the same results as before. With the new OnlineGaussianAccumulator plugged
>>>>> in, the tests all return slightly different results but still no NaNs.
>>>>>
>>>>> I still have not added priors and I'm not entirely sure where to do
>>>>> that. I've committed the changes so you can see my quandary.
>>>>> OnlineGaussianAccumulator is still a work in progress but, since it is never
>>>>> used it is in the commit for your viewing.
>>>>>
>>>>> Jeff
>>>>>
>>>>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>>>>>
>>>>>> Thanks Jeff, I'll try out the changes when they're committed. I tried a
>>>>>> couple of things locally (removing the clusters/setting a small prior), but
>>>>>> I ended up with inter-density>  intra-density, so I suspect I've slipped up
>>>>>> somewhere. I'll hold off on it for now.
>>>>>>
>>>>>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>>>>
>>>>>>>   Hi Derek,
>>>>>>>
>>>>>>> That makes sense. With the very, very tight cluster that your
>>>>>>> clustering produced you've uncovered an instability in that std calculation.
>>>>>>> I'm going to rework that method today to use a better algorithm and will add
>>>>>>> a small prior in the process. I'm also going to add a unit test to reproduce
>>>>>>> this problem first. Look for a commit in a couple of hours.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>>>>>
>>>>>>>> Hi Jeff,
>>>>>>>>
>>>>>>>> FYI I checked the problem I was having in CDbwEvaluator with the same
>>>>>>>> dataset from the ClusterEvaluator thread, the problem is occurring in the
>>>>>>>> std calculation in CDbwEvaluator.computeStd(), in that
>>>>>>>> s2.times(s0).minus(s1.times(s1)) generates negative values which then
>>>>>>>> produce NaN with the subsequent SquareRootFunction(). This then sets the
>>>>>>>> average std to NaN later on in intraClusterDensity(). It's happening for the
>>>>>>>> cluster I have with the almost-identical points.
>>>>>>>>
>>>>>>>> It's the same symptom as the problem last week, where this was
>>>>>>>> happening when s0 was 1. Is the solution to ignore these clusters, like the
>>>>>>>> s0 = 1 clusters? Or to add a small prior std as was done for the similar
>>>>>>>> issue in NormalModel.pdf()?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Derek
>>>>>>>>
>>>>>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>>>>
>>>>>>>>>   Hi Ted,
>>>>>>>>>
>>>>>>>>> The clustering code computes this value for cluster radius.
>>>>>>>>> Currently, it is done with a running sums approach (s^0, s^1, s^2) that
>>>>>>>>> computes the std of each vector term using:
>>>>>>>>>
>>>>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
>>>>>>>>> SquareRootFunction()).divide(s0);
>>>>>>>>>
>>>>>>>>> For CDbw, they need a scalar, average std value, and this is
>>>>>>>>> currently computed by averaging the vector terms:
>>>>>>>>>
>>>>>>>>> double d = std.zSum() / std.size();
>>>>>>>>>
>>>>>>>>> The more I read about it; however, the less confident I am about
>>>>>>>>> this approach. The paper itself seems to indicate a covariance approach, but
>>>>>>>>> I am lost in their notation. See page 5, just above Definition 1.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>>>>>>>>>
>>>>>>>>>

Re: Standard Deviation of a Set of Vectors

Posted by Ted Dunning <te...@gmail.com>.

That slight difference case is exactly where the running sums approach
fails.  I think that you found the problem.

Your suspicion about inter being larger than intra confuses me.  Isn't that
just another way of saying that clusters are very tight?

On Wed, Sep 29, 2010 at 11:19 AM, Derek O'Callaghan <derek.ocallaghan@ucd.ie
> wrote:

> I just stepped through invalidCluster(), and it seems that there's a slight
> difference between the centre and the other points, so it returns false. I
> was positive that there was no difference when I stepped through it last, I
> must have overlooked something, sorry about that.
>
> I just tried the OnlineGaussianAccumulator and it does run better, in that
> I get values for the 4 metrics. One thing I need to check is why the
> inter-density is so much bigger than the intra-, I'm getting the following
> values:
>
> CDbw = 68.51761788802385
> Intra-cluster density = 0.3734803797950363
> Inter-cluster density = 3.474415557178534
> Separation = 183.4570745741071
>
> When using RunningSums and ignoring the identical points cluster, I get a
> similar issue in that inter = ~1.5, with intra = ~0.15. I have to leave for
> the evening, I'll look into it tomorrow to see if I can determine if it's
> correct.
>
> Thanks again.
>
> On 29/09/10 18:55, Jeff Eastman wrote:
>
>>  If all of the representative points for that cluster are identical then
>> they are also identical to the cluster center (the first representative
>> point) and should be pruned. I'm wondering why this was not detected in
>> invalidCluster, can you investigate that? You may also want to plug in an
>> instance of the new OnlineGaussianAccumulator to see if it does any better.
>> It is likely to me much more stable than the RunningSums...
>>
>> On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
>>
>>> Thanks for that Jeff. I tried the changes and get the same result as
>>> expected. FYI I've investigated further and it seems that all of the points
>>> in the affected cluster are identical, so it ends up as more or less the
>>> same problem we had last week with clusters with total points < #
>>> representative points, in that there are duplicate representative points. In
>>> this case total > # representative, but the end result is the same.
>>>
>>> I'm wondering if the quickest and easiest solution is to simply ignore
>>> such clusters, i.e. those that currently generate a NaN std? I'm not sure if
>>> it's the "correct" approach though...
>>>
>>>
>>>
>>> On 29/09/10 17:37, Jeff Eastman wrote:
>>>
>>>>  Hi Derek,
>>>>
>>>> I've committed some changes which will hopefully help in fixing this
>>>> problem but which do not yet accomplish that. As you can see from the new
>>>> CDbw test (testAlmostSameValueCluster) I tried creating a test cluster with
>>>> points identical to the cluster center but with one which differed from it
>>>> by Double.MIN_NORMAL in one element. That test failed to duplicate your
>>>> issue.
>>>>
>>>> The patch also factors out the std calculation into an implementor of
>>>> GaussianAccumulator. I factored the current std calculations out of
>>>> CDbwEvaluator into RunningSumsGaussianAccumulator and all the tests produced
>>>> the same results as before. With the new OnlineGaussianAccumulator plugged
>>>> in, the tests all return slightly different results but still no NaNs.
>>>>
>>>> I still have not added priors and I'm not entirely sure where to do
>>>> that. I've committed the changes so you can see my quandary.
>>>> OnlineGaussianAccumulator is still a work in progress but, since it is never
>>>> used it is in the commit for your viewing.
>>>>
>>>> Jeff
>>>>
>>>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>>>>
>>>>> Thanks Jeff, I'll try out the changes when they're committed. I tried a
>>>>> couple of things locally (removing the clusters/setting a small prior), but
>>>>> I ended up with inter-density > intra-density, so I suspect I've slipped up
>>>>> somewhere. I'll hold off on it for now.
>>>>>
>>>>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>>>
>>>>>>  Hi Derek,
>>>>>>
>>>>>> That makes sense. With the very, very tight cluster that your
>>>>>> clustering produced you've uncovered an instability in that std calculation.
>>>>>> I'm going to rework that method today to use a better algorithm and will add
>>>>>> a small prior in the process. I'm also going to add a unit test to reproduce
>>>>>> this problem first. Look for a commit in a couple of hours.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>>>>
>>>>>>> Hi Jeff,
>>>>>>>
>>>>>>> FYI I checked the problem I was having in CDbwEvaluator with the same
>>>>>>> dataset from the ClusterEvaluator thread, the problem is occurring in the
>>>>>>> std calculation in CDbwEvaluator.computeStd(), in that
>>>>>>> s2.times(s0).minus(s1.times(s1)) generates negative values which then
>>>>>>> produce NaN with the subsequent SquareRootFunction(). This then sets the
>>>>>>> average std to NaN later on in intraClusterDensity(). It's happening for the
>>>>>>> cluster I have with the almost-identical points.
>>>>>>>
>>>>>>> It's the same symptom as the problem last week, where this was
>>>>>>> happening when s0 was 1. Is the solution to ignore these clusters, like the
>>>>>>> s0 = 1 clusters? Or to add a small prior std as was done for the similar
>>>>>>> issue in NormalModel.pdf()?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Derek
>>>>>>>
>>>>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>>>
>>>>>>>>  Hi Ted,
>>>>>>>>
>>>>>>>> The clustering code computes this value for cluster radius.
>>>>>>>> Currently, it is done with a running sums approach (s^0, s^1, s^2) that
>>>>>>>> computes the std of each vector term using:
>>>>>>>>
>>>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
>>>>>>>> SquareRootFunction()).divide(s0);
>>>>>>>>
>>>>>>>> For CDbw, they need a scalar, average std value, and this is
>>>>>>>> currently computed by averaging the vector terms:
>>>>>>>>
>>>>>>>> double d = std.zSum() / std.size();
>>>>>>>>
>>>>>>>> The more I read about it; however, the less confident I am about
>>>>>>>> this approach. The paper itself seems to indicate a covariance approach, but
>>>>>>>> I am lost in their notation. See page 5, just above Definition 1.
>>>>>>>>
>>>>>>>>
>>>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

I just stepped through invalidCluster(), and it seems that there's a 
slight difference between the centre and the other points, so it returns 
false. I was positive that there was no difference when I stepped 
through it last, I must have overlooked something, sorry about that.

I just tried the OnlineGaussianAccumulator and it does run better, in 
that I get values for the 4 metrics. One thing I need to check is why 
the inter-density is so much bigger than the intra-, I'm getting the 
following values:

CDbw = 68.51761788802385
Intra-cluster density = 0.3734803797950363
Inter-cluster density = 3.474415557178534
Separation = 183.4570745741071

When using RunningSums and ignoring the identical points cluster, I get 
a similar issue in that inter = ~1.5, with intra = ~0.15. I have to 
leave for the evening, I'll look into it tomorrow to see if I can 
determine if it's correct.

Thanks again.

On 29/09/10 18:55, Jeff Eastman wrote:
>  If all of the representative points for that cluster are identical 
> then they are also identical to the cluster center (the first 
> representative point) and should be pruned. I'm wondering why this was 
> not detected in invalidCluster, can you investigate that? You may also 
> want to plug in an instance of the new OnlineGaussianAccumulator to 
> see if it does any better. It is likely to me much more stable than 
> the RunningSums...
>
> On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
>> Thanks for that Jeff. I tried the changes and get the same result as 
>> expected. FYI I've investigated further and it seems that all of the 
>> points in the affected cluster are identical, so it ends up as more 
>> or less the same problem we had last week with clusters with total 
>> points < # representative points, in that there are duplicate 
>> representative points. In this case total > # representative, but the 
>> end result is the same.
>>
>> I'm wondering if the quickest and easiest solution is to simply 
>> ignore such clusters, i.e. those that currently generate a NaN std? 
>> I'm not sure if it's the "correct" approach though...
>>
>>
>>
>> On 29/09/10 17:37, Jeff Eastman wrote:
>>>  Hi Derek,
>>>
>>> I've committed some changes which will hopefully help in fixing this 
>>> problem but which do not yet accomplish that. As you can see from 
>>> the new CDbw test (testAlmostSameValueCluster) I tried creating a 
>>> test cluster with points identical to the cluster center but with 
>>> one which differed from it by Double.MIN_NORMAL in one element. That 
>>> test failed to duplicate your issue.
>>>
>>> The patch also factors out the std calculation into an implementor 
>>> of GaussianAccumulator. I factored the current std calculations out 
>>> of CDbwEvaluator into RunningSumsGaussianAccumulator and all the 
>>> tests produced the same results as before. With the new 
>>> OnlineGaussianAccumulator plugged in, the tests all return slightly 
>>> different results but still no NaNs.
>>>
>>> I still have not added priors and I'm not entirely sure where to do 
>>> that. I've committed the changes so you can see my quandary. 
>>> OnlineGaussianAccumulator is still a work in progress but, since it 
>>> is never used it is in the commit for your viewing.
>>>
>>> Jeff
>>>
>>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>>>> Thanks Jeff, I'll try out the changes when they're committed. I 
>>>> tried a couple of things locally (removing the clusters/setting a 
>>>> small prior), but I ended up with inter-density > intra-density, so 
>>>> I suspect I've slipped up somewhere. I'll hold off on it for now.
>>>>
>>>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>>>  Hi Derek,
>>>>>
>>>>> That makes sense. With the very, very tight cluster that your 
>>>>> clustering produced you've uncovered an instability in that std 
>>>>> calculation. I'm going to rework that method today to use a better 
>>>>> algorithm and will add a small prior in the process. I'm also 
>>>>> going to add a unit test to reproduce this problem first. Look for 
>>>>> a commit in a couple of hours.
>>>>>
>>>>>
>>>>>
>>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>>>> Hi Jeff,
>>>>>>
>>>>>> FYI I checked the problem I was having in CDbwEvaluator with the 
>>>>>> same dataset from the ClusterEvaluator thread, the problem is 
>>>>>> occurring in the std calculation in CDbwEvaluator.computeStd(), 
>>>>>> in that s2.times(s0).minus(s1.times(s1)) generates negative 
>>>>>> values which then produce NaN with the subsequent 
>>>>>> SquareRootFunction(). This then sets the average std to NaN later 
>>>>>> on in intraClusterDensity(). It's happening for the cluster I 
>>>>>> have with the almost-identical points.
>>>>>>
>>>>>> It's the same symptom as the problem last week, where this was 
>>>>>> happening when s0 was 1. Is the solution to ignore these 
>>>>>> clusters, like the s0 = 1 clusters? Or to add a small prior std 
>>>>>> as was done for the similar issue in NormalModel.pdf()?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Derek
>>>>>>
>>>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>>>  Hi Ted,
>>>>>>>
>>>>>>> The clustering code computes this value for cluster radius. 
>>>>>>> Currently, it is done with a running sums approach (s^0, s^1, 
>>>>>>> s^2) that computes the std of each vector term using:
>>>>>>>
>>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>>>>>>> SquareRootFunction()).divide(s0);
>>>>>>>
>>>>>>> For CDbw, they need a scalar, average std value, and this is 
>>>>>>> currently computed by averaging the vector terms:
>>>>>>>
>>>>>>> double d = std.zSum() / std.size();
>>>>>>>
>>>>>>> The more I read about it; however, the less confident I am about 
>>>>>>> this approach. The paper itself seems to indicate a covariance 
>>>>>>> approach, but I am lost in their notation. See page 5, just 
>>>>>>> above Definition 1.
>>>>>>>
>>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  If all of the representative points for that cluster are identical 
then they are also identical to the cluster center (the first 
representative point) and should be pruned. I'm wondering why this was 
not detected in invalidCluster, can you investigate that? You may also 
want to plug in an instance of the new OnlineGaussianAccumulator to see 
if it does any better. It is likely to me much more stable than the 
RunningSums...

On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
> Thanks for that Jeff. I tried the changes and get the same result as 
> expected. FYI I've investigated further and it seems that all of the 
> points in the affected cluster are identical, so it ends up as more or 
> less the same problem we had last week with clusters with total points 
> < # representative points, in that there are duplicate representative 
> points. In this case total > # representative, but the end result is 
> the same.
>
> I'm wondering if the quickest and easiest solution is to simply ignore 
> such clusters, i.e. those that currently generate a NaN std? I'm not 
> sure if it's the "correct" approach though...
>
>
>
> On 29/09/10 17:37, Jeff Eastman wrote:
>>  Hi Derek,
>>
>> I've committed some changes which will hopefully help in fixing this 
>> problem but which do not yet accomplish that. As you can see from the 
>> new CDbw test (testAlmostSameValueCluster) I tried creating a test 
>> cluster with points identical to the cluster center but with one 
>> which differed from it by Double.MIN_NORMAL in one element. That test 
>> failed to duplicate your issue.
>>
>> The patch also factors out the std calculation into an implementor of 
>> GaussianAccumulator. I factored the current std calculations out of 
>> CDbwEvaluator into RunningSumsGaussianAccumulator and all the tests 
>> produced the same results as before. With the new 
>> OnlineGaussianAccumulator plugged in, the tests all return slightly 
>> different results but still no NaNs.
>>
>> I still have not added priors and I'm not entirely sure where to do 
>> that. I've committed the changes so you can see my quandary. 
>> OnlineGaussianAccumulator is still a work in progress but, since it 
>> is never used it is in the commit for your viewing.
>>
>> Jeff
>>
>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>>> Thanks Jeff, I'll try out the changes when they're committed. I 
>>> tried a couple of things locally (removing the clusters/setting a 
>>> small prior), but I ended up with inter-density > intra-density, so 
>>> I suspect I've slipped up somewhere. I'll hold off on it for now.
>>>
>>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>>  Hi Derek,
>>>>
>>>> That makes sense. With the very, very tight cluster that your 
>>>> clustering produced you've uncovered an instability in that std 
>>>> calculation. I'm going to rework that method today to use a better 
>>>> algorithm and will add a small prior in the process. I'm also going 
>>>> to add a unit test to reproduce this problem first. Look for a 
>>>> commit in a couple of hours.
>>>>
>>>>
>>>>
>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>>> Hi Jeff,
>>>>>
>>>>> FYI I checked the problem I was having in CDbwEvaluator with the 
>>>>> same dataset from the ClusterEvaluator thread, the problem is 
>>>>> occurring in the std calculation in CDbwEvaluator.computeStd(), in 
>>>>> that s2.times(s0).minus(s1.times(s1)) generates negative values 
>>>>> which then produce NaN with the subsequent SquareRootFunction(). 
>>>>> This then sets the average std to NaN later on in 
>>>>> intraClusterDensity(). It's happening for the cluster I have with 
>>>>> the almost-identical points.
>>>>>
>>>>> It's the same symptom as the problem last week, where this was 
>>>>> happening when s0 was 1. Is the solution to ignore these clusters, 
>>>>> like the s0 = 1 clusters? Or to add a small prior std as was done 
>>>>> for the similar issue in NormalModel.pdf()?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Derek
>>>>>
>>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>>  Hi Ted,
>>>>>>
>>>>>> The clustering code computes this value for cluster radius. 
>>>>>> Currently, it is done with a running sums approach (s^0, s^1, 
>>>>>> s^2) that computes the std of each vector term using:
>>>>>>
>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>>>>>> SquareRootFunction()).divide(s0);
>>>>>>
>>>>>> For CDbw, they need a scalar, average std value, and this is 
>>>>>> currently computed by averaging the vector terms:
>>>>>>
>>>>>> double d = std.zSum() / std.size();
>>>>>>
>>>>>> The more I read about it; however, the less confident I am about 
>>>>>> this approach. The paper itself seems to indicate a covariance 
>>>>>> approach, but I am lost in their notation. See page 5, just above 
>>>>>> Definition 1.
>>>>>>
>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

Thanks for that Jeff. I tried the changes and get the same result as 
expected. FYI I've investigated further and it seems that all of the 
points in the affected cluster are identical, so it ends up as more or 
less the same problem we had last week with clusters with total points < 
# representative points, in that there are duplicate representative 
points. In this case total > # representative, but the end result is the 
same.

I'm wondering if the quickest and easiest solution is to simply ignore 
such clusters, i.e. those that currently generate a NaN std? I'm not 
sure if it's the "correct" approach though...



On 29/09/10 17:37, Jeff Eastman wrote:
>  Hi Derek,
>
> I've committed some changes which will hopefully help in fixing this 
> problem but which do not yet accomplish that. As you can see from the 
> new CDbw test (testAlmostSameValueCluster) I tried creating a test 
> cluster with points identical to the cluster center but with one which 
> differed from it by Double.MIN_NORMAL in one element. That test failed 
> to duplicate your issue.
>
> The patch also factors out the std calculation into an implementor of 
> GaussianAccumulator. I factored the current std calculations out of 
> CDbwEvaluator into RunningSumsGaussianAccumulator and all the tests 
> produced the same results as before. With the new 
> OnlineGaussianAccumulator plugged in, the tests all return slightly 
> different results but still no NaNs.
>
> I still have not added priors and I'm not entirely sure where to do 
> that. I've committed the changes so you can see my quandary. 
> OnlineGaussianAccumulator is still a work in progress but, since it is 
> never used it is in the commit for your viewing.
>
> Jeff
>
> On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
>> Thanks Jeff, I'll try out the changes when they're committed. I tried 
>> a couple of things locally (removing the clusters/setting a small 
>> prior), but I ended up with inter-density > intra-density, so I 
>> suspect I've slipped up somewhere. I'll hold off on it for now.
>>
>> On 29/09/10 13:48, Jeff Eastman wrote:
>>>  Hi Derek,
>>>
>>> That makes sense. With the very, very tight cluster that your 
>>> clustering produced you've uncovered an instability in that std 
>>> calculation. I'm going to rework that method today to use a better 
>>> algorithm and will add a small prior in the process. I'm also going 
>>> to add a unit test to reproduce this problem first. Look for a 
>>> commit in a couple of hours.
>>>
>>>
>>>
>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>>> Hi Jeff,
>>>>
>>>> FYI I checked the problem I was having in CDbwEvaluator with the 
>>>> same dataset from the ClusterEvaluator thread, the problem is 
>>>> occurring in the std calculation in CDbwEvaluator.computeStd(), in 
>>>> that s2.times(s0).minus(s1.times(s1)) generates negative values 
>>>> which then produce NaN with the subsequent SquareRootFunction(). 
>>>> This then sets the average std to NaN later on in 
>>>> intraClusterDensity(). It's happening for the cluster I have with 
>>>> the almost-identical points.
>>>>
>>>> It's the same symptom as the problem last week, where this was 
>>>> happening when s0 was 1. Is the solution to ignore these clusters, 
>>>> like the s0 = 1 clusters? Or to add a small prior std as was done 
>>>> for the similar issue in NormalModel.pdf()?
>>>>
>>>> Thanks,
>>>>
>>>> Derek
>>>>
>>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>>  Hi Ted,
>>>>>
>>>>> The clustering code computes this value for cluster radius. 
>>>>> Currently, it is done with a running sums approach (s^0, s^1, s^2) 
>>>>> that computes the std of each vector term using:
>>>>>
>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>>>>> SquareRootFunction()).divide(s0);
>>>>>
>>>>> For CDbw, they need a scalar, average std value, and this is 
>>>>> currently computed by averaging the vector terms:
>>>>>
>>>>> double d = std.zSum() / std.size();
>>>>>
>>>>> The more I read about it; however, the less confident I am about 
>>>>> this approach. The paper itself seems to indicate a covariance 
>>>>> approach, but I am lost in their notation. See page 5, just above 
>>>>> Definition 1.
>>>>>
>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hi Derek,

I've committed some changes which will hopefully help in fixing this 
problem but which do not yet accomplish that. As you can see from the 
new CDbw test (testAlmostSameValueCluster) I tried creating a test 
cluster with points identical to the cluster center but with one which 
differed from it by Double.MIN_NORMAL in one element. That test failed 
to duplicate your issue.

The patch also factors out the std calculation into an implementor of 
GaussianAccumulator. I factored the current std calculations out of 
CDbwEvaluator into RunningSumsGaussianAccumulator and all the tests 
produced the same results as before. With the new 
OnlineGaussianAccumulator plugged in, the tests all return slightly 
different results but still no NaNs.

I still have not added priors and I'm not entirely sure where to do 
that. I've committed the changes so you can see my quandary. 
OnlineGaussianAccumulator is still a work in progress but, since it is 
never used it is in the commit for your viewing.

Jeff

On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
> Thanks Jeff, I'll try out the changes when they're committed. I tried 
> a couple of things locally (removing the clusters/setting a small 
> prior), but I ended up with inter-density > intra-density, so I 
> suspect I've slipped up somewhere. I'll hold off on it for now.
>
> On 29/09/10 13:48, Jeff Eastman wrote:
>>  Hi Derek,
>>
>> That makes sense. With the very, very tight cluster that your 
>> clustering produced you've uncovered an instability in that std 
>> calculation. I'm going to rework that method today to use a better 
>> algorithm and will add a small prior in the process. I'm also going 
>> to add a unit test to reproduce this problem first. Look for a commit 
>> in a couple of hours.
>>
>>
>>
>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>>> Hi Jeff,
>>>
>>> FYI I checked the problem I was having in CDbwEvaluator with the 
>>> same dataset from the ClusterEvaluator thread, the problem is 
>>> occurring in the std calculation in CDbwEvaluator.computeStd(), in 
>>> that s2.times(s0).minus(s1.times(s1)) generates negative values 
>>> which then produce NaN with the subsequent SquareRootFunction(). 
>>> This then sets the average std to NaN later on in 
>>> intraClusterDensity(). It's happening for the cluster I have with 
>>> the almost-identical points.
>>>
>>> It's the same symptom as the problem last week, where this was 
>>> happening when s0 was 1. Is the solution to ignore these clusters, 
>>> like the s0 = 1 clusters? Or to add a small prior std as was done 
>>> for the similar issue in NormalModel.pdf()?
>>>
>>> Thanks,
>>>
>>> Derek
>>>
>>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>>  Hi Ted,
>>>>
>>>> The clustering code computes this value for cluster radius. 
>>>> Currently, it is done with a running sums approach (s^0, s^1, s^2) 
>>>> that computes the std of each vector term using:
>>>>
>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>>>> SquareRootFunction()).divide(s0);
>>>>
>>>> For CDbw, they need a scalar, average std value, and this is 
>>>> currently computed by averaging the vector terms:
>>>>
>>>> double d = std.zSum() / std.size();
>>>>
>>>> The more I read about it; however, the less confident I am about 
>>>> this approach. The paper itself seems to indicate a covariance 
>>>> approach, but I am lost in their notation. See page 5, just above 
>>>> Definition 1.
>>>>
>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>
>>>>
>>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

Thanks Jeff, I'll try out the changes when they're committed. I tried a 
couple of things locally (removing the clusters/setting a small prior), 
but I ended up with inter-density > intra-density, so I suspect I've 
slipped up somewhere. I'll hold off on it for now.

On 29/09/10 13:48, Jeff Eastman wrote:
>  Hi Derek,
>
> That makes sense. With the very, very tight cluster that your 
> clustering produced you've uncovered an instability in that std 
> calculation. I'm going to rework that method today to use a better 
> algorithm and will add a small prior in the process. I'm also going to 
> add a unit test to reproduce this problem first. Look for a commit in 
> a couple of hours.
>
>
>
> On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
>> Hi Jeff,
>>
>> FYI I checked the problem I was having in CDbwEvaluator with the same 
>> dataset from the ClusterEvaluator thread, the problem is occurring in 
>> the std calculation in CDbwEvaluator.computeStd(), in that 
>> s2.times(s0).minus(s1.times(s1)) generates negative values which then 
>> produce NaN with the subsequent SquareRootFunction(). This then sets 
>> the average std to NaN later on in intraClusterDensity(). It's 
>> happening for the cluster I have with the almost-identical points.
>>
>> It's the same symptom as the problem last week, where this was 
>> happening when s0 was 1. Is the solution to ignore these clusters, 
>> like the s0 = 1 clusters? Or to add a small prior std as was done for 
>> the similar issue in NormalModel.pdf()?
>>
>> Thanks,
>>
>> Derek
>>
>> On 28/09/10 20:28, Jeff Eastman wrote:
>>>  Hi Ted,
>>>
>>> The clustering code computes this value for cluster radius. 
>>> Currently, it is done with a running sums approach (s^0, s^1, s^2) 
>>> that computes the std of each vector term using:
>>>
>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>>> SquareRootFunction()).divide(s0);
>>>
>>> For CDbw, they need a scalar, average std value, and this is 
>>> currently computed by averaging the vector terms:
>>>
>>> double d = std.zSum() / std.size();
>>>
>>> The more I read about it; however, the less confident I am about 
>>> this approach. The paper itself seems to indicate a covariance 
>>> approach, but I am lost in their notation. See page 5, just above 
>>> Definition 1.
>>>
>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>
>>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hi Derek,

That makes sense. With the very, very tight cluster that your clustering 
produced you've uncovered an instability in that std calculation. I'm 
going to rework that method today to use a better algorithm and will add 
a small prior in the process. I'm also going to add a unit test to 
reproduce this problem first. Look for a commit in a couple of hours.



On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
> Hi Jeff,
>
> FYI I checked the problem I was having in CDbwEvaluator with the same 
> dataset from the ClusterEvaluator thread, the problem is occurring in 
> the std calculation in CDbwEvaluator.computeStd(), in that 
> s2.times(s0).minus(s1.times(s1)) generates negative values which then 
> produce NaN with the subsequent SquareRootFunction(). This then sets 
> the average std to NaN later on in intraClusterDensity(). It's 
> happening for the cluster I have with the almost-identical points.
>
> It's the same symptom as the problem last week, where this was 
> happening when s0 was 1. Is the solution to ignore these clusters, 
> like the s0 = 1 clusters? Or to add a small prior std as was done for 
> the similar issue in NormalModel.pdf()?
>
> Thanks,
>
> Derek
>
> On 28/09/10 20:28, Jeff Eastman wrote:
>>  Hi Ted,
>>
>> The clustering code computes this value for cluster radius. 
>> Currently, it is done with a running sums approach (s^0, s^1, s^2) 
>> that computes the std of each vector term using:
>>
>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
>> SquareRootFunction()).divide(s0);
>>
>> For CDbw, they need a scalar, average std value, and this is 
>> currently computed by averaging the vector terms:
>>
>> double d = std.zSum() / std.size();
>>
>> The more I read about it; however, the less confident I am about this 
>> approach. The paper itself seems to indicate a covariance approach, 
>> but I am lost in their notation. See page 5, just above Definition 1.
>>
>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Derek O'Callaghan <de...@ucd.ie>.

Hi Jeff,

FYI I checked the problem I was having in CDbwEvaluator with the same 
dataset from the ClusterEvaluator thread, the problem is occurring in 
the std calculation in CDbwEvaluator.computeStd(), in that 
s2.times(s0).minus(s1.times(s1)) generates negative values which then 
produce NaN with the subsequent SquareRootFunction(). This then sets the 
average std to NaN later on in intraClusterDensity(). It's happening for 
the cluster I have with the almost-identical points.

It's the same symptom as the problem last week, where this was happening 
when s0 was 1. Is the solution to ignore these clusters, like the s0 = 1 
clusters? Or to add a small prior std as was done for the similar issue 
in NormalModel.pdf()?

Thanks,

Derek

On 28/09/10 20:28, Jeff Eastman wrote:
>  Hi Ted,
>
> The clustering code computes this value for cluster radius. Currently, 
> it is done with a running sums approach (s^0, s^1, s^2) that computes 
> the std of each vector term using:
>
> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
> SquareRootFunction()).divide(s0);
>
> For CDbw, they need a scalar, average std value, and this is currently 
> computed by averaging the vector terms:
>
> double d = std.zSum() / std.size();
>
> The more I read about it; however, the less confident I am about this 
> approach. The paper itself seems to indicate a covariance approach, 
> but I am lost in their notation. See page 5, just above Definition 1.
>
> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>
>

Re: Standard Deviation of a Set of Vectors

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Sep 28, 2010 at 12:28 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

>  Hi Ted,
>
> The clustering code computes this value for cluster radius. Currently, it
> is done with a running sums approach (s^0, s^1, s^2) that computes the std
> of each vector term using:
>
> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
> SquareRootFunction()).divide(s0);
>

This is algebraicly correct, but can be subject to bad round-off error.

I would recommend Welford's method which keeps a running sum of s^0, the
current mean and the current std.  Then to update counts, means and variance
for cluster i which has a new distance r:

    double n = counts.getQuick(i) + 1;
    counts.setQuick(i, n);

    double oldMean = means.getQuick(i);
    double newMean = oldMean + (r - oldMean) / n;
    means.setQuick(i, newMean);

    double var = variance.getQuick(i);
    variance.setQuick(i, var + (r - oldMean) * (r - newMean) / n);

The variance is the square of the std, of course.  If you initialize it with
a small number and set counts to something non-zero, you have effectively
added a prior on std (and mean unless you keep the non-zeroness separate).
 If the initial count is <1, then the prior has very little weight or
effect, except in the nasty cases.

See
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
.

> For CDbw, they need a scalar, average std value, and this is currently
> computed by averaging the vector terms:
>
> double d = std.zSum() / std.size();
>
> The more I read about it; however, the less confident I am about this
> approach.

That makes reasonable sense.

> The paper itself seems to indicate a covariance approach, but I am lost in
> their notation. See page 5, just above Definition 1.
>
>
> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>
> I will take a quick glance.