You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Al Chou <ho...@yahoo.com> on 2003/06/14 07:57:23 UTC

Re: [math] more improvement to storage free mean, variance computation

>Date: Wed, 04 Jun 2003 21:05:14 -0700
>From: Phil Steitz <ph...@steitz.com>
>Subject: [math] more improvement to storage free mean, variance computation
>
>Check out procedure sum.2 and var.2 in
>
>http://www.stanford.edu/~glynn/PDF/0208.pdf
>
>The first looks like Brent's suggestion for a corrected mean 
>computation, with no memory required.  The additional computational cost 
>that I complained about is docuemented to be 3x the flops cost of the 
>direct computation, but the computation is claimed to be more stable. So 
>the question is: do we pay the flops cost to get the numerical 
>stability?  The example in the paper is compelling; but it uses small 
>words (err, numbers I mean -- sorry, slipped in to my native Fortran for 
>a moment there ;-)).  So how do we go about deciding whether the 
>stability in the mean computation is worth the increased computational 
>effort?  I would prefer not to answer "let the user decide".  To make 
>the decision harder, we should note that it is actually worse than 3x, 
>since in the no storage version, the user may request the mean only 
>rarely (if at all) and the 3x comparison is against computiing the mean 
>for each value added.
>
>The variance formula looks better than what we have now, still requiring 
>no memory.  Should we implement this for the no storage case?

After implementing var.2 from the Stanford paper in UnivariateImpl and
scratching my head for some time over why the variance calculation failed its
JUnit test case, I realized there's a flaw in var.2 that I can't understand no
one talks about.  To update the variance (called S in the paper), the formula
calculates

z = y / i
S = S + (i?1) * y * z

where i is the number of data values (including the value just being added to
the collection).  It doesn't really matter how y is defined, because you will
notice that

S = S + (i?1) * y * y / i
  = S + (i?1) * y**2 / i

which means that S can never decrease in magnitude (for real data, which is
what we're talking about).  But for the simple case of three data values {1, 2,
2} in the JUnit test case, the variance decreases between the addition of the
second and third data values.

Can anyone point out what I'm missing here?



Al

=====
Albert Davidson Chou

    Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] more improvement to storage free mean, variance computation

Posted by Al Chou <ho...@yahoo.com>.

--- Phil Steitz <ph...@steitz.com> wrote:
> Al Chou wrote:
> > --- Phil Steitz <ph...@steitz.com> wrote:
> >>I think that is OK, since if you look at the definition of S earlier in 
> >>the paper, S is not the variance, it is the sum of the squared 
> >>deviations from the mean.  This should be always increasing.
> > 
> > 
> > Where is that definition?  I'm looking at equations 3 and 4, which define
> > S_{1,q} (in LaTeX notation), and the return statement in algorithm
> Procedure
> > var.2, which says S_{1,q} = S.
> 
> Equation 3 defines S to be the sum of squared differences between L_i 
> and L-bar, which are defined on p. 209 to be the observed values and 
> their mean.

Damn, you're right, sorry for being dense.  I wonder why they talk about
variance both before equation 3 and equation 5 and then don't bother to do the
final division that calculates it?  In any case, once I see whether my laptop
will boot stably, I'll put in the division by (i - 1) that turns S into the
variance and see how it goes.

Al

=====
Albert Davidson Chou

    Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] more improvement to storage free mean, variance computation

Posted by Phil Steitz <ph...@steitz.com>.

Al Chou wrote:
> --- Phil Steitz <ph...@steitz.com> wrote:
> 
>>Al Chou wrote:
>>
>>>>Date: Wed, 04 Jun 2003 21:05:14 -0700
>>>>From: Phil Steitz <ph...@steitz.com>
>>>>Subject: [math] more improvement to storage free mean, variance computation
>>>>
>>>>Check out procedure sum.2 and var.2 in
>>>>
>>>>http://www.stanford.edu/~glynn/PDF/0208.pdf
>>>>
>>>>The first looks like Brent's suggestion for a corrected mean 
>>>>computation, with no memory required.  The additional computational cost 
>>>>that I complained about is docuemented to be 3x the flops cost of the 
>>>>direct computation, but the computation is claimed to be more stable. So 
>>>>the question is: do we pay the flops cost to get the numerical 
>>>>stability?  The example in the paper is compelling; but it uses small 
>>>>words (err, numbers I mean -- sorry, slipped in to my native Fortran for 
>>>>a moment there ;-)).  So how do we go about deciding whether the 
>>>>stability in the mean computation is worth the increased computational 
>>>>effort?  I would prefer not to answer "let the user decide".  To make 
>>>>the decision harder, we should note that it is actually worse than 3x, 
>>>>since in the no storage version, the user may request the mean only 
>>>>rarely (if at all) and the 3x comparison is against computiing the mean 
>>>>for each value added.
>>>>
>>>>The variance formula looks better than what we have now, still requiring 
>>>>no memory.  Should we implement this for the no storage case?
>>>
>>>
>>>After implementing var.2 from the Stanford paper in UnivariateImpl and
>>>scratching my head for some time over why the variance calculation failed
>>
>>its
>>
>>>JUnit test case, I realized there's a flaw in var.2 that I can't understand
>>
>>no
>>
>>>one talks about.  To update the variance (called S in the paper), the
>>
>>formula
>>
>>>calculates
>>>
>>>z = y / i
>>>S = S + (i-1) * y * z
>>>
>>>where i is the number of data values (including the value just being added
>>
>>to
>>
>>>the collection).  It doesn't really matter how y is defined, because you
>>
>>will
>>
>>>notice that
>>>
>>>S = S + (i-1) * y * y / i
>>>  = S + (i-1) * y**2 / i
>>>
>>>which means that S can never decrease in magnitude (for real data, which is
>>>what we're talking about).  But for the simple case of three data values
>>
>>{1, 2,
>>
>>>2} in the JUnit test case, the variance decreases between the addition of
>>
>>the
>>
>>>second and third data values.
>>>
>>>Can anyone point out what I'm missing here?
>>>
>>>
>>
>>I think that is OK, since if you look at the definition of S earlier in 
>>the paper, S is not the variance, it is the sum of the squared 
>>deviations from the mean.  This should be always increasing.
> 
> 
> Where is that definition?  I'm looking at equations 3 and 4, which define
> S_{1,q} (in LaTeX notation), and the return statement in algorithm Procedure
> var.2, which says S_{1,q} = S.

Equation 3 defines S to be the sum of squared differences between L_i 
and L-bar, which are defined on p. 209 to be the observed values and 
their mean.

> 
> Anyway, I think the resolution is contained in messages to follow shortly.
> 
> 
> Al
> 
> =====
> Albert Davidson Chou
> 
>     Get answers to Mac questions at http://www.Mac-Mgrs.org/ .
> 
> __________________________________
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!
> http://sbc.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] more improvement to storage free mean, variance computation

Posted by Al Chou <ho...@yahoo.com>.

--- Phil Steitz <ph...@steitz.com> wrote:
> Al Chou wrote:
> >>Date: Wed, 04 Jun 2003 21:05:14 -0700
> >>From: Phil Steitz <ph...@steitz.com>
> >>Subject: [math] more improvement to storage free mean, variance computation
> >>
> >>Check out procedure sum.2 and var.2 in
> >>
> >>http://www.stanford.edu/~glynn/PDF/0208.pdf
> >>
> >>The first looks like Brent's suggestion for a corrected mean 
> >>computation, with no memory required.  The additional computational cost 
> >>that I complained about is docuemented to be 3x the flops cost of the 
> >>direct computation, but the computation is claimed to be more stable. So 
> >>the question is: do we pay the flops cost to get the numerical 
> >>stability?  The example in the paper is compelling; but it uses small 
> >>words (err, numbers I mean -- sorry, slipped in to my native Fortran for 
> >>a moment there ;-)).  So how do we go about deciding whether the 
> >>stability in the mean computation is worth the increased computational 
> >>effort?  I would prefer not to answer "let the user decide".  To make 
> >>the decision harder, we should note that it is actually worse than 3x, 
> >>since in the no storage version, the user may request the mean only 
> >>rarely (if at all) and the 3x comparison is against computiing the mean 
> >>for each value added.
> >>
> >>The variance formula looks better than what we have now, still requiring 
> >>no memory.  Should we implement this for the no storage case?
> > 
> > 
> > After implementing var.2 from the Stanford paper in UnivariateImpl and
> > scratching my head for some time over why the variance calculation failed
> its
> > JUnit test case, I realized there's a flaw in var.2 that I can't understand
> no
> > one talks about.  To update the variance (called S in the paper), the
> formula
> > calculates
> > 
> > z = y / i
> > S = S + (i-1) * y * z
> > 
> > where i is the number of data values (including the value just being added
> to
> > the collection).  It doesn't really matter how y is defined, because you
> will
> > notice that
> > 
> > S = S + (i-1) * y * y / i
> >   = S + (i-1) * y**2 / i
> > 
> > which means that S can never decrease in magnitude (for real data, which is
> > what we're talking about).  But for the simple case of three data values
> {1, 2,
> > 2} in the JUnit test case, the variance decreases between the addition of
> the
> > second and third data values.
> > 
> > Can anyone point out what I'm missing here?
> > 
> > 
> I think that is OK, since if you look at the definition of S earlier in 
> the paper, S is not the variance, it is the sum of the squared 
> deviations from the mean.  This should be always increasing.

Where is that definition?  I'm looking at equations 3 and 4, which define
S_{1,q} (in LaTeX notation), and the return statement in algorithm Procedure
var.2, which says S_{1,q} = S.

Anyway, I think the resolution is contained in messages to follow shortly.


Al

=====
Albert Davidson Chou

    Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] more improvement to storage free mean, variance computation

Posted by Phil Steitz <ph...@steitz.com>.

Al Chou wrote:
>>Date: Wed, 04 Jun 2003 21:05:14 -0700
>>From: Phil Steitz <ph...@steitz.com>
>>Subject: [math] more improvement to storage free mean, variance computation
>>
>>Check out procedure sum.2 and var.2 in
>>
>>http://www.stanford.edu/~glynn/PDF/0208.pdf
>>
>>The first looks like Brent's suggestion for a corrected mean 
>>computation, with no memory required.  The additional computational cost 
>>that I complained about is docuemented to be 3x the flops cost of the 
>>direct computation, but the computation is claimed to be more stable. So 
>>the question is: do we pay the flops cost to get the numerical 
>>stability?  The example in the paper is compelling; but it uses small 
>>words (err, numbers I mean -- sorry, slipped in to my native Fortran for 
>>a moment there ;-)).  So how do we go about deciding whether the 
>>stability in the mean computation is worth the increased computational 
>>effort?  I would prefer not to answer "let the user decide".  To make 
>>the decision harder, we should note that it is actually worse than 3x, 
>>since in the no storage version, the user may request the mean only 
>>rarely (if at all) and the 3x comparison is against computiing the mean 
>>for each value added.
>>
>>The variance formula looks better than what we have now, still requiring 
>>no memory.  Should we implement this for the no storage case?
> 
> 
> After implementing var.2 from the Stanford paper in UnivariateImpl and
> scratching my head for some time over why the variance calculation failed its
> JUnit test case, I realized there's a flaw in var.2 that I can't understand no
> one talks about.  To update the variance (called S in the paper), the formula
> calculates
> 
> z = y / i
> S = S + (i?1) * y * z
> 
> where i is the number of data values (including the value just being added to
> the collection).  It doesn't really matter how y is defined, because you will
> notice that
> 
> S = S + (i?1) * y * y / i
>   = S + (i?1) * y**2 / i
> 
> which means that S can never decrease in magnitude (for real data, which is
> what we're talking about).  But for the simple case of three data values {1, 2,
> 2} in the JUnit test case, the variance decreases between the addition of the
> second and third data values.
> 
> Can anyone point out what I'm missing here?
> 
> 
I think that is OK, since if you look at the definition of S earlier in 
the paper, S is not the variance, it is the sum of the squared 
deviations from the mean.  This should be always increasing.

> 
> Al
> 
> =====
> Albert Davidson Chou
> 
>     Get answers to Mac questions at http://www.Mac-Mgrs.org/ .
> 
> __________________________________
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!
> http://sbc.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] more improvement to storage free mean, variance computation

Posted by Al Chou <ho...@yahoo.com>.

--- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
> Al Chou wrote:
> > After implementing var.2 from the Stanford paper in UnivariateImpl and
> > scratching my head for some time over why the variance calculation failed
> its
> > JUnit test case, I realized there's a flaw in var.2 that I can't understand
> no
> > one talks about.  To update the variance (called S in the paper), the
> formula
> > calculates
> > 
> > z = y / i
> > S = S + (i-1) * y * z
> > 
> > where i is the number of data values (including the value just being added
> to
> > the collection).  It doesn't really matter how y is defined, because you
> will
> > notice that
> > 
> > S = S + (i-1) * y * y / i
> >   = S + (i-1) * y**2 / i
> > 
> > which means that S can never decrease in magnitude (for real data, which is
> > what we're talking about).  But for the simple case of three data values
> {1, 2,
> > 2} in the JUnit test case, the variance decreases between the addition of
> the
> > second and third data values.
> > 
> > Can anyone point out what I'm missing here?
> 
> Al, I see what your saying, I wrote a little example case to implement 
> the pseudo code they have in the paper:
> 
> public class SmallTest {
> 
>      public static void main(String[] args) {
>          double[] vals = new double[] { 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0 };
> 
>          double m = vals[0];
>          double s = 0.0;
> 
>          System.out.println("m=" + m);
>          System.out.println("s=" + s);
>          System.out.println("");
> 
>          for (int i = 2; i <= vals.length; i++) {
> 
>              double y = vals[i-1] - m;
>              double z = y / i;
>              m += z;
>              s += (i - 1) * y * z;
> 
>              System.out.println("y=" + y);
>              System.out.println("z=" + z);
>              System.out.println("m=" + m);
>              System.out.println("s=" + s);
>              System.out.println("");
>          }
>      }
> }
> 
> s does seem to increase even thought the variance of the calculation 
> should be going down.
> 
> I want us to review this paper further and go back to the research of
> 
> Hanson, R. J. 1975. Stably updating mean and standard
> deviation of data. Communications of the
> ACM 18:57-58.
> Stanford, where he currently holds the Thomas Ford
> Chair in the Department of Engineering-Economic
> 
> Lets verify if theres a typo in the equation or something. Maybe these 
> guys even misenterpreted his work.

Thanks for trying it out, Mark.  Your code reads substantially the same as
mine, except that I was working inside of UnivariateImpl.

Google can't find the original paper online, but it does find Richard J.
Hanson's personal Web site, containing a bibliography of his publications and
two email addresses for him.  Anyone have the courage to email him without
having first read the original paper?  I wish I could derive the (or at least
an) updating variance formula myself; maybe I should try again.


Al

=====
Albert Davidson Chou

    Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] more improvement to storage free mean, variance computation

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.

Al Chou wrote:
> 
> 
> After implementing var.2 from the Stanford paper in UnivariateImpl and
> scratching my head for some time over why the variance calculation failed its
> JUnit test case, I realized there's a flaw in var.2 that I can't understand no
> one talks about.  To update the variance (called S in the paper), the formula
> calculates
> 
> z = y / i
> S = S + (i?1) * y * z
> 
> where i is the number of data values (including the value just being added to
> the collection).  It doesn't really matter how y is defined, because you will
> notice that
> 
> S = S + (i?1) * y * y / i
>   = S + (i?1) * y**2 / i
> 
> which means that S can never decrease in magnitude (for real data, which is
> what we're talking about).  But for the simple case of three data values {1, 2,
> 2} in the JUnit test case, the variance decreases between the addition of the
> second and third data values.
> 
> Can anyone point out what I'm missing here?

Al, I see what your saying, I wrote a little example case to implement 
the pseudo code they have in the paper:

public class SmallTest {

     public static void main(String[] args) {
         double[] vals = new double[] { 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0 };

         double m = vals[0];
         double s = 0.0;

         System.out.println("m=" + m);
         System.out.println("s=" + s);
         System.out.println("");

         for (int i = 2; i <= vals.length; i++) {

             double y = vals[i-1] - m;
             double z = y / i;
             m += z;
             s += (i - 1) * y * z;

             System.out.println("y=" + y);
             System.out.println("z=" + z);
             System.out.println("m=" + m);
             System.out.println("s=" + s);
             System.out.println("");
         }
     }
}

s does seem to increase even thought the variance of the calculation 
should be going down.

I want us to review this paper further and go back to the research of

Hanson, R. J. 1975. Stably updating mean and standard
deviation of data. Communications of the
ACM 18:57{58.
Stanford, where he currently holds the Thomas Ford
Chair in the Department of Engineering-Economic

Lets verify if theres a typo in the equation or something. Maybe these 
guys even misenterpreted his work.

-M.


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org