You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by md...@apache.org on 2003/06/18 00:55:39 UTC

cvs commit: jakarta-commons-sandbox/math/src/java/org/apache/commons/math/stat StatUtils.java

mdiggory    2003/06/17 15:55:39

  Added:       math/src/java/org/apache/commons/math/stat StatUtils.java
  Log:
  Initial Addition of StatUtils. Some methods need review, implementation and possibly debugging
  
  Revision  Changes    Path
  1.1                  jakarta-commons-sandbox/math/src/java/org/apache/commons/math/stat/StatUtils.java
  
  Index: StatUtils.java
  ===================================================================
  /* ====================================================================
   * The Apache Software License, Version 1.1
   *
   * Copyright (c) 2003 The Apache Software Foundation.  All rights
   * reserved.
   *
   * Redistribution and use in source and binary forms, with or without
   * modification, are permitted provided that the following conditions
   * are met:
   *
   * 1. Redistributions of source code must retain the above copyright
   *    notice, this list of conditions and the following disclaimer.
   *
   * 2. Redistributions in binary form must reproduce the above copyright
   *    notice, this list of conditions and the following disclaimer in
   *    the documentation and/or other materials provided with the
   *    distribution.
   *
   * 3. The end-user documentation included with the redistribution, if
   *    any, must include the following acknowlegement:
   *       "This product includes software developed by the
   *        Apache Software Foundation (http://www.apache.org/)."
   *    Alternately, this acknowlegement may appear in the software itself,
   *    if and wherever such third-party acknowlegements normally appear.
   *
   * 4. The names "The Jakarta Project", "Commons", and "Apache Software
   *    Foundation" must not be used to endorse or promote products derived
   *    from this software without prior written permission. For written
   *    permission, please contact apache@apache.org.
   *
   * 5. Products derived from this software may not be called "Apache"
   *    nor may "Apache" appear in their names without prior written
   *    permission of the Apache Software Foundation.
   *
   * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
   * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
   * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
   * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
   * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
   * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
   * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
   * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
   * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
   * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
   * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
   * SUCH DAMAGE.
   * ====================================================================
   *
   * This software consists of voluntary contributions made by many
   * individuals on behalf of the Apache Software Foundation.  For more
   * information on the Apache Software Foundation, please see
   * <http://www.apache.org/>.
   */
  package org.apache.commons.math.stat;
  
  /**
   * StatUtils provides easy static implementations of common double[] based
   * statistical methods. These return a single result value or in some cases, as
   * identified in the javadoc for each method, Double.NaN.
   *
   * @author <a href="mailto:mdiggory@apache.org">Mark Diggory</a>
   */
  public class StatUtils {
  
  	/**
       * The sum of the values that have been added to Univariate.
       * @param values Is a double[] containing the values
       * @return the sum of the values or Double.NaN if the array is empty
  	 */
  	public static double sum(double[] values) {
  		double accum = 0.0;
  		for (int i = 0; i < values.length; i++) {
  			accum += values[i];
  		}
  		return accum;
  	}
  
  	/**
       * Returns the sum of the squares of the available values.
       * @param values Is a double[] containing the values
       * @return the sum of the squared values or Double.NaN if the array is empty
  	 */
  	public static double sumSq(double[] values) {
  		double accum = 0.0;
  		for (int i = 0; i < values.length; i++) {
  			accum += Math.pow(values[i], 2.0);
  		}
  		return accum;
  	}
  
  	/**
       * Returns the <a href=http://www.xycoon.com/arithmetic_mean.htm>
       * arithmetic mean </a> of the available values 
       * @param values Is a double[] containing the values
       * @return the mean of the values or Double.NaN if the array is empty
  	 */
  	public static double mean(double[] values) {
  		return sum(values) / values.length;
  	}
  
      /**
       *      
       * @param values Is a double[] containing the values
       * @return the result, Double.NaN if no values for an empty array 
       * or 0.0 for a single value set.  
       */
      public static double standardDeviation(double[] values) {
          double stdDev = Double.NaN;
          if (values.length != 0) {
              stdDev = Math.sqrt(variance(values));
          }
          return (stdDev);
      }
      
  	/**
       * Returns the variance of the available values.
       * @param values Is a double[] containing the values
       * @return the result, Double.NaN if no values for an empty array 
       * or 0.0 for a single value set.  
  	 */
  	public static double variance(double[] values) {
  		double variance = Double.NaN;
          
  		if (values.length == 1) {
  			variance = 0;
  		} else if (values.length > 1) {
  			double mean = mean(values);
  			double accum = 0.0;
  			for (int i = 0; i < values.length; i++) {
  				accum += Math.pow((values[i] - mean), 2.0);
  			}
  			variance = accum / (values.length - 1);
  		}
  		return variance;
  	}
  
      /**
       * Returns the skewness of a collection of values.  Skewness is a 
       * measure of the assymetry of a given distribution. 
       * @param values Is a double[] containing the values
       * @return the skewness of the values or Double.NaN if the array is empty
       */
      public double skewness(double[] values) {
          // Initialize the skewness
          double skewness = Double.NaN;
  
          // Get the mean and the standard deviation
          double mean = mean(values);
          double stdDev = standardDeviation(values);
  
          // Sum the cubes of the distance from the mean divided by the 
          // standard deviation
          double accum = 0.0;
          for (int i = 0; i < values.length; i++) {
              accum += Math.pow((values[i] - mean) / stdDev, 3.0);
          }
  
          // Get N
          double n = values.length;
  
          // Calculate skewness
          skewness = (n / ((n - 1) * (n - 2))) * accum;
  
          return skewness;
      }
  
      /**
       * Returns the kurtosis for this collection of values. Kurtosis is a 
       * measure of the "peakedness" of a distribution.
       * @param values Is a double[] containing the values
       * @return the kurtosis of the values or Double.NaN if the array is empty
       */
      public double kurtosis(double[] values) {
          // Initialize the kurtosis
          double kurtosis = Double.NaN;
  
          // Get the mean and the standard deviation
          double mean = mean(values);
          double stdDev = standardDeviation(values);
  
          // Sum the ^4 of the distance from the mean divided by the 
          // standard deviation
          double accum = 0.0;
          for (int i = 0; i < values.length; i++) {
              accum += Math.pow((values[i] - mean) / stdDev, 4.0);
          }
  
          // Get N
          double n = values.length;
  
          double coefficientOne = (n * (n + 1)) / ((n - 1) * (n - 2) * (n - 3));
          double termTwo = ((3 * Math.pow(n - 1, 2.0)) 
                             / ((n - 2) * (n - 3))); 
          // Calculate kurtosis
          kurtosis = (coefficientOne * accum) - termTwo;
  
          return kurtosis;
      }
      
  	/**
       * Returns the maximum of the available values
       * @param values Is a double[] containing the values
  	 * @return the maximum of the values or Double.NaN if the array is empty
  	 */
  	public static double max(double[] values) {
  		double max = Double.NaN;
  		for (int i = 0; i < values.length; i++) {
  			if (i == 0) {
  				max = values[i];
  			} else {
  				max = Math.max(max, values[i]);
  			}
  		}
  		return max;
  	}
  
  	/**
       * Returns the minimum of the available values
       * @param values Is a double[] containing the values
  	 * @return the minimum of the values or Double.NaN if the array is empty
  	 */
  	public static double min(double[] values) {
  		double min = Double.NaN;
  		for (int i = 0; i < values.length; i++) {
  			if (i == 0) {
  				min = values[i];
  			} else {
  				min = Math.min(min, values[i]);
  			}
  		}
  		return min;
  	}
      
      /** 
       * Returns the mode of the values that have been added.  The mode is
       * the element which occurs with the most frequency
       * @return the mode
       */
      public static double mode(){
          // Mode depends on a refactor Freq class
          String msg = "mode() is not yet implemented";
          throw new UnsupportedOperationException(msg);
      }
      
      /** 
       * Returns the mode of the values that have been added.  The mode is
       * the element which occurs with the most frequency
       * @return the mode
       */
      public static double median(double[] values){
          // Mode depends on a refactor Freq class
          String msg = "median() is not yet implemented";
          throw new UnsupportedOperationException(msg);
      }
  }
  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
J.Pietschmann wrote:

> Mark R. Diggory wrote:
>
>>>> J.Pietschmann wrote:
>>>>
>>>>> No. If you cast the base into a double there is not much risk of
>>>>> overflow: double x = n;  y=x*x; or y=((double)n)*((double)n);
>>>>> or even y=n*(double)n; (but avoid y=(double)n*n).
>>>>> Double mantissa has IIRC 52 bits, this should be good for integers
>>>>> up to 2^26=67108864 without loss of precision.
>>>>
>>
>> Is this correct, I've been reading (again, if I'm getting this 
>> correctly) the doubles mantissa is capable of supporting 15 -17 
>> decimal places, 2^26 is only 8.
>
>
> I meant: you can square integers up to 2^26 without loss of precision
> (due to truncation).
>
> J.Pietschmann 

Yes I do agree with you on that one.


-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Mark R. Diggory wrote:
>>> J.Pietschmann wrote:
>>>
>>>> No. If you cast the base into a double there is not much risk of
>>>> overflow: double x = n;  y=x*x; or y=((double)n)*((double)n);
>>>> or even y=n*(double)n; (but avoid y=(double)n*n).
>>>> Double mantissa has IIRC 52 bits, this should be good for integers
>>>> up to 2^26=67108864 without loss of precision.
> 
> Is this correct, I've been reading (again, if I'm getting this 
> correctly) the doubles mantissa is capable of supporting 15 -17 decimal 
> places, 2^26 is only 8.

I meant: you can square integers up to 2^26 without loss of precision
(due to truncation).

J.Pietschmann



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Phil Steitz wrote:
> --- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
> 
>>Thanks for entertaining my sometimes naive questioning,
>>
>>J.Pietschmann wrote:
>>>No. If you cast the base into a double there is not much risk of
>>>overflow: double x = n;  y=x*x; or y=((double)n)*((double)n);
>>>or even y=n*(double)n; (but avoid y=(double)n*n).
>>>Double mantissa has IIRC 52 bits, this should be good for integers
>>>up to 2^26=67108864 without loss of precision.

Is this correct, I've been reading (again, if I'm getting this 
correctly) the doubles mantissa is capable of supporting 15 -17 decimal 
places, 2^26 is only 8.

http://www.particle.kth.se/~lindsey/JavaCourse/Book/Tech/Chapter02/floatingPt.html


>>
>>Wow, thats a great clarification, I understand your defense now. It 
>>would be best to cast n to double asap and always have * operating on 
>>doubles, so doing things like consolidating the casting of n like this
>>
>>((double) (n*(n - 1)))
> 
> 
>>is a poor choice, if I understand correctly, its wiser to do at least
>>
>>((double) n) * (n - 1)
> 
> 
> For avoiding overflow maybe, but not for precision, since if n is a long, 
> n*(n-1) will be computed exactly in the first case and then cast to a double.
> 

Just to clarify, and if I understand correctly, we are only talking 
about n in terms of "values.length". n is currently any integer >= 0. 
The concern is that calculating n*n for any n > Integer.MAX_VALUE^(1/2) 
and for calculating n*n*n for any n > Integer.MAX_VALUE^(1/3) there is 
significant overflow that can easily be rectified through pre-casting to 
doubles.

I think, that its only when the case that such a method begins to loose 
precision above the threshold of the double's mantissa of IIRC 52 bits 
which I think is ~1.0 x 10^17 significant digits, not quite as good as 
~1.0 x 10^19 max for long integers, but the behavior is more stable for 
larger doubles than for longs, resulting initially in loss of precision 
for numbers > ~1.0 x 10^17 instead of a hard overflow.

 From this I now draw that its often better to cast the int "n" to 
double prior to any operations preformed on it.

-Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Phil Steitz wrote:
> I am also a little confused by J's analysis. Do we know definitively that using
> J2SE, there is precision loss in Math.pow(x,n) vs x*x*...*x (n terms) for small
> integer n?

Actually no. I didn't know they use a specific library routine
which is stunningly good. Some tests showed d*d equals pow(d,2)
uniformly for the whole range from 1E-10 to 1E11, and for cubes
there is one bit difference, which might well be a roundoff error
in d*d*d rather than precision loss in pow.

Mind you, testing exp(2*log(d)) showed expected behaviour, roughly
one bit difference for numbers in 1..10, logarithmically growing
with abs(exponent) to the expected 4 bits at 1E7.

The other issue is performance. I used loops which store i*i in
an integer array, i*(double)i and pow(i,2) in a double array. If
the first was 100%, the second used 120% and the third 1500% (yes,
more than an order of magnitude slower). For the first two results,
time is still swamped with loop overhead, array range check and
other stuff, using cubes instead of squares increased the first
loop time to 110%, second perhaps to 125%, while pow(d,3) nearly
tripled to ~4500%. The inlined HW-supported code I was familar
was not nearly as bad, in particular performance was basically
independent of the exponent.

A few attempts at eliminating the loop and other overhead from
measurement failed badly. Summing up the result instead of storing
it in an integer resulted in similar relative times, except that i*i
and i*(double)i had the same performance. I suspect that storing
32000 doubles (~256kB) hit harder on memory bandwidth than storing
the same number of integers.

>  If the answer is yes, we should establish the guideline that for
> integer n < 4, we use explicit products instead of Math.pow(-,n).

Well, there is no precision loss, and a few performance hogs here
and there are hardly noticed in the overall performance of an algorithm.
The latter is especially true for Java, where achiving the maximal
possible numerical performance for the hardware or even getting in the
same ballpark is illusionary anyway.
Nevertheless, I find x*x easier to read than Math.pow(x,2), in particular
if "x" is really a short-named variable.

J.Pietschmann


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
J.Pietschmann wrote:

> Mark R. Diggory wrote:
>
>> (1) It is very important to also use ((double)x)*x instead of 
>> (double)(x*x), as the loss of precision starts to occur at far 
>> greater values than overflow occurs if one were doing integer arithmetic
>
>
> IIRC Java shares also the C behaviour in that n*n becomes
> negative instead of signalling an overflow. If this is
> embedded in a complicated expression you only notice strange
> results, or not even that. This can be quite hard to debug.
>
> I'm too lazy to run a test to confirm this right now, but I'm
> sure someone else will have done it when I wake up tomorrow :-)
>
> J.Pietschmann
>
Yes I wrote my own little test, it goes negative initially, eventually 
it seems to go to 1 if the magnitude of the overflow gets excessive

-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Mark R. Diggory wrote:
> (1) It is very important to also use ((double)x)*x instead of 
> (double)(x*x), as the loss of precision starts to occur at far greater 
> values than overflow occurs if one were doing integer arithmetic

IIRC Java shares also the C behaviour in that n*n becomes
negative instead of signalling an overflow. If this is
embedded in a complicated expression you only notice strange
results, or not even that. This can be quite hard to debug.

I'm too lazy to run a test to confirm this right now, but I'm
sure someone else will have done it when I wake up tomorrow :-)

J.Pietschmann



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Phil Steitz wrote:

>--- Brent Worden <br...@worden.org> wrote:
>  
>
>>>I am also a little confused by J's analysis. Do we know
>>>definitively that using
>>>J2SE, there is precision loss in Math.pow(x,n) vs x*x*...*x (n
>>>terms) for small
>>>integer n?  If the answer is yes, we should establish the
>>>guideline that for
>>>integer n < 4, we use explicit products instead of Math.pow(-,n).
>>>
>>>Phil
>>>      
>>>
>>According to the J2SE javadoc, the math methods use fdlibm
>>(http://www.netlib.org/fdlibm/) for their implementations.  Here's the pow
>>implementation http://www.netlib.org/fdlibm/e_pow.c.  According to its
>>comments, its only perfectly accurate for the two integer argument case.
>>Otherwise it's "nearly rounded."
>>
>>    
>>
>
>Thanks for doing the research, Brent.  Based on this, I propose that we
>establish the guideline above -- i.e., avoid Math.pow(x,n) for (constant)
>integer n < 4 and floating point x.  I will submit a patch to developer.xml
>including this.
>
>Phil
>
Just to elaborate slightly,

its only perfectly accurate for the two integer argument case.
Otherwise it's "nearly rounded."

This does not mean that Math.pow is using Integer Arithmetic in 
accomplishing x^n. Java still does implicit casting to doubles prior to 
Math.pow, it has to as the only method signature for Math.pow is

public static native double pow(double a, double b);

this means it is doing double precision floating point arithmetic when 
calculating Math.pow( x, n). Because of this, for x values where x^n > 
Integer.MAX_VALUE, Math.pow does not overflow during the calculation. It 
is also the case that ((double)x) * x  carries the same precision as 
Math.pow(x,n) upto 2^52 bits if significant digits (the mantissa of a 
double).

So in terms of computational efficiency, ((double)x)*x is perfectly 
accurate for results upto 2^52 or 4.5036E+15 significant digits. With 
this in mind,

(1) It is very important to also use ((double)x)*x instead of 
(double)(x*x), as the loss of precision starts to occur at far greater 
values than overflow occurs if one were doing integer arithmetic

(2) Its appropriate to also use ((double)x)*x  in place of Math.pow(x,2) 
when x = values.length (0 <= x < Integer.MAX_VALUE), because both will 
begin loosing precision at about the same time and ((double)x)*x is a 
more efficient approach than Math.pow.

-Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] Math.pow usage was: Re: cvs commit: ...

Posted by Phil Steitz <st...@yahoo.com>.
--- Brent Worden <br...@worden.org> wrote:
> >
> > I am also a little confused by J's analysis. Do we know
> > definitively that using
> > J2SE, there is precision loss in Math.pow(x,n) vs x*x*...*x (n
> > terms) for small
> > integer n?  If the answer is yes, we should establish the
> > guideline that for
> > integer n < 4, we use explicit products instead of Math.pow(-,n).
> >
> > Phil
> 
> According to the J2SE javadoc, the math methods use fdlibm
> (http://www.netlib.org/fdlibm/) for their implementations.  Here's the pow
> implementation http://www.netlib.org/fdlibm/e_pow.c.  According to its
> comments, its only perfectly accurate for the two integer argument case.
> Otherwise it's "nearly rounded."
> 

Thanks for doing the research, Brent.  Based on this, I propose that we
establish the guideline above -- i.e., avoid Math.pow(x,n) for (constant)
integer n < 4 and floating point x.  I will submit a patch to developer.xml
including this.

Phil

> Brent Worden
> http://www.brent.worden.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] Math.pow usage was: Re: cvs commit: ...

Posted by Brent Worden <br...@worden.org>.
>
> I am also a little confused by J's analysis. Do we know
> definitively that using
> J2SE, there is precision loss in Math.pow(x,n) vs x*x*...*x (n
> terms) for small
> integer n?  If the answer is yes, we should establish the
> guideline that for
> integer n < 4, we use explicit products instead of Math.pow(-,n).
>
> Phil

According to the J2SE javadoc, the math methods use fdlibm
(http://www.netlib.org/fdlibm/) for their implementations.  Here's the pow
implementation http://www.netlib.org/fdlibm/e_pow.c.  According to its
comments, its only perfectly accurate for the two integer argument case.
Otherwise it's "nearly rounded."

Brent Worden
http://www.brent.worden.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Math.pow usage was: Re: cvs commit: ...

Posted by Phil Steitz <st...@yahoo.com>.
--- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
> Thanks for entertaining my sometimes naive questioning,
> 
> J.Pietschmann wrote:
> 
> > Mark R. Diggory wrote:
> >
> >> (1) Does it seem logical that when working with "n" (or 
> >> values.length) to use Math.pow(n, x), as positive integers, the risk 
> >> is actually 'integer overflow' when the array representing the number 
> >> of cases gets very large, for which the log implementation of 
> >> Math.pow would help retain greater numerical accuracy?
> >
> >
> > No. If you cast the base into a double there is not much risk of
> > overflow: double x = n;  y=x*x; or y=((double)n)*((double)n);
> > or even y=n*(double)n; (but avoid y=(double)n*n).
> > Double mantissa has IIRC 52 bits, this should be good for integers
> > up to 2^26=67108864 without loss of precision. 
> 
> Wow, thats a great clarification, I understand your defense now. It 
> would be best to cast n to double asap and always have * operating on 
> doubles, so doing things like consolidating the casting of n like this
> 
> ((double) (n*(n - 1)))

> 
> is a poor choice, if I understand correctly, its wiser to do at least
> 
> ((double) n) * (n - 1)

For avoiding overflow maybe, but not for precision, since if n is a long, 
n*(n-1) will be computed exactly in the first case and then cast to a double.

 > >
> > If you are dealing with floating point numbers, your concern is
> > loss of precision, not overflow. Apart from this, I don't understand
> > in what sense the log based Math.pow(values[i], 2.0) should be
> > favorable. If ther's precision loss for x*x, there will be at least
> > the same presicion loss for Math.pow(values[i], 2.0), because at least
> > the same number of bits will be missing from the mantissa. 
> 
> Again, its starting be obvious that I had some bad assumptions about 
> floating point arith. and any benefits from "e*log(2)+log(m)" 
> calculations in Math.pow. Your discussion has convinced me that its 
> usage isn't that great a benefit in terms of any numerical stability in 
> terms of working with (doubles).
> 
> But, now, I'm a little more confused, on a different subject I(we) took 
> on this exp*log strategy for calculating the geometric mean as:
> 
>     /**
>      * Returns the sum of the natural logs for this collection of values
>      * @param values Is a double[] containing the values
>      * @return the sumLog value or Double.NaN if the array is empty
>      */
>     public static double sumLog(double[] values) {
>         double sumLog = Double.NaN;
>         if (values.length > 0) {
>             sumLog = 0.0;
>             for (int i = 0; i < values.length; i++) {
>                 sumLog += Math.log(values[i]);
>             }
>         }
>         return sumLog;
>     }
> 
>     /**
>      * Returns the geometric mean for this collection of values
>      * @param values Is a double[] containing the values
>      * @return the geometric mean or Double.NaN if the array is empty or
>      * any of the values are &lt;= 0.
>      */
>     public static double geometricMean(double[] values) {
>         return Math.exp(sumLog(values) / (double) values.length);
>     }
> 
> I'm not sure, but isn't this applying a similar exp*log approach found 
> in math.pow.  How would you interpret this solution over the what we had 
> before? We approached the above because of concerns that (in the below 
> example) as the product in "product *= values[i]" gets very large, that 
> Math.pow(product(values),(1.0/values.length)); would then introduce a 
> loss of precision as "1.0/values.length" gets smaller and smaller. But, 
> doesn't the above algorithm introduce a  loss of precision when 
> values[i] are very small values as well (similar to what you are 
> describing in Math.pow(...))?
> 
>     /**
>      * Returns the product for this collection of values
>      * @param values Is a double[] containing the values
>      * @return the product values or Double.NaN if the array is empty
>      */
>     public static double product(double[] values) {
>         double product = Double.NaN;
>         if( values.length > 0 ) {
>             product = 1.0;
>             for( int i = 0; i < values.length; i++) {
>                 product *= values[i];
>             }
>         }
>         return product;
>     }
>    
>     /**
>      * Returns the geometric mean for this collection of values
>      * @param values Is a double[] containing the values
>      * @return the geometric mean or Double.NaN if the array is empty or
>      * any of the values are &lt;= 0.
>      */
>     public static double geometricMean(double[] values) {
>         return Math.pow(product(values),(1.0/values.length));
>     }
> 
I would like to hear J's opinion on this, but from my perspective, the
difference is in the bound on the number of terms in the product.  In the case
of the geometric mean, there is no bound and there is a real possiblity of
premature overflow.  Also, there is exponentiation applied in the computation
in any case to compute the statistic and this computation "might" be more
stable working with logs (need definitive confirmation of this).  In any case,
I have seen this computational strategy applied elsewhere for geometric means. 
Here again, some research should be done, or we should drop the statistic.

I am also a little confused by J's analysis. Do we know definitively that using
J2SE, there is precision loss in Math.pow(x,n) vs x*x*...*x (n terms) for small
integer n?  If the answer is yes, we should establish the guideline that for
integer n < 4, we use explicit products instead of Math.pow(-,n).

Phil


p.s. Let's all try to remember the [math] in the subject lines.
> 
> -Mark
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: Math.pow usage was: Re: cvs commit: ...

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Thanks for entertaining my sometimes naive questioning,

J.Pietschmann wrote:

> Mark R. Diggory wrote:
>
>> (1) Does it seem logical that when working with "n" (or 
>> values.length) to use Math.pow(n, x), as positive integers, the risk 
>> is actually 'integer overflow' when the array representing the number 
>> of cases gets very large, for which the log implementation of 
>> Math.pow would help retain greater numerical accuracy?
>
>
> No. If you cast the base into a double there is not much risk of
> overflow: double x = n;  y=x*x; or y=((double)n)*((double)n);
> or even y=n*(double)n; (but avoid y=(double)n*n).
> Double mantissa has IIRC 52 bits, this should be good for integers
> up to 2^26=67108864 without loss of precision. 

Wow, thats a great clarification, I understand your defense now. It 
would be best to cast n to double asap and always have * operating on 
doubles, so doing things like consolidating the casting of n like this

((double) (n*(n - 1)))

is a poor choice, if I understand correctly, its wiser to do at least

((double) n) * (n - 1)

but unnecessary to go the extreme of:

((double)n) * (((double)n) - 1.0)

> OTOH the log is calculated
> for x=m*2^e with 0.5<=m<1 as e*log(2)+log(m), where log(m) can be
> calculated to full precision by a process called pseudodivision,
> see http://www.math.uni-bremen.de/~thielema/Research/arithmetics.pdf,
> however, for 2^26 you'll loose log(26*log(2))/log(2)~4 bits due to
> the first summand (the same effect happens for very small bases,
> like 1E-16). The exponentiation amplifies the error, although I'm not
> sure by how much.
> Well, with some luck the processor uses its guarding bits to cover
> the precision loss mentioned above (on i386, FP registers have a 64 bit
> denormalized mantissa in contrast to the 52 bit normalized mantissa for
> double). Strict math and IEEE 854 conformant hardware IIRC will discard
> them though.
>
>> (2) In the opposite case from below, where values[i] are very large, 
>> doesn't the log based Math.pow(values[i], 2.0) again work in our 
>> favor to reduce overflow? Seems a catch22.
>
>
> If you are dealing with floating point numbers, your concern is
> loss of precision, not overflow. Apart from this, I don't understand
> in what sense the log based Math.pow(values[i], 2.0) should be
> favorable. If ther's precision loss for x*x, there will be at least
> the same presicion loss for Math.pow(values[i], 2.0), because at least
> the same number of bits will be missing from the mantissa. 

Again, its starting be obvious that I had some bad assumptions about 
floating point arith. and any benefits from "e*log(2)+log(m)" 
calculations in Math.pow. Your discussion has convinced me that its 
usage isn't that great a benefit in terms of any numerical stability in 
terms of working with (doubles).

But, now, I'm a little more confused, on a different subject I(we) took 
on this exp*log strategy for calculating the geometric mean as:

    /**
     * Returns the sum of the natural logs for this collection of values
     * @param values Is a double[] containing the values
     * @return the sumLog value or Double.NaN if the array is empty
     */
    public static double sumLog(double[] values) {
        double sumLog = Double.NaN;
        if (values.length > 0) {
            sumLog = 0.0;
            for (int i = 0; i < values.length; i++) {
                sumLog += Math.log(values[i]);
            }
        }
        return sumLog;
    }

    /**
     * Returns the geometric mean for this collection of values
     * @param values Is a double[] containing the values
     * @return the geometric mean or Double.NaN if the array is empty or
     * any of the values are &lt;= 0.
     */
    public static double geometricMean(double[] values) {
        return Math.exp(sumLog(values) / (double) values.length);
    }

I'm not sure, but isn't this applying a similar exp*log approach found 
in math.pow.  How would you interpret this solution over the what we had 
before? We approached the above because of concerns that (in the below 
example) as the product in "product *= values[i]" gets very large, that 
Math.pow(product(values),(1.0/values.length)); would then introduce a 
loss of precision as "1.0/values.length" gets smaller and smaller. But, 
doesn't the above algorithm introduce a  loss of precision when 
values[i] are very small values as well (similar to what you are 
describing in Math.pow(...))?

    /**
     * Returns the product for this collection of values
     * @param values Is a double[] containing the values
     * @return the product values or Double.NaN if the array is empty
     */
    public static double product(double[] values) {
        double product = Double.NaN;
        if( values.length > 0 ) {
            product = 1.0;
            for( int i = 0; i < values.length; i++) {
                product *= values[i];
            }
        }
        return product;
    }
   
    /**
     * Returns the geometric mean for this collection of values
     * @param values Is a double[] containing the values
     * @return the geometric mean or Double.NaN if the array is empty or
     * any of the values are &lt;= 0.
     */
    public static double geometricMean(double[] values) {
        return Math.pow(product(values),(1.0/values.length));
    }


-Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: Math.pow usage was: Re: cvs commit: ...

Posted by "J.Pietschmann" <j3...@yahoo.de>.
Mark R. Diggory wrote:
> (1) Does it seem logical that when working with "n" (or values.length) 
> to use Math.pow(n, x), as positive integers, the risk is actually 
> 'integer overflow' when the array representing the number of cases gets 
> very large, for which the log implementation of Math.pow would help 
> retain greater numerical accuracy?

No. If you cast the base into a double there is not much risk of
overflow: double x = n;  y=x*x; or y=((double)n)*((double)n);
or even y=n*(double)n; (but avoid y=(double)n*n).
Double mantissa has IIRC 52 bits, this should be good for integers
up to 2^26=67108864 without loss of precision. OTOH the log is calculated
for x=m*2^e with 0.5<=m<1 as e*log(2)+log(m), where log(m) can be
calculated to full precision by a process called pseudodivision,
see http://www.math.uni-bremen.de/~thielema/Research/arithmetics.pdf,
however, for 2^26 you'll loose log(26*log(2))/log(2)~4 bits due to
the first summand (the same effect happens for very small bases,
like 1E-16). The exponentiation amplifies the error, although I'm not
sure by how much.
Well, with some luck the processor uses its guarding bits to cover
the precision loss mentioned above (on i386, FP registers have a 64 bit
denormalized mantissa in contrast to the 52 bit normalized mantissa for
double). Strict math and IEEE 854 conformant hardware IIRC will discard
them though.

> (2) In the opposite case from below, where values[i] are very large, 
> doesn't the log based Math.pow(values[i], 2.0) again work in our favor 
> to reduce overflow? Seems a catch22.

If you are dealing with floating point numbers, your concern is
loss of precision, not overflow. Apart from this, I don't understand
in what sense the log based Math.pow(values[i], 2.0) should be
favorable. If ther's precision loss for x*x, there will be at least
the same presicion loss for Math.pow(values[i], 2.0), because at least
the same number of bits will be missing from the mantissa.

> More than anyone really ever wanted to know about Math.pow's 
> implementation:
> http://www.netlib.org/fdlibm/e_pow.c

This is a pure software solution. With modern processors, hardware
supported solutions are the rule, which are much much much more
performant and less prone to precision loss without carrying expensive
guarding bits.

J.Pietschmann


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: Math.pow usage was: Re: cvs commit: ...

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Al Chou wrote:

>--- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
>  
>
>>(1) Does it seem logical that when working with "n" (or values.length) 
>>to use Math.pow(n, x), as positive integers, the risk is actually 
>>'integer overflow' when the array representing the number of cases gets 
>>very large, for which the log implementation of Math.pow would help 
>>retain greater numerical accuracy?
>>    
>>
>
>values.length would have to be > sqrt( Integer.MAX_VALUE ) ~ 46340 in order to
>run into that overflow, but I can imagine some users blithely using a
>64k-element array.
>
Or in UnivariateImpl's memoryless Inifinite window case, based on the 
number of times addValue(...) has been called.

>>(2) In the opposite case from below, where values[i] are very large, 
>>doesn't the log based Math.pow(values[i], 2.0) again work in our favor 
>>to reduce overflow? Seems a catch22.
>>
>>More than anyone really ever wanted to know about Math.pow's implementation:
>>http://www.netlib.org/fdlibm/e_pow.c
>>    
>>
>
>I notice that java.lang.StrictMath explicitly mentions fdlibm.  Does
>java.lang.Math also use fdlibm's routines?
>  
>
Yes, Math deligates to StrictMath in many cases, which then uses native 
methods to call fdlibm. Math.pow is one of these cases.

>Maybe we should just trust Math.pow (or StrictMath.pow?) by default until we
>create some test cases that give us cause to doubt it....
>
>Al
>



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: Math.pow usage was: Re: cvs commit: ...

Posted by Al Chou <ho...@yahoo.com>.
--- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
> (1) Does it seem logical that when working with "n" (or values.length) 
> to use Math.pow(n, x), as positive integers, the risk is actually 
> 'integer overflow' when the array representing the number of cases gets 
> very large, for which the log implementation of Math.pow would help 
> retain greater numerical accuracy?

values.length would have to be > sqrt( Integer.MAX_VALUE ) ~ 46340 in order to
run into that overflow, but I can imagine some users blithely using a
64k-element array.


> (2) In the opposite case from below, where values[i] are very large, 
> doesn't the log based Math.pow(values[i], 2.0) again work in our favor 
> to reduce overflow? Seems a catch22.
> 
> More than anyone really ever wanted to know about Math.pow's implementation:
> http://www.netlib.org/fdlibm/e_pow.c

I notice that java.lang.StrictMath explicitly mentions fdlibm.  Does
java.lang.Math also use fdlibm's routines?


Maybe we should just trust Math.pow (or StrictMath.pow?) by default until we
create some test cases that give us cause to doubt it....



Al

=====
Albert Davidson Chou

    Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: Math.pow usage was: Re: cvs commit: ...

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
(1) Does it seem logical that when working with "n" (or values.length) 
to use Math.pow(n, x), as positive integers, the risk is actually 
'integer overflow' when the array representing the number of cases gets 
very large, for which the log implementation of Math.pow would help 
retain greater numerical accuracy?
.
(2) In the opposite case from below, where values[i] are very large, 
doesn't the log based Math.pow(values[i], 2.0) again work in our favor 
to reduce overflow? Seems a catch22.

More than anyone really ever wanted to know about Math.pow's implementation:
http://www.netlib.org/fdlibm/e_pow.c

-Mark

,J.Pietschmann wrote:

>> accum += Math.pow(values[i], 2.0);
>
>
> I think Math.pow should not be used for small integer exponents,
> in particular 2,3 and perhaps 5. You could do this in FORTRAN or
> other languages with a built-in power operator, because the compiler
> will optimize it, but languages without that usually rewrite
> pos(x,y) as exp(y*log(x)). This is bad for precision, in particular
> if x is very small, and somewhat bad for performance (taking time of
> one multiplication and two divisions, typically costing 3 to 20 times
> as much as a multiplication each).
> I'm not sure how Java handles it, but the loss of precision may
> be responsible for the unexpected negative values.
>
> I'd recommend to replace pow(x,2) by x*x, pow(x,3) by x*x*x and
> pow(x,4) by a=x*x, y=a*a. It doesn't matter much whether x is an
> expression because the compiler will take care of it and eliminate
> redundant computation, however code may be more redable if a
> temporary variable is used. If x is an integer, an explicit cast
> to double may be necessary to avoid integer overflow.
>
> Math.pow should only be used if the exponent is larger, not an
> integer, or not a constant.
>
> J.Pietschmann
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Math.pow usage was: Re: cvs commit: ...

Posted by "J.Pietschmann" <j3...@yahoo.de>.
> accum += Math.pow(values[i], 2.0);

I think Math.pow should not be used for small integer exponents,
in particular 2,3 and perhaps 5. You could do this in FORTRAN or
other languages with a built-in power operator, because the compiler
will optimize it, but languages without that usually rewrite
pos(x,y) as exp(y*log(x)). This is bad for precision, in particular
if x is very small, and somewhat bad for performance (taking time of
one multiplication and two divisions, typically costing 3 to 20 times
as much as a multiplication each).
I'm not sure how Java handles it, but the loss of precision may
be responsible for the unexpected negative values.

I'd recommend to replace pow(x,2) by x*x, pow(x,3) by x*x*x and
pow(x,4) by a=x*x, y=a*a. It doesn't matter much whether x is an
expression because the compiler will take care of it and eliminate
redundant computation, however code may be more redable if a
temporary variable is used. If x is an integer, an explicit cast
to double may be necessary to avoid integer overflow.

Math.pow should only be used if the exponent is larger, not an
integer, or not a constant.

J.Pietschmann



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org