You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by Jeff Drew <je...@gmail.com> on 2006/11/07 23:51:15 UTC
Commons Math vs. Excel stats?
I'm having a weird problem when using the commons math package. When I run
statistics using the Commons math, then compare the results to Excel, I get
different standard deviation and median, but min, max, and count are the
same. I'd appreciate any ideas on how Commons Math and Excel differ in
these calculations.
MEDIAN: Excel: 468,231 CommonsMath: 485,711
STD: Excel: 11,861 CommonsMath: 10,678
The data set is 18,000 integers so I won't include those. They are mostly 6
digit numbers. Here's the code:
import org.apache.commons.math.stat.descriptive.moment.StandardDeviation;
import org.apache.commons.math.stat.descriptive.rank.Max;
import org.apache.commons.math.stat.descriptive.rank.Median;
import org.apache.commons.math.stat.descriptive.rank.Min;
import gnu.trove.TDoubleHashSet;
public class ExampleForMailingList {
StandardDeviation std = new StandardDeviation( );
Min min = new Min( );
Max max = new Max( );
Median medianInstance = new Median();
private double minimum = 0;
private double maximum = 0;
private double standardDev = 0;
private double median = 0;
private boolean isCalcDone = false;
private double count = 0;
/**
* <code>data</code> If the length is zero, then only 0 measurements
were added.
*/
TDoubleHashSet data = new TDoubleHashSet( );
/**
* If the <code>measurement</code> is greater than 0, then add it to the
data.
*
* @param measurement
*/
public void addMeasurement( int measurement ) {
data.add( measurement );
count++;
}
/**
* Must be called before using the getters. This method calculates the
statistics.
*/
public void calculate() {
try {
double[] dataArray = data.toArray( );
minimum = min.evaluate( dataArray );
maximum = max.evaluate( dataArray );
standardDev = std.evaluate( dataArray );
median = medianInstance.evaluate(dataArray);
isCalcDone = true;
} catch ( RuntimeException e ) {
// TODO Auto-generated catch block
e.printStackTrace( );
}
} // calculate
public double getMinimum() throws CalcNotDoneException {
return minimum;
} // get minimum
public double getMaximum() throws CalcNotDoneException {
return maximum;
} // get maximum
public double getStd() throws CalcNotDoneException {
return standardDev;
} // get std
public double getMedian() throws CalcNotDoneException {
return median;
} // get median
/**
* Converts a result set into a set of statistics which a table model
consumes. Calculates: <br>
* 1. min <br>
* 2. average <br>
* 3. max <br>
* 4. median<br>
* 5. percent threshold violations <br>
* @param resultSetArg
* Results of an order table query
*/
public void processResults( ResultSet results,String column ) {
int value = Integer.MAX_VALUE;
try {
while ( results.next( ) ) {
value = ( int ) results.getLong( column );
if ( value > -1 ) {
addMeasurement( value );
}
}
} catch ( SQLException e ) {
// TODO Auto-generated catch block
e.printStackTrace();
} // while
} // processResults
public static void main( String[] args ) {
ExampleForMailingList example = new ExampleForMailingList();
example.processResults(ResultSet set,"columnA");
example.calculate( );
System.out.println("std: "+ example.getStd( ));
System.out.println("std: "+ example.getMedian( ));
}
}
Thanks!
Re: Commons Math vs. Excel stats?
Posted by Remko Popma <re...@daiwasmbc.co.jp>.
Median is defined as the middle value of a sorted list, or if the list
has an even number of elements, the average of the two middle values of
the sorted list.
The code below uses this definition to calculate the median. If this
code gives the same answer as commons math for your large dataset, then
perhaps Excel is wrong.
double median(double[] values) {
if (values.length == 0) {
return Double.NaN; // act same as commons.Math
}
double[] copy = new double[values.length];
System.arraycopy(values, 0, copy, 0, values.length);
java.util.Arrays.sort(copy);
int midpos = copy.length / 2;
if (copy.length % 2 == 1) { // odd number of elements
return copy[midpos]; // return middle element
}
// even number of elements: return avg of two middle elements
return (copy[midpos - 1] + copy[midpos]) / 2;
}
I tested with some small datasets, and this method gave the same answer
as commons.Math...
Hope this helps,
--Remko Popma
Jeff Drew wrote:
> I'm having a weird problem when using the commons math package. When I run
> statistics using the Commons math, then compare the results to Excel, I get
> different standard deviation and median, but min, max, and count are the
> same. I'd appreciate any ideas on how Commons Math and Excel differ in
> these calculations.
>
> MEDIAN: Excel: 468,231 CommonsMath: 485,711
> STD: Excel: 11,861 CommonsMath: 10,678
>
> The data set is 18,000 integers so I won't include those. They are
> mostly 6
> digit numbers. Here's the code:
>
> import org.apache.commons.math.stat.descriptive.moment.StandardDeviation;
> import org.apache.commons.math.stat.descriptive.rank.Max;
> import org.apache.commons.math.stat.descriptive.rank.Median;
> import org.apache.commons.math.stat.descriptive.rank.Min;
> import gnu.trove.TDoubleHashSet;
>
> public class ExampleForMailingList {
>
> StandardDeviation std = new StandardDeviation( );
>
> Min min = new Min( );
>
> Max max = new Max( );
>
> Median medianInstance = new Median();
>
> private double minimum = 0;
>
> private double maximum = 0;
>
> private double standardDev = 0;
>
> private double median = 0;
>
> private boolean isCalcDone = false;
>
> private double count = 0;
>
> /**
> * <code>data</code> If the length is zero, then only 0 measurements
> were added.
> */
> TDoubleHashSet data = new TDoubleHashSet( );
>
> /**
> * If the <code>measurement</code> is greater than 0, then add it to the
> data.
> *
> * @param measurement
> */
> public void addMeasurement( int measurement ) {
>
> data.add( measurement );
>
> count++;
> }
>
> /**
> * Must be called before using the getters. This method calculates the
> statistics.
> */
> public void calculate() {
>
> try {
> double[] dataArray = data.toArray( );
>
> minimum = min.evaluate( dataArray );
>
> maximum = max.evaluate( dataArray );
>
> standardDev = std.evaluate( dataArray );
>
> median = medianInstance.evaluate(dataArray);
>
> isCalcDone = true;
>
> } catch ( RuntimeException e ) {
> // TODO Auto-generated catch block
> e.printStackTrace( );
> }
> } // calculate
>
> public double getMinimum() throws CalcNotDoneException {
> return minimum;
> } // get minimum
>
> public double getMaximum() throws CalcNotDoneException {
> return maximum;
> } // get maximum
>
> public double getStd() throws CalcNotDoneException {
> return standardDev;
> } // get std
>
> public double getMedian() throws CalcNotDoneException {
> return median;
> } // get median
>
> /**
> * Converts a result set into a set of statistics which a table model
> consumes. Calculates: <br>
> * 1. min <br>
> * 2. average <br>
> * 3. max <br>
> * 4. median<br>
> * 5. percent threshold violations <br>
>
> * @param resultSetArg
> * Results of an order table query
> */
> public void processResults( ResultSet results,String column ) {
>
> int value = Integer.MAX_VALUE;
>
> try {
> while ( results.next( ) ) {
>
> value = ( int ) results.getLong( column );
>
> if ( value > -1 ) {
> addMeasurement( value );
> }
> }
> } catch ( SQLException e ) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> } // while
> } // processResults
>
> public static void main( String[] args ) {
> ExampleForMailingList example = new ExampleForMailingList();
> example.processResults(ResultSet set,"columnA");
> example.calculate( );
>
> System.out.println("std: "+ example.getStd( ));
> System.out.println("std: "+ example.getMedian( ));
> }
> }
>
> Thanks!
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org
Re: Commons Math vs. Excel stats?
Posted by Rory Winston <rw...@eircom.net>.
This example was really bugging me, as I had a hard time believing that
[math] and Excel could be so far out of whack. I did a simple experiment
and posted the results on my blog:
http://www.researchkitchen.co.uk/blog/archives/75
Jeff, this may show that the issue is likely somewhere else in your
code. If you're still having issues, you may actually want to post the
actual data used, if possible, so somebody else can verify it with your
data.
Cheers
Rory
Jeff Drew wrote:
> I'm having a weird problem when using the commons math package. When
> I run
> statistics using the Commons math, then compare the results to Excel,
> I get
> different standard deviation and median, but min, max, and count are the
> same. I'd appreciate any ideas on how Commons Math and Excel differ in
> these calculations.
>
> MEDIAN: Excel: 468,231 CommonsMath: 485,711
> STD: Excel: 11,861 CommonsMath: 10,678
>
> The data set is 18,000 integers so I won't include those. They are
> mostly 6
> digit numbers. Here's the code:
>
> import org.apache.commons.math.stat.descriptive.moment.StandardDeviation;
> import org.apache.commons.math.stat.descriptive.rank.Max;
> import org.apache.commons.math.stat.descriptive.rank.Median;
> import org.apache.commons.math.stat.descriptive.rank.Min;
> import gnu.trove.TDoubleHashSet;
>
> public class ExampleForMailingList {
>
> StandardDeviation std = new StandardDeviation( );
>
> Min min = new Min( );
>
> Max max = new Max( );
>
> Median medianInstance = new Median();
>
> private double minimum = 0;
>
> private double maximum = 0;
>
> private double standardDev = 0;
>
> private double median = 0;
>
> private boolean isCalcDone = false;
>
> private double count = 0;
>
> /**
> * <code>data</code> If the length is zero, then only 0 measurements
> were added.
> */
> TDoubleHashSet data = new TDoubleHashSet( );
>
> /**
> * If the <code>measurement</code> is greater than 0, then add it
> to the
> data.
> *
> * @param measurement
> */
> public void addMeasurement( int measurement ) {
>
> data.add( measurement );
>
> count++;
> }
>
> /**
> * Must be called before using the getters. This method calculates
> the
> statistics.
> */
> public void calculate() {
>
> try {
> double[] dataArray = data.toArray( );
>
> minimum = min.evaluate( dataArray );
>
> maximum = max.evaluate( dataArray );
>
> standardDev = std.evaluate( dataArray );
>
> median = medianInstance.evaluate(dataArray);
>
> isCalcDone = true;
>
> } catch ( RuntimeException e ) {
> // TODO Auto-generated catch block
> e.printStackTrace( );
> }
> } // calculate
>
> public double getMinimum() throws CalcNotDoneException {
> return minimum;
> } // get minimum
>
> public double getMaximum() throws CalcNotDoneException {
> return maximum;
> } // get maximum
>
> public double getStd() throws CalcNotDoneException {
> return standardDev;
> } // get std
>
> public double getMedian() throws CalcNotDoneException {
> return median;
> } // get median
>
> /**
> * Converts a result set into a set of statistics which a table model
> consumes. Calculates: <br>
> * 1. min <br>
> * 2. average <br>
> * 3. max <br>
> * 4. median<br>
> * 5. percent threshold violations <br>
>
> * @param resultSetArg
> * Results of an order table query
> */
> public void processResults( ResultSet results,String column ) {
>
> int value = Integer.MAX_VALUE;
>
> try {
> while ( results.next( ) ) {
>
> value = ( int ) results.getLong( column );
>
> if ( value > -1 ) {
> addMeasurement( value );
> }
> }
> } catch ( SQLException e ) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> } // while
> } // processResults
>
> public static void main( String[] args ) {
> ExampleForMailingList example = new ExampleForMailingList();
> example.processResults(ResultSet set,"columnA");
> example.calculate( );
>
> System.out.println("std: "+ example.getStd( ));
> System.out.println("std: "+ example.getMedian( ));
> }
> }
>
> Thanks!
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org
Re: Commons Math vs. Excel stats?
Posted by Henri Yandell <fl...@gmail.com>.
I remember having similar things a while back (between custom Java and
Excel rather than Commons Math). I seem to recall Excel having some
overflow oddities/bugs that led to the incorrect results.
It was a couple of years ago at a previous company, so my memory isn't
too hot :)
Hen
On 11/7/06, Jeff Drew <je...@gmail.com> wrote:
> I'm having a weird problem when using the commons math package. When I run
> statistics using the Commons math, then compare the results to Excel, I get
> different standard deviation and median, but min, max, and count are the
> same. I'd appreciate any ideas on how Commons Math and Excel differ in
> these calculations.
>
> MEDIAN: Excel: 468,231 CommonsMath: 485,711
> STD: Excel: 11,861 CommonsMath: 10,678
>
> The data set is 18,000 integers so I won't include those. They are mostly 6
> digit numbers. Here's the code:
>
> import org.apache.commons.math.stat.descriptive.moment.StandardDeviation;
> import org.apache.commons.math.stat.descriptive.rank.Max;
> import org.apache.commons.math.stat.descriptive.rank.Median;
> import org.apache.commons.math.stat.descriptive.rank.Min;
> import gnu.trove.TDoubleHashSet;
>
> public class ExampleForMailingList {
>
> StandardDeviation std = new StandardDeviation( );
>
> Min min = new Min( );
>
> Max max = new Max( );
>
> Median medianInstance = new Median();
>
> private double minimum = 0;
>
> private double maximum = 0;
>
> private double standardDev = 0;
>
> private double median = 0;
>
> private boolean isCalcDone = false;
>
> private double count = 0;
>
> /**
> * <code>data</code> If the length is zero, then only 0 measurements
> were added.
> */
> TDoubleHashSet data = new TDoubleHashSet( );
>
> /**
> * If the <code>measurement</code> is greater than 0, then add it to the
> data.
> *
> * @param measurement
> */
> public void addMeasurement( int measurement ) {
>
> data.add( measurement );
>
> count++;
> }
>
> /**
> * Must be called before using the getters. This method calculates the
> statistics.
> */
> public void calculate() {
>
> try {
> double[] dataArray = data.toArray( );
>
> minimum = min.evaluate( dataArray );
>
> maximum = max.evaluate( dataArray );
>
> standardDev = std.evaluate( dataArray );
>
> median = medianInstance.evaluate(dataArray);
>
> isCalcDone = true;
>
> } catch ( RuntimeException e ) {
> // TODO Auto-generated catch block
> e.printStackTrace( );
> }
> } // calculate
>
> public double getMinimum() throws CalcNotDoneException {
> return minimum;
> } // get minimum
>
> public double getMaximum() throws CalcNotDoneException {
> return maximum;
> } // get maximum
>
> public double getStd() throws CalcNotDoneException {
> return standardDev;
> } // get std
>
> public double getMedian() throws CalcNotDoneException {
> return median;
> } // get median
>
> /**
> * Converts a result set into a set of statistics which a table model
> consumes. Calculates: <br>
> * 1. min <br>
> * 2. average <br>
> * 3. max <br>
> * 4. median<br>
> * 5. percent threshold violations <br>
>
> * @param resultSetArg
> * Results of an order table query
> */
> public void processResults( ResultSet results,String column ) {
>
> int value = Integer.MAX_VALUE;
>
> try {
> while ( results.next( ) ) {
>
> value = ( int ) results.getLong( column );
>
> if ( value > -1 ) {
> addMeasurement( value );
> }
> }
> } catch ( SQLException e ) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> } // while
> } // processResults
>
> public static void main( String[] args ) {
> ExampleForMailingList example = new ExampleForMailingList();
> example.processResults(ResultSet set,"columnA");
> example.calculate( );
>
> System.out.println("std: "+ example.getStd( ));
> System.out.println("std: "+ example.getMedian( ));
> }
> }
>
> Thanks!
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org
Re: Commons Math vs. Excel stats?
Posted by lu...@free.fr.
Selon Jeff Drew <je...@gmail.com>:
> I'm having a weird problem when using the commons math package. When I run
> statistics using the Commons math, then compare the results to Excel, I get
> different standard deviation and median, but min, max, and count are the
> same. I'd appreciate any ideas on how Commons Math and Excel differ in
> these calculations.
>
> MEDIAN: Excel: 468,231 CommonsMath: 485,711
> STD: Excel: 11,861 CommonsMath: 10,678
>
> The data set is 18,000 integers so I won't include those. They are mostly 6
> digit numbers. Here's the code:
This is weird ...
For the median, one way to check what happens is to sort your data in ascending
order set and look at the data at the middle index. If you have an even number
of samples 2k and 468231 is at index k (counting from 1) and 485711 is at index
k+1 then it is a matter of interpretation. If you have an even number of samples
2k+1, then the result MUST BE the value at index k+1 (counting from 1). Could
you check this in both Excel and CommonsMath ?
For the standard deviation, one way to check is to split your data set in two
parts, compute the various moments and combine them afterwards to check.
Unfortunately, I am currently replying to you from a public area and cannot
provide you with the equations for the combination. These equations are based
on the linearity of the expectancy (is this the right english term ?) and the
definition of the variance from the expectancy. If you prefer to wait until
Monday, I can provide this equations for you.
Luc
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org
RE: Commons Math vs. Excel stats?
Posted by Andy Turner <A....@leeds.ac.uk>.
Hi,
Not sure of the details of either Commons Math or Excel. But could this
be a precission issue. Are something like BigInteger and BigDecimal used
in the calculations? If not precision will probably be lost for large
calculations.
Best wishes,
Andy
A.G.D.Turner@leeds.ac.uk
http://www.geog.leeds.ac.uk/people/a.turner
-----Original Message-----
From: Jeff Drew [mailto:jeffrdrew@gmail.com]
Sent: 07 November 2006 22:51
To: commons-user@jakarta.apache.org
Subject: Commons Math vs. Excel stats?
I'm having a weird problem when using the commons math package. When I
run statistics using the Commons math, then compare the results to
Excel, I get different standard deviation and median, but min, max, and
count are the same. I'd appreciate any ideas on how Commons Math and
Excel differ in these calculations.
MEDIAN: Excel: 468,231 CommonsMath: 485,711
STD: Excel: 11,861 CommonsMath: 10,678
The data set is 18,000 integers so I won't include those. They are
mostly 6 digit numbers. Here's the code:
import
org.apache.commons.math.stat.descriptive.moment.StandardDeviation;
import org.apache.commons.math.stat.descriptive.rank.Max;
import org.apache.commons.math.stat.descriptive.rank.Median;
import org.apache.commons.math.stat.descriptive.rank.Min;
import gnu.trove.TDoubleHashSet;
public class ExampleForMailingList {
StandardDeviation std = new StandardDeviation( );
Min min = new Min( );
Max max = new Max( );
Median medianInstance = new Median();
private double minimum = 0;
private double maximum = 0;
private double standardDev = 0;
private double median = 0;
private boolean isCalcDone = false;
private double count = 0;
/**
* <code>data</code> If the length is zero, then only 0 measurements
were added.
*/
TDoubleHashSet data = new TDoubleHashSet( );
/**
* If the <code>measurement</code> is greater than 0, then add it to
the data.
*
* @param measurement
*/
public void addMeasurement( int measurement ) {
data.add( measurement );
count++;
}
/**
* Must be called before using the getters. This method calculates
the statistics.
*/
public void calculate() {
try {
double[] dataArray = data.toArray( );
minimum = min.evaluate( dataArray );
maximum = max.evaluate( dataArray );
standardDev = std.evaluate( dataArray );
median = medianInstance.evaluate(dataArray);
isCalcDone = true;
} catch ( RuntimeException e ) {
// TODO Auto-generated catch block
e.printStackTrace( );
}
} // calculate
public double getMinimum() throws CalcNotDoneException {
return minimum;
} // get minimum
public double getMaximum() throws CalcNotDoneException {
return maximum;
} // get maximum
public double getStd() throws CalcNotDoneException {
return standardDev;
} // get std
public double getMedian() throws CalcNotDoneException { return
median;
} // get median
/**
* Converts a result set into a set of statistics which a table
model consumes. Calculates: <br>
* 1. min <br>
* 2. average <br>
* 3. max <br>
* 4. median<br>
* 5. percent threshold violations <br>
* @param resultSetArg
* Results of an order table query
*/
public void processResults( ResultSet results,String column ) {
int value = Integer.MAX_VALUE;
try {
while ( results.next( ) ) {
value = ( int ) results.getLong( column );
if ( value > -1 ) {
addMeasurement( value );
}
}
} catch ( SQLException e ) {
// TODO Auto-generated catch block
e.printStackTrace();
} // while
} // processResults
public static void main( String[] args ) {
ExampleForMailingList example = new ExampleForMailingList();
example.processResults(ResultSet set,"columnA");
example.calculate( );
System.out.println("std: "+ example.getStd( ));
System.out.println("std: "+ example.getMedian( ));
}
}
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org