You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Alex Herbert <al...@gmail.com> on 2021/09/14 12:09:18 UTC

Re: [commons-math] branch master updated: Upgrade dependency.

Hi,

You have beaten me to this. I have updated the entire distribution
code and will push an update when I have rebased from this change.

Alex

On Tue, 14 Sept 2021 at 12:35, <er...@apache.org> wrote:
>
> This is an automated email from the ASF dual-hosted git repository.
>
> erans pushed a commit to branch master
> in repository https://gitbox.apache.org/repos/asf/commons-math.git
>
>
> The following commit(s) were added to refs/heads/master by this push:
>      new acfc270  Upgrade dependency.
> acfc270 is described below
>
> commit acfc27083412f57a71a497e500070a08a956300d
> Author: Gilles Sadowski <gi...@gmail.com>
> AuthorDate: Tue Sep 14 13:34:44 2021 +0200
>
>     Upgrade dependency.
> ---
>  pom.xml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/pom.xml b/pom.xml
> index 96c5848..a80f258 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -63,7 +63,7 @@
>      <math.checkstyle.dep.version>8.29</math.checkstyle.dep.version>
>      <math.mathjax.version>2.7.2</math.mathjax.version>
>      <math.commons.numbers.version>1.0</math.commons.numbers.version>
> -    <math.commons.rng.version>1.4-SNAPSHOT</math.commons.rng.version>
> +    <math.commons.rng.version>1.4</math.commons.rng.version>
>      <math.commons.geometry.version>1.0</math.commons.geometry.version>
>      <math.commons.statistics.version>1.0-SNAPSHOT</math.commons.statistics.version>
>      <math.commons.math3.version>3.6.1</math.commons.math3.version>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Alex Herbert <al...@gmail.com>.

On Wed, 6 Oct 2021 at 23:20, Gilles Sadowski <gi...@gmail.com> wrote:
>
>
> Are there other issues with "BrentSolver"?
> I'd suggest that we release v1.1 (or v1.0.1) of Commons Numbers
> (with the fix for NUMBERS-167 and the changes related to Erfc).

I don't think so. The bug I found was in other brent solver
implementations I compared it to. It is a standard way to look for a
sign change:

double x, y;

if (x * y < 0) {
    // sign change
}

But this does not work if x*y underflows to -0.0. The sign change is
not detected.

I stopped working on the brent solver used in the inverse CDF as I
moved on to other things. I just need a bit more time to look at it.
It may just require some updates to the tolerances that are being
used.

Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Gilles Sadowski <gi...@gmail.com>.

Le mer. 6 oct. 2021 à 23:16, Alex Herbert <al...@gmail.com> a écrit :
>
> I have finished working on a revised test suite for [statistics]. It
> is ready to merge into master. I will do this when it has passed the
> CI checks [1].

+1

>
> [...]
> I fixed a lot of issues [...]

Thanks!

>
> I will create tickets for the following unresolved issues that may
> need to be addressed before a release: [...]
>
>
> 2. Many distributions only compute the inverse CDF to 1e-9 relative
> error. Some fail to compute when the input p value maps to a very
> small x value (e.g. 1e-50). This is related to the defaults for the
> brent solver used in the default implementation. I tried to adjust the
> defaults but found a bug in the brent solver, which I have fixed [6].
> The brent solver requires a release to be used with this fix so work
> can use a local copy until all issues with it have been resolved, and
> the changes ported back to numbers.

Are there other issues with "BrentSolver"?
I'd suggest that we release v1.1 (or v1.0.1) of Commons Numbers
(with the fix for NUMBERS-167 and the changes related to Erfc).

Best,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Alex Herbert <al...@gmail.com>.

I have finished working on a revised test suite for [statistics]. It
is ready to merge into master. I will do this when it has passed the
CI checks [1].

The test suite uses properties files to contain reference values for a
parameterisation of a distribution. Ideally all distributions should
be tested using these files as it ensures that the distribution is run
through all the standard tests. In the case of some old legacy tests
for issues discovered in Commons Math the tests have been preserved.

I updated most test data to higher accuracy and now most of the test
suite passes using a *relative* tolerance of 1e-12. Using a relative
tolerance is helpful for assessing the inverse CDF as the output value
x may not lie in the interval [0, 1]. For testing probabilities an
absolute accuracy is OK but if the probabilities have a large range of
magnitudes then a high tolerance needed for some values will not
identify problems in small values. Absolute accuracy is supported by
JUnit assertions, but not relative accuracy. For relative accuracy I
created a wrapper interface (DoubleTolerance) that uses Commons
Numbers Precision to compute equivalence. This can use absolute,
relative or ULP distance tolerances; and combinations of them with And
and Or operations.

I fixed a lot of issues that were missed due to sparsity of the
previous test suite. I found additional cases for distributions which
had bugs in functions that were not tested. I found domain issues with
overflow of integers to negative. There are so many fixes that it is
better to merge this into master and work on specific outstanding
issues.

I will create tickets for the following unresolved issues that may
need to be addressed before a release:

1. The PoissonDistribution can be parameterised with any mean and yet
the PoissonSampler in RNG excludes a mean above 2^30 to avoid
truncation. This change was made to avoid distribution truncation and
because the algorithm uses (int) floor(mean) (see [5]). Commons RNG
now has a LongSampler interface and it may be possible to create a
PossionSampler that outputs a long with the same algorithm and avoid
truncation to int values. The final PoissonDistribution is still
bounded by int values but the sampler can be less restrictive if it
uses long. At large means the Poisson sampler mainly uses a Gaussian
sampler and runtime performance should not be impacted.

2. Many distributions only compute the inverse CDF to 1e-9 relative
error. Some fail to compute when the input p value maps to a very
small x value (e.g. 1e-50). This is related to the defaults for the
brent solver used in the default implementation. I tried to adjust the
defaults but found a bug in the brent solver, which I have fixed [6].
The brent solver requires a release to be used with this fix so work
can use a local copy until all issues with it have been resolved, and
the changes ported back to numbers.

3. The Normal distribution uses InverseErfc for the inverse CDF. This
is a wrapper for InverseErf and as such it is limited in precision as
p -> 0 since the method computes using (2p - 1). When p is < 1e-16
then the inverse CDF returns -infinity. The smallest z is around -8.2
as computed by matlab:

norminv(1e-16)
-8.222082216130437

scipy, R, matlab, octave, Mathematica all have methods to invert small
p. Here is output from matlab for norminv(5e-324):
-38.472346342768788. It is accurate down to the minimum value for a
64-bit float.

There are two free libraries that seem to be used for this: Boost and
Cephes (function ntdri). The Boost licence is compatible with Apache
and there are some Boost derived works already in Commons Numbers.
Incorporating the Boost Erf functions into numbers would be useful.

The inability to invert the Erfc for all p affects the Normal,
Truncated normal and Levy distributions.

4. The Poisson distribution currently allows user configured
tolerances for computing the Gamma functions for the CDF. These may
not be necessary and it should be investigated. The version of the
Gamma function used by the Poisson distribution may not be the same as
those that required these tolerances in Commons Math 2 (use of the
tolerances go as far back as the commit history in the git repo).

5. The Gamma distribution density is not accurate when shape < 1.
There are extensive tests related to MATH-753 [7] for accuracy of the
Gamma distribution. These test shape >= 1. The work requires extension
to smaller shapes where the density -> infinite as X -> 0. When shape
>= 1 then density -> 0 as X -> 0 (see [4]). This results in inaccuracy
of the Chi-squared distribution which uses Gamma(df/2, 2) when the
degrees of freedom (df) are < 2. I've read the code and the
documentation for the Boost functions which the code is based on and I
think the switch to the alternative computation to avoid overflow is
currently not working for all cases.

Regards,

Alex

[1] https://github.com/apache/commons-statistics/pull/31
[2] Boost error function inverses:
https://www.boost.org/doc/libs/1_77_0/libs/math/doc/html/math_toolkit/sf_erf/error_inv.html
[3] Cephes (ntdri in the probability functions) : https://www.netlib.org/cephes/
[4] https://en.wikipedia.org/wiki/Gamma_distribution
[5] https://issues.apache.org/jira/browse/RNG-52
[6] https://issues.apache.org/jira/browse/NUMBERS-168
[7] https://issues.apache.org/jira/browse/MATH-753

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Gilles Sadowski <gi...@gmail.com>.

Hi.

Le lun. 20 sept. 2021 à 00:59, Alex Herbert <al...@gmail.com> a écrit :
>
> I have created an initial version of the statistics distributions
> application. It can be used with:
>
> cd commons-statistics-examples/examples-distribution
> mvn package -Pexamples-distribution
> java -jar target/examples-distribution.jar
>
> There are 26 distributions each with a command. Each distribution
> command has sub-commands to evaluate functions: pdf, cdf, survival,
> icdf, logpdf
>
> I have created this entirely using picocli with sub-classes for each
> command and to hold default values. The default parameterisations and
> ranges for the distributions were copied from Wikipedia if available.
> The evaluated points and the distribution parameters can be changed:
>
> To create a sample of the PDF for the exponential distribution:
>
>     java -jar target/examples-distribution.jar exp pdf
>     java -jar target/examples-distribution.jar exp pdf --mean 3.45
>     java -jar target/examples-distribution.jar exp pdf \
>         --mean 3.45 --min 3 --max 5 --steps 50
>
> To create a sample of the CDF for the Poisson distribution:
>
>     java -jar target/examples-distribution.jar poisson cdf
>     java -jar target/examples-distribution.jar poisson cdf --mean 3.45
>     java -jar target/examples-distribution.jar poisson cdf \
>         --mean 12.34 --min 0 --max 50 --increment 2
>
> To create a sample of the PDF for the Gamma distribution with
> different shape and
> a fixed scale:
>
>     java -jar target/examples-distribution.jar gamma pdf \
>         --shape 0.5,1,2 --scale 1
>
> As a quick verification I added a hidden command named 'check'. It was
> added to verify the survival function is complementary to the CDF (so
> I did not have to create plots of this function and visually check
> it). This currently tests:
>
> cdf + survival = 1.0
> icdf(0) = lower bound
> icdf(1) = upper bound
> pdf = exp(logpdf)
> x = icdf(cdf(x))
>
>
> Any issues are output otherwise the command appears to do nothing.
>
> I checked all the output for PDF and CDF against the figures on
> wikipedia. This has revealed some issues.
>
> The following distributions are OK:
>
> Binomial
> Gumbel
> Hypergeometric
> Laplace
> Logistic
> LogNormal
> Normal
> Pascal
> T
> Triangular
> TruncatedNormal
> UniformContinuous
> UniformDiscrete
> Zipf
>
>
> The following have issues:

Great that your work has readily revealed them.

>
> Beta
> ---
> This is the only distribution parameterised using a reference to
> Wolfram Mathworld rather than wikipedia. The parameterisation matches
> that on wikipedia so could be updated.

+1 for using a single "primary" reference (a.o. for naming the
arguments).

>
> Major:
> This distribution can throw an exception:
>
> p(x=0) throws an exception when alpha<1.
> p(x=1) throws an exception when beta<1.
> This is not the case in matlab which returns the natural limit as +inf, e.g.
>
> betapdf(0, 0.5, 0.5) -> +inf
>
> The CDF is OK.
> x=0 when alpha<1: returns 0.
> x=1 when beta<1: returns 1.
> Thus you can evaluate the CDF but not the PDF which rejects the input point.
>
> Cauchy
> ---
> The median parameter is referred to as the location on Wikipedia and
> in R. I suggest updating parameters from median and scale to location
> and scale.
>
> ChiSquared
> ---
> The support should be (0, +inf) if k=1 otherwise [0, +inf). This is
> not reflected in the lower bound return. This is minor.
>
> The PDF is not correct at x=0.
> pdf(x=0) returns 0 when k<2. This should return higher according to
> matlab which uses the limit at x -> 0:
>
> chi2pdf(0, 1) -> +inf
> chi2pdf(0, 1.99) -> +inf
> chi2pdf(0, 2) -> 0.5
> chi2pdf(0, 2.01) -> 0
> chi2pdf(0, 3) -> 0
>
> Exp
> ---
> The parameter is named the mean. Wikipedia uses either the scale or
> 1/scale = rate. R uses the rate parameterisation. Matlab uses the mean
> (mu). Possible change mean to scale (or label it as a scale
> parameter).
>
> F
> ---
> pdf(x=0) is always zero but it should go to infinity when df=1,df=2.
> Using a very small x shows the direction:
>
> java -jar target/examples-distribution.jar f pdf -x 0,1e-6
> x df1=1.0,df2=1.0 df1=2.0,df2=1.0 df1=5.0,df2=2.0 df1=10.0,df2=1.0
> df1=100.0,df2=100.0
> 0.00000 0.0 0.0 0.0 0.0 0.0
> 1.00000e-06 318.30956787422247 0.9999970000074999 2.470507804995697E-8
> 1.2304010764181617E-19 2.522031398015255E-264
>
> Matlab:
>
> fpdf(0,1,1) -> +inf
> fpdf(0,1.9999,3) ->+inf
> fpdf(0,2,3) -> 1
> fpdf(0,2.0001,1) -> 0
>
> Gamma
> ---
>
> pdf(x=0) is zero when shape <= 1. This should be +inf
>
> gampdf(0, [0.5,0.9999,1,1.001], 1)
>
> ans =
>
>    Inf   Inf     1     0
>
> java -jar target/examples-distribution.jar gamma pdf -x 0,1e-10 -k
> 0.5,0.9999,1,1.001 -t 1 --format %.2f
> x shape=0.5,scale=1.0 shape=0.9999,scale=1.0 shape=1.0,scale=1.0
> shape=1.001,scale=1.0
> 0.00000 0.00 0.00 0.00 0.00
> 1.00000e-10 56418.95 1.00 1.00 0.98
>
> Geometric
> ---
>
> Does not handle the special case p=1:
> - pmf and logpmf return NaN for x=1
> - icdf(1.0) returns int max value (should be 0)
> - upper bound returns int max value
>
> Some occurrences where x != icdf(cdf(x)):
>
> java -jar target/examples-distribution.jar geo check
>
> I do not think this is a problem. It is the only distribution where
> the round trip fails. It is due to rounding errors during direct
> inversion of the computation for the CDF.
>
> Levy
> ---
>
> support is [mu, +inf)
> The distribution outputs NaN for the PDF at mu. It should output p=0
> at x=mu. This is due to division by zero.
> This function is not in matlab or R. I checked the formula from
> wikipedia and it is implemented correctly.
>
>
> Nakagami
> ---
>
> Uses omega scale parameter.
> Wikipedia uses omega spread parameter.
> Matlab uses omega scale parameter.
> Check other implementations for scale vs spread.
>
> Pareto
> ---
>
> When alpha (shape) is large then pdf(x=1) == large.
> When alpha is infinite this becomes nan (all other x are nan as well).
>
> java -jar target/examples-distribution.jar pareto pdf --alpha
> 1e10,1e100,Infinity -x 0,1,1.0000000001,2,3,4 --xformat %s --format
> %.3g
> x xm=1.0,alpha=1.0E10 xm=1.0,alpha=1.0E100 xm=1.0,alpha=Infinity
> 0.0 0.00 0.00 0.00
> 1.0 1.00e+10 1.00e+100 NaN
> 1.0000000001 3.68e+09 0.00 NaN
> 2.0 0.00 0.00 NaN
> 3.0 0.00 0.00 NaN
> 4.0 0.00 0.00 NaN
>
> Poisson
> ---
>
> This is the only distribution using the RegularizedGamma function that
> has the option to set the epsilon and iterations (default 1e-12,
> 10000000). The others use the default in the RegularizedGamma (1e-15,
> Max int).
>
> Perhaps this should be simplified by removing the epsilon option.
> Switching to the default in the RegularizedGamma should be
> investigated.
>
> Weibull
> ---
>
> The p(x=0) == 0 when shape <= 1. It should go to infinity if shape < 1
> and be 1 when shape = 1.
>
> java -jar target/examples-distribution.jar wbl pdf  -x 0,1e-6 --format %.3g
> x alpha=0.5,beta=1.0 alpha=1.0,beta=1.0 alpha=1.5,beta=1.0 alpha=5.0,beta=1.0
> 0.00000 0.00 0.00 0.00 0.00
> 1.00000e-06 500 1.00 0.00150 5.00e-24
>
> Matlab will use the natural limit as x -> 0.
>
> wblpdf(0, 1, 0.5) -> inf
> wblpdf(0, 1, 1) -> 1
>
>
> General changes:
> ---
>
> Should we change from the use of constructors to factory methods?

+1

> This
> would allow instance specialisations:
>
> geometric : probability of success = 1
> normal : mean=0, standard deviation = 1
> pareto : shape=inf
>
> Conclusion
> ---
>
> Many of the issues identified are trivial. However some are due to
> different behaviour of the distributions at x=0 depending on the
> parameterisation. In certain cases the pdf should go to infinity or
> return a finite non-zero number. These would have been found by
> expanding the test suite to use more parameterisations. Further
> investigation of the behaviour of different implementations at x=0 is
> required. Matlab will evaluate the function. Other implementations may
> return NaN or throw an exception.
>
> I am considering an update of the test suite to use file resources.

+1
I've used "@CsvFileSource" in [Math] "SimplexOptimizerTest".

> Each distribution will have resources containing expected results for
> the pdf, cdf, icdf, survival function, etc which are dynamically
> discovered. Adding a new test would require putting a new file into
> the resources. This is only an idea and will require more
> investigation as not all functions may require testing for each
> parameterisation. For example some tests currently target the high
> precision computations of the survival function. However what is
> apparent is that the current test suite is not simple to add tests for
> more than one parameterisation of the distribution (see
> TruncatedNormalDistributionTest). The Gamma and Beta distributions
> also use more extensive tests of different parameterisations in a
> different pattern. The Gamma distribution has file resources computed
> using a Maxima script for the distribution. A consolidation of the
> test suite to make testing variations simpler would be helpful.

Sure.  Still quite some work before a first release... :-}

>
> I will work on fixing the identified issues. Please try the new
> application and determine if any more issues are present.
>

Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Alex Herbert <al...@gmail.com>.

I have created an initial version of the statistics distributions
application. It can be used with:

cd commons-statistics-examples/examples-distribution
mvn package -Pexamples-distribution
java -jar target/examples-distribution.jar

There are 26 distributions each with a command. Each distribution
command has sub-commands to evaluate functions: pdf, cdf, survival,
icdf, logpdf

I have created this entirely using picocli with sub-classes for each
command and to hold default values. The default parameterisations and
ranges for the distributions were copied from Wikipedia if available.
The evaluated points and the distribution parameters can be changed:

To create a sample of the PDF for the exponential distribution:

    java -jar target/examples-distribution.jar exp pdf
    java -jar target/examples-distribution.jar exp pdf --mean 3.45
    java -jar target/examples-distribution.jar exp pdf \
        --mean 3.45 --min 3 --max 5 --steps 50

To create a sample of the CDF for the Poisson distribution:

    java -jar target/examples-distribution.jar poisson cdf
    java -jar target/examples-distribution.jar poisson cdf --mean 3.45
    java -jar target/examples-distribution.jar poisson cdf \
        --mean 12.34 --min 0 --max 50 --increment 2

To create a sample of the PDF for the Gamma distribution with
different shape and
a fixed scale:

    java -jar target/examples-distribution.jar gamma pdf \
        --shape 0.5,1,2 --scale 1

As a quick verification I added a hidden command named 'check'. It was
added to verify the survival function is complementary to the CDF (so
I did not have to create plots of this function and visually check
it). This currently tests:

cdf + survival = 1.0
icdf(0) = lower bound
icdf(1) = upper bound
pdf = exp(logpdf)
x = icdf(cdf(x))


Any issues are output otherwise the command appears to do nothing.

I checked all the output for PDF and CDF against the figures on
wikipedia. This has revealed some issues.

The following distributions are OK:

Binomial
Gumbel
Hypergeometric
Laplace
Logistic
LogNormal
Normal
Pascal
T
Triangular
TruncatedNormal
UniformContinuous
UniformDiscrete
Zipf


The following have issues:

Beta
---
This is the only distribution parameterised using a reference to
Wolfram Mathworld rather than wikipedia. The parameterisation matches
that on wikipedia so could be updated.

Major:
This distribution can throw an exception:

p(x=0) throws an exception when alpha<1.
p(x=1) throws an exception when beta<1.
This is not the case in matlab which returns the natural limit as +inf, e.g.

betapdf(0, 0.5, 0.5) -> +inf

The CDF is OK.
x=0 when alpha<1: returns 0.
x=1 when beta<1: returns 1.
Thus you can evaluate the CDF but not the PDF which rejects the input point.

Cauchy
---
The median parameter is referred to as the location on Wikipedia and
in R. I suggest updating parameters from median and scale to location
and scale.

ChiSquared
---
The support should be (0, +inf) if k=1 otherwise [0, +inf). This is
not reflected in the lower bound return. This is minor.

The PDF is not correct at x=0.
pdf(x=0) returns 0 when k<2. This should return higher according to
matlab which uses the limit at x -> 0:

chi2pdf(0, 1) -> +inf
chi2pdf(0, 1.99) -> +inf
chi2pdf(0, 2) -> 0.5
chi2pdf(0, 2.01) -> 0
chi2pdf(0, 3) -> 0

Exp
---
The parameter is named the mean. Wikipedia uses either the scale or
1/scale = rate. R uses the rate parameterisation. Matlab uses the mean
(mu). Possible change mean to scale (or label it as a scale
parameter).

F
---
pdf(x=0) is always zero but it should go to infinity when df=1,df=2.
Using a very small x shows the direction:

java -jar target/examples-distribution.jar f pdf -x 0,1e-6
x df1=1.0,df2=1.0 df1=2.0,df2=1.0 df1=5.0,df2=2.0 df1=10.0,df2=1.0
df1=100.0,df2=100.0
0.00000 0.0 0.0 0.0 0.0 0.0
1.00000e-06 318.30956787422247 0.9999970000074999 2.470507804995697E-8
1.2304010764181617E-19 2.522031398015255E-264

Matlab:

fpdf(0,1,1) -> +inf
fpdf(0,1.9999,3) ->+inf
fpdf(0,2,3) -> 1
fpdf(0,2.0001,1) -> 0

Gamma
---

pdf(x=0) is zero when shape <= 1. This should be +inf

gampdf(0, [0.5,0.9999,1,1.001], 1)

ans =

   Inf   Inf     1     0

java -jar target/examples-distribution.jar gamma pdf -x 0,1e-10 -k
0.5,0.9999,1,1.001 -t 1 --format %.2f
x shape=0.5,scale=1.0 shape=0.9999,scale=1.0 shape=1.0,scale=1.0
shape=1.001,scale=1.0
0.00000 0.00 0.00 0.00 0.00
1.00000e-10 56418.95 1.00 1.00 0.98

Geometric
---

Does not handle the special case p=1:
- pmf and logpmf return NaN for x=1
- icdf(1.0) returns int max value (should be 0)
- upper bound returns int max value

Some occurrences where x != icdf(cdf(x)):

java -jar target/examples-distribution.jar geo check

I do not think this is a problem. It is the only distribution where
the round trip fails. It is due to rounding errors during direct
inversion of the computation for the CDF.

Levy
---

support is [mu, +inf)
The distribution outputs NaN for the PDF at mu. It should output p=0
at x=mu. This is due to division by zero.
This function is not in matlab or R. I checked the formula from
wikipedia and it is implemented correctly.


Nakagami
---

Uses omega scale parameter.
Wikipedia uses omega spread parameter.
Matlab uses omega scale parameter.
Check other implementations for scale vs spread.

Pareto
---

When alpha (shape) is large then pdf(x=1) == large.
When alpha is infinite this becomes nan (all other x are nan as well).

java -jar target/examples-distribution.jar pareto pdf --alpha
1e10,1e100,Infinity -x 0,1,1.0000000001,2,3,4 --xformat %s --format
%.3g
x xm=1.0,alpha=1.0E10 xm=1.0,alpha=1.0E100 xm=1.0,alpha=Infinity
0.0 0.00 0.00 0.00
1.0 1.00e+10 1.00e+100 NaN
1.0000000001 3.68e+09 0.00 NaN
2.0 0.00 0.00 NaN
3.0 0.00 0.00 NaN
4.0 0.00 0.00 NaN

Poisson
---

This is the only distribution using the RegularizedGamma function that
has the option to set the epsilon and iterations (default 1e-12,
10000000). The others use the default in the RegularizedGamma (1e-15,
Max int).

Perhaps this should be simplified by removing the epsilon option.
Switching to the default in the RegularizedGamma should be
investigated.

Weibull
---

The p(x=0) == 0 when shape <= 1. It should go to infinity if shape < 1
and be 1 when shape = 1.

java -jar target/examples-distribution.jar wbl pdf  -x 0,1e-6 --format %.3g
x alpha=0.5,beta=1.0 alpha=1.0,beta=1.0 alpha=1.5,beta=1.0 alpha=5.0,beta=1.0
0.00000 0.00 0.00 0.00 0.00
1.00000e-06 500 1.00 0.00150 5.00e-24

Matlab will use the natural limit as x -> 0.

wblpdf(0, 1, 0.5) -> inf
wblpdf(0, 1, 1) -> 1


General changes:
---

Should we change from the use of constructors to factory methods? This
would allow instance specialisations:

geometric : probability of success = 1
normal : mean=0, standard deviation = 1
pareto : shape=inf

Conclusion
---

Many of the issues identified are trivial. However some are due to
different behaviour of the distributions at x=0 depending on the
parameterisation. In certain cases the pdf should go to infinity or
return a finite non-zero number. These would have been found by
expanding the test suite to use more parameterisations. Further
investigation of the behaviour of different implementations at x=0 is
required. Matlab will evaluate the function. Other implementations may
return NaN or throw an exception.

I am considering an update of the test suite to use file resources.
Each distribution will have resources containing expected results for
the pdf, cdf, icdf, survival function, etc which are dynamically
discovered. Adding a new test would require putting a new file into
the resources. This is only an idea and will require more
investigation as not all functions may require testing for each
parameterisation. For example some tests currently target the high
precision computations of the survival function. However what is
apparent is that the current test suite is not simple to add tests for
more than one parameterisation of the distribution (see
TruncatedNormalDistributionTest). The Gamma and Beta distributions
also use more extensive tests of different parameterisations in a
different pattern. The Gamma distribution has file resources computed
using a Maxima script for the distribution. A consolidation of the
test suite to make testing variations simpler would be helpful.

I will work on fixing the identified issues. Please try the new
application and determine if any more issues are present.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Gilles Sadowski <gi...@gmail.com>.

Hi.

Le jeu. 16 sept. 2021 à 15:00, Alex Herbert <al...@gmail.com> a écrit :
>
> On Thu, 16 Sept 2021 at 12:06, Gilles Sadowski <gi...@gmail.com> wrote:
> <!-- SNIP -->
> >
> > I was seeing this as a "visual" validation test, so making it as
> > simple as possible (perhaps even with "interesting" values being
> > hard-coded).
>
> I had thought of this as simple is best. However it should not be too
> hard to create an application that outputs the PDF, CDF, etc for any
> parameterisation. My idea would be to generate output over a range of
> parameterisations and then check them against a reference. For example
> the PDF and CDF plots for the parameterisations shown on wikipedia:
>
> https://en.wikipedia.org/wiki/Gamma_distribution
> https://en.wikipedia.org/wiki/L%C3%A9vy_distribution
> https://en.wikipedia.org/wiki/Poisson_distribution
>
> etc.
>
> Using just 1 set of hard coded values may not spot a problem.

I did not mean that; but rather take (all) the values used in the
plots from the above references in order to reproduce them.

> Suitable defaults can be coded into the program for each distribution.
> These can then be varied by command line arguments.
>
> Note that you cannot output the inverse CDF with the same input x
> range as the PDF, CDF and Survival functions since it is bounded to
> [0, 1]. This was one reason for parameterising the values.
>
> I am looking into making the application at the moment. I am not going
> to waste time over engineering this but I would like to have something
> that can test the distributions with a range of values so we can
> visually check them before the release.

That was the intention.

Thanks,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Alex Herbert <al...@gmail.com>.

On Thu, 16 Sept 2021 at 12:06, Gilles Sadowski <gi...@gmail.com> wrote:
<!-- SNIP -->
>
> I was seeing this as a "visual" validation test, so making it as
> simple as possible (perhaps even with "interesting" values being
> hard-coded).

I had thought of this as simple is best. However it should not be too
hard to create an application that outputs the PDF, CDF, etc for any
parameterisation. My idea would be to generate output over a range of
parameterisations and then check them against a reference. For example
the PDF and CDF plots for the parameterisations shown on wikipedia:

https://en.wikipedia.org/wiki/Gamma_distribution
https://en.wikipedia.org/wiki/L%C3%A9vy_distribution
https://en.wikipedia.org/wiki/Poisson_distribution

etc.

Using just 1 set of hard coded values may not spot a problem.

Suitable defaults can be coded into the program for each distribution.
These can then be varied by command line arguments.

Note that you cannot output the inverse CDF with the same input x
range as the PDF, CDF and Survival functions since it is bounded to
[0, 1]. This was one reason for parameterising the values.

I am looking into making the application at the moment. I am not going
to waste time over engineering this but I would like to have something
that can test the distributions with a range of values so we can
visually check them before the release.

Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Gilles Sadowski <gi...@gmail.com>.

Hello.

Le jeu. 16 sept. 2021 à 00:46, Alex Herbert <al...@gmail.com> a écrit :
>
> On Wed, 15 Sept 2021 at 17:10, Gilles Sadowski <gi...@gmail.com> wrote:
>
> > Hello.
> >
> > Le mar. 14 sept. 2021 à 17:13, Alex Herbert <al...@gmail.com> a
> > écrit :
> > >
> > > The statistics component is a candidate for a release.
> > >
> > > The statistics distributions module contains mature functionality
> > > ported from CM. The dependency on [numbers] is now satisfied as that
> > > has had an official release. There is nothing outstanding in the
> > > project Jira. Thus a first release of this component can be performed.
> > >
> > > Items before a release:
> > >
> > > - Remaining Jira tickets should be checked and resolved
> >
> > Can we set to "resolved" the following two:
> >   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-3
>
>
> Since the JDK Math is supposed to be within 1 ULP of the correct value for
> exp, log1p and pow (the functions used in place of CM AccurateMath) then
> the accuracy issues are due to rounding of intermediates. Looking at the
> tests the tolerances have increased from 1.5 to 2 for the mean ULP for one
> test. This seems like an acceptably low ULP from the exact result. The
> other test the tolerance increased from 220 to 230 ULP for the standard
> deviation of the ULP (which must have a mean below 160). Although high in
> ULPs the increase is less than 5% in the tolerance which appears rather
> arbitrary to begin with (and was probably chosen just to make the test
> pass). I would say this ticket is not a problem.

Thanks; resolved as such.

> >
> >   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-25
>
>
> Here the implemented fix is computing results almost as well as python and
> R (which compute up to 1e10 to 1e20 degrees of freedom) versus the
> threshold of 2.99e6 used in statistics. I would say it is not resolved but
> is not a blocker.

OK.

> >
> > ?
> >
> > It would be nice to have
> >   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-9
> > in an "examples" module.
> >
>
> OK. The examples module can be based on code in RNG which uses picocli to
> build programs. I suggest a program that accepts the name of the
> distribution as the command. Each command would have inputs for the
> distribution parameters, the range of points to evaluate and the number of
> steps in the range. It would output a csv format:
>
> x,pdf(x)
>
> For example for the exponential:
>
> java -jar statistics.jar exp --mean 3.45 --min 0 --max 20 --steps 200
> --function cdf
>
> This should not be too complicated.

I was seeing this as a "visual" validation test, so making it as
simple as possible (perhaps even with "interesting" values being
hard-coded).  [Also, I don't think that, versatile or not, such an
application won't be much used beyond generating all the plots
that could help spot a problem (as you did with Ziggurat).]

So the output should perhaps be fixed:

x, pdf, cdf, inv cdf, survival

> If the desire is to generate data for figures with multiple
> parameterisations then the parameters can be comma delimited:
>
> java -jar statistics.jar gamma --shape 1,2,3 --scale 2,2,2 --min 0 --max 20
> --steps 200 --function pdf

Given that useful "min", "max" and "steps" could be dependent on
the parameters, I don't think that this feature would be very useful.

>
> Output would be multiple columns:
>
> x,gamma(shape=1;scale=2),gamma(shape=2;scale=2),gamma(shape=3;scale=2),
>
> An alternative is to have the input points determined by a file:
>
> java -jar statistics.jar gamma --shape 1,2,3 --scale 2,2,2 --points
> input.txt --function pdf

Given the intended usage, I don't see the advantage (and the
drawback would be having to maintain many input list files).

Regards,
Gilles

>
> Functions to support are: pdf, cdf, inverse cdf and survival probability.
>
> Thoughts on this?
>
> Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Alex Herbert <al...@gmail.com>.

On Wed, 15 Sept 2021 at 17:10, Gilles Sadowski <gi...@gmail.com> wrote:

> Hello.
>
> Le mar. 14 sept. 2021 à 17:13, Alex Herbert <al...@gmail.com> a
> écrit :
> >
> > The statistics component is a candidate for a release.
> >
> > The statistics distributions module contains mature functionality
> > ported from CM. The dependency on [numbers] is now satisfied as that
> > has had an official release. There is nothing outstanding in the
> > project Jira. Thus a first release of this component can be performed.
> >
> > Items before a release:
> >
> > - Remaining Jira tickets should be checked and resolved
>
> Can we set to "resolved" the following two:
>   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-3

Since the JDK Math is supposed to be within 1 ULP of the correct value for
exp, log1p and pow (the functions used in place of CM AccurateMath) then
the accuracy issues are due to rounding of intermediates. Looking at the
tests the tolerances have increased from 1.5 to 2 for the mean ULP for one
test. This seems like an acceptably low ULP from the exact result. The
other test the tolerance increased from 220 to 230 ULP for the standard
deviation of the ULP (which must have a mean below 160). Although high in
ULPs the increase is less than 5% in the tolerance which appears rather
arbitrary to begin with (and was probably chosen just to make the test
pass). I would say this ticket is not a problem.

>
>   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-25

Here the implemented fix is computing results almost as well as python and
R (which compute up to 1e10 to 1e20 degrees of freedom) versus the
threshold of 2.99e6 used in statistics. I would say it is not resolved but
is not a blocker.

>
> ?
>
> It would be nice to have
>   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-9
> in an "examples" module.
>

OK. The examples module can be based on code in RNG which uses picocli to
build programs. I suggest a program that accepts the name of the
distribution as the command. Each command would have inputs for the
distribution parameters, the range of points to evaluate and the number of
steps in the range. It would output a csv format:

x,pdf(x)

For example for the exponential:

java -jar statistics.jar exp --mean 3.45 --min 0 --max 20 --steps 200
--function cdf

This should not be too complicated.

If the desire is to generate data for figures with multiple
parameterisations then the parameters can be comma delimited:

java -jar statistics.jar gamma --shape 1,2,3 --scale 2,2,2 --min 0 --max 20
--steps 200 --function pdf

Output would be multiple columns:

x,gamma(shape=1;scale=2),gamma(shape=2;scale=2),gamma(shape=3;scale=2),

An alternative is to have the input points determined by a file:

java -jar statistics.jar gamma --shape 1,2,3 --scale 2,2,2 --points
input.txt --function pdf

Functions to support are: pdf, cdf, inverse cdf and survival probability.

Thoughts on this?

Alex

Re: [statistics] Release 1.0

Posted by Gilles Sadowski <gi...@gmail.com>.

Hello.

Le mar. 14 sept. 2021 à 17:13, Alex Herbert <al...@gmail.com> a écrit :
>
> The statistics component is a candidate for a release.
>
> The statistics distributions module contains mature functionality
> ported from CM. The dependency on [numbers] is now satisfied as that
> has had an official release. There is nothing outstanding in the
> project Jira. Thus a first release of this component can be performed.
>
> Items before a release:
>
> - Remaining Jira tickets should be checked and resolved

Can we set to "resolved" the following two:
  https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-3
  https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-25
?

It would be nice to have
  https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-9
in an "examples" module.

The rest are mostly task descriptions that were not completed in the
2019 GSoC.

> - Should a release go out as an alpha, beta or simply a 1.0?

Given the amount of coverage[1], an "alpha" is not warranted. ;-)
A "beta" would have been useful if we could hope for beta-testers, but I
doubt that there will be more feedback than there was for either [Numbers]
or [Geometry].
One thing to check (if not done already) is the API consistency (a.o. the
ordering and naming of the distributions' parameters).

Regards,
Gilles

[1] https://coveralls.io/github/apache/commons-statistics

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [statistics] Release 1.0

Posted by Matt Juntunen <ma...@gmail.com>.

If the functionality is mature, I would vote for a 1.0 release. YOLO :-)
-Matt J

On Tue, Sep 14, 2021 at 11:13 AM Alex Herbert <al...@gmail.com> wrote:
>
> The statistics component is a candidate for a release.
>
> The statistics distributions module contains mature functionality
> ported from CM. The dependency on [numbers] is now satisfied as that
> has had an official release. There is nothing outstanding in the
> project Jira. Thus a first release of this component can be performed.
>
> Items before a release:
>
> - Remaining Jira tickets should be checked and resolved
> - Should a release go out as an alpha, beta or simply a 1.0?
>
> Alex
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[statistics] Release 1.0

Posted by Alex Herbert <al...@gmail.com>.

The statistics component is a candidate for a release.

The statistics distributions module contains mature functionality
ported from CM. The dependency on [numbers] is now satisfied as that
has had an official release. There is nothing outstanding in the
project Jira. Thus a first release of this component can be performed.

Items before a release:

- Remaining Jira tickets should be checked and resolved
- Should a release go out as an alpha, beta or simply a 1.0?

Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org