You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2022/11/24 12:16:00 UTC

[jira] [Commented] (STATISTICS-59) Correct Pareto distribution sampling with extreme shape parameter

    [ https://issues.apache.org/jira/browse/STATISTICS-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638267#comment-17638267 ] 

Alex Herbert commented on STATISTICS-59:
----------------------------------------

Pareto distribution CDF for scale (xm) 1 and various shape (alpha) parameters:

!pareto.png!

Data generated using the distribution examples application and plotted in gnuplot:
{noformat}
java -jar target/examples-distribution.jar pareto cdf --shape 10,5,1,0.5,0.1 --out target/1.txt --min 1 --max 5
{noformat}
As the shape -> large then the distribution is pushed towards the scale parameter.

As the shape -> tiny then the distribution is pushed towards infinity.

Inverse sampling for tiny shape can generate a NaN when p=0. In this case sampling can avoid this by sampling p from (0, 1]  so concentrating samples at the upper end of the range. If 1 / shape is infinite then all possible samples are infinity.

Using the inverse CDF for infinite shape will generate an infinite sample when p=1. In this case sampling can avoid this by sampling from [0, 1) so concentrate samples at the lower end of the range. If shape is infinite then all possible samples are scale.

I suggest updating the sampling to effectively use p in [0, 1) when shape is small, and p in (0, 1] when shape is large. This will avoid spurious sample values for extreme parameters and p-values.

 

> Correct Pareto distribution sampling with extreme shape parameter
> -----------------------------------------------------------------
>
>                 Key: STATISTICS-59
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-59
>             Project: Commons Statistics
>          Issue Type: Improvement
>          Components: distribution
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: pareto.png
>
>
> The Pareto distribution has CDF:
> {noformat}
>              ( scale )^shape
> CDF(x) = 1 - ( ----- )
>              (   x   ){noformat}
> This is inverted using high precision Math functions to support very small p values:
> {noformat}
> x = scale / exp(log(1 - p) / shape)
>   = scale / Math.exp(Math.log1p(-p) / shape);{noformat}
> This is sampled using inverse transform sampling as:
> {noformat}
> x = scale / (1 - p)^(1 / shape)
>   = scale / Math.pow(1 - p, 1 / shape){noformat}
> This is fast as it requires a single call to Math.pow. It must only handle p-values down to 2^-53 as sampling generates p as one of the 2^53 dyadic rationals in [0, 1).
> However it has some issues when the shape parameter is extreme: either shape is infinite or 1 / shape is infinite.
> Here is a table of the inverse CDF and the sample value for scale = 1 and an extreme shape. p has been set using the most extreme values from the dyadic rationals (0, 2^-53, 1 - 2^-53, 1):
> ||Shape||p||icdf(p)||sample||
> |Infinity|0.0|1.0|1.0|
> |Infinity|1.1102230246251565E-16|1.0|1.0|
> |Infinity|0.9999999999999999|1.0|1.0|
> |Infinity|1.0|Infinity|1.0|
> |4.9E-324|0.0|1.0|NaN|
> |4.9E-324|1.1102230246251565E-16|Infinity|Infinity|
> |4.9E-324|0.9999999999999999|Infinity|Infinity|
> |4.9E-324|1.0|Infinity|Infinity|
> When 1 / shape is infinite the NaN occurs when Math.pow(1, Infinity) == NaN. In this case sampling inversion is an error.
> When shape is infinite the mismatch occurs when Math.pow(0, 0) == 1 and the shape is returned rather than the distribution upper bound. This is because the inverse CDF detects this edge case when the input p=1. In this case pure inversion of the CDF is creating an outlier and the sampling inversion is more suitable.
> The sampling should be updated to avoid the possibility of NaN generation and ensure samples are returned without outliers from the main region of the CDF.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)