You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2022/11/24 14:57:00 UTC

[jira] [Resolved] (STATISTICS-59) Correct Pareto distribution sampling with extreme shape parameter

     [ https://issues.apache.org/jira/browse/STATISTICS-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Herbert resolved STATISTICS-59.
------------------------------------
    Resolution: Implemented

Updated in commit:

1e5c695164a886d4654418cf6c64e26ea791b8c2

The sampler in Commons RNG 1.5 directly uses nextDouble.

The Pareto distribution sampler wraps the input RNG to use it as a source of long bits. These are then output for nextDouble as [0, 1) or (0, 1] as appropriate based on the shape being above or below 1.

Commons RNG should be updated to perform the same change. This would allow the Commons RNG sampler to be used directly without modifying the input RNG.

> Correct Pareto distribution sampling with extreme shape parameter
> -----------------------------------------------------------------
>
>                 Key: STATISTICS-59
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-59
>             Project: Commons Statistics
>          Issue Type: Improvement
>          Components: distribution
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: pareto.png
>
>
> The Pareto distribution has CDF:
> {noformat}
>              ( scale )^shape
> CDF(x) = 1 - ( ----- )
>              (   x   ){noformat}
> This is inverted using high precision Math functions to support very small p values:
> {noformat}
> x = scale / exp(log(1 - p) / shape)
>   = scale / Math.exp(Math.log1p(-p) / shape);{noformat}
> This is sampled using inverse transform sampling as:
> {noformat}
> x = scale / (1 - p)^(1 / shape)
>   = scale / Math.pow(1 - p, 1 / shape){noformat}
> This is fast as it requires a single call to Math.pow. It must only handle p-values down to 2^-53 as sampling generates p as one of the 2^53 dyadic rationals in [0, 1).
> However it has some issues when the shape parameter is extreme: either shape is infinite or 1 / shape is infinite.
> Here is a table of the inverse CDF and the sample value for scale = 1 and an extreme shape. p has been set using the most extreme values from the dyadic rationals (0, 2^-53, 1 - 2^-53, 1):
> ||Shape||p||icdf(p)||sample||
> |Infinity|0.0|1.0|1.0|
> |Infinity|1.1102230246251565E-16|1.0|1.0|
> |Infinity|0.9999999999999999|1.0|1.0|
> |Infinity|1.0|Infinity|1.0|
> |4.9E-324|0.0|1.0|NaN|
> |4.9E-324|1.1102230246251565E-16|Infinity|Infinity|
> |4.9E-324|0.9999999999999999|Infinity|Infinity|
> |4.9E-324|1.0|Infinity|Infinity|
> When 1 / shape is infinite the NaN occurs when Math.pow(1, Infinity) == NaN. In this case sampling inversion is an error.
> When shape is infinite the mismatch occurs when Math.pow(0, 0) == 1 and the shape is returned rather than the distribution upper bound. This is because the inverse CDF detects this edge case when the input p=1. In this case pure inversion of the CDF is creating an outlier and the sampling inversion is more suitable.
> The sampling should be updated to avoid the possibility of NaN generation and ensure samples are returned without outliers from the main region of the CDF.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)