You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Otmar Ertl (JIRA)" <ji...@apache.org> on 2015/04/27 17:11:38 UTC

[jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution

    [ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514257#comment-14514257 ] 

Otmar Ertl edited comment on MATH-1220 at 4/27/15 3:10 PM:
-----------------------------------------------------------

Do you mean using a pre-calculated CDF table and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a pre-calculated table for the CDF is probably faster for smaller N. 

The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (<=N) points as the head and to use such a pre-calculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large.


was (Author: otmar ertl):
Do you mean using a pre-calculated CDF tanle and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a pre-calculated table for the CDF is probably faster for smaller N. 

The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (<=N) points as the head and to use such a pre-calculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large.

> More efficient sample() method for ZipfDistribution
> ---------------------------------------------------
>
>                 Key: MATH-1220
>                 URL: https://issues.apache.org/jira/browse/MATH-1220
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Otmar Ertl
>         Attachments: patch_v1
>
>
> Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)