You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2021/09/28 13:30:00 UTC
[jira] [Commented] (RNG-160) Performance of modified Ziggurat samplers

    [ https://issues.apache.org/jira/browse/RNG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421390#comment-17421390 ] 

Alex Herbert commented on RNG-160:
----------------------------------

The modified ziggurat sampler mainly chooses samples from inside the ziggurat layers. This occurs approximately 98.4% of the time for the exponential and 98.8% for the Gaussian. Differences in overhang sampling are difficult to measure. I updated the benchmark to force the sampler to sample from the overhang. This requires setting the lowest 8 bits of the first random long used for the sample. The rest of the RNG output is unchanged. These 8-bits are used to pick the ziggurat layer and then discarded by the sampler. Thus functionality is unchanged for the overhang sampling. This allows differences in overhang sampling to be observed. The benchmark has been run on JDK 8, 11 and 17 using the same 3 CPUs and OS platforms as before. This adds JDK 8u301 for the AMD 3960X processor and the recently released JDK 17 for all CPUs.

Since this benchmark is not applicable to the Marsaglia ziggurat sampler (it cannot easily be forced to do overhang sampling) these have been omitted. Extra methods have been added.
||Method||Description||
|ModExponential|McFarland's modified exponential ziggurat sampler in the main sampling module|
|ModExponential2|A copy of McFarland's modified exponential ziggurat sampler from the main sampling module with modifications to the private members so they can be over-ridden|
|ModExponential512|ModExponential2 modified to use a 512 entry look-up table instead of 256|
|ModExponentialEmax2|ModExponential2 modified to use two values for the maximum epsilon value for the overhangs. This exploits the fact that the exponential curve is very well approximated by the straight line of the triangle hypotenuse in the majority of cases; only close to the tail does the maximum epsilon value differ more than 0.01.|
|ModExponentialEmaxTable|ModExponential2 modified to use a table for the maximum epsilon value for each overhang|
|ModExponentialInlining|ModExponential2 modified to fit the main 'sample' method within 35 bytes when compiled to Java byte code (the default JVM size for inlining a method)|
|ModExponentialIntMap|ModExponential2 modified to use an int[] to store the alias map instead of a byte[].|
|ModExponentialLoop|ModExponential2 modified to use a loop for tail sampling in place of recursive method calls|
|ModExponentialLoop2|ModExponential2 modified to use a loop for tail sampling in place of recursive method calls. This also avoids recursion in overhang sampling and recycles the unused bits in the first uniform deviate.|
|ModExponentialRecursion|ModExponential2 modified to use a dedicated sample method when recursion is required. This fits the main sample method within 35 bytes when compiled to byte code|
|ModExponentialSimpleOverhangs|ModExponential2 modified to use simple overhangs|
|ModExponentialTernary|ModExponential2 modified to use two ternary statements to sort the two unsigned integers in place of a branch statement. This eliminates a 50:50 branch.|
|ModExponentialTernarySubtract|ModExponential2 modified to use a single ternary statement and then a subtraction to sort the two unsigned integers in place of a branch statement. This eliminates a 50:50 branch.|
|ModGaussian|McFarland's modified Gaussian ziggurat sampler in the main sampling module|
|ModGaussian2|A copy of McFarland's modified Gaussian ziggurat sampler from the main sampling module with modifications to the private members so they can be over-ridden|
|ModGaussian512|ModGaussian2 modified to use a 512 entry look-up table instead of 256|
|ModGaussianEmax2|ModGaussian2 modified to use two values for the maximum epsilon value for the overhangs. This exploits the fact that the Gaussian curve is very well approximated by the straight line of the triangle hypotenuse in the majority of cases; only close to the tail or at x=0 does the maximum epsilon value differ more than 0.01.|
|ModGaussianEmaxTable|ModGaussian2 modified to use a table for the maximum epsilon value for each overhang|
|ModGaussianInlining|ModGaussian2 modified to fit the main 'sample' method within 35 bytes when compiled to Java byte code (the default JVM size for inlining a method)|
|ModGaussianInliningShift|ModGaussian2 modified to fit the main 'sample' method within 35 bytes when compiled to Java byte code (the default JVM size for inlining a method) with unsigned longs created using bit shifts instead of masking|
|ModGaussianInliningSimpleOverhangs|ModGaussian2 modified to fit the main 'sample' method within 35 bytes when compiled to Java byte code and modified to use simple overhangs|
|ModGaussianIntMap|ModGaussian2 modified to use an int[] to store the alias map instead of a byte[].|
|ModGaussianSimpleOverhangs|ModGaussian2 modified to use simple overhangs, all code within one method|
|ModGaussianTernary|ModGaussian2 modified to use two ternary statements to sort the two unsigned integers in place of a branch statement. This eliminates a 50:50 branch but only in the concave overhangs which are sampled 21% of the time.|
h2. Exponential

!exp_overhang.jpg!
 * The simple overhangs method is much slower. It is off the scale and the values range from approximately 16 to 33 verses the best score of around 7.
 * The 512 table method is approximately the same speed. The larger tables have no detrimental effect on overhang sampling.
 * The versions using variable thresholds for the maximum epsilon difference between the curve and the triangle hypotenuse are faster. Using a table for the max epsilon has a big effect on JDK 8 but less so on other JDKs.
 * The Loop2 method is fast. This eliminates recursive method calls and recycles the initial random bits when the overhang region has been selected. This is the method currently used in the main sampling package. Differences between this method and the ModGaussian method are likely measurement noise.
 * The Ternary method has a big difference.

These results suggest updating the sorting of the two random deviates to use a ternary statement:
{code:java}
            long u1 = randomInt63();
            long uDistance = randomInt63() - u1;
            if (uDistance < 0) {
                uDistance = -uDistance;
                u1 -= uDistance;
            }
            // u2 = u1 + uDistance
{code}
Becomes:
{code:java}
            final long ua = randomInt63();
            final long ub = randomInt63();
            final long u1 = ua < ub ? ua : ub;
            final long u2 = ua < ub ? ub : ua;
            // uDistance = u2 - u1
{code}
This is a simple change where the use of a ternary creates a conditional load operation that can be efficiently pipelined on the CPU. Use of an unpredictable 50:50 branch statement can be avoided.

The same benchmark was performed without forcing overhang sampling. In this case the results for the methods of interest are:

!exp_normal.jpg!
 * The 512 table is faster. This reduces the overhang sampling by fitting more of the ziggurat inside the distribution curve.
 * Speed-up from any other method is not consistent.
 * The Loop2 method is typically faster except for JDK 8 on the 'sample' benchmark.
 * The versions using variable thresholds for the maximum epsilon difference between the curve and the triangle hypotenuse can be faster. They are not observed to be slower.
 * The Ternary version is marginally faster in most benchmarks. It is much faster in the 'sequentialSample' benchmark on JDK 11.

h2. Conclusion

These results show that the ternary operator can have a speed-up on the sampler. This change is recommended.

The use of table size of 512 improves speed. It will increase the memory footprint of the sampler class by approximately (2*24+1) / 25 = 1.96 fold. This is static data and will not affect instance size. This change alters the sampling layers inside the ziggurat:
||Size||Inside||
|256|98.4|
|512|99.2|
|1024|99.6|
|2048|99.8|

A change to 512 size tables gains 0.8% on the fast path. Further layers have smaller gains. A change to size 512 tables would gain most of the performance benefit. However the effect of memory footprint may require further investigation as the benchmark only uses the sampler. Typically a sampler will be used in a larger application and memory used by the sampler will reduce the memory cache available for the rest of the program at the smallest and fastest cache level.

The use of tables to store the epsilon values also improves speed. It will increase the memory footprint of the sampler class by approximately (25+8) / 25 = 1.32 fold. Performance gains here are far less than using a larger ziggurat table size and add complexity to the class. This change does not seem justified; memory should be used to switch to 512 size tables.
h2. Gaussian

!gauss_overhang.jpg!

Results are similar to the exponential sampler with some differences.
 * The simple overhangs method is much slower. It is off the scale and the values range from approximately 12 to 21 verses the best score of around 8.
 * The 512 table method is a little slower. The larger tables have a small detrimental effect on overhang sampling.
 * The versions using variable thresholds for the maximum epsilon difference between the curve and the triangle hypotenuse are faster. This is consistent across all JDKs.
 * The Ternary method has a small performance gain or no effect (JDK 17 'sequentialSample' benchmark)

The same benchmark was performed without forcing overhang sampling. In this case the results for the methods of interest are:

!gauss_normal.jpg!
 * The current code for ModGaussian is faster than all the variants of overhang sampling. This uses method inlining to extract the fast path to a single method and overhang sampling to a separate method.
 * The 512 table is faster. This reduces the overhang sampling by fitting more of the ziggurat inside the distribution curve.
 * Speed-up from any other method is not consistent.
 * The versions using variable thresholds for the maximum epsilon difference between the curve and the triangle hypotenuse can be faster. However they can also be slower. These differences may be only marginal and the variations could be due to measurement noise.
 * The Ternary version is marginally faster on JDK 11 and 17 benchmarks but slower on JDK 8. This slowdown is only observed on the Macbook laptop. The workstation CPUs are marginally faster on JDK 8 too.

h2. Conclusion

These results show that the ternary operator can have a speed-up on the sampler. This change is recommended despite the performance impact on JDK 8 on the Macbook for the Gaussian sampler. Changing this code will match the code change for the Exponential sampler.

The use of table size of 512 improves speed. It will increase the memory footprint of the sampler class by approximately (2*24+1) / 25 = 1.96 fold. This is static data and will not affect instance size. This change alters the sampling layers inside the ziggurat:
||Size||Inside||
|256|98.8|
|512|99.4|
|1024|99.7|
|2048|99.85|

A change to 512 size tables gains 0.6% on the fast path. Further layers have smaller gains. A change to size 512 tables would gain most of the performance benefit. However the effect of memory footprint may require further investigation as the benchmark only uses the sampler. Typically a sampler will be used in a larger application and memory used by the sampler will reduce the memory cache available for the rest of the program at the smallest and fastest cache level.

The use of tables to store the epsilon values also improves speed. It will increase the memory footprint of the sampler class by approximately (25+8) / 25 = 1.32 fold. Performance gains here are far less than using a larger ziggurat table size and add complexity to the class. This change does not seem justified; memory should be used to switch to 512 size tables.
h2. Overall

The current Loop2 implementation in the exponential sampler is the fastest overhang sampling method.

Method inlining in the Gaussian sampler has a greater performance impact than any overhang sampling variants.

Use of the ternary operator has a performance gain on overhang sampling and is recommended due to the simplicity of the code change.

Use of look-up tables for the maximum epsilon values has a marginal gain. Memory would be better used to switch to a larger ziggurat table. With more sampling on the fast path any speed-up of overhang sampling will be reduced and may become negligible.

Use of a larger ziggurat table will have a performance gain. The effect of increased memory size should be investigated before switching the implementation.

> Performance of modified Ziggurat samplers
> -----------------------------------------
>
>                 Key: RNG-160
>                 URL: https://issues.apache.org/jira/browse/RNG-160
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.4
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: Exp.jpg, Gaussian.jpg, exp_normal.jpg, exp_overhang.jpg, gauss_normal.jpg, gauss_overhang.jpg
>
>
> The modified ziggurat algorithm implemented in RNG-151 has a variation for sampling from the overhang regions at the ziggurat edges, region A below:
> {noformat}
>             \
>   ----------+\
>             | \
>      B      |A \
>   -------------+\
>                | \
> {noformat}
> Note: Region B is the ziggurat layer which is entirely within the distribution PDF.
> These overhangs can be convex, concave or an inflection (change from convex to concave). Samples from the region A must be below the distribution density curve (PDF). When sampling from region A, the sampler will create a random point (x,y) uniformly within the rectangle. This is tested to determine if it is below the curve. Convex and concave curves can use a fast method that knows the largest possible triangle that fits above or below the curve. If a sampled point (x,y) is within these triangles then the actual curve position (pdf( x)) for the point x does not need to be computed. This can save time if the pdf is computationally expensive. In the case of exponential (exp(-x)) or Gaussian (exp(-0.5 * x * x)) this involves a call to Math.exp.
> The current sampler implements the fast look-up method. The alternative is to always compute pdf( x) and determine if y is below the curve. This is known as 'simple overhangs'.
> Investigate the use of simple overhangs on the performance of the sampler.
> Note: The Marsaglia version of the ziggurat sampler uses the 'simple overhangs' method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)