You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2021/08/12 10:41:00 UTC
[jira] [Commented] (RNG-160) Performance of modified Ziggurat samplers

    [ https://issues.apache.org/jira/browse/RNG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397982#comment-17397982 ] 

Alex Herbert commented on RNG-160:
----------------------------------

A JMH performance test has been add to the examples module to test variations of McFarland's ziggurat method. The methods tested are:
||Method||Description||
|ModExponential|McFarland's modified exponential ziggurat sampler in the main sampling module|
|ModExponential2|A copy of McFarland's modified exponential ziggurat sampler from the main sampling module with modifications to the private members so they can be over-ridden|
|Exponential|An implementation of Marsaglia's original ziggurat exponential sampler using a table size of 256|
|ModExponentialInlining|ModExponential2 modified to fit the main 'sample' method within 35 bytes when compiled to Java byte code (the default JVM size for inlining a method)|
|ModExponentialLoop|ModExponential2 modified to use a loop for tail sampling in place of recursive method calls|
|ModExponentialRecursion|ModExponential2 modified to use a dedicated sample method when recursion is required. This fits the main sample method within 35 bytes when compiled to byte code|
|ModExponentialSimpleOverhangs|ModExponential2 modified to use simple overhangs|
|ModGaussian|McFarland's modified Gaussian ziggurat sampler in the main sampling module|
|ModGaussian2|A copy of McFarland's modified Gaussian ziggurat sampler from the main sampling module with modifications to the private members so they can be over-ridden|
|ModGaussianInlining|ModGaussian2 modified to fit the main 'sample' method within 35 bytes when compiled to Java byte code (the default JVM size for inlining a method)|
|ModGaussianInliningSimpleOverhangs|ModGaussian2 modified to fit the main 'sample' method within 35 bytes when compiled to Java byte code and modified to use simple overhangs|
|ModGaussianSimpleOverhangs|ModGaussian2 modified to use simple overhangs, all code within one method|

Two benchmarks have been performed:
||Method||Description||
|sample|Create a single sample from the sampler. Uses RNGs: JDK, MWC_256, XO_RO_SHI_RO_128_PP which are slow, medium and fast generators.|
|sequentialSample|Create a multiple samples sequentially from the sampler. This is done inline when <=5 or in a loop for larger sizes. Uses XO_RO_SHI_RO_128_PP and sizes 2, 8, 32.|

Test Environments
||JDK||OS||CPU||
|OpenJDK 64-Bit Server VM 11.0.9.1-internal|"linux", version: "5.4.0-77-generic"|AMD Ryzen Threadripper 3960X 24-Core Processor (Q4 2019)|
|Java HotSpot(TM) 64-Bit Server VM 1.8.0_241|"mac os x", version: "11.5"|Intel(R) Core(TM) i7-6820HQ CPU (Q4 2015)|
|OpenJDK 64-Bit Server VM 11.0.12|"mac os x", version: "11.5"|Intel(R) Core(TM) i7-6820HQ CPU (Q4 2015)|
|Java HotSpot(TM) 64-Bit Server VM 1.8.0_241|"linux", version: "4.15.0-153-generic"|Intel(R) Xeon(R) CPU E5-1680 v3 (Q1 2015)|
|OpenJDK 64-Bit Server VM 11.0.11|"linux", version: "4.15.0-153-generic"|Intel(R) Xeon(R) CPU E5-1680 v3 (Q1 2015)|

This is a MacBook Pro and two workstations.

Running all the tests creates a large amount of results and looking at individual tests is subject to some variation in the scores produced by JMH. I have used a summary analysis to identify if there are clear outliers in performance. All methods were adjusted to be relative to the reference. Since there are two versions of McFarland's method I have used the minimum of the two as the reference, i.e. min(ModGaussian, ModGaussian2). These methods should have the same speed (they are identical) but there are variations showing that the results are not completely reproducible. The variations are not minor where using an average would be a suitable reference. The variations can be large indicating that the JVM may have compiled the code differently. This also emphasises that methods should be judged to be different if they have a big difference, small differences may be noise.

The following results show the sum of the relative scores grouped by JDK and method.
h2. Exponential

!Exp.jpg!
 * Marsaglia's Exponential method is slower
 * SimpleOverhangs is slower
 * There is some difference between ModExponential and ModExponential2 which use the same method. This is especially true on JDK 11 where ModExponential is faster for the single sample but slower for the sequentialSample. This sets the magnitude for variation in the test results.
 * Variations of McFarland's method are not consistently ranked. All variations appear to be around the same performance. The results may be due to noise in the performance testing.

Note that Marsaglia's method uses simple overhangs. An exponential curve is always convex. Thus McFarland's method for this curve is able to use the same fast look-up for all regions at the edge of ziggurat and it is clearly superior on all tested JVMS and CPUs.

The results do not provide an argument for changing from the current implementation.
h2. Gaussian 
 !Gaussian.jpg! 
  
 * Marsaglia's method is much slower on JDK 8
 * The Inlining variation of ModGaussian is faster on JDK 8
 * Marsaglia's method is comparably fast to the best methods on JDK 11, and faster than the current ModGaussian implementation
 * Inlining or SimpleOverhangs are faster on JDK 11

Note that Marsaglia's method uses simple overhangs. A Gaussian curve can be convex, concave or have an inflection. This adds complexity to McFarland's method. When this complexity is moved to a separate method the main sample method is small enough to be inlined and this change creates a performance improvement. The main sample path is used 98.8% of the time. The JVM does not know this when it compiles the method and so separating the complex code from the code used 98.8% of the time is a useful hint to the compiler.

On JDK 11 the use of simple overhangs runs at an equal speed to the inlining version. This is without moving any code to a separate method to provide compiler hints (as is done in the InliningSimpleOverhangs version). It may be that JDK 11 has better optimisation for inlining code and the simple overhangs version allows the same optimisation as the inlining version, i.e the code path that is used 98.8% of the time where a value is returned from the main ziggurat can be moved inline.

The results suggest changing to the inline variation from the current implementation.
h2. Conclusion

No changes to the exponential sampler have a noticeable affect on performance.

Updating the Gaussian sampler to separate the sample method shows a performance improvement on JDK 8 and 11. Using simple overhangs has a similar level of performance improvement on JDK 11 but no benefit on JDK 8.

This suggests updating to allow inlining the main sample method. This changes the code layout of the sampler but does not alter the method which still uses McFarland's fast triangle look-up method for sampling convex or concave curves.

> Performance of modified Ziggurat samplers
> -----------------------------------------
>
>                 Key: RNG-160
>                 URL: https://issues.apache.org/jira/browse/RNG-160
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.4
>            Reporter: Alex Herbert
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: Exp.jpg, Gaussian.jpg
>
>
> The modified ziggurat algorithm implemented in RNG-151 has a variation for sampling from the overhang regions at the ziggurat edges, region A below:
> {noformat}
>             \
>   ----------+\
>             | \
>      B      |A \
>   -------------+\
>                | \
> {noformat}
> Note: Region B is the ziggurat layer which is entirely within the distribution PDF.
> These overhangs can be convex, concave or an inflection (change from convex to concave). Samples from the region A must be below the distribution density curve (PDF). When sampling from region A, the sampler will create a random point (x,y) uniformly within the rectangle. This is tested to determine if it is below the curve. Convex and concave curves can use a fast method that knows the largest possible triangle that fits above or below the curve. If a sampled point (x,y) is within these triangles then the actual curve position (pdf( x)) for the point x does not need to be computed. This can save time if the pdf is computationally expensive. In the case of exponential (exp(-x)) or Gaussian (exp(-0.5 * x * x)) this involves a call to Math.exp.
> The current sampler implements the fast look-up method. The alternative is to always compute pdf( x) and determine if y is below the curve. This is known as 'simple overhangs'.
> Investigate the use of simple overhangs on the performance of the sampler.
> Note: The Marsaglia version of the ziggurat sampler uses the 'simple overhangs' method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)