You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex D Herbert (Jira)" <ji...@apache.org> on 2019/09/08 21:16:00 UTC

[jira] [Commented] (RNG-114) Improve ListSampler shuffle algorithm to detect instances of RandomAccess

    [ https://issues.apache.org/jira/browse/RNG-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925254#comment-16925254 ] 

Alex D Herbert commented on RNG-114:
------------------------------------

I've built a benchmark to investigate the shuffle of a List.

The benchmarking is done using an ArrayList and LinkedList. In the JDK the ArrayList is marked with an interface {{java.util.RandomAccess}} that states that the list supports fast (generally constant time) random access. This allows the JDK to handle lists using appropriate algorithms.

Here is the existing code from ListSampler against {{java.util.Collections.shuffle}}:
||Size||Method||Array||Linked||Relative Array||Relative Linked||
|10|Collections|41.43|92.26| | |
|10|PermutationSampler|67.97|104.04|1.64|1.13|
|100|Collections|437.58|762.51| | |
|100|PermutationSampler|476.92|2152.72|1.09|2.82|
|1000|Collections|4280.79|8275.67| | |
|1000|PermutationSampler|4631.31|331530.26|1.08|40.06|
|10000|Collections|42823.80|112289.16| | |
|10000|PermutationSampler|52537.36|38748538.37|1.23|345.08|

The existing code is a bit slower for ArrayLists and, as expected, the existing code does not handle LinkedList.

The current code in ListSampler creates an array of indices, shuffles them using the PermutationSampler, then uses those indices with a copy of the array to rewrite the original array. In contrast Collections shuffles a RandomAccess list in place, or extracts an array, shuffle that in-place and then rewrites to the original array using a list iterator.

An update of the algorithm has two options:
 # PermutationSamplerRandomAccess: Continue to delegate the shuffle to the PermuationSampler. However detect if the list is RandomAccess and then rewrite to the original array using either direct set of each index or via the list iterator.
 # DirectRandomAccess: Copy the method of Collections and shuffle direct if possible

The method to delegate to PermutationSampler currently duplicates the input List and also creates a list of integers of the same size for a shuffle. In contrast the method of Collections has no memory overhead for DirectAccess lists and only the overhead of a list duplication for non DirectAccess lists.
||Size||Method||Array||Linked||Relative Array||Relative Linked||
|10|Collections|41.43|92.26| | |
|10|DirectRandomAccess|61.99|90.88|1.50|0.99|
|10|PermutationSamplerRandomAccess|67.14|101.13|1.62|1.10|
|100|Collections|437.58|762.51| | |
|100|DirectRandomAccess|406.57|722.86|0.93|0.95|
|100|PermutationSamplerRandomAccess|459.10|703.42|1.05|0.92|
|1000|Collections|4280.79|8275.67| | |
|1000|DirectRandomAccess|4309.85|8322.40|1.01|1.01|
|1000|PermutationSamplerRandomAccess|4849.94|8929.17|1.13|1.08|
|10000|Collections|42823.80|112289.16| | |
|10000|DirectRandomAccess|42645.86|109335.42|1.00|0.97|
|10000|PermutationSamplerRandomAccess|50631.56|113643.49|1.18|1.01|

Here the modification of the original PermutationSampler has removed the quadratic time behaviour for LinkedLists. This method remains a bit slower for ArrayLists.

However these results show that for a full list shuffle using the method from Collections is faster (and is also more memory efficient).
h2. Extension to directional shuffle

The ListSampler also supports a partial shuffle of a list from a start point towards the head or tail of the list. This is where it is appropriate to delegate to the PermutationSampler as the shuffle handles this. The results can be copied back to the original list from the appropriate range that was shuffled.

An update of the algorithm has three options:
 # PermutationSamplerRandomAccess: Continue to delegate the shuffle to the PermuationSampler. However detect if the list is RandomAccess and then rewrite to the original array using either direct set of each index or via the list iterator. The update of the list should detect the range to target in the update (the present method does not do this an updates the entire list).
 # DirectRandomAccessDirectional: Copy the method of Collections and shuffle direct if possible. Modify the shuffle to do a directional shuffle and then update only the target range in the original list.
 # DirectRandomAccessSublist: Exploit the {{List.subList(int, int)}} method to extract a sublist from the input list and then shuffle using the direct method adapted from Collections.

Note for the use of a sublist:
 - ArrayLists return a list that is marked with DirectAccess.
 - LinkedLists that are converted to arrays will only require extra memory for the part of the list that is shuffled.

Here are the results for a shuffle of an array twice. Once from the middle to the head and once from the middle to the tail:
||Size||Method||Array||Linked||Relative Array||Relative Linked||
|10|DirectRandomAccessSublist|46.16|186.48|1.00|1.00|
|10|DirectRandomAccessDirectional|41.99|146.28|0.91|0.78|
|10|PermutationSamplerRandomAccess|105.35|193.64|2.28|1.04|
|100|DirectRandomAccessSublist|500.83|1021.95|1.00|1.00|
|100|DirectRandomAccessDirectional|480.11|1125.24|0.96|1.10|
|100|PermutationSamplerRandomAccess|589.58|1111.30|1.18|1.09|
|1000|DirectRandomAccessSublist|4734.42|10101.32|1.00|1.00|
|1000|DirectRandomAccessDirectional|4676.64|11379.03|0.99|1.13|
|1000|PermutationSamplerRandomAccess|6038.58|13147.48|1.28|1.30|
|10000|DirectRandomAccessSublist|47020.68|155079.67|1.00|1.00|
|10000|DirectRandomAccessDirectional|46537.58|179918.52|0.99|1.16|
|10000|PermutationSamplerRandomAccess|68735.10|171949.83|1.46|1.11|

The relative speed is shown against the sublist method. This overall is the fastest method across the range of array sizes and types. The direct partial shuffle of an ArrayList is faster for small lists. This may be due to the removal of the list iterator which adds an offset to all get and set accessor methods. When the list is larger this effect is less noticeable as speed may be dominated by memory access overhead to the larger array.

These results suggest altering the ListSampler to shuffle using a method similar to the method in Collections. Support for the partial shuffle of lists can be provided using a sublist.
h2. Optimisation

Most of the algorithms in Collections detect DirectAccess lists and switch algorithms. When the non DirectAccess list is small the algorithm for the direct access list is faster than the alternative. For shuffle the LinkedList can use the direct method when the size is less than 5. The Collections source code state that these sizes should be regularly revisited. I have tested lists of size 3-8. The method to shuffle an array in place is faster for lists under size 4:

!LinkedListShufflePerformance.png!

I have uploaded the benchmark and the modified ListSampler to use a shuffle based on Collections in [PR 66|https://github.com/apache/commons-rng/pull/66].

> Improve ListSampler shuffle algorithm to detect instances of RandomAccess
> -------------------------------------------------------------------------
>
>                 Key: RNG-114
>                 URL: https://issues.apache.org/jira/browse/RNG-114
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.2
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>         Attachments: LinkedListShufflePerformance.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The ListSampler shuffle algorithm uses the input list {{set(index)}} method. If this runs in constant time then the algorithm is fast. If the list is not an instance of {{java.util.RandomAccess}} such as a LinkedList the algorithm will suffer from performance slowdown for large lists.
> This is handled in the JDK Collections.shuffle method by detecting if the list is not an instance of RandomAccess. If so the list {{set(index)}} method is avoided and the list is traversed for update using its iterator.
> Tasks:
>  * Benchmark the ListSampler method verses Collections.shuffle with ArrayList and LinkedList
>  * Update the ListSampler to have comparable performance to Collections.shuffle



--
This message was sent by Atlassian Jira
(v8.3.2#803003)