You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "GaoLun (JIRA)" <ji...@apache.org> on 2015/09/02 09:02:46 UTC
[jira] [Commented] (FLINK-2535) Fixed size sample algorithm
optimization
[ https://issues.apache.org/jira/browse/FLINK-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726870#comment-14726870 ]
GaoLun commented on FLINK-2535:
-------------------------------
Hi.I have replaced the sampling algorithm with scalable simple random sampling based on the paper. And I have done some test to compare the performance between the two sampling method.Here are some statistical data of rejected items' number : (source_size = 10000000)
SampleSize SRS SSRS
100 9998727 9998488
100 9998755 9998493
500 9994502 9994081
500 9994624 9994029
1000 9989781 9989061
1000 9989657 9989132
5000 9956780 9956061
5000 9956812 9956018
10000 9921601 9919174
10000 9920866 9919396
50000 9685085 9682182
50000 9685203 9681611
100000 9439345 9435887
100000 9440469 9435521
500000 8001535 7998046
500000 8000370 7996807
1000000 6699752 6690612
1000000 6696175 6692111
5000000 1534814 1530409
5000000 1534088 1529784
With the statistical data, we can see the number of items SRS rejected is more than SSRS but isn't obvious.
> Fixed size sample algorithm optimization
> ----------------------------------------
>
> Key: FLINK-2535
> URL: https://issues.apache.org/jira/browse/FLINK-2535
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Chengxiang Li
> Priority: Minor
>
> Fixed size sample algorithm is known to be less efficient than sample algorithms with fraction, but sometime it's necessary. Some optimization could significantly reduce the storage size and computation cost, such as the algorithm described in [this paper|http://machinelearning.wustl.edu/mlpapers/papers/icml2013_meng13a].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)