You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "GaoLun (JIRA)" <ji...@apache.org> on 2015/09/02 09:02:46 UTC

[jira] [Commented] (FLINK-2535) Fixed size sample algorithm optimization

    [ https://issues.apache.org/jira/browse/FLINK-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726870#comment-14726870 ] 

GaoLun commented on FLINK-2535:
-------------------------------

Hi.I have replaced the sampling algorithm with scalable simple random sampling based on the paper. And I have done some test to compare the performance between the two sampling method.Here are some statistical data of rejected items' number : (source_size = 10000000)

SampleSize              	SRS 	SSRS
 100	                             9998727     9998488
 100	                             9998755	9998493
 500	                             9994502	9994081
 500	                             9994624	9994029
 1000	                     9989781	9989061
 1000	                     9989657	9989132
 5000	                     9956780	9956061
 5000	                     9956812	9956018
 10000	                     9921601	9919174
 10000	                     9920866	9919396
 50000	                     9685085	9682182
 50000	                     9685203	9681611
 100000	                     9439345	9435887
 100000	                     9440469	9435521
 500000	                     8001535	7998046
 500000	                     8000370	7996807
 1000000	                     6699752	6690612
 1000000	                     6696175	6692111
 5000000	                     1534814	1530409
 5000000	                     1534088	1529784

With the statistical data, we can see the number of items SRS rejected is more than SSRS but isn't obvious.

> Fixed size sample algorithm optimization
> ----------------------------------------
>
>                 Key: FLINK-2535
>                 URL: https://issues.apache.org/jira/browse/FLINK-2535
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chengxiang Li
>            Priority: Minor
>
> Fixed size sample algorithm is known to be less efficient than sample algorithms with fraction, but sometime it's necessary. Some optimization could significantly reduce the storage size and computation cost, such as the algorithm described in [this paper|http://machinelearning.wustl.edu/mlpapers/papers/icml2013_meng13a].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)