You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/04/21 23:46:13 UTC

[jira] [Commented] (FLINK-3802) Add Very Fast Reservoir Sampling

    [ https://issues.apache.org/jira/browse/FLINK-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252808#comment-15252808 ] 

ASF GitHub Bot commented on FLINK-3802:
---------------------------------------

GitHub user gaoyike opened a pull request:

    https://github.com/apache/flink/pull/1924

    [FLINK-3802] Add Very Fast Reservoir Sampling

    Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
    If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html).
    In addition to going through the list, please provide a meaningful description of your changes.
    
    - [x] General
      - The pull request references the related JIRA issue
      - The pull request addresses only one issue
      - Each commit in the PR has a meaningful commit message
    
    - [x] Documentation
      - Documentation has been added for new functionality
      - Old documentation affected by the pull request has been updated
      - JavaDoc for public methods has been added
    
    - [ ] Tests & Build
      - Functionality added by the pull request is covered by tests
      - `mvn clean verify` has been executed successfully locally or a Travis build has passed
    
    
    
    A in memory implementation of Very Fast Reservoir Sampling, the algorithm works well then the size of streaming data is much larger than size of reservoir.
    
      The algorithm runs in random sampling with P(R/j) where in R is the size of sampling and j is the current index of streaming data.
      The algorithm consists of two part:
      	(1) Before the size of streaming data reaches threshold, it uses regular reservoir sampling
      	(2) After the size of streaming data reaches threshold, it uses geometric distribution to generate the approximation gap
      		to skip data, and size of gap is determined by  geometric distribution with probability p = R/j
    
       Thanks to Erik Erlandson who is the author of this algorithm and help me with implementation.
    
    Reference: http://erikerlandson.github.io/blog/2015/11/20/very-fast-reservoir-sampling/

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gaoyike/flink master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1924.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1924
    
----
commit 81e0622b20d8bc969dec1555bd55d4230d9b38de
Author: 晨光 何 <ga...@gmail.com>
Date:   2016-04-21T21:42:26Z

     A in memory implementation of Very Fast Reservoir Sampling. The algorithm works well then the size of streaming data is much larger than size of reservoir.
      The algorithm runs in random sampling with P(R/j) where in R is the size of sampling and j is the current index of streaming data.
      The algorithm consists of two part:
      	(1) Before the size of streaming data reaches threshold, it uses regular reservoir sampling
      	(2) After the size of streaming data reaches threshold, it uses geometric distribution to generate the approximation gap
      		to skip data, and size of gap is determined by  geometric distribution with probability p = R/j
    
       Thanks to Erik Erlandson who is the author of this algorithm and help me with implementation.

----


> Add Very Fast Reservoir Sampling
> --------------------------------
>
>                 Key: FLINK-3802
>                 URL: https://issues.apache.org/jira/browse/FLINK-3802
>             Project: Flink
>          Issue Type: Improvement
>          Components: Java API
>            Reporter: Chenguang He
>            Assignee: Chenguang He
>              Labels: Sampling
>
> Adding Very Fast Reservoir Sampling (http://erikerlandson.github.io/blog/2015/11/20/very-fast-reservoir-sampling/)
> An improved version of Reservoir Sampling, it's used to deal with small sampling in large dataset, where the size of dataset is much larger than the size of sampling.
> It is a random sampling proved in the link. The average possibility is P(R/J), where R is size of sampling and J is index of streaming data 
> Thanks Erik Erlandson who is the author of this algorithm help me with implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)