You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by cestella <gi...@git.apache.org> on 2017/12/13 21:07:17 UTC

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

GitHub user cestella opened a pull request:

    https://github.com/apache/metron/pull/867

    METRON-1350: Add reservoir sampling functions to Stellar

    ## Contributor Comments
    Sampling capabilities would fit very well with the profiler and enable algorithms that do not necessarily support our existing probabilistic sketches. We should add a reservoir sampler and utilities to merge and resample.
    
    You can play with `SAMPLE_INIT`, `SAMPLE_ADD`, `SAMPLE_MERGE` and `SAMPLE_GET` via the REPL:
    ```
    [Stellar]>>> ?SAMPLE_ADD
    SAMPLE_ADD
    Description: Add to a sample
    
    Arguments:
    	sampler - Sampler to use.  If null, then a default Uniform sampler is created
    	o - The value to add.  If o is an Iterable, then each item is added.
    
    Returns:
    [Stellar]>>> s_10 := SAMPLE_INIT(10)
    [Stellar]>>> sample := REDUCE( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ], (s, x) -> SAMPLE_ADD(s, x), SAMPLE_INIT(5))
    [Stellar]>>> SAMPLE_GET(sample)
    [6, 8, 11, 4, 5]
    [Stellar]>>> SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ])
    org.apache.metron.statistics.sampling.UniformSampler@3d8d06c0
    [Stellar]>>> SAMPLE_GET(SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ]))
    [5, 2, 5, 7, 10, 5, 2, 5, 7, 10]
    ```
    ## Pull Request Checklist
    
    Thank you for submitting a contribution to Apache Metron.  
    Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions.  
    Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides.  
    
    
    In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). 
    - [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    
    ### For code changes:
    - [x] Have you included steps to reproduce the behavior or problem that is being changed or addressed?
    - [x] Have you included steps or a guide to how the change may be verified and tested manually?
    - [x] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
      ```
      mvn -q clean integration-test install && build_utils/verify_licenses.sh 
      ```
    
    - [x] Have you written or updated unit tests and or integration tests to verify your changes?
    - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [ ] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
    
    ### For documentation related changes:
    - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`:
    
      ```
      cd site-book
      mvn site
      ```
    
    #### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
    It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron sampling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/867.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #867
    
----
commit 7e1a19e29f86a23140aa46291f0083409ddac40d
Author: cstella <ce...@gmail.com>
Date:   2017-12-13T20:59:40Z

    METRON-1350: Add reservoir sampling functions to Stellar

----


---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on the issue:

    https://github.com/apache/metron/pull/867
  
    @ottobackwards Left a comment on the ticket you made: https://issues.apache.org/jira/browse/METRON-1361?focusedCommentId=16291159&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16291159


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156800428
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    Actually, more than likely it'd be a separate init since each type are going to have different types of parameters depending on the algorithm.  So a biased sampler would be `SAMPLE_INIT_BIASED(size, ...)`


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156792461
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    Couldn't this be simplified to
    
    ```java
     if(ret == null ) {
          if(obj != null) {
             throw new IllegalStateException(argName + "argument(" + obj
                                            + " is expected to be an " + expectedClazz.getName()
                                            + ", but was " + obj
                                            );
           }
         }
    return Optional.ofNullable(ret);
    
    
    ```


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156799950
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
    @@ -0,0 +1,91 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.statistics.sampling;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +import java.util.Random;
    +
    +public class UniformSampler implements Sampler {
    +  private List<Object> reservoir;
    +  private int seen = 0;
    +  private int size;
    +  private Random rng = new Random(0);
    +
    +  public UniformSampler() {
    +    this(DEFAULT_SIZE);
    +  }
    +
    +  public UniformSampler(int size) {
    +    this.size = size;
    +    reservoir = new ArrayList<>(size);
    +  }
    +
    +  @Override
    +  public Iterable<Object> get() {
    +    return reservoir;
    +  }
    +
    +  /**
    +   * Add an object to the reservoir
    +   * @param o
    +   */
    +  public void add(Object o) {
    +    if(o == null) {
    +      return;
    +    }
    +    if (reservoir.size() < size) {
    +      reservoir.add(o);
    +    } else {
    +      int rIndex = rng.nextInt(seen + 1);
    --- End diff --
    
    This makes me think that we need "namespace" scoped documentation


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156792855
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    Ok, It seemed like the Uniform implementation was leaking


---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/867
  
    any chance I can see @justinleet 's ideas/prototypes?


---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/867
  
    Sorry, I am not sure I understand, this is random replacement when after the size limit.  Am I mistaking your question?


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156959616
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
    @@ -0,0 +1,91 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.statistics.sampling;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +import java.util.Random;
    +
    +public class UniformSampler implements Sampler {
    +  private List<Object> reservoir;
    +  private int seen = 0;
    +  private int size;
    +  private Random rng = new Random(0);
    +
    +  public UniformSampler() {
    +    this(DEFAULT_SIZE);
    +  }
    +
    +  public UniformSampler(int size) {
    +    this.size = size;
    +    reservoir = new ArrayList<>(size);
    +  }
    +
    +  @Override
    +  public Iterable<Object> get() {
    +    return reservoir;
    +  }
    +
    +  /**
    +   * Add an object to the reservoir
    +   * @param o
    +   */
    +  public void add(Object o) {
    +    if(o == null) {
    +      return;
    +    }
    +    if (reservoir.size() < size) {
    +      reservoir.add(o);
    +    } else {
    +      int rIndex = rng.nextInt(seen + 1);
    --- End diff --
    
    I modified the docs to have a link to the Reservoir sampling wikipedia article and made things a bit more clear regarding Uniform.


---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/867
  
    @cestella  comments?
    https://issues.apache.org/jira/browse/METRON-1361


---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/867
  
    I'll keep that +1 in my pocket...I want to spin it up in full-dev and try it in conjunction with a PR that I'm not quite done with yet.  Let's hold off merging until I can validate that I haven't botched the API in some horrible way for sampling ;)


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/metron/pull/867


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156799854
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
    @@ -0,0 +1,91 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.statistics.sampling;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +import java.util.Random;
    +
    +public class UniformSampler implements Sampler {
    +  private List<Object> reservoir;
    +  private int seen = 0;
    +  private int size;
    +  private Random rng = new Random(0);
    +
    +  public UniformSampler() {
    +    this(DEFAULT_SIZE);
    +  }
    +
    +  public UniformSampler(int size) {
    +    this.size = size;
    +    reservoir = new ArrayList<>(size);
    +  }
    +
    +  @Override
    +  public Iterable<Object> get() {
    +    return reservoir;
    +  }
    +
    +  /**
    +   * Add an object to the reservoir
    +   * @param o
    +   */
    +  public void add(Object o) {
    +    if(o == null) {
    +      return;
    +    }
    +    if (reservoir.size() < size) {
    +      reservoir.add(o);
    +    } else {
    +      int rIndex = rng.nextInt(seen + 1);
    --- End diff --
    
    Shouldn't we reference Reservoir Sampling in the documentation?  Then the use of Universal and other terms would be more in context.


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156790945
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
    @@ -0,0 +1,91 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.statistics.sampling;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +import java.util.Random;
    +
    +public class UniformSampler implements Sampler {
    +  private List<Object> reservoir;
    +  private int seen = 0;
    +  private int size;
    +  private Random rng = new Random(0);
    +
    +  public UniformSampler() {
    +    this(DEFAULT_SIZE);
    +  }
    +
    +  public UniformSampler(int size) {
    +    this.size = size;
    +    reservoir = new ArrayList<>(size);
    +  }
    +
    +  @Override
    +  public Iterable<Object> get() {
    +    return reservoir;
    +  }
    +
    +  /**
    +   * Add an object to the reservoir
    +   * @param o
    +   */
    +  public void add(Object o) {
    +    if(o == null) {
    +      return;
    +    }
    +    if (reservoir.size() < size) {
    +      reservoir.add(o);
    +    } else {
    +      int rIndex = rng.nextInt(seen + 1);
    --- End diff --
    
    Just so I'm clear, up to the reservoir size, we add to the reservoir.  When we're past the reservoir, we do a random replacement as per https://en.wikipedia.org/wiki/Reservoir_sampling


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by simonellistonball <gi...@git.apache.org>.
Github user simonellistonball commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156791019
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
    @@ -0,0 +1,91 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.statistics.sampling;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +import java.util.Random;
    +
    +public class UniformSampler implements Sampler {
    +  private List<Object> reservoir;
    +  private int seen = 0;
    +  private int size;
    +  private Random rng = new Random(0);
    +
    +  public UniformSampler() {
    +    this(DEFAULT_SIZE);
    +  }
    +
    +  public UniformSampler(int size) {
    +    this.size = size;
    +    reservoir = new ArrayList<>(size);
    +  }
    +
    +  @Override
    +  public Iterable<Object> get() {
    +    return reservoir;
    +  }
    +
    +  /**
    +   * Add an object to the reservoir
    +   * @param o
    +   */
    +  public void add(Object o) {
    +    if(o == null) {
    +      return;
    +    }
    +    if (reservoir.size() < size) {
    +      reservoir.add(o);
    +    } else {
    +      int rIndex = rng.nextInt(seen + 1);
    --- End diff --
    
    you are 100% right, that's what I get for skim reading.


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156799548
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    Then we'll have a get sample types method, like we do with other things like this right?


---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/867
  
    I'm +1 by inspection, this looks great!


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156788055
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    This makes it seem like Uniform sampler is a 'known' thing.  But it is not, either by explanation or reference to where it is explained ( as we have done referring to algorithms before ).
    Is there another type of sampler?
    
    Somewhere ( I'm not sure where ) we should say:
    "A sampler is a xxxxx that is | does | acts as  xxxxx for the sample functions. The default has these properties, but you can override that in init"
    
    Why even mention the Universal?
    
    



---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/867
  
    I commented on the ticket, but will repeat a portion of it here at the risk of this being quite off-topic.  @justinleet has some great ideas/prototypes about automatic documentation generation from stellar.  I'd support a new annotation for documenting namespaces.


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156794655
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    There are definitely other types of reservoir samplers which we will probably want.  Most specifically a sampler that is biased toward recency (so non-uniform in that case).


---

[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar

Posted by simonellistonball <gi...@git.apache.org>.
Github user simonellistonball commented on the issue:

    https://github.com/apache/metron/pull/867
  
    Should the size limit on the sample really be a cut off? In a likely usage scenario a users would sample over a window in a profile. Limiting the size is likely to skew to time at the beginning of the window rather than being genuinely uniform. Would a random replacement strategy make more sense when over the limit? This could be a lot heavier in terms of performance, but may be more mathematically sound.


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by simonellistonball <gi...@git.apache.org>.
Github user simonellistonball commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156794990
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    Recency would surely be more relevant for merged resampling in a profile context? 


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156790508
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/SamplingInitFunctions.java ---
    @@ -0,0 +1,89 @@
    +/*
    + *
    + *  Licensed to the Apache Software Foundation (ASF) under one
    + *  or more contributor license agreements.  See the NOTICE file
    + *  distributed with this work for additional information
    + *  regarding copyright ownership.  The ASF licenses this file
    + *  to you under the Apache License, Version 2.0 (the
    + *  "License"); you may not use this file except in compliance
    + *  with the License.  You may obtain a copy of the License at
    + *
    + *      http://www.apache.org/licenses/LICENSE-2.0
    + *
    + *  Unless required by applicable law or agreed to in writing, software
    + *  distributed under the License is distributed on an "AS IS" BASIS,
    + *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + *  See the License for the specific language governing permissions and
    + *  limitations under the License.
    + *
    + */
    +package org.apache.metron.statistics.sampling;
    +
    +import org.apache.metron.stellar.common.utils.ConversionUtils;
    +import org.apache.metron.stellar.dsl.Context;
    +import org.apache.metron.stellar.dsl.ParseException;
    +import org.apache.metron.stellar.dsl.Stellar;
    +import org.apache.metron.stellar.dsl.StellarFunction;
    +
    +import java.util.List;
    +import java.util.Optional;
    +import java.util.function.Supplier;
    +
    +public class SamplingInitFunctions {
    +
    +  @Stellar(namespace="SAMPLE"
    +          ,name="INIT"
    +          ,description="Create a uniform reservoir sampler of a specific size or, if unspecified, size " + Sampler.DEFAULT_SIZE
    +          ,params = {
    +            "size? - The size of the reservoir sampler.  If unspecified, the size is " + Sampler.DEFAULT_SIZE
    +          }
    +          ,returns="The sampler object."
    +  )
    +
    +  public static class UniformSamplerInit implements StellarFunction {
    +    @Override
    +    public Object apply(List<Object> args, Context context) throws ParseException {
    +      if(args.size() == 0) {
    +        return new UniformSampler();
    +      }
    +      else {
    +        Optional<Integer> sizeArg = get(args, 0, "Size", Integer.class);
    +        if(sizeArg.isPresent() && sizeArg.get() <= 0) {
    +          throw new IllegalStateException("Size must be a positive integer");
    +        }
    +        else {
    +          return new UniformSampler(sizeArg.orElse(Sampler.DEFAULT_SIZE));
    +        }
    +      }
    +    }
    +
    +    @Override
    +    public void initialize(Context context) {
    +    }
    +
    +    @Override
    +    public boolean isInitialized() {
    +      return true;
    +    }
    +  }
    +
    +
    +  public static <T> Optional<T> get(List<Object> args, int offset, String argName, Class<T> expectedClazz) {
    +    Object obj = args.get(offset);
    +    T ret = ConversionUtils.convert(obj, expectedClazz);
    --- End diff --
    
    Couldn't this be simplified to :
    
    ```java 
    if(ret == null ) {
          if(obj != null) {
             throw new IllegalStateException(argName + "argument(" + obj
                                            + " is expected to be an " + expectedClazz.getName()
                                            + ", but was " + obj
                                            );
           }
        }
        return Optional.ofNullable(ret);
     }
    ```


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156796690
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    They're both needed.  Some use-cases would be fine without bias and some would be better with bias.  As a follow-on, I was planning on adding a biased sampler, but this is a big enough PR without it.  It'd look something like:
    ```
    samples := SAMPLE_MERGE(PROFILE_GET('samples', ...))
    biased_sample := SAMPLE_GET_BIASED(samples, 0.015)
    ```


---

[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/867#discussion_r156792227
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
       * bounds - A list of value bounds (excluding min and max) in sorted order.
     * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
     
    +### Sampling Functions
    +
    +#### `SAMPLE_ADD`
    +* Description: Add a value or collection of values to a sampler.
    +* Input:
    --- End diff --
    
    Sorry, `uniform` here is intended to mean that there's each element has equal probability of being in the sample (e.g. the probability is pulled from a [uniform probability distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))).  I can probably do a better job documenting. 


---