You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by cestella <gi...@git.apache.org> on 2017/12/13 21:07:17 UTC
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
GitHub user cestella opened a pull request:
https://github.com/apache/metron/pull/867
METRON-1350: Add reservoir sampling functions to Stellar
## Contributor Comments
Sampling capabilities would fit very well with the profiler and enable algorithms that do not necessarily support our existing probabilistic sketches. We should add a reservoir sampler and utilities to merge and resample.
You can play with `SAMPLE_INIT`, `SAMPLE_ADD`, `SAMPLE_MERGE` and `SAMPLE_GET` via the REPL:
```
[Stellar]>>> ?SAMPLE_ADD
SAMPLE_ADD
Description: Add to a sample
Arguments:
sampler - Sampler to use. If null, then a default Uniform sampler is created
o - The value to add. If o is an Iterable, then each item is added.
Returns:
[Stellar]>>> s_10 := SAMPLE_INIT(10)
[Stellar]>>> sample := REDUCE( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ], (s, x) -> SAMPLE_ADD(s, x), SAMPLE_INIT(5))
[Stellar]>>> SAMPLE_GET(sample)
[6, 8, 11, 4, 5]
[Stellar]>>> SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ])
org.apache.metron.statistics.sampling.UniformSampler@3d8d06c0
[Stellar]>>> SAMPLE_GET(SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ]))
[5, 2, 5, 7, 10, 5, 2, 5, 7, 10]
```
## Pull Request Checklist
Thank you for submitting a contribution to Apache Metron.
Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions.
Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
- [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
### For code changes:
- [x] Have you included steps to reproduce the behavior or problem that is being changed or addressed?
- [x] Have you included steps or a guide to how the change may be verified and tested manually?
- [x] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
```
mvn -q clean integration-test install && build_utils/verify_licenses.sh
```
- [x] Have you written or updated unit tests and or integration tests to verify your changes?
- [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`:
```
cd site-book
mvn site
```
#### Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cestella/incubator-metron sampling
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metron/pull/867.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #867
----
commit 7e1a19e29f86a23140aa46291f0083409ddac40d
Author: cstella <ce...@gmail.com>
Date: 2017-12-13T20:59:40Z
METRON-1350: Add reservoir sampling functions to Stellar
----
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on the issue:
https://github.com/apache/metron/pull/867
@ottobackwards Left a comment on the ticket you made: https://issues.apache.org/jira/browse/METRON-1361?focusedCommentId=16291159&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16291159
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156800428
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
Actually, more than likely it'd be a separate init since each type are going to have different types of parameters depending on the algorithm. So a biased sampler would be `SAMPLE_INIT_BIASED(size, ...)`
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156792461
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
Couldn't this be simplified to
```java
if(ret == null ) {
if(obj != null) {
throw new IllegalStateException(argName + "argument(" + obj
+ " is expected to be an " + expectedClazz.getName()
+ ", but was " + obj
);
}
}
return Optional.ofNullable(ret);
```
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156799950
--- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
@@ -0,0 +1,91 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.statistics.sampling;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+
+public class UniformSampler implements Sampler {
+ private List<Object> reservoir;
+ private int seen = 0;
+ private int size;
+ private Random rng = new Random(0);
+
+ public UniformSampler() {
+ this(DEFAULT_SIZE);
+ }
+
+ public UniformSampler(int size) {
+ this.size = size;
+ reservoir = new ArrayList<>(size);
+ }
+
+ @Override
+ public Iterable<Object> get() {
+ return reservoir;
+ }
+
+ /**
+ * Add an object to the reservoir
+ * @param o
+ */
+ public void add(Object o) {
+ if(o == null) {
+ return;
+ }
+ if (reservoir.size() < size) {
+ reservoir.add(o);
+ } else {
+ int rIndex = rng.nextInt(seen + 1);
--- End diff --
This makes me think that we need "namespace" scoped documentation
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156792855
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
Ok, It seemed like the Uniform implementation was leaking
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:
https://github.com/apache/metron/pull/867
any chance I can see @justinleet 's ideas/prototypes?
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:
https://github.com/apache/metron/pull/867
Sorry, I am not sure I understand, this is random replacement when after the size limit. Am I mistaking your question?
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156959616
--- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
@@ -0,0 +1,91 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.statistics.sampling;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+
+public class UniformSampler implements Sampler {
+ private List<Object> reservoir;
+ private int seen = 0;
+ private int size;
+ private Random rng = new Random(0);
+
+ public UniformSampler() {
+ this(DEFAULT_SIZE);
+ }
+
+ public UniformSampler(int size) {
+ this.size = size;
+ reservoir = new ArrayList<>(size);
+ }
+
+ @Override
+ public Iterable<Object> get() {
+ return reservoir;
+ }
+
+ /**
+ * Add an object to the reservoir
+ * @param o
+ */
+ public void add(Object o) {
+ if(o == null) {
+ return;
+ }
+ if (reservoir.size() < size) {
+ reservoir.add(o);
+ } else {
+ int rIndex = rng.nextInt(seen + 1);
--- End diff --
I modified the docs to have a link to the Reservoir sampling wikipedia article and made things a bit more clear regarding Uniform.
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:
https://github.com/apache/metron/pull/867
@cestella comments?
https://issues.apache.org/jira/browse/METRON-1361
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:
https://github.com/apache/metron/pull/867
I'll keep that +1 in my pocket...I want to spin it up in full-dev and try it in conjunction with a PR that I'm not quite done with yet. Let's hold off merging until I can validate that I haven't botched the API in some horrible way for sampling ;)
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/metron/pull/867
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156799854
--- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
@@ -0,0 +1,91 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.statistics.sampling;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+
+public class UniformSampler implements Sampler {
+ private List<Object> reservoir;
+ private int seen = 0;
+ private int size;
+ private Random rng = new Random(0);
+
+ public UniformSampler() {
+ this(DEFAULT_SIZE);
+ }
+
+ public UniformSampler(int size) {
+ this.size = size;
+ reservoir = new ArrayList<>(size);
+ }
+
+ @Override
+ public Iterable<Object> get() {
+ return reservoir;
+ }
+
+ /**
+ * Add an object to the reservoir
+ * @param o
+ */
+ public void add(Object o) {
+ if(o == null) {
+ return;
+ }
+ if (reservoir.size() < size) {
+ reservoir.add(o);
+ } else {
+ int rIndex = rng.nextInt(seen + 1);
--- End diff --
Shouldn't we reference Reservoir Sampling in the documentation? Then the use of Universal and other terms would be more in context.
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156790945
--- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
@@ -0,0 +1,91 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.statistics.sampling;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+
+public class UniformSampler implements Sampler {
+ private List<Object> reservoir;
+ private int seen = 0;
+ private int size;
+ private Random rng = new Random(0);
+
+ public UniformSampler() {
+ this(DEFAULT_SIZE);
+ }
+
+ public UniformSampler(int size) {
+ this.size = size;
+ reservoir = new ArrayList<>(size);
+ }
+
+ @Override
+ public Iterable<Object> get() {
+ return reservoir;
+ }
+
+ /**
+ * Add an object to the reservoir
+ * @param o
+ */
+ public void add(Object o) {
+ if(o == null) {
+ return;
+ }
+ if (reservoir.size() < size) {
+ reservoir.add(o);
+ } else {
+ int rIndex = rng.nextInt(seen + 1);
--- End diff --
Just so I'm clear, up to the reservoir size, we add to the reservoir. When we're past the reservoir, we do a random replacement as per https://en.wikipedia.org/wiki/Reservoir_sampling
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by simonellistonball <gi...@git.apache.org>.
Github user simonellistonball commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156791019
--- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java ---
@@ -0,0 +1,91 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.statistics.sampling;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+
+public class UniformSampler implements Sampler {
+ private List<Object> reservoir;
+ private int seen = 0;
+ private int size;
+ private Random rng = new Random(0);
+
+ public UniformSampler() {
+ this(DEFAULT_SIZE);
+ }
+
+ public UniformSampler(int size) {
+ this.size = size;
+ reservoir = new ArrayList<>(size);
+ }
+
+ @Override
+ public Iterable<Object> get() {
+ return reservoir;
+ }
+
+ /**
+ * Add an object to the reservoir
+ * @param o
+ */
+ public void add(Object o) {
+ if(o == null) {
+ return;
+ }
+ if (reservoir.size() < size) {
+ reservoir.add(o);
+ } else {
+ int rIndex = rng.nextInt(seen + 1);
--- End diff --
you are 100% right, that's what I get for skim reading.
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156799548
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
Then we'll have a get sample types method, like we do with other things like this right?
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:
https://github.com/apache/metron/pull/867
I'm +1 by inspection, this looks great!
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156788055
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
This makes it seem like Uniform sampler is a 'known' thing. But it is not, either by explanation or reference to where it is explained ( as we have done referring to algorithms before ).
Is there another type of sampler?
Somewhere ( I'm not sure where ) we should say:
"A sampler is a xxxxx that is | does | acts as xxxxx for the sample functions. The default has these properties, but you can override that in init"
Why even mention the Universal?
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:
https://github.com/apache/metron/pull/867
I commented on the ticket, but will repeat a portion of it here at the risk of this being quite off-topic. @justinleet has some great ideas/prototypes about automatic documentation generation from stellar. I'd support a new annotation for documenting namespaces.
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156794655
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
There are definitely other types of reservoir samplers which we will probably want. Most specifically a sampler that is biased toward recency (so non-uniform in that case).
---
[GitHub] metron issue #867: METRON-1350: Add reservoir sampling functions to Stellar
Posted by simonellistonball <gi...@git.apache.org>.
Github user simonellistonball commented on the issue:
https://github.com/apache/metron/pull/867
Should the size limit on the sample really be a cut off? In a likely usage scenario a users would sample over a window in a profile. Limiting the size is likely to skew to time at the beginning of the window rather than being genuinely uniform. Would a random replacement strategy make more sense when over the limit? This could be a lot heavier in terms of performance, but may be more mathematically sound.
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by simonellistonball <gi...@git.apache.org>.
Github user simonellistonball commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156794990
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
Recency would surely be more relevant for merged resampling in a profile context?
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156790508
--- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/SamplingInitFunctions.java ---
@@ -0,0 +1,89 @@
+/*
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ */
+package org.apache.metron.statistics.sampling;
+
+import org.apache.metron.stellar.common.utils.ConversionUtils;
+import org.apache.metron.stellar.dsl.Context;
+import org.apache.metron.stellar.dsl.ParseException;
+import org.apache.metron.stellar.dsl.Stellar;
+import org.apache.metron.stellar.dsl.StellarFunction;
+
+import java.util.List;
+import java.util.Optional;
+import java.util.function.Supplier;
+
+public class SamplingInitFunctions {
+
+ @Stellar(namespace="SAMPLE"
+ ,name="INIT"
+ ,description="Create a uniform reservoir sampler of a specific size or, if unspecified, size " + Sampler.DEFAULT_SIZE
+ ,params = {
+ "size? - The size of the reservoir sampler. If unspecified, the size is " + Sampler.DEFAULT_SIZE
+ }
+ ,returns="The sampler object."
+ )
+
+ public static class UniformSamplerInit implements StellarFunction {
+ @Override
+ public Object apply(List<Object> args, Context context) throws ParseException {
+ if(args.size() == 0) {
+ return new UniformSampler();
+ }
+ else {
+ Optional<Integer> sizeArg = get(args, 0, "Size", Integer.class);
+ if(sizeArg.isPresent() && sizeArg.get() <= 0) {
+ throw new IllegalStateException("Size must be a positive integer");
+ }
+ else {
+ return new UniformSampler(sizeArg.orElse(Sampler.DEFAULT_SIZE));
+ }
+ }
+ }
+
+ @Override
+ public void initialize(Context context) {
+ }
+
+ @Override
+ public boolean isInitialized() {
+ return true;
+ }
+ }
+
+
+ public static <T> Optional<T> get(List<Object> args, int offset, String argName, Class<T> expectedClazz) {
+ Object obj = args.get(offset);
+ T ret = ConversionUtils.convert(obj, expectedClazz);
--- End diff --
Couldn't this be simplified to :
```java
if(ret == null ) {
if(obj != null) {
throw new IllegalStateException(argName + "argument(" + obj
+ " is expected to be an " + expectedClazz.getName()
+ ", but was " + obj
);
}
}
return Optional.ofNullable(ret);
}
```
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156796690
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
They're both needed. Some use-cases would be fine without bias and some would be better with bias. As a follow-on, I was planning on adding a biased sampler, but this is a big enough PR without it. It'd look something like:
```
samples := SAMPLE_MERGE(PROFILE_GET('samples', ...))
biased_sample := SAMPLE_GET_BIASED(samples, 0.015)
```
---
[GitHub] metron pull request #867: METRON-1350: Add reservoir sampling functions to S...
Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:
https://github.com/apache/metron/pull/867#discussion_r156792227
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -53,6 +53,32 @@ functions can be used from everywhere where Stellar is used.
* bounds - A list of value bounds (excluding min and max) in sorted order.
* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.
+### Sampling Functions
+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:
--- End diff --
Sorry, `uniform` here is intended to mean that there's each element has equal probability of being in the sample (e.g. the probability is pulled from a [uniform probability distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))). I can probably do a better job documenting.
---