You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by cestella <gi...@git.apache.org> on 2016/09/13 18:18:47 UTC

[GitHub] incubator-metron pull request #250: METRON-416: Provide the ability to store...

GitHub user cestella opened a pull request:

    https://github.com/apache/incubator-metron/pull/250

    METRON-416: Provide the ability to store mergeable data structures for summarizing data on-line

    With the currently worked on profiler, we it will be advantageous for us to be able to store more than just numeric data as a profile. In particular, adding some common data structures that can be merged will allow us to store data at a fixed tick-rate, but merge the results from hbase across multiple ticks.
    
    The following structures should be supported:
    * An implementation of the `STATS` stellar functions which is backed by an online and mergeable class using distributional sketches for querying distribution.
    * An implementation of bloom filters so that probabalistic existence queries can be made
    
    This should facilitate simple statistical outlier analysis.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron probabalistic_functions

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #250
    
----
commit 1fa3950097668602f2f6e98f523b37c995139ea9
Author: cstella <ce...@gmail.com>
Date:   2016-09-12T22:21:27Z

    Refactoring to allow online statistical calculations

commit ce8e175979b9f7df023b47534b7b29970d737e1f
Author: cstella <ce...@gmail.com>
Date:   2016-09-13T18:03:50Z

    fixed kurtosis and skewness to be unbiased.

commit 17615e96f488710a7c1dc2269ab0cea8356a744b
Author: cstella <ce...@gmail.com>
Date:   2016-09-13T18:12:00Z

    Licensing.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #250: METRON-416: Provide the ability to store...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-metron/pull/250


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #250: METRON-416: Provide the ability to store mergea...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/250
  
    Agree, both of them are irritating and now fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #250: METRON-416: Provide the ability to store mergea...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/incubator-metron/pull/250
  
    I don't see any unit tests that check for correct handling of overflow or underflow.  Am I missing them?
    
    I modified the tests to see what happens and when you actually hit an overflow condition.  The code correctly detects it from what I can tell, but there is no indication that it occurred.  Similar to METRON-404, but not during function initialization.  The exception gets swallowed and all you see is "ParseException: Unable to pop an empty stack".  Could be addressed as a separate PR, but worthy of a fixin.  
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #250: METRON-416: Provide the ability to store mergea...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/250
  
    Done, thanks for the comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #250: METRON-416: Provide the ability to store...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/250#discussion_r78633816
  
    --- Diff: metron-platform/metron-common/src/test/java/org/apache/metron/common/stellar/StellarStatisticsFunctionsTest.java ---
    @@ -215,10 +261,11 @@ public void testSumSquares() throws Exception {
         assertEquals(stats.getSumsq(), (Double) actual, 0.1);
       }
     
    -  @Test(expected = ParseException.class)
    +  @Test
       public void testKurtosisNoWindow() throws Exception {
         statsInit(0);
    -    run("STATS_KURTOSIS(stats)", variables);
    +    Object actual = run("STATS_KURTOSIS(stats)", variables);
    +    assertEquals(stats.getKurtosis(), (Double) actual, 0.1);
       }
    --- End diff --
    
    I think you can get rid of all the "test*[No]Window" tests (for skewness, percentile, and kurtosis) and replace them with a single test just like the others (max, min, count, etc) that test both scenarios.  Those were created only for tests where there was a behavioral difference between the 'windowed' versus 'non-windowed' implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #250: METRON-416: Provide the ability to store mergea...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/incubator-metron/pull/250
  
    Nice +1 pending CI build


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #250: METRON-416: Provide the ability to store mergea...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/250
  
    Validated on Vagrant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #250: METRON-416: Provide the ability to store...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/250#discussion_r78631385
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/math/stats/OnlineStatisticsProvider.java ---
    @@ -0,0 +1,243 @@
    +/*
    + *
    + *  Licensed to the Apache Software Foundation (ASF) under one
    + *  or more contributor license agreements.  See the NOTICE file
    + *  distributed with this work for additional information
    + *  regarding copyright ownership.  The ASF licenses this file
    + *  to you under the Apache License, Version 2.0 (the
    + *  "License"); you may not use this file except in compliance
    + *  with the License.  You may obtain a copy of the License at
    + *
    + *      http://www.apache.org/licenses/LICENSE-2.0
    + *
    + *  Unless required by applicable law or agreed to in writing, software
    + *  distributed under the License is distributed on an "AS IS" BASIS,
    + *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + *  See the License for the specific language governing permissions and
    + *  limitations under the License.
    + *
    + */
    +
    +package org.apache.metron.common.math.stats;
    +
    +import com.tdunning.math.stats.TDigest;
    +import org.apache.commons.math3.util.FastMath;
    +
    +/**
    + * A (near) constant memory implementation of a statistics provider.
    + * For first order statistics, simple terms are stored and composed
    + * to return the statistics results.  This is intended to provide a
    + * mergeable implementation for a statistics provider.
    + */
    +public class OnlineStatisticsProvider implements StatisticsProvider {
    +  /**
    +   * A sensible default for compression to use in the T-Digest.
    +   * As per https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/TDigest.java#L86
    +   * 100 is a sensible default and the number of centroids retained (to construct the sketch)
    +   * is usually a smallish (usually < 10) multiple of the compression.
    +   */
    +  public static final int COMPRESSION = 100;
    +
    +
    +  /**
    +   * A distributional sketch that uses a variant of 1-D k-means to construct a tree of ranges
    +   * that sketches the distribution.  See https://github.com/tdunning/t-digest#t-digest for
    +   * more detail.
    +   */
    +  private TDigest digest;
    +
    +  private long n = 0;
    +  private double sum = 0;
    +  private double sumOfSquares = 0;
    +  private double sumOfLogs = 0;
    +  private Double min = null;
    +  private Double max = null;
    +
    +  //\mu_1, E[X]
    +  private double M1 = 0;
    +  //\mu_2: E[(X - \mu)^2]
    +  private double M2 = 0;
    +  //\mu_3: E[(X - \mu)^3]
    +  private double M3 = 0;
    +  //\mu_4: E[(X - \mu)^4]
    +  private double M4 = 0;
    +
    +  public OnlineStatisticsProvider() {
    +    digest = TDigest.createAvlTreeDigest(COMPRESSION);
    +  }
    +
    +  /**
    +   * Add a value.
    +   * NOTE: This does not store the point, but only updates internal state.
    +   * NOTE: This is NOT threadsafe.
    +   * @param value
    +   */
    +  @Override
    +  public void addValue(double value) {
    +    long n1 = n;
    +    min = min == null?value:Math.min(min, value);
    +    max = max == null?value:Math.max(max, value);
    +    sum += value;
    +    sumOfLogs += Math.log(value);
    +    sumOfSquares += value*value;
    --- End diff --
    
    Do we need to check for overflow?  Does it make sense to accumulate doubles in a double?  Shouldn't we use something bigger than a double, like a BigDecimal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #250: METRON-416: Provide the ability to store...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/250#discussion_r78632699
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/math/stats/OnlineStatisticsProvider.java ---
    @@ -0,0 +1,243 @@
    +/*
    + *
    + *  Licensed to the Apache Software Foundation (ASF) under one
    + *  or more contributor license agreements.  See the NOTICE file
    + *  distributed with this work for additional information
    + *  regarding copyright ownership.  The ASF licenses this file
    + *  to you under the Apache License, Version 2.0 (the
    + *  "License"); you may not use this file except in compliance
    + *  with the License.  You may obtain a copy of the License at
    + *
    + *      http://www.apache.org/licenses/LICENSE-2.0
    + *
    + *  Unless required by applicable law or agreed to in writing, software
    + *  distributed under the License is distributed on an "AS IS" BASIS,
    + *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + *  See the License for the specific language governing permissions and
    + *  limitations under the License.
    + *
    + */
    +
    +package org.apache.metron.common.math.stats;
    +
    +import com.tdunning.math.stats.TDigest;
    +import org.apache.commons.math3.util.FastMath;
    +
    +/**
    + * A (near) constant memory implementation of a statistics provider.
    + * For first order statistics, simple terms are stored and composed
    + * to return the statistics results.  This is intended to provide a
    + * mergeable implementation for a statistics provider.
    + */
    +public class OnlineStatisticsProvider implements StatisticsProvider {
    +  /**
    +   * A sensible default for compression to use in the T-Digest.
    +   * As per https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/TDigest.java#L86
    +   * 100 is a sensible default and the number of centroids retained (to construct the sketch)
    +   * is usually a smallish (usually < 10) multiple of the compression.
    +   */
    +  public static final int COMPRESSION = 100;
    +
    +
    +  /**
    +   * A distributional sketch that uses a variant of 1-D k-means to construct a tree of ranges
    +   * that sketches the distribution.  See https://github.com/tdunning/t-digest#t-digest for
    +   * more detail.
    +   */
    +  private TDigest digest;
    +
    +  private long n = 0;
    +  private double sum = 0;
    +  private double sumOfSquares = 0;
    +  private double sumOfLogs = 0;
    +  private Double min = null;
    +  private Double max = null;
    +
    +  //\mu_1, E[X]
    +  private double M1 = 0;
    +  //\mu_2: E[(X - \mu)^2]
    +  private double M2 = 0;
    +  //\mu_3: E[(X - \mu)^3]
    +  private double M3 = 0;
    +  //\mu_4: E[(X - \mu)^4]
    +  private double M4 = 0;
    +
    +  public OnlineStatisticsProvider() {
    +    digest = TDigest.createAvlTreeDigest(COMPRESSION);
    +  }
    +
    +  /**
    +   * Add a value.
    +   * NOTE: This does not store the point, but only updates internal state.
    +   * NOTE: This is NOT threadsafe.
    +   * @param value
    +   */
    +  @Override
    +  public void addValue(double value) {
    +    long n1 = n;
    +    min = min == null?value:Math.min(min, value);
    +    max = max == null?value:Math.max(max, value);
    +    sum += value;
    +    sumOfLogs += Math.log(value);
    +    sumOfSquares += value*value;
    --- End diff --
    
    Yeah, I'd prefer to avoid `BigDecimal` if at all possible due to performance concerns.  Also, I will point out that Commons Math also [uses](https://github.com/apache/commons-math/blob/050dfa6f0850174fdf958ce6751d2f06900201f8/src/main/java/org/apache/commons/math4/stat/descriptive/summary/SumOfSquares.java) `double` for its accumulation, so I think we're pretty safe for some large amount of headroom.  That being said, I will do the math ops via `ExactMath` so it throws an exception on overflow.  Good catch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---