You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by cestella <gi...@git.apache.org> on 2016/12/21 20:03:46 UTC

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

GitHub user cestella opened a pull request:

    https://github.com/apache/incubator-metron/pull/401

    METRON-637: Add a STATS_BIN function to Stellar.

    When passing parameters to models, it's often useful to pass the binned representation of a variable based on an empirical statistical distribution, rather than the actual variable. This function should accept a set of percentile bins and a statistical sketch and a value. It should return the index where the percentile of the value falls.
    
    For instance, consider the value 17 who is percentile 27. If we use 25, 75, 95 to define our bins, this function would return 1, because its percentile, 27, is between 25 and 75.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron METRON-637

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/401.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #401
    
----
commit 6cc98a22af1efe0fc548c59e4a1fa379a41e245b
Author: cstella <ce...@gmail.com>
Date:   2016-12-21T17:47:51Z

    Added STATS_BIN function.

commit 81921691ef413b407117764b0597c471f7eebf30
Author: cstella <ce...@gmail.com>
Date:   2016-12-21T19:22:43Z

    Documentation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r95370657
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,49 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    --- End diff --
    
    It's not immediately obvious to me what's being tested here - is this checking that default enums' list of splits is syntactically sound? What would make this test fail?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93530871
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    +      if(bin < splits.size()) {
    +        double percentileOfBin = provider.getPercentile(splits.get(bin));
    +        if (d > percentileOfBin) {
    --- End diff --
    
    Sorry if I'm wrong here, but I couldn't find the definition of stats.getSortedValues().
    Isn't this line 380 comparing a raw value `d` to a percentile value `percentileOfBin` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93536290
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    +      if(bin < splits.size()) {
    +        double percentileOfBin = provider.getPercentile(splits.get(bin));
    +        if (d > percentileOfBin) {
    --- End diff --
    
    I see, so provider.getPercentile() actually takes a percentile value (as a number between 0 and 99.99) and returns a corresponding raw value?  That makes sense if so.  Please confirm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93523865
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,74 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    --- End diff --
    
    Yep, I'll fix that.  Good clarification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93573955
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,49 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, List<Double> bins, Function<Integer, Double> boundFunc) {
    +      for(int bin = 0; bin < bins.size();++bin) {
    +        double bound = boundFunc.apply(bin);
    +        if(value <= bound) {
    +          return bin;
    +        }
    +      }
    +      return bins.size();
    +    }
    +
    +    @Override
    +    public Object apply(List<Object> args) {
    +      Double value = convert(args.get(0), Double.class);
    +      List<Double> bins = new ArrayList<>();
    +      if (args.size() > 1) {
    +        List<Number> objList = convert(args.get(1), List.class);
    +        if(objList == null) {
    +          return null;
    +        }
    +        for(Number n : objList) {
    +          bins.add(n.doubleValue());
    --- End diff --
    
    During this step we need to validate that the bounds list is strictly increasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93702246
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,45 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, int numBins, Function<Integer, Double> boundFunc) {
    +      double lastBound = Long.MIN_VALUE;
    +      for(int bin = 0; bin < numBins;++bin) {
    +        double bound = boundFunc.apply(bin);
    +        if(bound < lastBound) {
    +          throw new IllegalStateException("Your bins must be monotonically increasing");
    --- End diff --
    
    You're right, strictly increasing is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93523895
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,74 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    +  }
    +          , returns = "Which bin the value falls in such that bin < value < bin + 1"
    --- End diff --
    
    yup, will do; great suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    Do these methods *have* to run on every tuple?  Can you set some kind of sample rate parameter to give a way to play with performance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93533831
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    --- End diff --
    
    That can be done, for sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93524020
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,74 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    +  }
    +          , returns = "Which bin the value falls in such that bin < value < bin + 1"
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +    public enum BinSplits {
    +      QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
    +      QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
    +      DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0))
    +      ;
    +      public final List<Double> split;
    +      BinSplits(List<Double> split) {
    +        this.split = split;
    +      }
    +
    +      public static List<Double> getSplit(Object o) {
    +        if(o instanceof String) {
    +          return BinSplits.valueOf((String)o).split;
    +        }
    +        else if(o instanceof List) {
    +          List<Double> ret = new ArrayList<>();
    +          for(Object valO : (List<Object>)o) {
    +            ret.add(ConversionUtils.convert(valO, Double.class));
    +          }
    +          return ret;
    +        }
    +        throw new IllegalStateException("The split you tried to pass is not a valid split: " + o.toString());
    +      }
    +    }
    +
    +
    +    @Override
    +    public Object apply(List<Object> args) {
    +      StatisticsProvider stats = convert(args.get(0), StatisticsProvider.class);
    +      Double value = convert(args.get(1), Double.class);
    +      List<Double> bins = BinSplits.QUARTILE.split;
    +      if (args.size() > 2) {
    +        bins = BinSplits.getSplit(args.get(2));
    +      }
    +      if (stats == null || value == null || bins.size() == 0) {
    +        return -1;
    +      }
    +
    +      double prevPctile = stats.getPercentile(bins.get(0));
    +
    +      if(value <= prevPctile) {
    +        return 0;
    +      }
    +      for(int bin = 1; bin < bins.size();++bin) {
    +        double pctile = stats.getPercentile(bins.get(bin));
    +        if(value > prevPctile && value <= pctile) {
    --- End diff --
    
    haha yes, you can.  I'll correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93526574
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    --- End diff --
    
    Put `provider` assignment outside the loop?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-metron/pull/401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    Hmm good point.  How about keeping a guava timed cache in the STATS_BIN function that maps the stats object and raw bounds list to the computed percentile bounds list?  That should do the trick, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93575237
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -373,15 +373,16 @@ public void statsBinRunner(List<Double> splits) throws Exception {
     
       public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
         int bin = 0;
    +    StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
         for(Double d : stats.getSortedValues()) {
    -      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
           if(bin < splits.size()) {
             double percentileOfBin = provider.getPercentile(splits.get(bin));
             if (d > percentileOfBin) {
               //we aren't the right bin, so let's find the right one.
               // Keep in mind that this value could be more than one bin away from the last good bin.
    -          for(;bin < splits.size() && d > provider.getPercentile(splits.get(bin));bin++) {
    -
    +          while ( bin < splits.size()  &&  d > provider.getPercentile(splits.get(bin)) ) {
    +            //increment the bin number until it includes the target value, or we run out of bins
    +            bin++;
    --- End diff --
    
    This whole block:
    ```
          if(bin < splits.size()) {
            double percentileOfBin = provider.getPercentile(splits.get(bin));
            if (d > percentileOfBin) {
              //we aren't the right bin, so let's find the right one.
              // Keep in mind that this value could be more than one bin away from the last good bin.
              while ( bin < splits.size()  &&  d > provider.getPercentile(splits.get(bin)) ) {
                //increment the bin number until it includes the target value, or we run out of bins
                bin++;
              }
            }
          }
    ``` 
    can be replaced by:
    ```
          while ( bin < splits.size()  &&  d > provider.getPercentile(splits.get(bin)) ) {
            //increment the bin number until it includes the target value, or we run out of bins
            bin++;
          }
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93513473
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -112,6 +112,13 @@ functions can be used from everywhere where Stellar is used.
       * Input:
         * stats - The Stellar statistics object
       * Returns: The variance of the values in the window or NaN if the statistics object is null.
    +* `STATS_BIN`
    +  * Description: Computes the bin that the value is in based on the statistical distribution. 
    +  * Input:
    +    * stats - The Stellar statistics object
    +    * value - The value to bin
    +    * range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. If this argument is omitted, then we assume a Quartile bin split.
    --- End diff --
    
    Would it make sense, as an option, to also allow binning by raw value instead of percentile?  Same format, again excluding min and max.  Could then apply to any Comparable field.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    This looks really good! Other than a minor question on one of the stats bin unit tests, it looks good to me. +1 pending comment/clarification on that item.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93520478
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,74 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    +  }
    +          , returns = "Which bin the value falls in such that bin < value < bin + 1"
    --- End diff --
    
    One of the "<" has to be "<=", please.  From code, it's the upper one.
    
    Suggest phrasing as:
    returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    "and values greater than the last bound go in the M'th bin."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93636499
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -373,15 +373,16 @@ public void statsBinRunner(List<Double> splits) throws Exception {
     
       public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
         int bin = 0;
    +    StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
         for(Double d : stats.getSortedValues()) {
    -      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
           if(bin < splits.size()) {
             double percentileOfBin = provider.getPercentile(splits.get(bin));
             if (d > percentileOfBin) {
               //we aren't the right bin, so let's find the right one.
               // Keep in mind that this value could be more than one bin away from the last good bin.
    -          for(;bin < splits.size() && d > provider.getPercentile(splits.get(bin));bin++) {
    -
    +          while ( bin < splits.size()  &&  d > provider.getPercentile(splits.get(bin)) ) {
    +            //increment the bin number until it includes the target value, or we run out of bins
    +            bin++;
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93524927
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -112,6 +112,13 @@ functions can be used from everywhere where Stellar is used.
       * Input:
         * stats - The Stellar statistics object
       * Returns: The variance of the values in the window or NaN if the statistics object is null.
    +* `STATS_BIN`
    +  * Description: Computes the bin that the value is in based on the statistical distribution. 
    +  * Input:
    +    * stats - The Stellar statistics object
    +    * value - The value to bin
    +    * range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. If this argument is omitted, then we assume a Quartile bin split.
    --- End diff --
    
    It seems like it would be barely more effort than writing another StellarFunction wrapper, but maybe I'm being optimistic again.  I would support including both here, but if you prefer to separate them I'm fine with that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r95372290
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,49 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    --- End diff --
    
    That would make sense



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    I made the corrections suggested and also, as a compromise for performance, I changed things up a bit so that the list munging is optimized a bit:
    * A new list of bins is not created per call in either `BIN` or `STATS_BIN`
    * We use the list as passed, converting the `Number` to `Double` lazily.
    * I do the monotonic increasing check as needed rather than prior to the function.
    
    All that being said, caching would increase the performance, but I think we're in a decent spot at the moment.  I rejiggered the performance driver to give us a distribution of performance characteristics.  Current run with this change is at:
    `Min/25th/50th/75th/Max Milliseconds: 2687.0 / 2700.5 / 2716.0 / 2733.5 / 3730.0`
    
    Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93538271
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    +      if(bin < splits.size()) {
    +        double percentileOfBin = provider.getPercentile(splits.get(bin));
    +        if (d > percentileOfBin) {
    --- End diff --
    
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r95373092
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,49 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    --- End diff --
    
    You got it :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    +1 lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by Matt Foley <ma...@apache.org>.
For some reason my overall comment on this PR failed to be copied to the list:

This is a really nice new feature. It works and is clean. It rates a +1. But I think we are likely to be processing thousands or perhaps millions of data points at any given time, and so the constant re-parsing of the bounds list is troublesome. Also for the STATS_BIN function, the percentile bounds list should be transformed once into a value bounds list, rather than applying the boundFunc many times to (on average) half the bounds in the list, per input value.

Unlike the Profiler, there is no config caching between calls to the Stellar Function, because this "configuration" is done every time, in-line, rather than in ZK. But we don't want all the complexity of ZK just for this little binning function. And we want multiple different binning functions to be in use at the same time, without needing complex scope management.

One solution would be to treat it like Python does with regex, and provide a compiler function. What if we have
COMPILE_BIN(bounds) and COMPILE_STATS_BIN(stats, bounds)
Each would return an opaque key (or an integer) that references a cached pre-parsed setup; it can be thread-safe as it would be read-only. Then we would invoke with
BIN(key, value) and STATS_BIN(key, value)

Normally I would say this is an optimization and we should leave it for later. But then we would be stuck with the inefficient, non-compiled form of the BIN and STATS_BIN functions.

Your call, @cestella . I don't want to get in the way of progress, I just feel obligated to bring it up.
--Matt



On 12/21/16, 11:50 PM, "mattf-horton" <gi...@git.apache.org> wrote:

    Github user mattf-horton commented on a diff in the pull request:
    
        https://github.com/apache/incubator-metron/pull/401#discussion_r93574334
      
        --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
        @@ -425,4 +428,61 @@ public Object apply(List<Object> args) {
               return result;
             }
           }
        +
        +  /**
        +   * Calculates the statistical bin that a value falls in.
        +   */
        +  @Stellar(namespace = "STATS", name = "BIN"
        +          , description = "Computes the bin that the value is in based on the statistical distribution."
        +          , params = {
        +          "stats - The Stellar statistics object"
        +          , "value - The value to bin"
        +          , "bounds? - A list of percentile bin bounds (excluding min and max) or a string representing a known and common set of bins.  " +
        +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
        +          " If this argument is omitted, then we assume a Quartile bin split."
        +                    }
        +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
        +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
        +          "and values greater than the last bound go in the M'th bin."
        +  )
        +  public static class StatsBin extends BaseStellarFunction {
        +    public enum BinSplits {
        +      QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
        +      QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
        +      DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0))
        +      ;
        +      public final List<Double> split;
        +      BinSplits(List<Double> split) {
        +        this.split = split;
        +      }
        +
        +      public static List<Double> getSplit(Object o) {
        +        if(o instanceof String) {
        +          return BinSplits.valueOf((String)o).split;
        +        }
        +        else if(o instanceof List) {
        +          List<Double> ret = new ArrayList<>();
        +          for(Object valO : (List<Object>)o) {
        +            ret.add(ConversionUtils.convert(valO, Double.class));
        --- End diff --
        
        During this step we need to validate that the bounds list is strictly increasing.
    
    
    ---
    If your project is set up for it, you can reply to this email and have your
    reply appear on GitHub as well. If your project does not have this feature
    enabled and wishes so, or if the feature is enabled but not working, please
    contact infrastructure at infrastructure@apache.org or file a JIRA ticket
    with INFRA.
    ---
    
    



[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93574334
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,61 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "bounds? - A list of percentile bin bounds (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class StatsBin extends BaseStellarFunction {
    +    public enum BinSplits {
    +      QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
    +      QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
    +      DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0))
    +      ;
    +      public final List<Double> split;
    +      BinSplits(List<Double> split) {
    +        this.split = split;
    +      }
    +
    +      public static List<Double> getSplit(Object o) {
    +        if(o instanceof String) {
    +          return BinSplits.valueOf((String)o).split;
    +        }
    +        else if(o instanceof List) {
    +          List<Double> ret = new ArrayList<>();
    +          for(Object valO : (List<Object>)o) {
    +            ret.add(ConversionUtils.convert(valO, Double.class));
    --- End diff --
    
    During this step we need to validate that the bounds list is strictly increasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93537846
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    +      if(bin < splits.size()) {
    +        double percentileOfBin = provider.getPercentile(splits.get(bin));
    +        if (d > percentileOfBin) {
    +          //we aren't the right bin, so let's find the right one.
    +          // Keep in mind that this value could be more than one bin away from the last good bin.
    +          for(;bin < splits.size() && d > provider.getPercentile(splits.get(bin));bin++) {
    +
    --- End diff --
    
    Yup, thanks.  Then the block (lines 378-387) could be better stated as
    ```
    while ( bin < splits.size()  &&  d > provider.getPercentile(splits.get(bin)) ) {
           //increment the bin number until it includes the target value, or we run out of bins
           bin++; 
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93636658
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,49 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, List<Double> bins, Function<Integer, Double> boundFunc) {
    --- End diff --
    
    Thanks! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    Ok, so a couple of things.  I ran a quick perf test on this function as it stands.  On my macbook from 4 years ago, I ran the `STATS_BIN` function a million times with random values and it took ~5.5s.  Even at a throughput of 1M messages per second, if we assume that the messages are spread across the cluster, I think this keeps up.
    
    Now, *EVEN GIVEN THIS*, normally I would go through the effort here of adding a caching layer to save us the computation of the percentile, but it's actually quite difficult to figure out when two `StatisticsProvider` objects are equivalent without resorting to actually calling percentiles.  I could do things like use non-computed attributes (number of data points, sum of data points, average and sum of data points), but the former Math grad student in me was uncomfortable in that.  It's just very hard to not be absolutely sure that you couldn't have all of those attributes the same and be different distributions.
    
    I think given these things together that I'm going to recommend to cross the caching bridge when we come to it.  Now, it won't take much to convince me to go the other direction, so if you (or anyone else, really ;) feels strongly @mattf-horton , I'll go ahead and tackle that dude as best I can.
    
    I'm going to correct the rest of your comments and check in the performance test I ran, so you can see that I have nothing up my sleeves and so it can be run periodically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93535574
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    +      if(bin < splits.size()) {
    +        double percentileOfBin = provider.getPercentile(splits.get(bin));
    +        if (d > percentileOfBin) {
    +          //we aren't the right bin, so let's find the right one.
    +          // Keep in mind that this value could be more than one bin away from the last good bin.
    +          for(;bin < splits.size() && d > provider.getPercentile(splits.get(bin));bin++) {
    +
    --- End diff --
    
    Hopefully the explanation above makes things clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    Thanks for the feedback, @mattf-horton!  I went ahead and incorporated your changes.  For posterity and so this gets replicated to the JIRA, I added a `BIN` function that just takes a list of bounds, not a list of percentiles to compute.  `STATS_BIN` could be encoded with `BIN` if we had a `MAP` function (i.e. `STATS_BIN` == `BIN( value, MAP( &STATS_GET_PERCENTILE(stats, _ ), [ 25.0, 50.0, 75.0])` where `MAP` takes a function pointer and applies it to a collection).  
    
    We do not have this capability yet in Stellar, but it may be worth considering to enable these kinds of use-cases.  Just a thought.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r95371355
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,49 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    --- End diff --
    
    This test is testing that the `STATS_BIN` function operates correctly by taking a sorted list of numbers, walking down it and ensuring that the `STATS_BIN` for each number yields the correct bin.  This is a reasonable test because we are not actually computing the bin so much as recognizing since the numbers are sorted, the bin will increase at the percentile boundaries, thus we have the expected bin without recreating the computation in the `STATS_BIN` function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93573772
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,49 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, List<Double> bins, Function<Integer, Double> boundFunc) {
    --- End diff --
    
    Nice use of lambdas to expand this into, essentially, the "mapped" function you were talking about.
    Results in a very parsimonious implementation of the STATS_BIN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93533345
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    +      if(bin < splits.size()) {
    +        double percentileOfBin = provider.getPercentile(splits.get(bin));
    +        if (d > percentileOfBin) {
    +          //we aren't the right bin, so let's find the right one.
    +          // Keep in mind that this value could be more than one bin away from the last good bin.
    +          for(;bin < splits.size() && d > provider.getPercentile(splits.get(bin));bin++) {
    +
    --- End diff --
    
    Right, I'm not getting this.  The bound is being fed into provider.getPercentile, which returns digest.quantile(p/100.0).  Isn't this what should be done to "value" or "d", not the bound?
    And the bound should just be divided by 100?  Maybe I stayed up too late last night.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93537026
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    +    statsInit(windowSize);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, "'QUARTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, "'QUINTILE'");
    +    statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, "'DECILE'");
    +    statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 75.0]");
    +  }
    +
    +  public void statsBinRunner(List<Double> splits) throws Exception {
    +    statsBinRunner(splits, null);
    +  }
    +
    +  public void statsBinRunner(List<Double> splits, String splitsName) throws Exception {
    +    int bin = 0;
    +    for(Double d : stats.getSortedValues()) {
    +      StatisticsProvider provider = (StatisticsProvider)variables.get("stats");
    +      if(bin < splits.size()) {
    +        double percentileOfBin = provider.getPercentile(splits.get(bin));
    +        if (d > percentileOfBin) {
    --- End diff --
    
    Yes, `provider.getPercentile(x)` returns the value at percentile `x`.  So, to get the median, you'd call `provider.getPercentile(50.0)`. 
    
    If you're thinking of the inverse (give me the percentile that a value falls at), we don't have that dude yet, but it's called the cumulative distribution function or (`ecdf` in R parlance).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r95371725
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java ---
    @@ -356,6 +357,49 @@ public void testSkewness() throws Exception {
         assertEquals(stats.getSkewness(), (Double) actual, 0.1);
       }
     
    +  @Test
    +  public void testStatsBin() throws Exception {
    --- End diff --
    
    Given that it's not entirely obvious what this is doing and why it is a valid test, do you think it merits a comment with the above in the test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93522344
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,74 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    +  }
    +          , returns = "Which bin the value falls in such that bin < value < bin + 1"
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +    public enum BinSplits {
    +      QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
    +      QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
    +      DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0))
    +      ;
    +      public final List<Double> split;
    +      BinSplits(List<Double> split) {
    +        this.split = split;
    +      }
    +
    +      public static List<Double> getSplit(Object o) {
    +        if(o instanceof String) {
    +          return BinSplits.valueOf((String)o).split;
    +        }
    +        else if(o instanceof List) {
    +          List<Double> ret = new ArrayList<>();
    +          for(Object valO : (List<Object>)o) {
    +            ret.add(ConversionUtils.convert(valO, Double.class));
    +          }
    +          return ret;
    +        }
    +        throw new IllegalStateException("The split you tried to pass is not a valid split: " + o.toString());
    +      }
    +    }
    +
    +
    +    @Override
    +    public Object apply(List<Object> args) {
    +      StatisticsProvider stats = convert(args.get(0), StatisticsProvider.class);
    +      Double value = convert(args.get(1), Double.class);
    +      List<Double> bins = BinSplits.QUARTILE.split;
    +      if (args.size() > 2) {
    +        bins = BinSplits.getSplit(args.get(2));
    +      }
    +      if (stats == null || value == null || bins.size() == 0) {
    +        return -1;
    +      }
    +
    +      double prevPctile = stats.getPercentile(bins.get(0));
    +
    +      if(value <= prevPctile) {
    +        return 0;
    +      }
    +      for(int bin = 1; bin < bins.size();++bin) {
    +        double pctile = stats.getPercentile(bins.get(bin));
    +        if(value > prevPctile && value <= pctile) {
    --- End diff --
    
    Really no need to check the lower bound here, is there? :-)
    Thus, you can dump "prevPctile" and start at bin = 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93528090
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -112,6 +112,13 @@ functions can be used from everywhere where Stellar is used.
       * Input:
         * stats - The Stellar statistics object
       * Returns: The variance of the values in the window or NaN if the statistics object is null.
    +* `STATS_BIN`
    +  * Description: Computes the bin that the value is in based on the statistical distribution. 
    +  * Input:
    +    * stats - The Stellar statistics object
    +    * value - The value to bin
    +    * range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. If this argument is omitted, then we assume a Quartile bin split.
    --- End diff --
    
    Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r95368025
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,57 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "bounds? - A list of percentile bin bounds (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class StatsBin extends BaseStellarFunction {
    +    public enum BinSplits {
    +      QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
    +      QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
    +      DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0))
    +      ;
    +      public final List<Number> split;
    +      BinSplits(List<Number> split) {
    +        this.split = split;
    +      }
    +
    +      public static List<Number> getSplit(Object o) {
    +        if(o instanceof String) {
    +          return BinSplits.valueOf((String)o).split;
    +        }
    +        else if(o instanceof List) {
    +          return ConversionUtils.convert(o, List.class);
    +        }
    +        throw new IllegalStateException("The split you tried to pass is not a valid split: " + o.toString());
    +      }
    +    }
    +
    +
    +    @Override
    +    public Object apply(List<Object> args) {
    +      StatisticsProvider stats = convert(args.get(0), StatisticsProvider.class);
    +      Double value = convert(args.get(1), Double.class);
    +      final List<Number> bins = args.size() > 2?BinSplits.getSplit(args.get(2)):BinSplits.QUARTILE.split;
    +
    +      if (stats == null || value == null || bins.size() == 0) {
    +        return -1;
    +      }
    +      return MathFunctions.Bin.getBin(value, bins.size(), bin -> stats.getPercentile(bins.get(bin).doubleValue()));
    --- End diff --
    
    Nice suggestion by Matt. And I like the math bin code reuse and ability to plug in a stats function provider.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    One last bug here.  I was actually right originally to insist on just monotonic increasing, rather than strict increasing for the bins.  Consider the situation where I create a distribution of a single value `d`.  Every percentile will be `d`.  In that case I would want the function to return `0` for `x` where `x <= d` and the number of bins for `x` where `x > d`.
    
    I created a test to validate this behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93523813
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -112,6 +112,13 @@ functions can be used from everywhere where Stellar is used.
       * Input:
         * stats - The Stellar statistics object
       * Returns: The variance of the values in the window or NaN if the statistics object is null.
    +* `STATS_BIN`
    +  * Description: Computes the bin that the value is in based on the statistical distribution. 
    +  * Input:
    +    * stats - The Stellar statistics object
    +    * value - The value to bin
    +    * range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. If this argument is omitted, then we assume a Quartile bin split.
    --- End diff --
    
    It would, for sure, but probably not called `STATS_BIN`.  I was thinking of adding a proper `BIN` function as a follow-on and refactoring this one to use it.  Or do you think we should just bite the bullet and create it all here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93683645
  
    --- Diff: metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StatisticalBinningPerformanceDriver.java ---
    @@ -0,0 +1,76 @@
    +/*
    + *
    + *  Licensed to the Apache Software Foundation (ASF) under one
    + *  or more contributor license agreements.  See the NOTICE file
    + *  distributed with this work for additional information
    + *  regarding copyright ownership.  The ASF licenses this file
    + *  to you under the Apache License, Version 2.0 (the
    + *  "License"); you may not use this file except in compliance
    + *  with the License.  You may obtain a copy of the License at
    + *
    + *      http://www.apache.org/licenses/LICENSE-2.0
    + *
    + *  Unless required by applicable law or agreed to in writing, software
    + *  distributed under the License is distributed on an "AS IS" BASIS,
    + *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + *  See the License for the specific language governing permissions and
    + *  limitations under the License.
    + *
    + */
    +package org.apache.metron.statistics;
    +
    +import com.google.common.collect.ImmutableList;
    +import org.apache.commons.math3.random.GaussianRandomGenerator;
    +import org.apache.commons.math3.random.MersenneTwister;
    +import org.apache.commons.math3.stat.descriptive.DescriptiveStatistics;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +import java.util.Random;
    +
    +/**
    + * This is a driver to drive evaluation of the performance characteristics of the STATS_BIN stellar function.
    + * It gets the distribution of the time it takes to calculate the bin of a million random numbers against the quintile bins
    + * of a statistical distribution of 10000 normally distributed reals between [-1000, 1000].
    + *
    + * On my 4 year old macbook pro, the values came out to be
    + *
    + * Min/25th/50th/75th/Max Milliseconds: 2687.0 / 2700.5 / 2716.0 / 2733.5 / 3730.0
    + */
    --- End diff --
    
    Great to have this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93679946
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,45 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, int numBins, Function<Integer, Double> boundFunc) {
    +      double lastBound = Long.MIN_VALUE;
    --- End diff --
    
    Double.NEGATIVE_INFINITY or -Double.MAX_VALUE would be better than Long.MIN_VALUE


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93636573
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,61 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "bounds? - A list of percentile bin bounds (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class StatsBin extends BaseStellarFunction {
    +    public enum BinSplits {
    +      QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
    +      QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
    +      DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0))
    +      ;
    +      public final List<Double> split;
    +      BinSplits(List<Double> split) {
    +        this.split = split;
    +      }
    +
    +      public static List<Double> getSplit(Object o) {
    +        if(o instanceof String) {
    +          return BinSplits.valueOf((String)o).split;
    +        }
    +        else if(o instanceof List) {
    +          List<Double> ret = new ArrayList<>();
    +          for(Object valO : (List<Object>)o) {
    +            ret.add(ConversionUtils.convert(valO, Double.class));
    --- End diff --
    
    Done, but not exactly in that spot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93520475
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java ---
    @@ -425,4 +428,74 @@ public Object apply(List<Object> args) {
           return result;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(namespace = "STATS", name = "BIN"
    +          , description = "Computes the bin that the value is in based on the statistical distribution."
    +          , params = {
    +          "stats - The Stellar statistics object"
    +          , "value - The value to bin"
    +          , "range? - A list of percentile bin ranges (excluding min and max) or a string representing a known and common set of bins.  " +
    +          "For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg." +
    +          " If this argument is omitted, then we assume a Quartile bin split."
    --- End diff --
    
    This is a list of bounds rather than ranges.  More precisely, the N'th bound specifies the closed upper bound of the N'th bin, where N is zero indexed; and if there is a list of M bounds then there are M+1 bins, numbered 0 through M. (Bin M is the bin for values greater than last bound in the list.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    Hmm, well, the idea is to get the statistical bin for a given value.  The sampling portion is happening in the profiler as you construct the statistical distribution.  This is just querying the already constructed distribution.  This is intended to be an input into a model and to give a sense of where a value falls in the distribution of values that have preceded it, so if you want a message to be used by your model, then it'll need to be computed and passed.
    
    I'll give you an example that may help clarify.  Let's say I'm building a really, really naive model over DNS requests to determine if a DNS request is being made for a synthetic domain created by a botnet.  One of the features I might be interested in may be the length of the domain.  However, I may also want to get a sense of if this domain's length is outside of the norm or not.  To do that, I'd create a profile that captures the statistical distribution of the length of the domains that have been requested in the past and when I'm calling the model (which is deployed via Model as a Service), I can pass the statistical bin that the length falls into (e.g. between the min-25th percentile, 25 - 50th percentile, 50th - 75th percentile or 75th - 95th percentile, 95th - max) by using this function.  So, every DNS request really needs to be scored in this scenario.
    
    I hope that makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93683449
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,45 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, int numBins, Function<Integer, Double> boundFunc) {
    +      double lastBound = Long.MIN_VALUE;
    +      for(int bin = 0; bin < numBins;++bin) {
    +        double bound = boundFunc.apply(bin);
    +        if(bound < lastBound) {
    +          throw new IllegalStateException("Your bins must be monotonically increasing");
    +        }
    +        if(value <= bound) {
    +          return bin;
    +        }
    +        lastBound = bound;
    +      }
    +      return numBins;
    +    }
    --- End diff --
    
    Beautiful.  Definitely pared down to the minimum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93703407
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,45 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, int numBins, Function<Integer, Double> boundFunc) {
    +      double lastBound = Long.MIN_VALUE;
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

Posted by mattf-horton <gi...@git.apache.org>.
Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/401#discussion_r93681129
  
    --- Diff: metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java ---
    @@ -59,4 +60,45 @@ public boolean isInitialized() {
           return true;
         }
       }
    +
    +  /**
    +   * Calculates the statistical bin that a value falls in.
    +   */
    +  @Stellar(name = "BIN"
    +          , description = "Computes the bin that the value is in given a set of bounds."
    +          , params = {
    +           "value - The value to bin"
    +          , "bounds - A list of value bounds (excluding min and max) in sorted order."
    +                    }
    +          ,returns = "Which bin N the value falls in such that bound(N-1) < value <= bound(N). " +
    +          "No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, " +
    +          "and values greater than the last bound go in the M'th bin."
    +  )
    +  public static class Bin extends BaseStellarFunction {
    +
    +    public static int getBin(double value, int numBins, Function<Integer, Double> boundFunc) {
    +      double lastBound = Long.MIN_VALUE;
    +      for(int bin = 0; bin < numBins;++bin) {
    +        double bound = boundFunc.apply(bin);
    +        if(bound < lastBound) {
    +          throw new IllegalStateException("Your bins must be monotonically increasing");
    --- End diff --
    
    If the stats model is sparse, can two bounds have the same value, and it's okay?  Otherwise,
    although it's pedantic, we really should test for (bound <= lastBound) and say
    "Your bins must be strictly increasing" rather than monotonically increasing.
    But I don't fully understand how the stats models work, so maybe it's better this way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---