You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/08/25 12:54:20 UTC

[jira] [Commented] (METRON-392) Allow User to Define Custom 'Group By' for a Profile

    [ https://issues.apache.org/jira/browse/METRON-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436782#comment-15436782 ] 

ASF GitHub Bot commented on METRON-392:
---------------------------------------

GitHub user nickwallen opened a pull request:

    https://github.com/apache/incubator-metron/pull/230

    METRON-392 Allow User to Define Custom 'Group By' for a Profile

    ### [METRON-392](https://issues.apache.org/jira/browse/METRON-392)
    
    Allows a user to optionally define a custom set of 'groupBy' expressions that controls how the data is persisted.  This is intended to allow for contiguous scans when training on subsets of the data. 
    
    The 'groupBy' expressions can refer to any field within a `ProfileMeasurement`.  This includes the following fields: 
      * `profileName`: The name of the profile.
      * `entity`: The name of the entity being profiled.
      * `start`: The window start time in milliseconds from the epoch.
      * `end`: The window end time in milliseconds from the epoch.
      * `value`: The summary value calculated over the window period.
      * `groupBy`: The set of 'groupBy' expressions; not the result of those expressions.
    
    A common use case would be grouping the data by day of week.  This would allow a contiguous scan to access all profile data for Mondays only.  The Stellar expression `DAY_OF_WEEK(start)` would achieve this. 
    
    *NOTE*: A series of date functions will be added to Stellar in a follow-on PR to enhance the types of groups that can be created.
    
    ### Example
    ```
    {
      "inputTopic": "indexing",
      "profiles": [
        {
          "profile": "example3",
          "foreach": "ip_src_addr",
          "onlyif": "protocol == 'HTTP'",
          "groupBy": "DAY_OF_WEEK(start)",
          "update": { "s": "STATS_ADD(s, length)" },
          "result": "STATS_MEAN(s)"
        }
      ]
    }
    ```
    
    ### Testing
    To test this change do the following. 
    * Create a profile and do not define a 'groupBy' expression.  Prior to this change the row key would include the day of week, week of month, etc which altered how the data was sorted on disk.  After this change, these fields will not be included in the row key.
    * Create a profile and define a 'groupBy' expression.  The result of this expression will be embedded in the row key.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nickwallen/incubator-metron METRON-392

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/230.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #230
    
----
commit f256fe7461a0a8273b0a9f9e2d10a01c7c53c473
Author: Nick Allen <ni...@nickallen.org>
Date:   2016-08-23T17:03:15Z

    METRON-372 Enhance Statistical Operations Available for Use with the Profiler

commit fc38cb8faf8970d6c8563e43cdf48158cc03cbda
Author: Nick Allen <ni...@nickallen.org>
Date:   2016-08-23T17:18:26Z

    METRON-377 Enable Profiles that Use Non-Single Pass Summary Functions

commit 9ee905ea7b03d13ac512a557a270d12be332a4b8
Author: Nick Allen <ni...@nickallen.org>
Date:   2016-08-22T14:52:34Z

    METRON-392 Allow User to Define Custom 'Group By' for a Profile

commit bfc01e17820894541803255a617a4b7a7804e04e
Author: Nick Allen <ni...@nickallen.org>
Date:   2016-08-25T11:47:49Z

    METRON-392 Merged with master

----


> Allow User to Define Custom 'Group By' for a Profile
> ----------------------------------------------------
>
>                 Key: METRON-392
>                 URL: https://issues.apache.org/jira/browse/METRON-392
>             Project: Metron
>          Issue Type: Improvement
>            Reporter: Nick Allen
>            Assignee: Nick Allen
>              Labels: profiler
>
> When creating models using Profile data, models are most often going to be trained and scored not with all of the Profile data, but only subsets or segments of the data.  For example, Mondays often look very different than Sundays.  When training and scoring a Monday, the model will only use data from previous Mondays.
> The current Profiler implementation embeds the day of week, week of month, month, and year in the row key before storing the data in HBase.  This is intended to sort the data to allow for a contiguous scan when training on subsets of the data.  For example, a read that should pull in data from Mondays only.
> The problem with this approach is that properly segmenting the data for the specific problem at hand is as important to building an effective model as feature selection.  Segmenting on day of week, week of month, etc will not be applicable for many models built by a user.  
> In addition, there will not be one way in which the data needs to be segmented that applies for all Profiles.  Each Profile is likely to have different ways in which the data needs to be segmented.  
> It will also be the case that users will need to segment the data by elements that only make sense in their specific environment.  For example, a company will have its own holiday calendar or have specific 'end-of-month' processing days that need to be taken into account.  A user needs to be able to apply these custom elements in how the data is segmented.
> This change will allow a user to customize as part of a Profile definition how the data should be grouped when stored in HBase.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)