You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@metron.apache.org by Nick Allen <ni...@nickallen.org> on 2018/09/19 15:14:46 UTC

[DISCUSS] Batch Profiler Feature Branch

I would like to open a discussion to get the Batch Profiler feature branch
merged into master as part of METRON-1699 [1] Create Batch Profiler. All
of the work that I had in mind for our first draft of the Batch Profiler
has been completed. Please take a look through what I have and let me know
if there are other features that you think are required *before* we merge.

Previous list discussions on this topic include [2] and [3].

(Q) What can I do with the feature branch?

* With the Batch Profiler, you can backfill/seed profiles using archived
telemetry. This enables the following types of use cases.

1. As a Security Data Scientist, I want to understand the historical
behaviors and trends of a profile that I have created so that I can
determine if I have created a feature set that has predictive value for
model building.

2. As a Security Data Scientist, I want to understand the historical
behaviors and trends of a profile that I have created so that I can
determine if I have defined the profile correctly and created a feature set
that matches reality.

3. As a Security Platform Engineer, I want to generate a profile
using archived telemetry when I deploy a new model to production so that
models depending on that profile can function on day 1.

* METRON-1699 [1] includes a more detailed description of the feature.

(Q) What work was completed?

* The Batch Profiler runs on Spark and was implemented in Java to remain
consistent with our current Java-heavy code base.

* The Batch Profiler is executed from the command-line. It can be
launched using a script or by calling `spark-submit`, which may be useful
for advanced users.

* Input telemetry can be consumed from multiple sources; for example HDFS
or the local file system.

* Input telemetry can be consumed in multiple formats; for example JSON
or ORC.

* The 'output' profile measurements are persisted in HBase and is
consistent with the Storm Profiler.

* It can be run on any underlying engine supported by Spark. I have
tested it both in 'local' mode and on a YARN cluster.

* It is installed automatically by the Metron MPack.

* A README was added that documents usage instructions.

* The existing Profiler code was refactored so that as much code as
possible is shared between the 3 Profiler ports; Storm, the Stellar REPL,
and Spark. For example, the logic which determines the timestamp of a
message was refactored so that it could be reused by all ports.

* metron-profiler-common: The common Profiler code shared amongst
each port.
* metron-profiler-storm: Profiler on Storm
* metron-profiler-spark: Profiler on Spark
* metron-profiler-repl: Profiler on the Stellar REPL
* metron-profiler-client: The client code for retrieving profile
data; for example PROFILE_GET.

* There are 3 separate RPM and DEB packages now created for the Profiler.

* metron-profiler-storm-*.rpm
* metron-profiler-spark-*.rpm
* metron-profiler-repl-*.rpm

* The Profiler integration tests were enhanced to leverage the Profiler
Client logic to validate the results.

* Review METRON-1699 [1] for a complete break-down of the tasks that have
been completed on the feature branch.

(Q) What limitations exist?

* You must manually install Spark to use the Batch Profiler. The Metron
MPack does not treat Spark as a Metron dependency and so does not install
it automatically.

* You do not configure the Batch Profiler in Ambari. It is configured
and executed completely from the command-line.

* To run the Batch Profiler in 'Full Dev', you have to take the following
manual steps. Some of these are arguably limitations with how Ambari
installs Spark 2 in the version of HDP that we run.

1. Install Spark 2 using Ambari.

2. Tell Spark how to talk with HBase.

SPARK_HOME=/usr/hdp/current/spark2-client
cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
$SPARK_HOME/conf/

3. Create the Spark History directory in HDFS.

export HADOOP_USER_NAME=hdfs
hdfs dfs -mkdir /spark2-history

4. Change the default input path to `hdfs://localhost:8020/...` to
match the port defined by HDP, instead of port 9000.

[1] https://issues.apache.org/jira/browse/METRON-1699
[2]
https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
[3]
https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

> It's just cleaner from a usage/management perspective to say "I want to put
a profile in prod, just use streaming profiler and the batch profiler with
the same setup and they're good to go."

Agreed.  I can add it.  It would be a simple addition.

On Thu, Sep 20, 2018 at 12:49 PM Justin Leet <ju...@gmail.com> wrote:

> I think the main difference between this and the flatfile loader is that we
> actively maintain our profiles in ZK for streaming.  Doing this from files
> is likely going to be the main usage, particularly for speculative usage.
>
> For me, the main use case for ZK is definitely use case 3.
>
> I can definitely be persuaded that this isn't a blocker for right now, but
> I think there will be problems in practice from not having the
> functionality. E.g. "We want to refresh everything because of mistake X,
> and nobody refreshed the file/ZK and they've diverged".  While nobody likes
> to refresh prod data (or some subset), I have seen it happen in literally
> every single project I've worked on.  On dev/integration environments this
> is even more likely.  Most people probably aren't going to store these
> files in their version control (even though they probably should) and these
> sort of divergences will happen.
>
>  It's just cleaner from a usage/management perspective to say "I want to
> put a profile in prod, just use streaming profiler and the batch profiler
> with the same setup and they're good to go."
>
> On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
> > Ok, makes sense. That's sort of what I was thinking as well, Nick.
> Pulling
> > at this thread just a bit more...
> >
> >    1. I have an existing system that's been up a while, and I have added
> k
> >    profiles - assume these are the first profiles I've created.
> >       1. I would have t0 - tm (where m is the time when the profiles were
> >       first installed) worth of data that has not been profiled yet.
> >       2. The batch profiler process would be to take that exact profile
> >       definition from ZK and run the batch loader with that from the CLI.
> >       3. Profiles are now up to date from t0 - tCurrent
> >    2. I've already done #1 above. Time goes by and now I want to add a
> new
> >    profile.
> >       1. Same first step above
> >       2. I would run the batch loader with *only* that new profile
> >       definition to seed?
> >
> > Forgive me if I missed this in PR's and discussion in the FB, but how do
> we
> > establish "tm" from 1.1 above? Any concerns about overlap or gaps after
> the
> > seeding is performed?
> >
> > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org> wrote:
> >
> > > I think more often than not, you would want to load your profile
> > definition
> > > from a file.  This is why I considered the 'load from Zk' more of a
> > > nice-to-have.
> > >
> > >    - In use case 1 and 2, this would definitely be the case.  The
> > profiles
> > >    I am working with are speculative and I am using the batch profiler
> to
> > >    determine if they are worth keeping.  In this case, my speculative
> > > profiles
> > >    would not be in Zk (yet).
> > >    - In use case 3, I could see it go either way.  It might be useful
> to
> > >    load from Zk, but it certainly isn't a blocker.
> > >
> > >
> > > > So if the config does not correctly match the profiler config held in
> > ZK
> > > and
> > > the user runs the batch seeding job, what happens?
> > >
> > > You would just get a profile that is slightly different over the entire
> > > time span.  This is not a new risk.  If the user changes their Profile
> > > definitions in Zk, the same thing would happen.
> > >
> > >
> > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > > michael.miklavcic@gmail.com> wrote:
> > >
> > > > I think I'm torn on this, specifically because it's batch and would
> > > > generally be run as-needed. Justin, can you elaborate on your
> concerns
> > > > there? This feels functionally very similar to our flat file loaders,
> > > which
> > > > all have inputs for config from the CLI only. On the other hand, our
> > flat
> > > > file loaders are not typically seeding an existing structure. My
> > concern
> > > of
> > > > a local file profiler config stems from this stated goal:
> > > > > The goal would be to enable “profile seeding” which allows profiles
> > to
> > > be
> > > > populated from a time before the profile was created.
> > > > So if the config does not correctly match the profiler config held in
> > ZK
> > > > and the user runs the batch seeding job, what happens?
> > > >
> > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <ju...@gmail.com>
> > > > wrote:
> > > >
> > > > > The profile not being able to read from ZK feels like a fairly
> > > > substantial,
> > > > > if subtle, set of potential problems.  I'd like to see that in
> either
> > > > > before merging or at least pretty soon after merging.  Is it a lot
> of
> > > > work
> > > > > to add that functionality based on where things are right now?
> > > > >
> > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org>
> > wrote:
> > > > >
> > > > > > Here is another limitation that I just thought. It can only read
> a
> > > > > profile
> > > > > > definition from a file.  It probably also makes sense to add an
> > > option
> > > > > that
> > > > > > allows it to read the current Profiler configuration from
> > Zookeeper.
> > > > > >
> > > > > >
> > > > > > > Is it worth setting up a default config that pulls from the
> main
> > > > > indexing
> > > > > > output?
> > > > > >
> > > > > > Yes, I think that makes sense.  We want the Batch Profiler to
> point
> > > to
> > > > > the
> > > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> > Metron
> > > > gets
> > > > > > spun-up on a cluster, I should be able to just run the Batch
> > Profiler
> > > > > > without having to fuss with the input path.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> justinjleet@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Re:
> > > > > > >
> > > > > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > > > > configured
> > > > > > > > and executed completely from the command-line.
> > > > > > > >
> > > > > > >
> > > > > > > Is it worth setting up a default config that pulls from the
> main
> > > > > indexing
> > > > > > > output?  I'm a little on the fence about it, but it seems like
> > > making
> > > > > the
> > > > > > > most common case more or less built-in would be nice.
> > > > > > >
> > > > > > > Having said that, I do not consider that a requirement for
> > merging
> > > > the
> > > > > > > feature branch.
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > jsirota@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > I think what you have outlined above is a good initial stab
> at
> > > the
> > > > > > > > feature.  Manual install of spark is not a big deal.
> > Configuring
> > > > via
> > > > > > > > command line while we mature this feature is ok as well.
> > Doesn't
> > > > > look
> > > > > > > like
> > > > > > > > configuration steps are too hard.  I think you should merge.
> > > > > > > >
> > > > > > > > James
> > > > > > > >
> > > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > > > I would like to open a discussion to get the Batch Profiler
> > > > feature
> > > > > > > > branch
> > > > > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > > > > Profiler.
> > > > > > > All
> > > > > > > > > of the work that I had in mind for our first draft of the
> > Batch
> > > > > > > Profiler
> > > > > > > > > has been completed. Please take a look through what I have
> > and
> > > > let
> > > > > me
> > > > > > > > know
> > > > > > > > > if there are other features that you think are required
> > > *before*
> > > > we
> > > > > > > > merge.
> > > > > > > > >
> > > > > > > > > Previous list discussions on this topic include [2] and
> [3].
> > > > > > > > >
> > > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > > >
> > > > > > > > >   * With the Batch Profiler, you can backfill/seed profiles
> > > using
> > > > > > > > archived
> > > > > > > > > telemetry. This enables the following types of use cases.
> > > > > > > > >
> > > > > > > > >       1. As a Security Data Scientist, I want to understand
> > the
> > > > > > > > historical
> > > > > > > > > behaviors and trends of a profile that I have created so
> > that I
> > > > can
> > > > > > > > > determine if I have created a feature set that has
> predictive
> > > > value
> > > > > > for
> > > > > > > > > model building.
> > > > > > > > >
> > > > > > > > >       2. As a Security Data Scientist, I want to understand
> > the
> > > > > > > > historical
> > > > > > > > > behaviors and trends of a profile that I have created so
> > that I
> > > > can
> > > > > > > > > determine if I have defined the profile correctly and
> > created a
> > > > > > feature
> > > > > > > > set
> > > > > > > > > that matches reality.
> > > > > > > > >
> > > > > > > > >       3. As a Security Platform Engineer, I want to
> generate
> > a
> > > > > > profile
> > > > > > > > > using archived telemetry when I deploy a new model to
> > > production
> > > > so
> > > > > > > that
> > > > > > > > > models depending on that profile can function on day 1.
> > > > > > > > >
> > > > > > > > >   * METRON-1699 [1] includes a more detailed description of
> > the
> > > > > > > feature.
> > > > > > > > >
> > > > > > > > > (Q) What work was completed?
> > > > > > > > >
> > > > > > > > >   * The Batch Profiler runs on Spark and was implemented in
> > > Java
> > > > to
> > > > > > > > remain
> > > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > > >
> > > > > > > > >   * The Batch Profiler is executed from the command-line.
> It
> > > can
> > > > be
> > > > > > > > > launched using a script or by calling `spark-submit`, which
> > may
> > > > be
> > > > > > > useful
> > > > > > > > > for advanced users.
> > > > > > > > >
> > > > > > > > >   * Input telemetry can be consumed from multiple sources;
> > for
> > > > > > example
> > > > > > > > HDFS
> > > > > > > > > or the local file system.
> > > > > > > > >
> > > > > > > > >   * Input telemetry can be consumed in multiple formats;
> for
> > > > > example
> > > > > > > JSON
> > > > > > > > > or ORC.
> > > > > > > > >
> > > > > > > > >   * The 'output' profile measurements are persisted in
> HBase
> > > and
> > > > is
> > > > > > > > > consistent with the Storm Profiler.
> > > > > > > > >
> > > > > > > > >   * It can be run on any underlying engine supported by
> > Spark.
> > > I
> > > > > have
> > > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > > >
> > > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > > >
> > > > > > > > >   * A README was added that documents usage instructions.
> > > > > > > > >
> > > > > > > > >   * The existing Profiler code was refactored so that as
> much
> > > > code
> > > > > as
> > > > > > > > > possible is shared between the 3 Profiler ports; Storm, the
> > > > Stellar
> > > > > > > REPL,
> > > > > > > > > and Spark. For example, the logic which determines the
> > > timestamp
> > > > > of a
> > > > > > > > > message was refactored so that it could be reused by all
> > ports.
> > > > > > > > >
> > > > > > > > >       * metron-profiler-common: The common Profiler code
> > shared
> > > > > > amongst
> > > > > > > > > each port.
> > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > > > > >       * metron-profiler-client: The client code for
> > retrieving
> > > > > > profile
> > > > > > > > > data; for example PROFILE_GET.
> > > > > > > > >
> > > > > > > > >   * There are 3 separate RPM and DEB packages now created
> for
> > > the
> > > > > > > > Profiler.
> > > > > > > > >
> > > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > > >
> > > > > > > > >   * The Profiler integration tests were enhanced to
> leverage
> > > the
> > > > > > > Profiler
> > > > > > > > > Client logic to validate the results.
> > > > > > > > >
> > > > > > > > >   * Review METRON-1699 [1] for a complete break-down of the
> > > tasks
> > > > > > that
> > > > > > > > have
> > > > > > > > > been completed on the feature branch.
> > > > > > > > >
> > > > > > > > > (Q) What limitations exist?
> > > > > > > > >
> > > > > > > > >   * You must manually install Spark to use the Batch
> > Profiler.
> > > > The
> > > > > > > Metron
> > > > > > > > > MPack does not treat Spark as a Metron dependency and so
> does
> > > not
> > > > > > > install
> > > > > > > > > it automatically.
> > > > > > > > >
> > > > > > > > >   * You do not configure the Batch Profiler in Ambari. It
> is
> > > > > > configured
> > > > > > > > > and executed completely from the command-line.
> > > > > > > > >
> > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to
> take
> > > the
> > > > > > > > following
> > > > > > > > > manual steps. Some of these are arguably limitations with
> how
> > > > > Ambari
> > > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > > >
> > > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > > >
> > > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > > >
> > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > > >         cp
> /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > > $SPARK_HOME/conf/
> > > > > > > > >
> > > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > > >
> > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > > >
> > > > > > > > >       4. Change the default input path to
> > > > > `hdfs://localhost:8020/...`
> > > > > > > to
> > > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > > >
> > > > > > > > -------------------
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > James Sirota
> > > > > > > > PMC- Apache Metron
> > > > > > > > jsirota AT apache DOT org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Justin Leet <ju...@gmail.com>.

I think the main difference between this and the flatfile loader is that we
actively maintain our profiles in ZK for streaming.  Doing this from files
is likely going to be the main usage, particularly for speculative usage.

For me, the main use case for ZK is definitely use case 3.

I can definitely be persuaded that this isn't a blocker for right now, but
I think there will be problems in practice from not having the
functionality. E.g. "We want to refresh everything because of mistake X,
and nobody refreshed the file/ZK and they've diverged".  While nobody likes
to refresh prod data (or some subset), I have seen it happen in literally
every single project I've worked on.  On dev/integration environments this
is even more likely.  Most people probably aren't going to store these
files in their version control (even though they probably should) and these
sort of divergences will happen.

 It's just cleaner from a usage/management perspective to say "I want to
put a profile in prod, just use streaming profiler and the batch profiler
with the same setup and they're good to go."

On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> Ok, makes sense. That's sort of what I was thinking as well, Nick. Pulling
> at this thread just a bit more...
>
>    1. I have an existing system that's been up a while, and I have added k
>    profiles - assume these are the first profiles I've created.
>       1. I would have t0 - tm (where m is the time when the profiles were
>       first installed) worth of data that has not been profiled yet.
>       2. The batch profiler process would be to take that exact profile
>       definition from ZK and run the batch loader with that from the CLI.
>       3. Profiles are now up to date from t0 - tCurrent
>    2. I've already done #1 above. Time goes by and now I want to add a new
>    profile.
>       1. Same first step above
>       2. I would run the batch loader with *only* that new profile
>       definition to seed?
>
> Forgive me if I missed this in PR's and discussion in the FB, but how do we
> establish "tm" from 1.1 above? Any concerns about overlap or gaps after the
> seeding is performed?
>
> On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org> wrote:
>
> > I think more often than not, you would want to load your profile
> definition
> > from a file.  This is why I considered the 'load from Zk' more of a
> > nice-to-have.
> >
> >    - In use case 1 and 2, this would definitely be the case.  The
> profiles
> >    I am working with are speculative and I am using the batch profiler to
> >    determine if they are worth keeping.  In this case, my speculative
> > profiles
> >    would not be in Zk (yet).
> >    - In use case 3, I could see it go either way.  It might be useful to
> >    load from Zk, but it certainly isn't a blocker.
> >
> >
> > > So if the config does not correctly match the profiler config held in
> ZK
> > and
> > the user runs the batch seeding job, what happens?
> >
> > You would just get a profile that is slightly different over the entire
> > time span.  This is not a new risk.  If the user changes their Profile
> > definitions in Zk, the same thing would happen.
> >
> >
> > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > michael.miklavcic@gmail.com> wrote:
> >
> > > I think I'm torn on this, specifically because it's batch and would
> > > generally be run as-needed. Justin, can you elaborate on your concerns
> > > there? This feels functionally very similar to our flat file loaders,
> > which
> > > all have inputs for config from the CLI only. On the other hand, our
> flat
> > > file loaders are not typically seeding an existing structure. My
> concern
> > of
> > > a local file profiler config stems from this stated goal:
> > > > The goal would be to enable “profile seeding” which allows profiles
> to
> > be
> > > populated from a time before the profile was created.
> > > So if the config does not correctly match the profiler config held in
> ZK
> > > and the user runs the batch seeding job, what happens?
> > >
> > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <ju...@gmail.com>
> > > wrote:
> > >
> > > > The profile not being able to read from ZK feels like a fairly
> > > substantial,
> > > > if subtle, set of potential problems.  I'd like to see that in either
> > > > before merging or at least pretty soon after merging.  Is it a lot of
> > > work
> > > > to add that functionality based on where things are right now?
> > > >
> > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org>
> wrote:
> > > >
> > > > > Here is another limitation that I just thought. It can only read a
> > > > profile
> > > > > definition from a file.  It probably also makes sense to add an
> > option
> > > > that
> > > > > allows it to read the current Profiler configuration from
> Zookeeper.
> > > > >
> > > > >
> > > > > > Is it worth setting up a default config that pulls from the main
> > > > indexing
> > > > > output?
> > > > >
> > > > > Yes, I think that makes sense.  We want the Batch Profiler to point
> > to
> > > > the
> > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> Metron
> > > gets
> > > > > spun-up on a cluster, I should be able to just run the Batch
> Profiler
> > > > > without having to fuss with the input path.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <justinjleet@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Re:
> > > > > >
> > > > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > > > configured
> > > > > > > and executed completely from the command-line.
> > > > > > >
> > > > > >
> > > > > > Is it worth setting up a default config that pulls from the main
> > > > indexing
> > > > > > output?  I'm a little on the fence about it, but it seems like
> > making
> > > > the
> > > > > > most common case more or less built-in would be nice.
> > > > > >
> > > > > > Having said that, I do not consider that a requirement for
> merging
> > > the
> > > > > > feature branch.
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> jsirota@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > I think what you have outlined above is a good initial stab at
> > the
> > > > > > > feature.  Manual install of spark is not a big deal.
> Configuring
> > > via
> > > > > > > command line while we mature this feature is ok as well.
> Doesn't
> > > > look
> > > > > > like
> > > > > > > configuration steps are too hard.  I think you should merge.
> > > > > > >
> > > > > > > James
> > > > > > >
> > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > > I would like to open a discussion to get the Batch Profiler
> > > feature
> > > > > > > branch
> > > > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > > > Profiler.
> > > > > > All
> > > > > > > > of the work that I had in mind for our first draft of the
> Batch
> > > > > > Profiler
> > > > > > > > has been completed. Please take a look through what I have
> and
> > > let
> > > > me
> > > > > > > know
> > > > > > > > if there are other features that you think are required
> > *before*
> > > we
> > > > > > > merge.
> > > > > > > >
> > > > > > > > Previous list discussions on this topic include [2] and [3].
> > > > > > > >
> > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > >
> > > > > > > >   * With the Batch Profiler, you can backfill/seed profiles
> > using
> > > > > > > archived
> > > > > > > > telemetry. This enables the following types of use cases.
> > > > > > > >
> > > > > > > >       1. As a Security Data Scientist, I want to understand
> the
> > > > > > > historical
> > > > > > > > behaviors and trends of a profile that I have created so
> that I
> > > can
> > > > > > > > determine if I have created a feature set that has predictive
> > > value
> > > > > for
> > > > > > > > model building.
> > > > > > > >
> > > > > > > >       2. As a Security Data Scientist, I want to understand
> the
> > > > > > > historical
> > > > > > > > behaviors and trends of a profile that I have created so
> that I
> > > can
> > > > > > > > determine if I have defined the profile correctly and
> created a
> > > > > feature
> > > > > > > set
> > > > > > > > that matches reality.
> > > > > > > >
> > > > > > > >       3. As a Security Platform Engineer, I want to generate
> a
> > > > > profile
> > > > > > > > using archived telemetry when I deploy a new model to
> > production
> > > so
> > > > > > that
> > > > > > > > models depending on that profile can function on day 1.
> > > > > > > >
> > > > > > > >   * METRON-1699 [1] includes a more detailed description of
> the
> > > > > > feature.
> > > > > > > >
> > > > > > > > (Q) What work was completed?
> > > > > > > >
> > > > > > > >   * The Batch Profiler runs on Spark and was implemented in
> > Java
> > > to
> > > > > > > remain
> > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > >
> > > > > > > >   * The Batch Profiler is executed from the command-line. It
> > can
> > > be
> > > > > > > > launched using a script or by calling `spark-submit`, which
> may
> > > be
> > > > > > useful
> > > > > > > > for advanced users.
> > > > > > > >
> > > > > > > >   * Input telemetry can be consumed from multiple sources;
> for
> > > > > example
> > > > > > > HDFS
> > > > > > > > or the local file system.
> > > > > > > >
> > > > > > > >   * Input telemetry can be consumed in multiple formats; for
> > > > example
> > > > > > JSON
> > > > > > > > or ORC.
> > > > > > > >
> > > > > > > >   * The 'output' profile measurements are persisted in HBase
> > and
> > > is
> > > > > > > > consistent with the Storm Profiler.
> > > > > > > >
> > > > > > > >   * It can be run on any underlying engine supported by
> Spark.
> > I
> > > > have
> > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > >
> > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > >
> > > > > > > >   * A README was added that documents usage instructions.
> > > > > > > >
> > > > > > > >   * The existing Profiler code was refactored so that as much
> > > code
> > > > as
> > > > > > > > possible is shared between the 3 Profiler ports; Storm, the
> > > Stellar
> > > > > > REPL,
> > > > > > > > and Spark. For example, the logic which determines the
> > timestamp
> > > > of a
> > > > > > > > message was refactored so that it could be reused by all
> ports.
> > > > > > > >
> > > > > > > >       * metron-profiler-common: The common Profiler code
> shared
> > > > > amongst
> > > > > > > > each port.
> > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > > > >       * metron-profiler-client: The client code for
> retrieving
> > > > > profile
> > > > > > > > data; for example PROFILE_GET.
> > > > > > > >
> > > > > > > >   * There are 3 separate RPM and DEB packages now created for
> > the
> > > > > > > Profiler.
> > > > > > > >
> > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > >
> > > > > > > >   * The Profiler integration tests were enhanced to leverage
> > the
> > > > > > Profiler
> > > > > > > > Client logic to validate the results.
> > > > > > > >
> > > > > > > >   * Review METRON-1699 [1] for a complete break-down of the
> > tasks
> > > > > that
> > > > > > > have
> > > > > > > > been completed on the feature branch.
> > > > > > > >
> > > > > > > > (Q) What limitations exist?
> > > > > > > >
> > > > > > > >   * You must manually install Spark to use the Batch
> Profiler.
> > > The
> > > > > > Metron
> > > > > > > > MPack does not treat Spark as a Metron dependency and so does
> > not
> > > > > > install
> > > > > > > > it automatically.
> > > > > > > >
> > > > > > > >   * You do not configure the Batch Profiler in Ambari. It is
> > > > > configured
> > > > > > > > and executed completely from the command-line.
> > > > > > > >
> > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to take
> > the
> > > > > > > following
> > > > > > > > manual steps. Some of these are arguably limitations with how
> > > > Ambari
> > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > >
> > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > >
> > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > >
> > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > $SPARK_HOME/conf/
> > > > > > > >
> > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > >
> > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > >
> > > > > > > >       4. Change the default input path to
> > > > `hdfs://localhost:8020/...`
> > > > > > to
> > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > [2]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > >
> > > > > > > -------------------
> > > > > > > Thank you,
> > > > > > >
> > > > > > > James Sirota
> > > > > > > PMC- Apache Metron
> > > > > > > jsirota AT apache DOT org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

Thanks for all the reviews and support.  I have merged the feature branch
into master.

On Thu, Sep 27, 2018 at 2:41 PM James Sirota <js...@apache.org> wrote:

> +1 from me as well. great work
>
> 27.09.2018, 11:15, "Ryan Merriman" <me...@gmail.com>:
> > +1 from me. Great work.
> >
> > On Thu, Sep 27, 2018 at 12:41 PM Justin Leet <ju...@gmail.com>
> wrote:
> >
> >>  I'm +1 on merging the feature branch into master. There's a lot of good
> >>  work here, and it's definitely been nice to see the couple remaining
> >>  improvements make it in.
> >>
> >>  Thanks a lot for the contribution, this is great stuff!
> >>
> >>  On Wed, Sep 26, 2018 at 6:26 PM Nick Allen <ni...@nickallen.org> wrote:
> >>
> >>  > Or support to be offered for merging this feature branch into master?
> >>  >
> >>  > On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <ni...@nickallen.org>
> wrote:
> >>  >
> >>  > > Thanks for the review. With
> >>  https://github.com/apache/metron/pull/1209
> >>  > complete,
> >>  > > I think the feature branch is ready to be merged. Sounds like I
> have
> >>  > > Mike's support. Anyone else have comments, concerns, questions?
> >>  > >
> >>  > > On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic <
> >>  > > michael.miklavcic@gmail.com> wrote:
> >>  > >
> >>  > >> I just made a couple minor comments on that PR, and I am in
> agreement
> >>  > >> about
> >>  > >> the readiness for merging with master. Good stuff Nick.
> >>  > >>
> >>  > >> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <ni...@nickallen.org>
> >>  wrote:
> >>  > >>
> >>  > >> > Here is a PR that adds the input time constraints to the Batch
> >>  > Profiler
> >>  > >> > (METRON-1787); https://github.com/apache/metron/pull/1209.
> >>  > >> >
> >>  > >> > It seems that the consensus is that this is probably the last
> >>  feature
> >>  > we
> >>  > >> > need before merging the FB into master. The other two can wait
> >>  until
> >>  > >> after
> >>  > >> > the feature branch has been merged. Let me know if you disagree.
> >>  > >> >
> >>  > >> > Thanks
> >>  > >> >
> >>  > >> >
> >>  > >> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org>
> >>  > wrote:
> >>  > >> >
> >>  > >> > > Yeah, agreed. Per use case 3, when deploying to production
> there
> >>  > >> really
> >>  > >> > > wouldn't be a huge overlap like 3 months of already profiled
> data.
> >>  > >> Its
> >>  > >> > day
> >>  > >> > > 1, the profile was just deployed around the same time as you
> are
> >>  > >> running
> >>  > >> > > the Batch Profiler, so the overlap is in minutes, maybe hours.
> >>  But
> >>  > I
> >>  > >> can
> >>  > >> > > definitely see the usefulness of the feature for re-runs, etc
> as
> >>  you
> >>  > >> have
> >>  > >> > > described.
> >>  > >> > >
> >>  > >> > > Based on this discussion, I created a few JIRAs. Thanks all
> for
> >>  the
> >>  > >> > great
> >>  > >> > > feedback and keep it coming.
> >>  > >> > >
> >>  > >> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler
> >>  > >> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch
> >>  > Profiler
> >>  > >> > > [3] METRON-1789 - MPack Should Define Default Input Path for
> Batch
> >>  > >> > > Profiler
> >>  > >> > >
> >>  > >> > >
> >>  > >> > > --
> >>  > >> > > [1] https://issues.apache.org/jira/browse/METRON-1787
> >>  > >> > > [2] https://issues.apache.org/jira/browse/METRON-1788
> >>  > >> > > [3] https://issues.apache.org/jira/browse/METRON-1789
> >>  > >> > >
> >>  > >> > >
> >>  > >> > >
> >>  > >> > >
> >>  > >> > >
> >>  > >> > >
> >>  > >> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
> >>  > >> > > michael.miklavcic@gmail.com> wrote:
> >>  > >> > >
> >>  > >> > >> I think we might want to allow the flexibility to choose the
> date
> >>  > >> range
> >>  > >> > >> then. I don't yet feel like I have a good enough
> understanding of
> >>  > all
> >>  > >> > the
> >>  > >> > >> ways in which users would want to seed to force them to run
> the
> >>  > batch
> >>  > >> > job
> >>  > >> > >> over all the data. It might also make it easier to deal with
> >>  > >> > remediation,
> >>  > >> > >> ie an error doesn't force you to re-run over the entire
> history.
> >>  > Same
> >>  > >> > goes
> >>  > >> > >> for testing out the profile seeing batch job in the first
> place.
> >>  > >> > >>
> >>  > >> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <
> nick@nickallen.org>
> >>  > >> wrote:
> >>  > >> > >>
> >>  > >> > >> > Assuming you have 9 months of data archived, yes.
> >>  > >> > >> >
> >>  > >> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
> >>  > >> > >> > michael.miklavcic@gmail.com> wrote:
> >>  > >> > >> >
> >>  > >> > >> > > So in the case of 3 - if you had 6 months of data that
> hadn't
> >>  > >> been
> >>  > >> > >> > profiled
> >>  > >> > >> > > and another 3 that had been profiled (9 months total
> data),
> >>  in
> >>  > >> its
> >>  > >> > >> > current
> >>  > >> > >> > > form the batch job runs over all 9 months?
> >>  > >> > >> > >
> >>  > >> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <
> >>  > nick@nickallen.org>
> >>  > >> > >> wrote:
> >>  > >> > >> > >
> >>  > >> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns
> >>  about
> >>  > >> > >> overlap
> >>  > >> > >> > or
> >>  > >> > >> > > > gaps after the seeding is performed?
> >>  > >> > >> > > >
> >>  > >> > >> > > > Good point. Right now, if the Streaming and Batch
> Profiler
> >>  > >> > overlap
> >>  > >> > >> the
> >>  > >> > >> > > > last write wins. And presumably the output of the
> >>  Streaming
> >>  > >> and
> >>  > >> > >> Batch
> >>  > >> > >> > > > Profiler are the same, so no worries, right? :)
> >>  > >> > >> > > >
> >>  > >> > >> > > > So it kind of works, but it is definitely not ideal
> for use
> >>  > >> case
> >>  > >> > >> 3. I
> >>  > >> > >> > > > could add --begin and --end args to constrain the time
> >>  frame
> >>  > >> over
> >>  > >> > >> which
> >>  > >> > >> > > the
> >>  > >> > >> > > > Batch Profiler runs. I do not have that in the feature
> >>  > branch.
> >>  > >> > It
> >>  > >> > >> > would
> >>  > >> > >> > > > be easy enough to add though.
> >>  > >> > >> > > >
> >>  > >> > >> > > >
> >>  > >> > >> > > >
> >>  > >> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> >>  > >> > >> > > > michael.miklavcic@gmail.com> wrote:
> >>  > >> > >> > > >
> >>  > >> > >> > > > > Ok, makes sense. That's sort of what I was thinking
> as
> >>  > well,
> >>  > >> > Nick.
> >>  > >> > >> > > > Pulling
> >>  > >> > >> > > > > at this thread just a bit more...
> >>  > >> > >> > > > >
> >>  > >> > >> > > > > 1. I have an existing system that's been up a while,
> >>  > and I
> >>  > >> > have
> >>  > >> > >> > > added
> >>  > >> > >> > > > k
> >>  > >> > >> > > > > profiles - assume these are the first profiles I've
> >>  > >> created.
> >>  > >> > >> > > > > 1. I would have t0 - tm (where m is the time when
> >>  the
> >>  > >> > >> profiles
> >>  > >> > >> > > were
> >>  > >> > >> > > > > first installed) worth of data that has not been
> >>  > >> profiled
> >>  > >> > >> yet.
> >>  > >> > >> > > > > 2. The batch profiler process would be to take that
> >>  > >> exact
> >>  > >> > >> > profile
> >>  > >> > >> > > > > definition from ZK and run the batch loader with
> >>  that
> >>  > >> from
> >>  > >> > >> the
> >>  > >> > >> > > CLI.
> >>  > >> > >> > > > > 3. Profiles are now up to date from t0 - tCurrent
> >>  > >> > >> > > > > 2. I've already done #1 above. Time goes by and now I
> >>  > >> want to
> >>  > >> > >> add
> >>  > >> > >> > a
> >>  > >> > >> > > > new
> >>  > >> > >> > > > > profile.
> >>  > >> > >> > > > > 1. Same first step above
> >>  > >> > >> > > > > 2. I would run the batch loader with *only* that
> >>  new
> >>  > >> > profile
> >>  > >> > >> > > > > definition to seed?
> >>  > >> > >> > > > >
> >>  > >> > >> > > > > Forgive me if I missed this in PR's and discussion
> in the
> >>  > FB,
> >>  > >> > but
> >>  > >> > >> how
> >>  > >> > >> > > do
> >>  > >> > >> > > > we
> >>  > >> > >> > > > > establish "tm" from 1.1 above? Any concerns about
> overlap
> >>  > or
> >>  > >> > gaps
> >>  > >> > >> > after
> >>  > >> > >> > > > the
> >>  > >> > >> > > > > seeding is performed?
> >>  > >> > >> > > > >
> >>  > >> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <
> >>  > >> nick@nickallen.org
> >>  > >> > >
> >>  > >> > >> > > wrote:
> >>  > >> > >> > > > >
> >>  > >> > >> > > > > > I think more often than not, you would want to load
> >>  your
> >>  > >> > profile
> >>  > >> > >> > > > > definition
> >>  > >> > >> > > > > > from a file. This is why I considered the 'load
> from
> >>  Zk'
> >>  > >> more
> >>  > >> > >> of a
> >>  > >> > >> > > > > > nice-to-have.
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > > - In use case 1 and 2, this would definitely be the
> >>  > >> case.
> >>  > >> > >> The
> >>  > >> > >> > > > > profiles
> >>  > >> > >> > > > > > I am working with are speculative and I am using
> the
> >>  > >> batch
> >>  > >> > >> > > profiler
> >>  > >> > >> > > > to
> >>  > >> > >> > > > > > determine if they are worth keeping. In this case,
> >>  my
> >>  > >> > >> > speculative
> >>  > >> > >> > > > > > profiles
> >>  > >> > >> > > > > > would not be in Zk (yet).
> >>  > >> > >> > > > > > - In use case 3, I could see it go either way. It
> >>  > >> might be
> >>  > >> > >> > useful
> >>  > >> > >> > > > to
> >>  > >> > >> > > > > > load from Zk, but it certainly isn't a blocker.
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > > > So if the config does not correctly match the
> >>  profiler
> >>  > >> > config
> >>  > >> > >> > held
> >>  > >> > >> > > in
> >>  > >> > >> > > > > ZK
> >>  > >> > >> > > > > > and
> >>  > >> > >> > > > > > the user runs the batch seeding job, what happens?
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > > You would just get a profile that is slightly
> different
> >>  > >> over
> >>  > >> > the
> >>  > >> > >> > > entire
> >>  > >> > >> > > > > > time span. This is not a new risk. If the user
> >>  changes
> >>  > >> their
> >>  > >> > >> > > Profile
> >>  > >> > >> > > > > > definitions in Zk, the same thing would happen.
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic
> <
> >>  > >> > >> > > > > > michael.miklavcic@gmail.com> wrote:
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > > > I think I'm torn on this, specifically because
> it's
> >>  > batch
> >>  > >> > and
> >>  > >> > >> > would
> >>  > >> > >> > > > > > > generally be run as-needed. Justin, can you
> elaborate
> >>  > on
> >>  > >> > your
> >>  > >> > >> > > > concerns
> >>  > >> > >> > > > > > > there? This feels functionally very similar to
> our
> >>  flat
> >>  > >> file
> >>  > >> > >> > > loaders,
> >>  > >> > >> > > > > > which
> >>  > >> > >> > > > > > > all have inputs for config from the CLI only. On
> the
> >>  > >> other
> >>  > >> > >> hand,
> >>  > >> > >> > > our
> >>  > >> > >> > > > > flat
> >>  > >> > >> > > > > > > file loaders are not typically seeding an
> existing
> >>  > >> > structure.
> >>  > >> > >> My
> >>  > >> > >> > > > > concern
> >>  > >> > >> > > > > > of
> >>  > >> > >> > > > > > > a local file profiler config stems from this
> stated
> >>  > goal:
> >>  > >> > >> > > > > > > > The goal would be to enable “profile seeding”
> which
> >>  > >> allows
> >>  > >> > >> > > profiles
> >>  > >> > >> > > > > to
> >>  > >> > >> > > > > > be
> >>  > >> > >> > > > > > > populated from a time before the profile was
> created.
> >>  > >> > >> > > > > > > So if the config does not correctly match the
> >>  profiler
> >>  > >> > config
> >>  > >> > >> > held
> >>  > >> > >> > > in
> >>  > >> > >> > > > > ZK
> >>  > >> > >> > > > > > > and the user runs the batch seeding job, what
> >>  happens?
> >>  > >> > >> > > > > > >
> >>  > >> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> >>  > >> > >> > > justinjleet@gmail.com>
> >>  > >> > >> > > > > > > wrote:
> >>  > >> > >> > > > > > >
> >>  > >> > >> > > > > > > > The profile not being able to read from ZK
> feels
> >>  > like a
> >>  > >> > >> fairly
> >>  > >> > >> > > > > > > substantial,
> >>  > >> > >> > > > > > > > if subtle, set of potential problems. I'd like
> to
> >>  > see
> >>  > >> > that
> >>  > >> > >> in
> >>  > >> > >> > > > either
> >>  > >> > >> > > > > > > > before merging or at least pretty soon after
> >>  merging.
> >>  > >> Is
> >>  > >> > >> it a
> >>  > >> > >> > > lot
> >>  > >> > >> > > > of
> >>  > >> > >> > > > > > > work
> >>  > >> > >> > > > > > > > to add that functionality based on where
> things are
> >>  > >> right
> >>  > >> > >> now?
> >>  > >> > >> > > > > > > >
> >>  > >> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
> >>  > >> > >> nick@nickallen.org
> >>  > >> > >> > >
> >>  > >> > >> > > > > wrote:
> >>  > >> > >> > > > > > > >
> >>  > >> > >> > > > > > > > > Here is another limitation that I just
> thought.
> >>  It
> >>  > >> can
> >>  > >> > >> only
> >>  > >> > >> > > read
> >>  > >> > >> > > > a
> >>  > >> > >> > > > > > > > profile
> >>  > >> > >> > > > > > > > > definition from a file. It probably also
> makes
> >>  > >> sense to
> >>  > >> > >> add
> >>  > >> > >> > an
> >>  > >> > >> > > > > > option
> >>  > >> > >> > > > > > > > that
> >>  > >> > >> > > > > > > > > allows it to read the current Profiler
> >>  > configuration
> >>  > >> > from
> >>  > >> > >> > > > > Zookeeper.
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > > > Is it worth setting up a default config
> that
> >>  > pulls
> >>  > >> > from
> >>  > >> > >> the
> >>  > >> > >> > > > main
> >>  > >> > >> > > > > > > > indexing
> >>  > >> > >> > > > > > > > > output?
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > > Yes, I think that makes sense. We want the
> Batch
> >>  > >> > >> Profiler to
> >>  > >> > >> > > > point
> >>  > >> > >> > > > > > to
> >>  > >> > >> > > > > > > > the
> >>  > >> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is
> >>  > >> deployed.
> >>  > >> > >> When
> >>  > >> > >> > > > > Metron
> >>  > >> > >> > > > > > > gets
> >>  > >> > >> > > > > > > > > spun-up on a cluster, I should be able to
> just
> >>  run
> >>  > >> the
> >>  > >> > >> Batch
> >>  > >> > >> > > > > Profiler
> >>  > >> > >> > > > > > > > > without having to fuss with the input path.
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> >>  > >> > >> > > > justinjleet@gmail.com
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > > > > > wrote:
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > > > > Re:
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > * You do not configure the Batch
> Profiler in
> >>  > >> > >> Ambari. It
> >>  > >> > >> > > is
> >>  > >> > >> > > > > > > > configured
> >>  > >> > >> > > > > > > > > > > and executed completely from the
> >>  command-line.
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > > > Is it worth setting up a default config
> that
> >>  > pulls
> >>  > >> > from
> >>  > >> > >> the
> >>  > >> > >> > > > main
> >>  > >> > >> > > > > > > > indexing
> >>  > >> > >> > > > > > > > > > output? I'm a little on the fence about it,
> >>  but
> >>  > it
> >>  > >> > >> seems
> >>  > >> > >> > > like
> >>  > >> > >> > > > > > making
> >>  > >> > >> > > > > > > > the
> >>  > >> > >> > > > > > > > > > most common case more or less built-in
> would be
> >>  > >> nice.
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > > > Having said that, I do not consider that a
> >>  > >> requirement
> >>  > >> > >> for
> >>  > >> > >> > > > > merging
> >>  > >> > >> > > > > > > the
> >>  > >> > >> > > > > > > > > > feature branch.
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James
> Sirota <
> >>  > >> > >> > > > > jsirota@apache.org>
> >>  > >> > >> > > > > > > > > wrote:
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > I think what you have outlined above is a
> >>  good
> >>  > >> > initial
> >>  > >> > >> > stab
> >>  > >> > >> > > > at
> >>  > >> > >> > > > > > the
> >>  > >> > >> > > > > > > > > > > feature. Manual install of spark is not a
> >>  big
> >>  > >> deal.
> >>  > >> > >> > > > > Configuring
> >>  > >> > >> > > > > > > via
> >>  > >> > >> > > > > > > > > > > command line while we mature this
> feature is
> >>  ok
> >>  > >> as
> >>  > >> > >> well.
> >>  > >> > >> > > > > Doesn't
> >>  > >> > >> > > > > > > > look
> >>  > >> > >> > > > > > > > > > like
> >>  > >> > >> > > > > > > > > > > configuration steps are too hard. I think
> >>  you
> >>  > >> > should
> >>  > >> > >> > > merge.
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > James
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <
> >>  > >> nick@nickallen.org
> >>  > >> > >:
> >>  > >> > >> > > > > > > > > > > > I would like to open a discussion to
> get
> >>  the
> >>  > >> Batch
> >>  > >> > >> > > Profiler
> >>  > >> > >> > > > > > > feature
> >>  > >> > >> > > > > > > > > > > branch
> >>  > >> > >> > > > > > > > > > > > merged into master as part of
> METRON-1699
> >>  [1]
> >>  > >> > Create
> >>  > >> > >> > > Batch
> >>  > >> > >> > > > > > > > Profiler.
> >>  > >> > >> > > > > > > > > > All
> >>  > >> > >> > > > > > > > > > > > of the work that I had in mind for our
> >>  first
> >>  > >> draft
> >>  > >> > >> of
> >>  > >> > >> > the
> >>  > >> > >> > > > > Batch
> >>  > >> > >> > > > > > > > > > Profiler
> >>  > >> > >> > > > > > > > > > > > has been completed. Please take a look
> >>  > through
> >>  > >> > what
> >>  > >> > >> I
> >>  > >> > >> > > have
> >>  > >> > >> > > > > and
> >>  > >> > >> > > > > > > let
> >>  > >> > >> > > > > > > > me
> >>  > >> > >> > > > > > > > > > > know
> >>  > >> > >> > > > > > > > > > > > if there are other features that you
> think
> >>  > are
> >>  > >> > >> required
> >>  > >> > >> > > > > > *before*
> >>  > >> > >> > > > > > > we
> >>  > >> > >> > > > > > > > > > > merge.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > Previous list discussions on this topic
> >>  > include
> >>  > >> > [2]
> >>  > >> > >> and
> >>  > >> > >> > > > [3].
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > (Q) What can I do with the feature
> branch?
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * With the Batch Profiler, you can
> >>  > >> backfill/seed
> >>  > >> > >> > > profiles
> >>  > >> > >> > > > > > using
> >>  > >> > >> > > > > > > > > > > archived
> >>  > >> > >> > > > > > > > > > > > telemetry. This enables the following
> types
> >>  > of
> >>  > >> use
> >>  > >> > >> > cases.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > 1. As a Security Data Scientist, I
> >>  want
> >>  > >> to
> >>  > >> > >> > > understand
> >>  > >> > >> > > > > the
> >>  > >> > >> > > > > > > > > > > historical
> >>  > >> > >> > > > > > > > > > > > behaviors and trends of a profile that
> I
> >>  have
> >>  > >> > >> created
> >>  > >> > >> > so
> >>  > >> > >> > > > > that I
> >>  > >> > >> > > > > > > can
> >>  > >> > >> > > > > > > > > > > > determine if I have created a feature
> set
> >>  > that
> >>  > >> has
> >>  > >> > >> > > > predictive
> >>  > >> > >> > > > > > > value
> >>  > >> > >> > > > > > > > > for
> >>  > >> > >> > > > > > > > > > > > model building.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > 2. As a Security Data Scientist, I
> >>  want
> >>  > >> to
> >>  > >> > >> > > understand
> >>  > >> > >> > > > > the
> >>  > >> > >> > > > > > > > > > > historical
> >>  > >> > >> > > > > > > > > > > > behaviors and trends of a profile that
> I
> >>  have
> >>  > >> > >> created
> >>  > >> > >> > so
> >>  > >> > >> > > > > that I
> >>  > >> > >> > > > > > > can
> >>  > >> > >> > > > > > > > > > > > determine if I have defined the profile
> >>  > >> correctly
> >>  > >> > >> and
> >>  > >> > >> > > > > created a
> >>  > >> > >> > > > > > > > > feature
> >>  > >> > >> > > > > > > > > > > set
> >>  > >> > >> > > > > > > > > > > > that matches reality.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > 3. As a Security Platform Engineer, I
> >>  > >> want
> >>  > >> > to
> >>  > >> > >> > > > generate
> >>  > >> > >> > > > > a
> >>  > >> > >> > > > > > > > > profile
> >>  > >> > >> > > > > > > > > > > > using archived telemetry when I deploy
> a
> >>  new
> >>  > >> model
> >>  > >> > >> to
> >>  > >> > >> > > > > > production
> >>  > >> > >> > > > > > > so
> >>  > >> > >> > > > > > > > > > that
> >>  > >> > >> > > > > > > > > > > > models depending on that profile can
> >>  function
> >>  > >> on
> >>  > >> > >> day 1.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * METRON-1699 [1] includes a more
> >>  detailed
> >>  > >> > >> > description
> >>  > >> > >> > > of
> >>  > >> > >> > > > > the
> >>  > >> > >> > > > > > > > > > feature.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > (Q) What work was completed?
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * The Batch Profiler runs on Spark and
> >>  was
> >>  > >> > >> > implemented
> >>  > >> > >> > > in
> >>  > >> > >> > > > > > Java
> >>  > >> > >> > > > > > > to
> >>  > >> > >> > > > > > > > > > > remain
> >>  > >> > >> > > > > > > > > > > > consistent with our current Java-heavy
> code
> >>  > >> base.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * The Batch Profiler is executed from
> the
> >>  > >> > >> > command-line.
> >>  > >> > >> > > > It
> >>  > >> > >> > > > > > can
> >>  > >> > >> > > > > > > be
> >>  > >> > >> > > > > > > > > > > > launched using a script or by calling
> >>  > >> > >> `spark-submit`,
> >>  > >> > >> > > which
> >>  > >> > >> > > > > may
> >>  > >> > >> > > > > > > be
> >>  > >> > >> > > > > > > > > > useful
> >>  > >> > >> > > > > > > > > > > > for advanced users.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * Input telemetry can be consumed from
> >>  > >> multiple
> >>  > >> > >> > > sources;
> >>  > >> > >> > > > > for
> >>  > >> > >> > > > > > > > > example
> >>  > >> > >> > > > > > > > > > > HDFS
> >>  > >> > >> > > > > > > > > > > > or the local file system.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * Input telemetry can be consumed in
> >>  > multiple
> >>  > >> > >> > formats;
> >>  > >> > >> > > > for
> >>  > >> > >> > > > > > > > example
> >>  > >> > >> > > > > > > > > > JSON
> >>  > >> > >> > > > > > > > > > > > or ORC.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * The 'output' profile measurements are
> >>  > >> > persisted
> >>  > >> > >> in
> >>  > >> > >> > > > HBase
> >>  > >> > >> > > > > > and
> >>  > >> > >> > > > > > > is
> >>  > >> > >> > > > > > > > > > > > consistent with the Storm Profiler.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * It can be run on any underlying
> engine
> >>  > >> > >> supported by
> >>  > >> > >> > > > > Spark.
> >>  > >> > >> > > > > > I
> >>  > >> > >> > > > > > > > have
> >>  > >> > >> > > > > > > > > > > > tested it both in 'local' mode and on a
> >>  YARN
> >>  > >> > >> cluster.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * It is installed automatically by the
> >>  > Metron
> >>  > >> > >> MPack.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * A README was added that documents
> usage
> >>  > >> > >> > instructions.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * The existing Profiler code was
> >>  refactored
> >>  > >> so
> >>  > >> > >> that
> >>  > >> > >> > as
> >>  > >> > >> > > > much
> >>  > >> > >> > > > > > > code
> >>  > >> > >> > > > > > > > as
> >>  > >> > >> > > > > > > > > > > > possible is shared between the 3
> Profiler
> >>  > >> ports;
> >>  > >> > >> Storm,
> >>  > >> > >> > > the
> >>  > >> > >> > > > > > > Stellar
> >>  > >> > >> > > > > > > > > > REPL,
> >>  > >> > >> > > > > > > > > > > > and Spark. For example, the logic which
> >>  > >> determines
> >>  > >> > >> the
> >>  > >> > >> > > > > > timestamp
> >>  > >> > >> > > > > > > > of a
> >>  > >> > >> > > > > > > > > > > > message was refactored so that it
> could be
> >>  > >> reused
> >>  > >> > by
> >>  > >> > >> > all
> >>  > >> > >> > > > > ports.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-common: The common
> >>  > >> > Profiler
> >>  > >> > >> > code
> >>  > >> > >> > > > > shared
> >>  > >> > >> > > > > > > > > amongst
> >>  > >> > >> > > > > > > > > > > > each port.
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-storm: Profiler on
> >>  > >> Storm
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-spark: Profiler on
> >>  > >> Spark
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-repl: Profiler on
> >>  the
> >>  > >> > >> Stellar
> >>  > >> > >> > > REPL
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-client: The client
> >>  > code
> >>  > >> > for
> >>  > >> > >> > > > > retrieving
> >>  > >> > >> > > > > > > > > profile
> >>  > >> > >> > > > > > > > > > > > data; for example PROFILE_GET.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * There are 3 separate RPM and DEB
> >>  packages
> >>  > >> now
> >>  > >> > >> > created
> >>  > >> > >> > > > for
> >>  > >> > >> > > > > > the
> >>  > >> > >> > > > > > > > > > > Profiler.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-storm-*.rpm
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-spark-*.rpm
> >>  > >> > >> > > > > > > > > > > > * metron-profiler-repl-*.rpm
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * The Profiler integration tests were
> >>  > >> enhanced
> >>  > >> > to
> >>  > >> > >> > > > leverage
> >>  > >> > >> > > > > > the
> >>  > >> > >> > > > > > > > > > Profiler
> >>  > >> > >> > > > > > > > > > > > Client logic to validate the results.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * Review METRON-1699 [1] for a complete
> >>  > >> > >> break-down of
> >>  > >> > >> > > the
> >>  > >> > >> > > > > > tasks
> >>  > >> > >> > > > > > > > > that
> >>  > >> > >> > > > > > > > > > > have
> >>  > >> > >> > > > > > > > > > > > been completed on the feature branch.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > (Q) What limitations exist?
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * You must manually install Spark to
> use
> >>  > the
> >>  > >> > Batch
> >>  > >> > >> > > > > Profiler.
> >>  > >> > >> > > > > > > The
> >>  > >> > >> > > > > > > > > > Metron
> >>  > >> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron
> >>  > >> dependency
> >>  > >> > >> and
> >>  > >> > >> > so
> >>  > >> > >> > > > does
> >>  > >> > >> > > > > > not
> >>  > >> > >> > > > > > > > > > install
> >>  > >> > >> > > > > > > > > > > > it automatically.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * You do not configure the Batch
> Profiler
> >>  > in
> >>  > >> > >> Ambari.
> >>  > >> > >> > It
> >>  > >> > >> > > > is
> >>  > >> > >> > > > > > > > > configured
> >>  > >> > >> > > > > > > > > > > > and executed completely from the
> >>  > command-line.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > * To run the Batch Profiler in 'Full
> >>  Dev',
> >>  > >> you
> >>  > >> > >> have
> >>  > >> > >> > to
> >>  > >> > >> > > > take
> >>  > >> > >> > > > > > the
> >>  > >> > >> > > > > > > > > > > following
> >>  > >> > >> > > > > > > > > > > > manual steps. Some of these are
> arguably
> >>  > >> > limitations
> >>  > >> > >> > with
> >>  > >> > >> > > > how
> >>  > >> > >> > > > > > > > Ambari
> >>  > >> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP
> that
> >>  > we
> >>  > >> > run.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > 1. Install Spark 2 using Ambari.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > 2. Tell Spark how to talk with HBase.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> SPARK_HOME=/usr/hdp/current/spark2-client
> >>  > >> > >> > > > > > > > > > > > cp
> >>  > >> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> >>  > >> > >> > > > > > > > > > > > $SPARK_HOME/conf/
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > 3. Create the Spark History directory
> >>  > in
> >>  > >> > HDFS.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > export HADOOP_USER_NAME=hdfs
> >>  > >> > >> > > > > > > > > > > > hdfs dfs -mkdir /spark2-history
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > 4. Change the default input path to
> >>  > >> > >> > > > > > > > `hdfs://localhost:8020/...`
> >>  > >> > >> > > > > > > > > > to
> >>  > >> > >> > > > > > > > > > > > match the port defined by HDP, instead
> of
> >>  > port
> >>  > >> > 9000.
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > > [1]
> >>  > >> > >> https://issues.apache.org/jira/browse/METRON-1699
> >>  > >> > >> > > > > > > > > > > > [2]
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > >
> >>  > >> > >> > > > > > >
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > >
> >>  > >> > >> > > >
> >>  > >> > >> > >
> >>  > >> > >> >
> >>  > >> > >>
> >>  > >> >
> >>  > >>
> >>  >
> >>
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> >>  > >> > >> > > > > > > > > > > > [3]
> >>  > >> > >> > > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > >
> >>  > >> > >> > > > > > >
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > >
> >>  > >> > >> > > >
> >>  > >> > >> > >
> >>  > >> > >> >
> >>  > >> > >>
> >>  > >> >
> >>  > >>
> >>  >
> >>
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > -------------------
> >>  > >> > >> > > > > > > > > > > Thank you,
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > > James Sirota
> >>  > >> > >> > > > > > > > > > > PMC- Apache Metron
> >>  > >> > >> > > > > > > > > > > jsirota AT apache DOT org
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > > >
> >>  > >> > >> > > > > > > > > >
> >>  > >> > >> > > > > > > > >
> >>  > >> > >> > > > > > > >
> >>  > >> > >> > > > > > >
> >>  > >> > >> > > > > >
> >>  > >> > >> > > > >
> >>  > >> > >> > > >
> >>  > >> > >> > >
> >>  > >> > >> >
> >>  > >> > >>
> >>  > >> > >
> >>  > >> >
> >>  > >>
> >>  > >
> >>  >
>
> -------------------
> Thank you,
>
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
>
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by James Sirota <js...@apache.org>.

+1 from me as well. great work

27.09.2018, 11:15, "Ryan Merriman" <me...@gmail.com>:
> +1 from me. Great work.
>
> On Thu, Sep 27, 2018 at 12:41 PM Justin Leet <ju...@gmail.com> wrote:
>
>>  I'm +1 on merging the feature branch into master. There's a lot of good
>>  work here, and it's definitely been nice to see the couple remaining
>>  improvements make it in.
>>
>>  Thanks a lot for the contribution, this is great stuff!
>>
>>  On Wed, Sep 26, 2018 at 6:26 PM Nick Allen <ni...@nickallen.org> wrote:
>>
>>  > Or support to be offered for merging this feature branch into master?
>>  >
>>  > On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <ni...@nickallen.org> wrote:
>>  >
>>  > > Thanks for the review. With
>>  https://github.com/apache/metron/pull/1209
>>  > complete,
>>  > > I think the feature branch is ready to be merged. Sounds like I have
>>  > > Mike's support. Anyone else have comments, concerns, questions?
>>  > >
>>  > > On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic <
>>  > > michael.miklavcic@gmail.com> wrote:
>>  > >
>>  > >> I just made a couple minor comments on that PR, and I am in agreement
>>  > >> about
>>  > >> the readiness for merging with master. Good stuff Nick.
>>  > >>
>>  > >> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <ni...@nickallen.org>
>>  wrote:
>>  > >>
>>  > >> > Here is a PR that adds the input time constraints to the Batch
>>  > Profiler
>>  > >> > (METRON-1787); https://github.com/apache/metron/pull/1209.
>>  > >> >
>>  > >> > It seems that the consensus is that this is probably the last
>>  feature
>>  > we
>>  > >> > need before merging the FB into master. The other two can wait
>>  until
>>  > >> after
>>  > >> > the feature branch has been merged. Let me know if you disagree.
>>  > >> >
>>  > >> > Thanks
>>  > >> >
>>  > >> >
>>  > >> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org>
>>  > wrote:
>>  > >> >
>>  > >> > > Yeah, agreed. Per use case 3, when deploying to production there
>>  > >> really
>>  > >> > > wouldn't be a huge overlap like 3 months of already profiled data.
>>  > >> Its
>>  > >> > day
>>  > >> > > 1, the profile was just deployed around the same time as you are
>>  > >> running
>>  > >> > > the Batch Profiler, so the overlap is in minutes, maybe hours.
>>  But
>>  > I
>>  > >> can
>>  > >> > > definitely see the usefulness of the feature for re-runs, etc as
>>  you
>>  > >> have
>>  > >> > > described.
>>  > >> > >
>>  > >> > > Based on this discussion, I created a few JIRAs. Thanks all for
>>  the
>>  > >> > great
>>  > >> > > feedback and keep it coming.
>>  > >> > >
>>  > >> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler
>>  > >> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch
>>  > Profiler
>>  > >> > > [3] METRON-1789 - MPack Should Define Default Input Path for Batch
>>  > >> > > Profiler
>>  > >> > >
>>  > >> > >
>>  > >> > > --
>>  > >> > > [1] https://issues.apache.org/jira/browse/METRON-1787
>>  > >> > > [2] https://issues.apache.org/jira/browse/METRON-1788
>>  > >> > > [3] https://issues.apache.org/jira/browse/METRON-1789
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
>>  > >> > > michael.miklavcic@gmail.com> wrote:
>>  > >> > >
>>  > >> > >> I think we might want to allow the flexibility to choose the date
>>  > >> range
>>  > >> > >> then. I don't yet feel like I have a good enough understanding of
>>  > all
>>  > >> > the
>>  > >> > >> ways in which users would want to seed to force them to run the
>>  > batch
>>  > >> > job
>>  > >> > >> over all the data. It might also make it easier to deal with
>>  > >> > remediation,
>>  > >> > >> ie an error doesn't force you to re-run over the entire history.
>>  > Same
>>  > >> > goes
>>  > >> > >> for testing out the profile seeing batch job in the first place.
>>  > >> > >>
>>  > >> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org>
>>  > >> wrote:
>>  > >> > >>
>>  > >> > >> > Assuming you have 9 months of data archived, yes.
>>  > >> > >> >
>>  > >> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
>>  > >> > >> > michael.miklavcic@gmail.com> wrote:
>>  > >> > >> >
>>  > >> > >> > > So in the case of 3 - if you had 6 months of data that hadn't
>>  > >> been
>>  > >> > >> > profiled
>>  > >> > >> > > and another 3 that had been profiled (9 months total data),
>>  in
>>  > >> its
>>  > >> > >> > current
>>  > >> > >> > > form the batch job runs over all 9 months?
>>  > >> > >> > >
>>  > >> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <
>>  > nick@nickallen.org>
>>  > >> > >> wrote:
>>  > >> > >> > >
>>  > >> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns
>>  about
>>  > >> > >> overlap
>>  > >> > >> > or
>>  > >> > >> > > > gaps after the seeding is performed?
>>  > >> > >> > > >
>>  > >> > >> > > > Good point. Right now, if the Streaming and Batch Profiler
>>  > >> > overlap
>>  > >> > >> the
>>  > >> > >> > > > last write wins. And presumably the output of the
>>  Streaming
>>  > >> and
>>  > >> > >> Batch
>>  > >> > >> > > > Profiler are the same, so no worries, right? :)
>>  > >> > >> > > >
>>  > >> > >> > > > So it kind of works, but it is definitely not ideal for use
>>  > >> case
>>  > >> > >> 3. I
>>  > >> > >> > > > could add --begin and --end args to constrain the time
>>  frame
>>  > >> over
>>  > >> > >> which
>>  > >> > >> > > the
>>  > >> > >> > > > Batch Profiler runs. I do not have that in the feature
>>  > branch.
>>  > >> > It
>>  > >> > >> > would
>>  > >> > >> > > > be easy enough to add though.
>>  > >> > >> > > >
>>  > >> > >> > > >
>>  > >> > >> > > >
>>  > >> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
>>  > >> > >> > > > michael.miklavcic@gmail.com> wrote:
>>  > >> > >> > > >
>>  > >> > >> > > > > Ok, makes sense. That's sort of what I was thinking as
>>  > well,
>>  > >> > Nick.
>>  > >> > >> > > > Pulling
>>  > >> > >> > > > > at this thread just a bit more...
>>  > >> > >> > > > >
>>  > >> > >> > > > > 1. I have an existing system that's been up a while,
>>  > and I
>>  > >> > have
>>  > >> > >> > > added
>>  > >> > >> > > > k
>>  > >> > >> > > > > profiles - assume these are the first profiles I've
>>  > >> created.
>>  > >> > >> > > > > 1. I would have t0 - tm (where m is the time when
>>  the
>>  > >> > >> profiles
>>  > >> > >> > > were
>>  > >> > >> > > > > first installed) worth of data that has not been
>>  > >> profiled
>>  > >> > >> yet.
>>  > >> > >> > > > > 2. The batch profiler process would be to take that
>>  > >> exact
>>  > >> > >> > profile
>>  > >> > >> > > > > definition from ZK and run the batch loader with
>>  that
>>  > >> from
>>  > >> > >> the
>>  > >> > >> > > CLI.
>>  > >> > >> > > > > 3. Profiles are now up to date from t0 - tCurrent
>>  > >> > >> > > > > 2. I've already done #1 above. Time goes by and now I
>>  > >> want to
>>  > >> > >> add
>>  > >> > >> > a
>>  > >> > >> > > > new
>>  > >> > >> > > > > profile.
>>  > >> > >> > > > > 1. Same first step above
>>  > >> > >> > > > > 2. I would run the batch loader with *only* that
>>  new
>>  > >> > profile
>>  > >> > >> > > > > definition to seed?
>>  > >> > >> > > > >
>>  > >> > >> > > > > Forgive me if I missed this in PR's and discussion in the
>>  > FB,
>>  > >> > but
>>  > >> > >> how
>>  > >> > >> > > do
>>  > >> > >> > > > we
>>  > >> > >> > > > > establish "tm" from 1.1 above? Any concerns about overlap
>>  > or
>>  > >> > gaps
>>  > >> > >> > after
>>  > >> > >> > > > the
>>  > >> > >> > > > > seeding is performed?
>>  > >> > >> > > > >
>>  > >> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <
>>  > >> nick@nickallen.org
>>  > >> > >
>>  > >> > >> > > wrote:
>>  > >> > >> > > > >
>>  > >> > >> > > > > > I think more often than not, you would want to load
>>  your
>>  > >> > profile
>>  > >> > >> > > > > definition
>>  > >> > >> > > > > > from a file. This is why I considered the 'load from
>>  Zk'
>>  > >> more
>>  > >> > >> of a
>>  > >> > >> > > > > > nice-to-have.
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > - In use case 1 and 2, this would definitely be the
>>  > >> case.
>>  > >> > >> The
>>  > >> > >> > > > > profiles
>>  > >> > >> > > > > > I am working with are speculative and I am using the
>>  > >> batch
>>  > >> > >> > > profiler
>>  > >> > >> > > > to
>>  > >> > >> > > > > > determine if they are worth keeping. In this case,
>>  my
>>  > >> > >> > speculative
>>  > >> > >> > > > > > profiles
>>  > >> > >> > > > > > would not be in Zk (yet).
>>  > >> > >> > > > > > - In use case 3, I could see it go either way. It
>>  > >> might be
>>  > >> > >> > useful
>>  > >> > >> > > > to
>>  > >> > >> > > > > > load from Zk, but it certainly isn't a blocker.
>>  > >> > >> > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > > So if the config does not correctly match the
>>  profiler
>>  > >> > config
>>  > >> > >> > held
>>  > >> > >> > > in
>>  > >> > >> > > > > ZK
>>  > >> > >> > > > > > and
>>  > >> > >> > > > > > the user runs the batch seeding job, what happens?
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > You would just get a profile that is slightly different
>>  > >> over
>>  > >> > the
>>  > >> > >> > > entire
>>  > >> > >> > > > > > time span. This is not a new risk. If the user
>>  changes
>>  > >> their
>>  > >> > >> > > Profile
>>  > >> > >> > > > > > definitions in Zk, the same thing would happen.
>>  > >> > >> > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
>>  > >> > >> > > > > > michael.miklavcic@gmail.com> wrote:
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > > I think I'm torn on this, specifically because it's
>>  > batch
>>  > >> > and
>>  > >> > >> > would
>>  > >> > >> > > > > > > generally be run as-needed. Justin, can you elaborate
>>  > on
>>  > >> > your
>>  > >> > >> > > > concerns
>>  > >> > >> > > > > > > there? This feels functionally very similar to our
>>  flat
>>  > >> file
>>  > >> > >> > > loaders,
>>  > >> > >> > > > > > which
>>  > >> > >> > > > > > > all have inputs for config from the CLI only. On the
>>  > >> other
>>  > >> > >> hand,
>>  > >> > >> > > our
>>  > >> > >> > > > > flat
>>  > >> > >> > > > > > > file loaders are not typically seeding an existing
>>  > >> > structure.
>>  > >> > >> My
>>  > >> > >> > > > > concern
>>  > >> > >> > > > > > of
>>  > >> > >> > > > > > > a local file profiler config stems from this stated
>>  > goal:
>>  > >> > >> > > > > > > > The goal would be to enable “profile seeding” which
>>  > >> allows
>>  > >> > >> > > profiles
>>  > >> > >> > > > > to
>>  > >> > >> > > > > > be
>>  > >> > >> > > > > > > populated from a time before the profile was created.
>>  > >> > >> > > > > > > So if the config does not correctly match the
>>  profiler
>>  > >> > config
>>  > >> > >> > held
>>  > >> > >> > > in
>>  > >> > >> > > > > ZK
>>  > >> > >> > > > > > > and the user runs the batch seeding job, what
>>  happens?
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
>>  > >> > >> > > justinjleet@gmail.com>
>>  > >> > >> > > > > > > wrote:
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > > > > The profile not being able to read from ZK feels
>>  > like a
>>  > >> > >> fairly
>>  > >> > >> > > > > > > substantial,
>>  > >> > >> > > > > > > > if subtle, set of potential problems. I'd like to
>>  > see
>>  > >> > that
>>  > >> > >> in
>>  > >> > >> > > > either
>>  > >> > >> > > > > > > > before merging or at least pretty soon after
>>  merging.
>>  > >> Is
>>  > >> > >> it a
>>  > >> > >> > > lot
>>  > >> > >> > > > of
>>  > >> > >> > > > > > > work
>>  > >> > >> > > > > > > > to add that functionality based on where things are
>>  > >> right
>>  > >> > >> now?
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
>>  > >> > >> nick@nickallen.org
>>  > >> > >> > >
>>  > >> > >> > > > > wrote:
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > > > > Here is another limitation that I just thought.
>>  It
>>  > >> can
>>  > >> > >> only
>>  > >> > >> > > read
>>  > >> > >> > > > a
>>  > >> > >> > > > > > > > profile
>>  > >> > >> > > > > > > > > definition from a file. It probably also makes
>>  > >> sense to
>>  > >> > >> add
>>  > >> > >> > an
>>  > >> > >> > > > > > option
>>  > >> > >> > > > > > > > that
>>  > >> > >> > > > > > > > > allows it to read the current Profiler
>>  > configuration
>>  > >> > from
>>  > >> > >> > > > > Zookeeper.
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > > Is it worth setting up a default config that
>>  > pulls
>>  > >> > from
>>  > >> > >> the
>>  > >> > >> > > > main
>>  > >> > >> > > > > > > > indexing
>>  > >> > >> > > > > > > > > output?
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > Yes, I think that makes sense. We want the Batch
>>  > >> > >> Profiler to
>>  > >> > >> > > > point
>>  > >> > >> > > > > > to
>>  > >> > >> > > > > > > > the
>>  > >> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is
>>  > >> deployed.
>>  > >> > >> When
>>  > >> > >> > > > > Metron
>>  > >> > >> > > > > > > gets
>>  > >> > >> > > > > > > > > spun-up on a cluster, I should be able to just
>>  run
>>  > >> the
>>  > >> > >> Batch
>>  > >> > >> > > > > Profiler
>>  > >> > >> > > > > > > > > without having to fuss with the input path.
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
>>  > >> > >> > > > justinjleet@gmail.com
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > > > wrote:
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > > Re:
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > > * You do not configure the Batch Profiler in
>>  > >> > >> Ambari. It
>>  > >> > >> > > is
>>  > >> > >> > > > > > > > configured
>>  > >> > >> > > > > > > > > > > and executed completely from the
>>  command-line.
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > Is it worth setting up a default config that
>>  > pulls
>>  > >> > from
>>  > >> > >> the
>>  > >> > >> > > > main
>>  > >> > >> > > > > > > > indexing
>>  > >> > >> > > > > > > > > > output? I'm a little on the fence about it,
>>  but
>>  > it
>>  > >> > >> seems
>>  > >> > >> > > like
>>  > >> > >> > > > > > making
>>  > >> > >> > > > > > > > the
>>  > >> > >> > > > > > > > > > most common case more or less built-in would be
>>  > >> nice.
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > Having said that, I do not consider that a
>>  > >> requirement
>>  > >> > >> for
>>  > >> > >> > > > > merging
>>  > >> > >> > > > > > > the
>>  > >> > >> > > > > > > > > > feature branch.
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
>>  > >> > >> > > > > jsirota@apache.org>
>>  > >> > >> > > > > > > > > wrote:
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > > I think what you have outlined above is a
>>  good
>>  > >> > initial
>>  > >> > >> > stab
>>  > >> > >> > > > at
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > > feature. Manual install of spark is not a
>>  big
>>  > >> deal.
>>  > >> > >> > > > > Configuring
>>  > >> > >> > > > > > > via
>>  > >> > >> > > > > > > > > > > command line while we mature this feature is
>>  ok
>>  > >> as
>>  > >> > >> well.
>>  > >> > >> > > > > Doesn't
>>  > >> > >> > > > > > > > look
>>  > >> > >> > > > > > > > > > like
>>  > >> > >> > > > > > > > > > > configuration steps are too hard. I think
>>  you
>>  > >> > should
>>  > >> > >> > > merge.
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > James
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <
>>  > >> nick@nickallen.org
>>  > >> > >:
>>  > >> > >> > > > > > > > > > > > I would like to open a discussion to get
>>  the
>>  > >> Batch
>>  > >> > >> > > Profiler
>>  > >> > >> > > > > > > feature
>>  > >> > >> > > > > > > > > > > branch
>>  > >> > >> > > > > > > > > > > > merged into master as part of METRON-1699
>>  [1]
>>  > >> > Create
>>  > >> > >> > > Batch
>>  > >> > >> > > > > > > > Profiler.
>>  > >> > >> > > > > > > > > > All
>>  > >> > >> > > > > > > > > > > > of the work that I had in mind for our
>>  first
>>  > >> draft
>>  > >> > >> of
>>  > >> > >> > the
>>  > >> > >> > > > > Batch
>>  > >> > >> > > > > > > > > > Profiler
>>  > >> > >> > > > > > > > > > > > has been completed. Please take a look
>>  > through
>>  > >> > what
>>  > >> > >> I
>>  > >> > >> > > have
>>  > >> > >> > > > > and
>>  > >> > >> > > > > > > let
>>  > >> > >> > > > > > > > me
>>  > >> > >> > > > > > > > > > > know
>>  > >> > >> > > > > > > > > > > > if there are other features that you think
>>  > are
>>  > >> > >> required
>>  > >> > >> > > > > > *before*
>>  > >> > >> > > > > > > we
>>  > >> > >> > > > > > > > > > > merge.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > Previous list discussions on this topic
>>  > include
>>  > >> > [2]
>>  > >> > >> and
>>  > >> > >> > > > [3].
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > (Q) What can I do with the feature branch?
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * With the Batch Profiler, you can
>>  > >> backfill/seed
>>  > >> > >> > > profiles
>>  > >> > >> > > > > > using
>>  > >> > >> > > > > > > > > > > archived
>>  > >> > >> > > > > > > > > > > > telemetry. This enables the following types
>>  > of
>>  > >> use
>>  > >> > >> > cases.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 1. As a Security Data Scientist, I
>>  want
>>  > >> to
>>  > >> > >> > > understand
>>  > >> > >> > > > > the
>>  > >> > >> > > > > > > > > > > historical
>>  > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I
>>  have
>>  > >> > >> created
>>  > >> > >> > so
>>  > >> > >> > > > > that I
>>  > >> > >> > > > > > > can
>>  > >> > >> > > > > > > > > > > > determine if I have created a feature set
>>  > that
>>  > >> has
>>  > >> > >> > > > predictive
>>  > >> > >> > > > > > > value
>>  > >> > >> > > > > > > > > for
>>  > >> > >> > > > > > > > > > > > model building.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 2. As a Security Data Scientist, I
>>  want
>>  > >> to
>>  > >> > >> > > understand
>>  > >> > >> > > > > the
>>  > >> > >> > > > > > > > > > > historical
>>  > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I
>>  have
>>  > >> > >> created
>>  > >> > >> > so
>>  > >> > >> > > > > that I
>>  > >> > >> > > > > > > can
>>  > >> > >> > > > > > > > > > > > determine if I have defined the profile
>>  > >> correctly
>>  > >> > >> and
>>  > >> > >> > > > > created a
>>  > >> > >> > > > > > > > > feature
>>  > >> > >> > > > > > > > > > > set
>>  > >> > >> > > > > > > > > > > > that matches reality.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 3. As a Security Platform Engineer, I
>>  > >> want
>>  > >> > to
>>  > >> > >> > > > generate
>>  > >> > >> > > > > a
>>  > >> > >> > > > > > > > > profile
>>  > >> > >> > > > > > > > > > > > using archived telemetry when I deploy a
>>  new
>>  > >> model
>>  > >> > >> to
>>  > >> > >> > > > > > production
>>  > >> > >> > > > > > > so
>>  > >> > >> > > > > > > > > > that
>>  > >> > >> > > > > > > > > > > > models depending on that profile can
>>  function
>>  > >> on
>>  > >> > >> day 1.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * METRON-1699 [1] includes a more
>>  detailed
>>  > >> > >> > description
>>  > >> > >> > > of
>>  > >> > >> > > > > the
>>  > >> > >> > > > > > > > > > feature.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > (Q) What work was completed?
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The Batch Profiler runs on Spark and
>>  was
>>  > >> > >> > implemented
>>  > >> > >> > > in
>>  > >> > >> > > > > > Java
>>  > >> > >> > > > > > > to
>>  > >> > >> > > > > > > > > > > remain
>>  > >> > >> > > > > > > > > > > > consistent with our current Java-heavy code
>>  > >> base.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The Batch Profiler is executed from the
>>  > >> > >> > command-line.
>>  > >> > >> > > > It
>>  > >> > >> > > > > > can
>>  > >> > >> > > > > > > be
>>  > >> > >> > > > > > > > > > > > launched using a script or by calling
>>  > >> > >> `spark-submit`,
>>  > >> > >> > > which
>>  > >> > >> > > > > may
>>  > >> > >> > > > > > > be
>>  > >> > >> > > > > > > > > > useful
>>  > >> > >> > > > > > > > > > > > for advanced users.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * Input telemetry can be consumed from
>>  > >> multiple
>>  > >> > >> > > sources;
>>  > >> > >> > > > > for
>>  > >> > >> > > > > > > > > example
>>  > >> > >> > > > > > > > > > > HDFS
>>  > >> > >> > > > > > > > > > > > or the local file system.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * Input telemetry can be consumed in
>>  > multiple
>>  > >> > >> > formats;
>>  > >> > >> > > > for
>>  > >> > >> > > > > > > > example
>>  > >> > >> > > > > > > > > > JSON
>>  > >> > >> > > > > > > > > > > > or ORC.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The 'output' profile measurements are
>>  > >> > persisted
>>  > >> > >> in
>>  > >> > >> > > > HBase
>>  > >> > >> > > > > > and
>>  > >> > >> > > > > > > is
>>  > >> > >> > > > > > > > > > > > consistent with the Storm Profiler.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * It can be run on any underlying engine
>>  > >> > >> supported by
>>  > >> > >> > > > > Spark.
>>  > >> > >> > > > > > I
>>  > >> > >> > > > > > > > have
>>  > >> > >> > > > > > > > > > > > tested it both in 'local' mode and on a
>>  YARN
>>  > >> > >> cluster.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * It is installed automatically by the
>>  > Metron
>>  > >> > >> MPack.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * A README was added that documents usage
>>  > >> > >> > instructions.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The existing Profiler code was
>>  refactored
>>  > >> so
>>  > >> > >> that
>>  > >> > >> > as
>>  > >> > >> > > > much
>>  > >> > >> > > > > > > code
>>  > >> > >> > > > > > > > as
>>  > >> > >> > > > > > > > > > > > possible is shared between the 3 Profiler
>>  > >> ports;
>>  > >> > >> Storm,
>>  > >> > >> > > the
>>  > >> > >> > > > > > > Stellar
>>  > >> > >> > > > > > > > > > REPL,
>>  > >> > >> > > > > > > > > > > > and Spark. For example, the logic which
>>  > >> determines
>>  > >> > >> the
>>  > >> > >> > > > > > timestamp
>>  > >> > >> > > > > > > > of a
>>  > >> > >> > > > > > > > > > > > message was refactored so that it could be
>>  > >> reused
>>  > >> > by
>>  > >> > >> > all
>>  > >> > >> > > > > ports.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * metron-profiler-common: The common
>>  > >> > Profiler
>>  > >> > >> > code
>>  > >> > >> > > > > shared
>>  > >> > >> > > > > > > > > amongst
>>  > >> > >> > > > > > > > > > > > each port.
>>  > >> > >> > > > > > > > > > > > * metron-profiler-storm: Profiler on
>>  > >> Storm
>>  > >> > >> > > > > > > > > > > > * metron-profiler-spark: Profiler on
>>  > >> Spark
>>  > >> > >> > > > > > > > > > > > * metron-profiler-repl: Profiler on
>>  the
>>  > >> > >> Stellar
>>  > >> > >> > > REPL
>>  > >> > >> > > > > > > > > > > > * metron-profiler-client: The client
>>  > code
>>  > >> > for
>>  > >> > >> > > > > retrieving
>>  > >> > >> > > > > > > > > profile
>>  > >> > >> > > > > > > > > > > > data; for example PROFILE_GET.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * There are 3 separate RPM and DEB
>>  packages
>>  > >> now
>>  > >> > >> > created
>>  > >> > >> > > > for
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > > Profiler.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * metron-profiler-storm-*.rpm
>>  > >> > >> > > > > > > > > > > > * metron-profiler-spark-*.rpm
>>  > >> > >> > > > > > > > > > > > * metron-profiler-repl-*.rpm
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The Profiler integration tests were
>>  > >> enhanced
>>  > >> > to
>>  > >> > >> > > > leverage
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > Profiler
>>  > >> > >> > > > > > > > > > > > Client logic to validate the results.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * Review METRON-1699 [1] for a complete
>>  > >> > >> break-down of
>>  > >> > >> > > the
>>  > >> > >> > > > > > tasks
>>  > >> > >> > > > > > > > > that
>>  > >> > >> > > > > > > > > > > have
>>  > >> > >> > > > > > > > > > > > been completed on the feature branch.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > (Q) What limitations exist?
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * You must manually install Spark to use
>>  > the
>>  > >> > Batch
>>  > >> > >> > > > > Profiler.
>>  > >> > >> > > > > > > The
>>  > >> > >> > > > > > > > > > Metron
>>  > >> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron
>>  > >> dependency
>>  > >> > >> and
>>  > >> > >> > so
>>  > >> > >> > > > does
>>  > >> > >> > > > > > not
>>  > >> > >> > > > > > > > > > install
>>  > >> > >> > > > > > > > > > > > it automatically.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * You do not configure the Batch Profiler
>>  > in
>>  > >> > >> Ambari.
>>  > >> > >> > It
>>  > >> > >> > > > is
>>  > >> > >> > > > > > > > > configured
>>  > >> > >> > > > > > > > > > > > and executed completely from the
>>  > command-line.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * To run the Batch Profiler in 'Full
>>  Dev',
>>  > >> you
>>  > >> > >> have
>>  > >> > >> > to
>>  > >> > >> > > > take
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > > following
>>  > >> > >> > > > > > > > > > > > manual steps. Some of these are arguably
>>  > >> > limitations
>>  > >> > >> > with
>>  > >> > >> > > > how
>>  > >> > >> > > > > > > > Ambari
>>  > >> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP that
>>  > we
>>  > >> > run.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 1. Install Spark 2 using Ambari.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 2. Tell Spark how to talk with HBase.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > >
>>  > >> SPARK_HOME=/usr/hdp/current/spark2-client
>>  > >> > >> > > > > > > > > > > > cp
>>  > >> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
>>  > >> > >> > > > > > > > > > > > $SPARK_HOME/conf/
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 3. Create the Spark History directory
>>  > in
>>  > >> > HDFS.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > export HADOOP_USER_NAME=hdfs
>>  > >> > >> > > > > > > > > > > > hdfs dfs -mkdir /spark2-history
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 4. Change the default input path to
>>  > >> > >> > > > > > > > `hdfs://localhost:8020/...`
>>  > >> > >> > > > > > > > > > to
>>  > >> > >> > > > > > > > > > > > match the port defined by HDP, instead of
>>  > port
>>  > >> > 9000.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > [1]
>>  > >> > >> https://issues.apache.org/jira/browse/METRON-1699
>>  > >> > >> > > > > > > > > > > > [2]
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > >
>>  > >> > >> > > >
>>  > >> > >> > >
>>  > >> > >> >
>>  > >> > >>
>>  > >> >
>>  > >>
>>  >
>>  https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
>>  > >> > >> > > > > > > > > > > > [3]
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > >
>>  > >> > >> > > >
>>  > >> > >> > >
>>  > >> > >> >
>>  > >> > >>
>>  > >> >
>>  > >>
>>  >
>>  https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > -------------------
>>  > >> > >> > > > > > > > > > > Thank you,
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > James Sirota
>>  > >> > >> > > > > > > > > > > PMC- Apache Metron
>>  > >> > >> > > > > > > > > > > jsirota AT apache DOT org
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > >
>>  > >> > >> > > >
>>  > >> > >> > >
>>  > >> > >> >
>>  > >> > >>
>>  > >> > >
>>  > >> >
>>  > >>
>>  > >
>>  >

------------------- 
Thank you,

James Sirota
PMC- Apache Metron
jsirota AT apache DOT org

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Ryan Merriman <me...@gmail.com>.

+1 from me.  Great work.

On Thu, Sep 27, 2018 at 12:41 PM Justin Leet <ju...@gmail.com> wrote:

> I'm +1 on merging the feature branch into master. There's a lot of good
> work here, and it's definitely been nice to see the couple remaining
> improvements make it in.
>
> Thanks a lot for the contribution, this is great stuff!
>
> On Wed, Sep 26, 2018 at 6:26 PM Nick Allen <ni...@nickallen.org> wrote:
>
> > Or support to be offered for merging this feature branch into master?
> >
> > On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <ni...@nickallen.org> wrote:
> >
> > > Thanks for the review.  With
> https://github.com/apache/metron/pull/1209
> > complete,
> > > I think the feature branch is ready to be merged.  Sounds like I have
> > > Mike's support.  Anyone else have comments, concerns, questions?
> > >
> > > On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic <
> > > michael.miklavcic@gmail.com> wrote:
> > >
> > >> I just made a couple minor comments on that PR, and I am in agreement
> > >> about
> > >> the readiness for merging with master. Good stuff Nick.
> > >>
> > >> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <ni...@nickallen.org>
> wrote:
> > >>
> > >> > Here is a PR that adds the input time constraints to the Batch
> > Profiler
> > >> > (METRON-1787);  https://github.com/apache/metron/pull/1209.
> > >> >
> > >> > It seems that the consensus is that this is probably the last
> feature
> > we
> > >> > need before merging the FB into master.  The other two can wait
> until
> > >> after
> > >> > the feature branch has been merged.  Let me know if you disagree.
> > >> >
> > >> > Thanks
> > >> >
> > >> >
> > >> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org>
> > wrote:
> > >> >
> > >> > > Yeah, agreed.  Per use case 3, when deploying to production there
> > >> really
> > >> > > wouldn't be a huge overlap like 3 months of already profiled data.
> > >> Its
> > >> > day
> > >> > > 1, the profile was just deployed around the same time as you are
> > >> running
> > >> > > the Batch Profiler, so the overlap is in minutes, maybe hours.
> But
> > I
> > >> can
> > >> > > definitely see the usefulness of the feature for re-runs, etc as
> you
> > >> have
> > >> > > described.
> > >> > >
> > >> > > Based on this discussion, I created a few JIRAs.  Thanks all for
> the
> > >> > great
> > >> > > feedback and keep it coming.
> > >> > >
> > >> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler
> > >> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch
> > Profiler
> > >> > > [3] METRON-1789 - MPack Should Define Default Input Path for Batch
> > >> > > Profiler
> > >> > >
> > >> > >
> > >> > > --
> > >> > > [1] https://issues.apache.org/jira/browse/METRON-1787
> > >> > > [2] https://issues.apache.org/jira/browse/METRON-1788
> > >> > > [3] https://issues.apache.org/jira/browse/METRON-1789
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
> > >> > > michael.miklavcic@gmail.com> wrote:
> > >> > >
> > >> > >> I think we might want to allow the flexibility to choose the date
> > >> range
> > >> > >> then. I don't yet feel like I have a good enough understanding of
> > all
> > >> > the
> > >> > >> ways in which users would want to seed to force them to run the
> > batch
> > >> > job
> > >> > >> over all the data. It might also make it easier to deal with
> > >> > remediation,
> > >> > >> ie an error doesn't force you to re-run over the entire history.
> > Same
> > >> > goes
> > >> > >> for testing out the profile seeing batch job in the first place.
> > >> > >>
> > >> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org>
> > >> wrote:
> > >> > >>
> > >> > >> > Assuming you have 9 months of data archived, yes.
> > >> > >> >
> > >> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
> > >> > >> > michael.miklavcic@gmail.com> wrote:
> > >> > >> >
> > >> > >> > > So in the case of 3 - if you had 6 months of data that hadn't
> > >> been
> > >> > >> > profiled
> > >> > >> > > and another 3 that had been profiled (9 months total data),
> in
> > >> its
> > >> > >> > current
> > >> > >> > > form the batch job runs over all 9 months?
> > >> > >> > >
> > >> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <
> > nick@nickallen.org>
> > >> > >> wrote:
> > >> > >> > >
> > >> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns
> about
> > >> > >> overlap
> > >> > >> > or
> > >> > >> > > > gaps after the seeding is performed?
> > >> > >> > > >
> > >> > >> > > > Good point.  Right now, if the Streaming and Batch Profiler
> > >> > overlap
> > >> > >> the
> > >> > >> > > > last write wins.  And presumably the output of the
> Streaming
> > >> and
> > >> > >> Batch
> > >> > >> > > > Profiler are the same, so no worries, right? :)
> > >> > >> > > >
> > >> > >> > > > So it kind of works, but it is definitely not ideal for use
> > >> case
> > >> > >> 3.  I
> > >> > >> > > > could add --begin and --end args to constrain the time
> frame
> > >> over
> > >> > >> which
> > >> > >> > > the
> > >> > >> > > > Batch Profiler runs.  I do not have that in the feature
> > branch.
> > >> > It
> > >> > >> > would
> > >> > >> > > > be easy enough to add though.
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> > >> > >> > > > michael.miklavcic@gmail.com> wrote:
> > >> > >> > > >
> > >> > >> > > > > Ok, makes sense. That's sort of what I was thinking as
> > well,
> > >> > Nick.
> > >> > >> > > > Pulling
> > >> > >> > > > > at this thread just a bit more...
> > >> > >> > > > >
> > >> > >> > > > >    1. I have an existing system that's been up a while,
> > and I
> > >> > have
> > >> > >> > > added
> > >> > >> > > > k
> > >> > >> > > > >    profiles - assume these are the first profiles I've
> > >> created.
> > >> > >> > > > >       1. I would have t0 - tm (where m is the time when
> the
> > >> > >> profiles
> > >> > >> > > were
> > >> > >> > > > >       first installed) worth of data that has not been
> > >> profiled
> > >> > >> yet.
> > >> > >> > > > >       2. The batch profiler process would be to take that
> > >> exact
> > >> > >> > profile
> > >> > >> > > > >       definition from ZK and run the batch loader with
> that
> > >> from
> > >> > >> the
> > >> > >> > > CLI.
> > >> > >> > > > >       3. Profiles are now up to date from t0 - tCurrent
> > >> > >> > > > >    2. I've already done #1 above. Time goes by and now I
> > >> want to
> > >> > >> add
> > >> > >> > a
> > >> > >> > > > new
> > >> > >> > > > >    profile.
> > >> > >> > > > >       1. Same first step above
> > >> > >> > > > >       2. I would run the batch loader with *only* that
> new
> > >> > profile
> > >> > >> > > > >       definition to seed?
> > >> > >> > > > >
> > >> > >> > > > > Forgive me if I missed this in PR's and discussion in the
> > FB,
> > >> > but
> > >> > >> how
> > >> > >> > > do
> > >> > >> > > > we
> > >> > >> > > > > establish "tm" from 1.1 above? Any concerns about overlap
> > or
> > >> > gaps
> > >> > >> > after
> > >> > >> > > > the
> > >> > >> > > > > seeding is performed?
> > >> > >> > > > >
> > >> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <
> > >> nick@nickallen.org
> > >> > >
> > >> > >> > > wrote:
> > >> > >> > > > >
> > >> > >> > > > > > I think more often than not, you would want to load
> your
> > >> > profile
> > >> > >> > > > > definition
> > >> > >> > > > > > from a file.  This is why I considered the 'load from
> Zk'
> > >> more
> > >> > >> of a
> > >> > >> > > > > > nice-to-have.
> > >> > >> > > > > >
> > >> > >> > > > > >    - In use case 1 and 2, this would definitely be the
> > >> case.
> > >> > >> The
> > >> > >> > > > > profiles
> > >> > >> > > > > >    I am working with are speculative and I am using the
> > >> batch
> > >> > >> > > profiler
> > >> > >> > > > to
> > >> > >> > > > > >    determine if they are worth keeping.  In this case,
> my
> > >> > >> > speculative
> > >> > >> > > > > > profiles
> > >> > >> > > > > >    would not be in Zk (yet).
> > >> > >> > > > > >    - In use case 3, I could see it go either way.  It
> > >> might be
> > >> > >> > useful
> > >> > >> > > > to
> > >> > >> > > > > >    load from Zk, but it certainly isn't a blocker.
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > > So if the config does not correctly match the
> profiler
> > >> > config
> > >> > >> > held
> > >> > >> > > in
> > >> > >> > > > > ZK
> > >> > >> > > > > > and
> > >> > >> > > > > > the user runs the batch seeding job, what happens?
> > >> > >> > > > > >
> > >> > >> > > > > > You would just get a profile that is slightly different
> > >> over
> > >> > the
> > >> > >> > > entire
> > >> > >> > > > > > time span.  This is not a new risk.  If the user
> changes
> > >> their
> > >> > >> > > Profile
> > >> > >> > > > > > definitions in Zk, the same thing would happen.
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > >> > >> > > > > > michael.miklavcic@gmail.com> wrote:
> > >> > >> > > > > >
> > >> > >> > > > > > > I think I'm torn on this, specifically because it's
> > batch
> > >> > and
> > >> > >> > would
> > >> > >> > > > > > > generally be run as-needed. Justin, can you elaborate
> > on
> > >> > your
> > >> > >> > > > concerns
> > >> > >> > > > > > > there? This feels functionally very similar to our
> flat
> > >> file
> > >> > >> > > loaders,
> > >> > >> > > > > > which
> > >> > >> > > > > > > all have inputs for config from the CLI only. On the
> > >> other
> > >> > >> hand,
> > >> > >> > > our
> > >> > >> > > > > flat
> > >> > >> > > > > > > file loaders are not typically seeding an existing
> > >> > structure.
> > >> > >> My
> > >> > >> > > > > concern
> > >> > >> > > > > > of
> > >> > >> > > > > > > a local file profiler config stems from this stated
> > goal:
> > >> > >> > > > > > > > The goal would be to enable “profile seeding” which
> > >> allows
> > >> > >> > > profiles
> > >> > >> > > > > to
> > >> > >> > > > > > be
> > >> > >> > > > > > > populated from a time before the profile was created.
> > >> > >> > > > > > > So if the config does not correctly match the
> profiler
> > >> > config
> > >> > >> > held
> > >> > >> > > in
> > >> > >> > > > > ZK
> > >> > >> > > > > > > and the user runs the batch seeding job, what
> happens?
> > >> > >> > > > > > >
> > >> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> > >> > >> > > justinjleet@gmail.com>
> > >> > >> > > > > > > wrote:
> > >> > >> > > > > > >
> > >> > >> > > > > > > > The profile not being able to read from ZK feels
> > like a
> > >> > >> fairly
> > >> > >> > > > > > > substantial,
> > >> > >> > > > > > > > if subtle, set of potential problems.  I'd like to
> > see
> > >> > that
> > >> > >> in
> > >> > >> > > > either
> > >> > >> > > > > > > > before merging or at least pretty soon after
> merging.
> > >> Is
> > >> > >> it a
> > >> > >> > > lot
> > >> > >> > > > of
> > >> > >> > > > > > > work
> > >> > >> > > > > > > > to add that functionality based on where things are
> > >> right
> > >> > >> now?
> > >> > >> > > > > > > >
> > >> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
> > >> > >> nick@nickallen.org
> > >> > >> > >
> > >> > >> > > > > wrote:
> > >> > >> > > > > > > >
> > >> > >> > > > > > > > > Here is another limitation that I just thought.
> It
> > >> can
> > >> > >> only
> > >> > >> > > read
> > >> > >> > > > a
> > >> > >> > > > > > > > profile
> > >> > >> > > > > > > > > definition from a file.  It probably also makes
> > >> sense to
> > >> > >> add
> > >> > >> > an
> > >> > >> > > > > > option
> > >> > >> > > > > > > > that
> > >> > >> > > > > > > > > allows it to read the current Profiler
> > configuration
> > >> > from
> > >> > >> > > > > Zookeeper.
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > > Is it worth setting up a default config that
> > pulls
> > >> > from
> > >> > >> the
> > >> > >> > > > main
> > >> > >> > > > > > > > indexing
> > >> > >> > > > > > > > > output?
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > Yes, I think that makes sense.  We want the Batch
> > >> > >> Profiler to
> > >> > >> > > > point
> > >> > >> > > > > > to
> > >> > >> > > > > > > > the
> > >> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is
> > >> deployed.
> > >> > >> When
> > >> > >> > > > > Metron
> > >> > >> > > > > > > gets
> > >> > >> > > > > > > > > spun-up on a cluster, I should be able to just
> run
> > >> the
> > >> > >> Batch
> > >> > >> > > > > Profiler
> > >> > >> > > > > > > > > without having to fuss with the input path.
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> > >> > >> > > > justinjleet@gmail.com
> > >> > >> > > > > >
> > >> > >> > > > > > > > wrote:
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > > Re:
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > >  * You do not configure the Batch Profiler in
> > >> > >> Ambari.  It
> > >> > >> > > is
> > >> > >> > > > > > > > configured
> > >> > >> > > > > > > > > > > and executed completely from the
> command-line.
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > Is it worth setting up a default config that
> > pulls
> > >> > from
> > >> > >> the
> > >> > >> > > > main
> > >> > >> > > > > > > > indexing
> > >> > >> > > > > > > > > > output?  I'm a little on the fence about it,
> but
> > it
> > >> > >> seems
> > >> > >> > > like
> > >> > >> > > > > > making
> > >> > >> > > > > > > > the
> > >> > >> > > > > > > > > > most common case more or less built-in would be
> > >> nice.
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > Having said that, I do not consider that a
> > >> requirement
> > >> > >> for
> > >> > >> > > > > merging
> > >> > >> > > > > > > the
> > >> > >> > > > > > > > > > feature branch.
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > >> > >> > > > > jsirota@apache.org>
> > >> > >> > > > > > > > > wrote:
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > > I think what you have outlined above is a
> good
> > >> > initial
> > >> > >> > stab
> > >> > >> > > > at
> > >> > >> > > > > > the
> > >> > >> > > > > > > > > > > feature.  Manual install of spark is not a
> big
> > >> deal.
> > >> > >> > > > > Configuring
> > >> > >> > > > > > > via
> > >> > >> > > > > > > > > > > command line while we mature this feature is
> ok
> > >> as
> > >> > >> well.
> > >> > >> > > > > Doesn't
> > >> > >> > > > > > > > look
> > >> > >> > > > > > > > > > like
> > >> > >> > > > > > > > > > > configuration steps are too hard.  I think
> you
> > >> > should
> > >> > >> > > merge.
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > > James
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <
> > >> nick@nickallen.org
> > >> > >:
> > >> > >> > > > > > > > > > > > I would like to open a discussion to get
> the
> > >> Batch
> > >> > >> > > Profiler
> > >> > >> > > > > > > feature
> > >> > >> > > > > > > > > > > branch
> > >> > >> > > > > > > > > > > > merged into master as part of METRON-1699
> [1]
> > >> > Create
> > >> > >> > > Batch
> > >> > >> > > > > > > > Profiler.
> > >> > >> > > > > > > > > > All
> > >> > >> > > > > > > > > > > > of the work that I had in mind for our
> first
> > >> draft
> > >> > >> of
> > >> > >> > the
> > >> > >> > > > > Batch
> > >> > >> > > > > > > > > > Profiler
> > >> > >> > > > > > > > > > > > has been completed. Please take a look
> > through
> > >> > what
> > >> > >> I
> > >> > >> > > have
> > >> > >> > > > > and
> > >> > >> > > > > > > let
> > >> > >> > > > > > > > me
> > >> > >> > > > > > > > > > > know
> > >> > >> > > > > > > > > > > > if there are other features that you think
> > are
> > >> > >> required
> > >> > >> > > > > > *before*
> > >> > >> > > > > > > we
> > >> > >> > > > > > > > > > > merge.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > Previous list discussions on this topic
> > include
> > >> > [2]
> > >> > >> and
> > >> > >> > > > [3].
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > (Q) What can I do with the feature branch?
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * With the Batch Profiler, you can
> > >> backfill/seed
> > >> > >> > > profiles
> > >> > >> > > > > > using
> > >> > >> > > > > > > > > > > archived
> > >> > >> > > > > > > > > > > > telemetry. This enables the following types
> > of
> > >> use
> > >> > >> > cases.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       1. As a Security Data Scientist, I
> want
> > >> to
> > >> > >> > > understand
> > >> > >> > > > > the
> > >> > >> > > > > > > > > > > historical
> > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I
> have
> > >> > >> created
> > >> > >> > so
> > >> > >> > > > > that I
> > >> > >> > > > > > > can
> > >> > >> > > > > > > > > > > > determine if I have created a feature set
> > that
> > >> has
> > >> > >> > > > predictive
> > >> > >> > > > > > > value
> > >> > >> > > > > > > > > for
> > >> > >> > > > > > > > > > > > model building.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       2. As a Security Data Scientist, I
> want
> > >> to
> > >> > >> > > understand
> > >> > >> > > > > the
> > >> > >> > > > > > > > > > > historical
> > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I
> have
> > >> > >> created
> > >> > >> > so
> > >> > >> > > > > that I
> > >> > >> > > > > > > can
> > >> > >> > > > > > > > > > > > determine if I have defined the profile
> > >> correctly
> > >> > >> and
> > >> > >> > > > > created a
> > >> > >> > > > > > > > > feature
> > >> > >> > > > > > > > > > > set
> > >> > >> > > > > > > > > > > > that matches reality.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       3. As a Security Platform Engineer, I
> > >> want
> > >> > to
> > >> > >> > > > generate
> > >> > >> > > > > a
> > >> > >> > > > > > > > > profile
> > >> > >> > > > > > > > > > > > using archived telemetry when I deploy a
> new
> > >> model
> > >> > >> to
> > >> > >> > > > > > production
> > >> > >> > > > > > > so
> > >> > >> > > > > > > > > > that
> > >> > >> > > > > > > > > > > > models depending on that profile can
> function
> > >> on
> > >> > >> day 1.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * METRON-1699 [1] includes a more
> detailed
> > >> > >> > description
> > >> > >> > > of
> > >> > >> > > > > the
> > >> > >> > > > > > > > > > feature.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > (Q) What work was completed?
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * The Batch Profiler runs on Spark and
> was
> > >> > >> > implemented
> > >> > >> > > in
> > >> > >> > > > > > Java
> > >> > >> > > > > > > to
> > >> > >> > > > > > > > > > > remain
> > >> > >> > > > > > > > > > > > consistent with our current Java-heavy code
> > >> base.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * The Batch Profiler is executed from the
> > >> > >> > command-line.
> > >> > >> > > > It
> > >> > >> > > > > > can
> > >> > >> > > > > > > be
> > >> > >> > > > > > > > > > > > launched using a script or by calling
> > >> > >> `spark-submit`,
> > >> > >> > > which
> > >> > >> > > > > may
> > >> > >> > > > > > > be
> > >> > >> > > > > > > > > > useful
> > >> > >> > > > > > > > > > > > for advanced users.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * Input telemetry can be consumed from
> > >> multiple
> > >> > >> > > sources;
> > >> > >> > > > > for
> > >> > >> > > > > > > > > example
> > >> > >> > > > > > > > > > > HDFS
> > >> > >> > > > > > > > > > > > or the local file system.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * Input telemetry can be consumed in
> > multiple
> > >> > >> > formats;
> > >> > >> > > > for
> > >> > >> > > > > > > > example
> > >> > >> > > > > > > > > > JSON
> > >> > >> > > > > > > > > > > > or ORC.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * The 'output' profile measurements are
> > >> > persisted
> > >> > >> in
> > >> > >> > > > HBase
> > >> > >> > > > > > and
> > >> > >> > > > > > > is
> > >> > >> > > > > > > > > > > > consistent with the Storm Profiler.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * It can be run on any underlying engine
> > >> > >> supported by
> > >> > >> > > > > Spark.
> > >> > >> > > > > > I
> > >> > >> > > > > > > > have
> > >> > >> > > > > > > > > > > > tested it both in 'local' mode and on a
> YARN
> > >> > >> cluster.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * It is installed automatically by the
> > Metron
> > >> > >> MPack.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * A README was added that documents usage
> > >> > >> > instructions.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * The existing Profiler code was
> refactored
> > >> so
> > >> > >> that
> > >> > >> > as
> > >> > >> > > > much
> > >> > >> > > > > > > code
> > >> > >> > > > > > > > as
> > >> > >> > > > > > > > > > > > possible is shared between the 3 Profiler
> > >> ports;
> > >> > >> Storm,
> > >> > >> > > the
> > >> > >> > > > > > > Stellar
> > >> > >> > > > > > > > > > REPL,
> > >> > >> > > > > > > > > > > > and Spark. For example, the logic which
> > >> determines
> > >> > >> the
> > >> > >> > > > > > timestamp
> > >> > >> > > > > > > > of a
> > >> > >> > > > > > > > > > > > message was refactored so that it could be
> > >> reused
> > >> > by
> > >> > >> > all
> > >> > >> > > > > ports.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       * metron-profiler-common: The common
> > >> > Profiler
> > >> > >> > code
> > >> > >> > > > > shared
> > >> > >> > > > > > > > > amongst
> > >> > >> > > > > > > > > > > > each port.
> > >> > >> > > > > > > > > > > >       * metron-profiler-storm: Profiler on
> > >> Storm
> > >> > >> > > > > > > > > > > >       * metron-profiler-spark: Profiler on
> > >> Spark
> > >> > >> > > > > > > > > > > >       * metron-profiler-repl: Profiler on
> the
> > >> > >> Stellar
> > >> > >> > > REPL
> > >> > >> > > > > > > > > > > >       * metron-profiler-client: The client
> > code
> > >> > for
> > >> > >> > > > > retrieving
> > >> > >> > > > > > > > > profile
> > >> > >> > > > > > > > > > > > data; for example PROFILE_GET.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * There are 3 separate RPM and DEB
> packages
> > >> now
> > >> > >> > created
> > >> > >> > > > for
> > >> > >> > > > > > the
> > >> > >> > > > > > > > > > > Profiler.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
> > >> > >> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
> > >> > >> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * The Profiler integration tests were
> > >> enhanced
> > >> > to
> > >> > >> > > > leverage
> > >> > >> > > > > > the
> > >> > >> > > > > > > > > > Profiler
> > >> > >> > > > > > > > > > > > Client logic to validate the results.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * Review METRON-1699 [1] for a complete
> > >> > >> break-down of
> > >> > >> > > the
> > >> > >> > > > > > tasks
> > >> > >> > > > > > > > > that
> > >> > >> > > > > > > > > > > have
> > >> > >> > > > > > > > > > > > been completed on the feature branch.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > (Q) What limitations exist?
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * You must manually install Spark to use
> > the
> > >> > Batch
> > >> > >> > > > > Profiler.
> > >> > >> > > > > > > The
> > >> > >> > > > > > > > > > Metron
> > >> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron
> > >> dependency
> > >> > >> and
> > >> > >> > so
> > >> > >> > > > does
> > >> > >> > > > > > not
> > >> > >> > > > > > > > > > install
> > >> > >> > > > > > > > > > > > it automatically.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * You do not configure the Batch Profiler
> > in
> > >> > >> Ambari.
> > >> > >> > It
> > >> > >> > > > is
> > >> > >> > > > > > > > > configured
> > >> > >> > > > > > > > > > > > and executed completely from the
> > command-line.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >   * To run the Batch Profiler in 'Full
> Dev',
> > >> you
> > >> > >> have
> > >> > >> > to
> > >> > >> > > > take
> > >> > >> > > > > > the
> > >> > >> > > > > > > > > > > following
> > >> > >> > > > > > > > > > > > manual steps. Some of these are arguably
> > >> > limitations
> > >> > >> > with
> > >> > >> > > > how
> > >> > >> > > > > > > > Ambari
> > >> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP that
> > we
> > >> > run.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       1. Install Spark 2 using Ambari.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       2. Tell Spark how to talk with HBase.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >
> > >>  SPARK_HOME=/usr/hdp/current/spark2-client
> > >> > >> > > > > > > > > > > >         cp
> > >> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > >> > >> > > > > > > > > > > > $SPARK_HOME/conf/
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       3. Create the Spark History directory
> > in
> > >> > HDFS.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > >> > >> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >       4. Change the default input path to
> > >> > >> > > > > > > > `hdfs://localhost:8020/...`
> > >> > >> > > > > > > > > > to
> > >> > >> > > > > > > > > > > > match the port defined by HDP, instead of
> > port
> > >> > 9000.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > [1]
> > >> > >> https://issues.apache.org/jira/browse/METRON-1699
> > >> > >> > > > > > > > > > > > [2]
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > >
> > >> > >> > > > > > >
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > >> > >> > > > > > > > > > > > [3]
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > >
> > >> > >> > > > > > >
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > > -------------------
> > >> > >> > > > > > > > > > > Thank you,
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > > James Sirota
> > >> > >> > > > > > > > > > > PMC- Apache Metron
> > >> > >> > > > > > > > > > > jsirota AT apache DOT org
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > >
> > >> > >> > > > > > >
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Justin Leet <ju...@gmail.com>.

I'm +1 on merging the feature branch into master. There's a lot of good
work here, and it's definitely been nice to see the couple remaining
improvements make it in.

Thanks a lot for the contribution, this is great stuff!

On Wed, Sep 26, 2018 at 6:26 PM Nick Allen <ni...@nickallen.org> wrote:

> Or support to be offered for merging this feature branch into master?
>
> On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <ni...@nickallen.org> wrote:
>
> > Thanks for the review.  With  https://github.com/apache/metron/pull/1209
> complete,
> > I think the feature branch is ready to be merged.  Sounds like I have
> > Mike's support.  Anyone else have comments, concerns, questions?
> >
> > On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic <
> > michael.miklavcic@gmail.com> wrote:
> >
> >> I just made a couple minor comments on that PR, and I am in agreement
> >> about
> >> the readiness for merging with master. Good stuff Nick.
> >>
> >> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <ni...@nickallen.org> wrote:
> >>
> >> > Here is a PR that adds the input time constraints to the Batch
> Profiler
> >> > (METRON-1787);  https://github.com/apache/metron/pull/1209.
> >> >
> >> > It seems that the consensus is that this is probably the last feature
> we
> >> > need before merging the FB into master.  The other two can wait until
> >> after
> >> > the feature branch has been merged.  Let me know if you disagree.
> >> >
> >> > Thanks
> >> >
> >> >
> >> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org>
> wrote:
> >> >
> >> > > Yeah, agreed.  Per use case 3, when deploying to production there
> >> really
> >> > > wouldn't be a huge overlap like 3 months of already profiled data.
> >> Its
> >> > day
> >> > > 1, the profile was just deployed around the same time as you are
> >> running
> >> > > the Batch Profiler, so the overlap is in minutes, maybe hours.  But
> I
> >> can
> >> > > definitely see the usefulness of the feature for re-runs, etc as you
> >> have
> >> > > described.
> >> > >
> >> > > Based on this discussion, I created a few JIRAs.  Thanks all for the
> >> > great
> >> > > feedback and keep it coming.
> >> > >
> >> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler
> >> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch
> Profiler
> >> > > [3] METRON-1789 - MPack Should Define Default Input Path for Batch
> >> > > Profiler
> >> > >
> >> > >
> >> > > --
> >> > > [1] https://issues.apache.org/jira/browse/METRON-1787
> >> > > [2] https://issues.apache.org/jira/browse/METRON-1788
> >> > > [3] https://issues.apache.org/jira/browse/METRON-1789
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
> >> > > michael.miklavcic@gmail.com> wrote:
> >> > >
> >> > >> I think we might want to allow the flexibility to choose the date
> >> range
> >> > >> then. I don't yet feel like I have a good enough understanding of
> all
> >> > the
> >> > >> ways in which users would want to seed to force them to run the
> batch
> >> > job
> >> > >> over all the data. It might also make it easier to deal with
> >> > remediation,
> >> > >> ie an error doesn't force you to re-run over the entire history.
> Same
> >> > goes
> >> > >> for testing out the profile seeing batch job in the first place.
> >> > >>
> >> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org>
> >> wrote:
> >> > >>
> >> > >> > Assuming you have 9 months of data archived, yes.
> >> > >> >
> >> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
> >> > >> > michael.miklavcic@gmail.com> wrote:
> >> > >> >
> >> > >> > > So in the case of 3 - if you had 6 months of data that hadn't
> >> been
> >> > >> > profiled
> >> > >> > > and another 3 that had been profiled (9 months total data), in
> >> its
> >> > >> > current
> >> > >> > > form the batch job runs over all 9 months?
> >> > >> > >
> >> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <
> nick@nickallen.org>
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns about
> >> > >> overlap
> >> > >> > or
> >> > >> > > > gaps after the seeding is performed?
> >> > >> > > >
> >> > >> > > > Good point.  Right now, if the Streaming and Batch Profiler
> >> > overlap
> >> > >> the
> >> > >> > > > last write wins.  And presumably the output of the Streaming
> >> and
> >> > >> Batch
> >> > >> > > > Profiler are the same, so no worries, right? :)
> >> > >> > > >
> >> > >> > > > So it kind of works, but it is definitely not ideal for use
> >> case
> >> > >> 3.  I
> >> > >> > > > could add --begin and --end args to constrain the time frame
> >> over
> >> > >> which
> >> > >> > > the
> >> > >> > > > Batch Profiler runs.  I do not have that in the feature
> branch.
> >> > It
> >> > >> > would
> >> > >> > > > be easy enough to add though.
> >> > >> > > >
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> >> > >> > > > michael.miklavcic@gmail.com> wrote:
> >> > >> > > >
> >> > >> > > > > Ok, makes sense. That's sort of what I was thinking as
> well,
> >> > Nick.
> >> > >> > > > Pulling
> >> > >> > > > > at this thread just a bit more...
> >> > >> > > > >
> >> > >> > > > >    1. I have an existing system that's been up a while,
> and I
> >> > have
> >> > >> > > added
> >> > >> > > > k
> >> > >> > > > >    profiles - assume these are the first profiles I've
> >> created.
> >> > >> > > > >       1. I would have t0 - tm (where m is the time when the
> >> > >> profiles
> >> > >> > > were
> >> > >> > > > >       first installed) worth of data that has not been
> >> profiled
> >> > >> yet.
> >> > >> > > > >       2. The batch profiler process would be to take that
> >> exact
> >> > >> > profile
> >> > >> > > > >       definition from ZK and run the batch loader with that
> >> from
> >> > >> the
> >> > >> > > CLI.
> >> > >> > > > >       3. Profiles are now up to date from t0 - tCurrent
> >> > >> > > > >    2. I've already done #1 above. Time goes by and now I
> >> want to
> >> > >> add
> >> > >> > a
> >> > >> > > > new
> >> > >> > > > >    profile.
> >> > >> > > > >       1. Same first step above
> >> > >> > > > >       2. I would run the batch loader with *only* that new
> >> > profile
> >> > >> > > > >       definition to seed?
> >> > >> > > > >
> >> > >> > > > > Forgive me if I missed this in PR's and discussion in the
> FB,
> >> > but
> >> > >> how
> >> > >> > > do
> >> > >> > > > we
> >> > >> > > > > establish "tm" from 1.1 above? Any concerns about overlap
> or
> >> > gaps
> >> > >> > after
> >> > >> > > > the
> >> > >> > > > > seeding is performed?
> >> > >> > > > >
> >> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <
> >> nick@nickallen.org
> >> > >
> >> > >> > > wrote:
> >> > >> > > > >
> >> > >> > > > > > I think more often than not, you would want to load your
> >> > profile
> >> > >> > > > > definition
> >> > >> > > > > > from a file.  This is why I considered the 'load from Zk'
> >> more
> >> > >> of a
> >> > >> > > > > > nice-to-have.
> >> > >> > > > > >
> >> > >> > > > > >    - In use case 1 and 2, this would definitely be the
> >> case.
> >> > >> The
> >> > >> > > > > profiles
> >> > >> > > > > >    I am working with are speculative and I am using the
> >> batch
> >> > >> > > profiler
> >> > >> > > > to
> >> > >> > > > > >    determine if they are worth keeping.  In this case, my
> >> > >> > speculative
> >> > >> > > > > > profiles
> >> > >> > > > > >    would not be in Zk (yet).
> >> > >> > > > > >    - In use case 3, I could see it go either way.  It
> >> might be
> >> > >> > useful
> >> > >> > > > to
> >> > >> > > > > >    load from Zk, but it certainly isn't a blocker.
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > > So if the config does not correctly match the profiler
> >> > config
> >> > >> > held
> >> > >> > > in
> >> > >> > > > > ZK
> >> > >> > > > > > and
> >> > >> > > > > > the user runs the batch seeding job, what happens?
> >> > >> > > > > >
> >> > >> > > > > > You would just get a profile that is slightly different
> >> over
> >> > the
> >> > >> > > entire
> >> > >> > > > > > time span.  This is not a new risk.  If the user changes
> >> their
> >> > >> > > Profile
> >> > >> > > > > > definitions in Zk, the same thing would happen.
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> >> > >> > > > > > michael.miklavcic@gmail.com> wrote:
> >> > >> > > > > >
> >> > >> > > > > > > I think I'm torn on this, specifically because it's
> batch
> >> > and
> >> > >> > would
> >> > >> > > > > > > generally be run as-needed. Justin, can you elaborate
> on
> >> > your
> >> > >> > > > concerns
> >> > >> > > > > > > there? This feels functionally very similar to our flat
> >> file
> >> > >> > > loaders,
> >> > >> > > > > > which
> >> > >> > > > > > > all have inputs for config from the CLI only. On the
> >> other
> >> > >> hand,
> >> > >> > > our
> >> > >> > > > > flat
> >> > >> > > > > > > file loaders are not typically seeding an existing
> >> > structure.
> >> > >> My
> >> > >> > > > > concern
> >> > >> > > > > > of
> >> > >> > > > > > > a local file profiler config stems from this stated
> goal:
> >> > >> > > > > > > > The goal would be to enable “profile seeding” which
> >> allows
> >> > >> > > profiles
> >> > >> > > > > to
> >> > >> > > > > > be
> >> > >> > > > > > > populated from a time before the profile was created.
> >> > >> > > > > > > So if the config does not correctly match the profiler
> >> > config
> >> > >> > held
> >> > >> > > in
> >> > >> > > > > ZK
> >> > >> > > > > > > and the user runs the batch seeding job, what happens?
> >> > >> > > > > > >
> >> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> >> > >> > > justinjleet@gmail.com>
> >> > >> > > > > > > wrote:
> >> > >> > > > > > >
> >> > >> > > > > > > > The profile not being able to read from ZK feels
> like a
> >> > >> fairly
> >> > >> > > > > > > substantial,
> >> > >> > > > > > > > if subtle, set of potential problems.  I'd like to
> see
> >> > that
> >> > >> in
> >> > >> > > > either
> >> > >> > > > > > > > before merging or at least pretty soon after merging.
> >> Is
> >> > >> it a
> >> > >> > > lot
> >> > >> > > > of
> >> > >> > > > > > > work
> >> > >> > > > > > > > to add that functionality based on where things are
> >> right
> >> > >> now?
> >> > >> > > > > > > >
> >> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
> >> > >> nick@nickallen.org
> >> > >> > >
> >> > >> > > > > wrote:
> >> > >> > > > > > > >
> >> > >> > > > > > > > > Here is another limitation that I just thought. It
> >> can
> >> > >> only
> >> > >> > > read
> >> > >> > > > a
> >> > >> > > > > > > > profile
> >> > >> > > > > > > > > definition from a file.  It probably also makes
> >> sense to
> >> > >> add
> >> > >> > an
> >> > >> > > > > > option
> >> > >> > > > > > > > that
> >> > >> > > > > > > > > allows it to read the current Profiler
> configuration
> >> > from
> >> > >> > > > > Zookeeper.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > > Is it worth setting up a default config that
> pulls
> >> > from
> >> > >> the
> >> > >> > > > main
> >> > >> > > > > > > > indexing
> >> > >> > > > > > > > > output?
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > Yes, I think that makes sense.  We want the Batch
> >> > >> Profiler to
> >> > >> > > > point
> >> > >> > > > > > to
> >> > >> > > > > > > > the
> >> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is
> >> deployed.
> >> > >> When
> >> > >> > > > > Metron
> >> > >> > > > > > > gets
> >> > >> > > > > > > > > spun-up on a cluster, I should be able to just run
> >> the
> >> > >> Batch
> >> > >> > > > > Profiler
> >> > >> > > > > > > > > without having to fuss with the input path.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> >> > >> > > > justinjleet@gmail.com
> >> > >> > > > > >
> >> > >> > > > > > > > wrote:
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > > Re:
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > >  * You do not configure the Batch Profiler in
> >> > >> Ambari.  It
> >> > >> > > is
> >> > >> > > > > > > > configured
> >> > >> > > > > > > > > > > and executed completely from the command-line.
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > Is it worth setting up a default config that
> pulls
> >> > from
> >> > >> the
> >> > >> > > > main
> >> > >> > > > > > > > indexing
> >> > >> > > > > > > > > > output?  I'm a little on the fence about it, but
> it
> >> > >> seems
> >> > >> > > like
> >> > >> > > > > > making
> >> > >> > > > > > > > the
> >> > >> > > > > > > > > > most common case more or less built-in would be
> >> nice.
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > Having said that, I do not consider that a
> >> requirement
> >> > >> for
> >> > >> > > > > merging
> >> > >> > > > > > > the
> >> > >> > > > > > > > > > feature branch.
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> >> > >> > > > > jsirota@apache.org>
> >> > >> > > > > > > > > wrote:
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > > I think what you have outlined above is a good
> >> > initial
> >> > >> > stab
> >> > >> > > > at
> >> > >> > > > > > the
> >> > >> > > > > > > > > > > feature.  Manual install of spark is not a big
> >> deal.
> >> > >> > > > > Configuring
> >> > >> > > > > > > via
> >> > >> > > > > > > > > > > command line while we mature this feature is ok
> >> as
> >> > >> well.
> >> > >> > > > > Doesn't
> >> > >> > > > > > > > look
> >> > >> > > > > > > > > > like
> >> > >> > > > > > > > > > > configuration steps are too hard.  I think you
> >> > should
> >> > >> > > merge.
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > > James
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <
> >> nick@nickallen.org
> >> > >:
> >> > >> > > > > > > > > > > > I would like to open a discussion to get the
> >> Batch
> >> > >> > > Profiler
> >> > >> > > > > > > feature
> >> > >> > > > > > > > > > > branch
> >> > >> > > > > > > > > > > > merged into master as part of METRON-1699 [1]
> >> > Create
> >> > >> > > Batch
> >> > >> > > > > > > > Profiler.
> >> > >> > > > > > > > > > All
> >> > >> > > > > > > > > > > > of the work that I had in mind for our first
> >> draft
> >> > >> of
> >> > >> > the
> >> > >> > > > > Batch
> >> > >> > > > > > > > > > Profiler
> >> > >> > > > > > > > > > > > has been completed. Please take a look
> through
> >> > what
> >> > >> I
> >> > >> > > have
> >> > >> > > > > and
> >> > >> > > > > > > let
> >> > >> > > > > > > > me
> >> > >> > > > > > > > > > > know
> >> > >> > > > > > > > > > > > if there are other features that you think
> are
> >> > >> required
> >> > >> > > > > > *before*
> >> > >> > > > > > > we
> >> > >> > > > > > > > > > > merge.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > Previous list discussions on this topic
> include
> >> > [2]
> >> > >> and
> >> > >> > > > [3].
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > (Q) What can I do with the feature branch?
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * With the Batch Profiler, you can
> >> backfill/seed
> >> > >> > > profiles
> >> > >> > > > > > using
> >> > >> > > > > > > > > > > archived
> >> > >> > > > > > > > > > > > telemetry. This enables the following types
> of
> >> use
> >> > >> > cases.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       1. As a Security Data Scientist, I want
> >> to
> >> > >> > > understand
> >> > >> > > > > the
> >> > >> > > > > > > > > > > historical
> >> > >> > > > > > > > > > > > behaviors and trends of a profile that I have
> >> > >> created
> >> > >> > so
> >> > >> > > > > that I
> >> > >> > > > > > > can
> >> > >> > > > > > > > > > > > determine if I have created a feature set
> that
> >> has
> >> > >> > > > predictive
> >> > >> > > > > > > value
> >> > >> > > > > > > > > for
> >> > >> > > > > > > > > > > > model building.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       2. As a Security Data Scientist, I want
> >> to
> >> > >> > > understand
> >> > >> > > > > the
> >> > >> > > > > > > > > > > historical
> >> > >> > > > > > > > > > > > behaviors and trends of a profile that I have
> >> > >> created
> >> > >> > so
> >> > >> > > > > that I
> >> > >> > > > > > > can
> >> > >> > > > > > > > > > > > determine if I have defined the profile
> >> correctly
> >> > >> and
> >> > >> > > > > created a
> >> > >> > > > > > > > > feature
> >> > >> > > > > > > > > > > set
> >> > >> > > > > > > > > > > > that matches reality.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       3. As a Security Platform Engineer, I
> >> want
> >> > to
> >> > >> > > > generate
> >> > >> > > > > a
> >> > >> > > > > > > > > profile
> >> > >> > > > > > > > > > > > using archived telemetry when I deploy a new
> >> model
> >> > >> to
> >> > >> > > > > > production
> >> > >> > > > > > > so
> >> > >> > > > > > > > > > that
> >> > >> > > > > > > > > > > > models depending on that profile can function
> >> on
> >> > >> day 1.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * METRON-1699 [1] includes a more detailed
> >> > >> > description
> >> > >> > > of
> >> > >> > > > > the
> >> > >> > > > > > > > > > feature.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > (Q) What work was completed?
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * The Batch Profiler runs on Spark and was
> >> > >> > implemented
> >> > >> > > in
> >> > >> > > > > > Java
> >> > >> > > > > > > to
> >> > >> > > > > > > > > > > remain
> >> > >> > > > > > > > > > > > consistent with our current Java-heavy code
> >> base.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * The Batch Profiler is executed from the
> >> > >> > command-line.
> >> > >> > > > It
> >> > >> > > > > > can
> >> > >> > > > > > > be
> >> > >> > > > > > > > > > > > launched using a script or by calling
> >> > >> `spark-submit`,
> >> > >> > > which
> >> > >> > > > > may
> >> > >> > > > > > > be
> >> > >> > > > > > > > > > useful
> >> > >> > > > > > > > > > > > for advanced users.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * Input telemetry can be consumed from
> >> multiple
> >> > >> > > sources;
> >> > >> > > > > for
> >> > >> > > > > > > > > example
> >> > >> > > > > > > > > > > HDFS
> >> > >> > > > > > > > > > > > or the local file system.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * Input telemetry can be consumed in
> multiple
> >> > >> > formats;
> >> > >> > > > for
> >> > >> > > > > > > > example
> >> > >> > > > > > > > > > JSON
> >> > >> > > > > > > > > > > > or ORC.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * The 'output' profile measurements are
> >> > persisted
> >> > >> in
> >> > >> > > > HBase
> >> > >> > > > > > and
> >> > >> > > > > > > is
> >> > >> > > > > > > > > > > > consistent with the Storm Profiler.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * It can be run on any underlying engine
> >> > >> supported by
> >> > >> > > > > Spark.
> >> > >> > > > > > I
> >> > >> > > > > > > > have
> >> > >> > > > > > > > > > > > tested it both in 'local' mode and on a YARN
> >> > >> cluster.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * It is installed automatically by the
> Metron
> >> > >> MPack.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * A README was added that documents usage
> >> > >> > instructions.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * The existing Profiler code was refactored
> >> so
> >> > >> that
> >> > >> > as
> >> > >> > > > much
> >> > >> > > > > > > code
> >> > >> > > > > > > > as
> >> > >> > > > > > > > > > > > possible is shared between the 3 Profiler
> >> ports;
> >> > >> Storm,
> >> > >> > > the
> >> > >> > > > > > > Stellar
> >> > >> > > > > > > > > > REPL,
> >> > >> > > > > > > > > > > > and Spark. For example, the logic which
> >> determines
> >> > >> the
> >> > >> > > > > > timestamp
> >> > >> > > > > > > > of a
> >> > >> > > > > > > > > > > > message was refactored so that it could be
> >> reused
> >> > by
> >> > >> > all
> >> > >> > > > > ports.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       * metron-profiler-common: The common
> >> > Profiler
> >> > >> > code
> >> > >> > > > > shared
> >> > >> > > > > > > > > amongst
> >> > >> > > > > > > > > > > > each port.
> >> > >> > > > > > > > > > > >       * metron-profiler-storm: Profiler on
> >> Storm
> >> > >> > > > > > > > > > > >       * metron-profiler-spark: Profiler on
> >> Spark
> >> > >> > > > > > > > > > > >       * metron-profiler-repl: Profiler on the
> >> > >> Stellar
> >> > >> > > REPL
> >> > >> > > > > > > > > > > >       * metron-profiler-client: The client
> code
> >> > for
> >> > >> > > > > retrieving
> >> > >> > > > > > > > > profile
> >> > >> > > > > > > > > > > > data; for example PROFILE_GET.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * There are 3 separate RPM and DEB packages
> >> now
> >> > >> > created
> >> > >> > > > for
> >> > >> > > > > > the
> >> > >> > > > > > > > > > > Profiler.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
> >> > >> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
> >> > >> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * The Profiler integration tests were
> >> enhanced
> >> > to
> >> > >> > > > leverage
> >> > >> > > > > > the
> >> > >> > > > > > > > > > Profiler
> >> > >> > > > > > > > > > > > Client logic to validate the results.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * Review METRON-1699 [1] for a complete
> >> > >> break-down of
> >> > >> > > the
> >> > >> > > > > > tasks
> >> > >> > > > > > > > > that
> >> > >> > > > > > > > > > > have
> >> > >> > > > > > > > > > > > been completed on the feature branch.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > (Q) What limitations exist?
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * You must manually install Spark to use
> the
> >> > Batch
> >> > >> > > > > Profiler.
> >> > >> > > > > > > The
> >> > >> > > > > > > > > > Metron
> >> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron
> >> dependency
> >> > >> and
> >> > >> > so
> >> > >> > > > does
> >> > >> > > > > > not
> >> > >> > > > > > > > > > install
> >> > >> > > > > > > > > > > > it automatically.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * You do not configure the Batch Profiler
> in
> >> > >> Ambari.
> >> > >> > It
> >> > >> > > > is
> >> > >> > > > > > > > > configured
> >> > >> > > > > > > > > > > > and executed completely from the
> command-line.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >   * To run the Batch Profiler in 'Full Dev',
> >> you
> >> > >> have
> >> > >> > to
> >> > >> > > > take
> >> > >> > > > > > the
> >> > >> > > > > > > > > > > following
> >> > >> > > > > > > > > > > > manual steps. Some of these are arguably
> >> > limitations
> >> > >> > with
> >> > >> > > > how
> >> > >> > > > > > > > Ambari
> >> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP that
> we
> >> > run.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       1. Install Spark 2 using Ambari.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       2. Tell Spark how to talk with HBase.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >
> >>  SPARK_HOME=/usr/hdp/current/spark2-client
> >> > >> > > > > > > > > > > >         cp
> >> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> >> > >> > > > > > > > > > > > $SPARK_HOME/conf/
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       3. Create the Spark History directory
> in
> >> > HDFS.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
> >> > >> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > >       4. Change the default input path to
> >> > >> > > > > > > > `hdfs://localhost:8020/...`
> >> > >> > > > > > > > > > to
> >> > >> > > > > > > > > > > > match the port defined by HDP, instead of
> port
> >> > 9000.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > [1]
> >> > >> https://issues.apache.org/jira/browse/METRON-1699
> >> > >> > > > > > > > > > > > [2]
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> >> > >> > > > > > > > > > > > [3]
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > > -------------------
> >> > >> > > > > > > > > > > Thank you,
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > > James Sirota
> >> > >> > > > > > > > > > > PMC- Apache Metron
> >> > >> > > > > > > > > > > jsirota AT apache DOT org
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

Or support to be offered for merging this feature branch into master?

On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <ni...@nickallen.org> wrote:

> Thanks for the review.  With  https://github.com/apache/metron/pull/1209 complete,
> I think the feature branch is ready to be merged.  Sounds like I have
> Mike's support.  Anyone else have comments, concerns, questions?
>
> On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
>> I just made a couple minor comments on that PR, and I am in agreement
>> about
>> the readiness for merging with master. Good stuff Nick.
>>
>> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <ni...@nickallen.org> wrote:
>>
>> > Here is a PR that adds the input time constraints to the Batch Profiler
>> > (METRON-1787);  https://github.com/apache/metron/pull/1209.
>> >
>> > It seems that the consensus is that this is probably the last feature we
>> > need before merging the FB into master.  The other two can wait until
>> after
>> > the feature branch has been merged.  Let me know if you disagree.
>> >
>> > Thanks
>> >
>> >
>> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org> wrote:
>> >
>> > > Yeah, agreed.  Per use case 3, when deploying to production there
>> really
>> > > wouldn't be a huge overlap like 3 months of already profiled data.
>> Its
>> > day
>> > > 1, the profile was just deployed around the same time as you are
>> running
>> > > the Batch Profiler, so the overlap is in minutes, maybe hours.  But I
>> can
>> > > definitely see the usefulness of the feature for re-runs, etc as you
>> have
>> > > described.
>> > >
>> > > Based on this discussion, I created a few JIRAs.  Thanks all for the
>> > great
>> > > feedback and keep it coming.
>> > >
>> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler
>> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch Profiler
>> > > [3] METRON-1789 - MPack Should Define Default Input Path for Batch
>> > > Profiler
>> > >
>> > >
>> > > --
>> > > [1] https://issues.apache.org/jira/browse/METRON-1787
>> > > [2] https://issues.apache.org/jira/browse/METRON-1788
>> > > [3] https://issues.apache.org/jira/browse/METRON-1789
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
>> > > michael.miklavcic@gmail.com> wrote:
>> > >
>> > >> I think we might want to allow the flexibility to choose the date
>> range
>> > >> then. I don't yet feel like I have a good enough understanding of all
>> > the
>> > >> ways in which users would want to seed to force them to run the batch
>> > job
>> > >> over all the data. It might also make it easier to deal with
>> > remediation,
>> > >> ie an error doesn't force you to re-run over the entire history. Same
>> > goes
>> > >> for testing out the profile seeing batch job in the first place.
>> > >>
>> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org>
>> wrote:
>> > >>
>> > >> > Assuming you have 9 months of data archived, yes.
>> > >> >
>> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
>> > >> > michael.miklavcic@gmail.com> wrote:
>> > >> >
>> > >> > > So in the case of 3 - if you had 6 months of data that hadn't
>> been
>> > >> > profiled
>> > >> > > and another 3 that had been profiled (9 months total data), in
>> its
>> > >> > current
>> > >> > > form the batch job runs over all 9 months?
>> > >> > >
>> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org>
>> > >> wrote:
>> > >> > >
>> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns about
>> > >> overlap
>> > >> > or
>> > >> > > > gaps after the seeding is performed?
>> > >> > > >
>> > >> > > > Good point.  Right now, if the Streaming and Batch Profiler
>> > overlap
>> > >> the
>> > >> > > > last write wins.  And presumably the output of the Streaming
>> and
>> > >> Batch
>> > >> > > > Profiler are the same, so no worries, right? :)
>> > >> > > >
>> > >> > > > So it kind of works, but it is definitely not ideal for use
>> case
>> > >> 3.  I
>> > >> > > > could add --begin and --end args to constrain the time frame
>> over
>> > >> which
>> > >> > > the
>> > >> > > > Batch Profiler runs.  I do not have that in the feature branch.
>> > It
>> > >> > would
>> > >> > > > be easy enough to add though.
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
>> > >> > > > michael.miklavcic@gmail.com> wrote:
>> > >> > > >
>> > >> > > > > Ok, makes sense. That's sort of what I was thinking as well,
>> > Nick.
>> > >> > > > Pulling
>> > >> > > > > at this thread just a bit more...
>> > >> > > > >
>> > >> > > > >    1. I have an existing system that's been up a while, and I
>> > have
>> > >> > > added
>> > >> > > > k
>> > >> > > > >    profiles - assume these are the first profiles I've
>> created.
>> > >> > > > >       1. I would have t0 - tm (where m is the time when the
>> > >> profiles
>> > >> > > were
>> > >> > > > >       first installed) worth of data that has not been
>> profiled
>> > >> yet.
>> > >> > > > >       2. The batch profiler process would be to take that
>> exact
>> > >> > profile
>> > >> > > > >       definition from ZK and run the batch loader with that
>> from
>> > >> the
>> > >> > > CLI.
>> > >> > > > >       3. Profiles are now up to date from t0 - tCurrent
>> > >> > > > >    2. I've already done #1 above. Time goes by and now I
>> want to
>> > >> add
>> > >> > a
>> > >> > > > new
>> > >> > > > >    profile.
>> > >> > > > >       1. Same first step above
>> > >> > > > >       2. I would run the batch loader with *only* that new
>> > profile
>> > >> > > > >       definition to seed?
>> > >> > > > >
>> > >> > > > > Forgive me if I missed this in PR's and discussion in the FB,
>> > but
>> > >> how
>> > >> > > do
>> > >> > > > we
>> > >> > > > > establish "tm" from 1.1 above? Any concerns about overlap or
>> > gaps
>> > >> > after
>> > >> > > > the
>> > >> > > > > seeding is performed?
>> > >> > > > >
>> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <
>> nick@nickallen.org
>> > >
>> > >> > > wrote:
>> > >> > > > >
>> > >> > > > > > I think more often than not, you would want to load your
>> > profile
>> > >> > > > > definition
>> > >> > > > > > from a file.  This is why I considered the 'load from Zk'
>> more
>> > >> of a
>> > >> > > > > > nice-to-have.
>> > >> > > > > >
>> > >> > > > > >    - In use case 1 and 2, this would definitely be the
>> case.
>> > >> The
>> > >> > > > > profiles
>> > >> > > > > >    I am working with are speculative and I am using the
>> batch
>> > >> > > profiler
>> > >> > > > to
>> > >> > > > > >    determine if they are worth keeping.  In this case, my
>> > >> > speculative
>> > >> > > > > > profiles
>> > >> > > > > >    would not be in Zk (yet).
>> > >> > > > > >    - In use case 3, I could see it go either way.  It
>> might be
>> > >> > useful
>> > >> > > > to
>> > >> > > > > >    load from Zk, but it certainly isn't a blocker.
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > > So if the config does not correctly match the profiler
>> > config
>> > >> > held
>> > >> > > in
>> > >> > > > > ZK
>> > >> > > > > > and
>> > >> > > > > > the user runs the batch seeding job, what happens?
>> > >> > > > > >
>> > >> > > > > > You would just get a profile that is slightly different
>> over
>> > the
>> > >> > > entire
>> > >> > > > > > time span.  This is not a new risk.  If the user changes
>> their
>> > >> > > Profile
>> > >> > > > > > definitions in Zk, the same thing would happen.
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
>> > >> > > > > > michael.miklavcic@gmail.com> wrote:
>> > >> > > > > >
>> > >> > > > > > > I think I'm torn on this, specifically because it's batch
>> > and
>> > >> > would
>> > >> > > > > > > generally be run as-needed. Justin, can you elaborate on
>> > your
>> > >> > > > concerns
>> > >> > > > > > > there? This feels functionally very similar to our flat
>> file
>> > >> > > loaders,
>> > >> > > > > > which
>> > >> > > > > > > all have inputs for config from the CLI only. On the
>> other
>> > >> hand,
>> > >> > > our
>> > >> > > > > flat
>> > >> > > > > > > file loaders are not typically seeding an existing
>> > structure.
>> > >> My
>> > >> > > > > concern
>> > >> > > > > > of
>> > >> > > > > > > a local file profiler config stems from this stated goal:
>> > >> > > > > > > > The goal would be to enable “profile seeding” which
>> allows
>> > >> > > profiles
>> > >> > > > > to
>> > >> > > > > > be
>> > >> > > > > > > populated from a time before the profile was created.
>> > >> > > > > > > So if the config does not correctly match the profiler
>> > config
>> > >> > held
>> > >> > > in
>> > >> > > > > ZK
>> > >> > > > > > > and the user runs the batch seeding job, what happens?
>> > >> > > > > > >
>> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
>> > >> > > justinjleet@gmail.com>
>> > >> > > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > The profile not being able to read from ZK feels like a
>> > >> fairly
>> > >> > > > > > > substantial,
>> > >> > > > > > > > if subtle, set of potential problems.  I'd like to see
>> > that
>> > >> in
>> > >> > > > either
>> > >> > > > > > > > before merging or at least pretty soon after merging.
>> Is
>> > >> it a
>> > >> > > lot
>> > >> > > > of
>> > >> > > > > > > work
>> > >> > > > > > > > to add that functionality based on where things are
>> right
>> > >> now?
>> > >> > > > > > > >
>> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
>> > >> nick@nickallen.org
>> > >> > >
>> > >> > > > > wrote:
>> > >> > > > > > > >
>> > >> > > > > > > > > Here is another limitation that I just thought. It
>> can
>> > >> only
>> > >> > > read
>> > >> > > > a
>> > >> > > > > > > > profile
>> > >> > > > > > > > > definition from a file.  It probably also makes
>> sense to
>> > >> add
>> > >> > an
>> > >> > > > > > option
>> > >> > > > > > > > that
>> > >> > > > > > > > > allows it to read the current Profiler configuration
>> > from
>> > >> > > > > Zookeeper.
>> > >> > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > > > > Is it worth setting up a default config that pulls
>> > from
>> > >> the
>> > >> > > > main
>> > >> > > > > > > > indexing
>> > >> > > > > > > > > output?
>> > >> > > > > > > > >
>> > >> > > > > > > > > Yes, I think that makes sense.  We want the Batch
>> > >> Profiler to
>> > >> > > > point
>> > >> > > > > > to
>> > >> > > > > > > > the
>> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is
>> deployed.
>> > >> When
>> > >> > > > > Metron
>> > >> > > > > > > gets
>> > >> > > > > > > > > spun-up on a cluster, I should be able to just run
>> the
>> > >> Batch
>> > >> > > > > Profiler
>> > >> > > > > > > > > without having to fuss with the input path.
>> > >> > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
>> > >> > > > justinjleet@gmail.com
>> > >> > > > > >
>> > >> > > > > > > > wrote:
>> > >> > > > > > > > >
>> > >> > > > > > > > > > Re:
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > >  * You do not configure the Batch Profiler in
>> > >> Ambari.  It
>> > >> > > is
>> > >> > > > > > > > configured
>> > >> > > > > > > > > > > and executed completely from the command-line.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > Is it worth setting up a default config that pulls
>> > from
>> > >> the
>> > >> > > > main
>> > >> > > > > > > > indexing
>> > >> > > > > > > > > > output?  I'm a little on the fence about it, but it
>> > >> seems
>> > >> > > like
>> > >> > > > > > making
>> > >> > > > > > > > the
>> > >> > > > > > > > > > most common case more or less built-in would be
>> nice.
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > Having said that, I do not consider that a
>> requirement
>> > >> for
>> > >> > > > > merging
>> > >> > > > > > > the
>> > >> > > > > > > > > > feature branch.
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
>> > >> > > > > jsirota@apache.org>
>> > >> > > > > > > > > wrote:
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > > I think what you have outlined above is a good
>> > initial
>> > >> > stab
>> > >> > > > at
>> > >> > > > > > the
>> > >> > > > > > > > > > > feature.  Manual install of spark is not a big
>> deal.
>> > >> > > > > Configuring
>> > >> > > > > > > via
>> > >> > > > > > > > > > > command line while we mature this feature is ok
>> as
>> > >> well.
>> > >> > > > > Doesn't
>> > >> > > > > > > > look
>> > >> > > > > > > > > > like
>> > >> > > > > > > > > > > configuration steps are too hard.  I think you
>> > should
>> > >> > > merge.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > James
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <
>> nick@nickallen.org
>> > >:
>> > >> > > > > > > > > > > > I would like to open a discussion to get the
>> Batch
>> > >> > > Profiler
>> > >> > > > > > > feature
>> > >> > > > > > > > > > > branch
>> > >> > > > > > > > > > > > merged into master as part of METRON-1699 [1]
>> > Create
>> > >> > > Batch
>> > >> > > > > > > > Profiler.
>> > >> > > > > > > > > > All
>> > >> > > > > > > > > > > > of the work that I had in mind for our first
>> draft
>> > >> of
>> > >> > the
>> > >> > > > > Batch
>> > >> > > > > > > > > > Profiler
>> > >> > > > > > > > > > > > has been completed. Please take a look through
>> > what
>> > >> I
>> > >> > > have
>> > >> > > > > and
>> > >> > > > > > > let
>> > >> > > > > > > > me
>> > >> > > > > > > > > > > know
>> > >> > > > > > > > > > > > if there are other features that you think are
>> > >> required
>> > >> > > > > > *before*
>> > >> > > > > > > we
>> > >> > > > > > > > > > > merge.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > Previous list discussions on this topic include
>> > [2]
>> > >> and
>> > >> > > > [3].
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > (Q) What can I do with the feature branch?
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * With the Batch Profiler, you can
>> backfill/seed
>> > >> > > profiles
>> > >> > > > > > using
>> > >> > > > > > > > > > > archived
>> > >> > > > > > > > > > > > telemetry. This enables the following types of
>> use
>> > >> > cases.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       1. As a Security Data Scientist, I want
>> to
>> > >> > > understand
>> > >> > > > > the
>> > >> > > > > > > > > > > historical
>> > >> > > > > > > > > > > > behaviors and trends of a profile that I have
>> > >> created
>> > >> > so
>> > >> > > > > that I
>> > >> > > > > > > can
>> > >> > > > > > > > > > > > determine if I have created a feature set that
>> has
>> > >> > > > predictive
>> > >> > > > > > > value
>> > >> > > > > > > > > for
>> > >> > > > > > > > > > > > model building.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       2. As a Security Data Scientist, I want
>> to
>> > >> > > understand
>> > >> > > > > the
>> > >> > > > > > > > > > > historical
>> > >> > > > > > > > > > > > behaviors and trends of a profile that I have
>> > >> created
>> > >> > so
>> > >> > > > > that I
>> > >> > > > > > > can
>> > >> > > > > > > > > > > > determine if I have defined the profile
>> correctly
>> > >> and
>> > >> > > > > created a
>> > >> > > > > > > > > feature
>> > >> > > > > > > > > > > set
>> > >> > > > > > > > > > > > that matches reality.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       3. As a Security Platform Engineer, I
>> want
>> > to
>> > >> > > > generate
>> > >> > > > > a
>> > >> > > > > > > > > profile
>> > >> > > > > > > > > > > > using archived telemetry when I deploy a new
>> model
>> > >> to
>> > >> > > > > > production
>> > >> > > > > > > so
>> > >> > > > > > > > > > that
>> > >> > > > > > > > > > > > models depending on that profile can function
>> on
>> > >> day 1.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * METRON-1699 [1] includes a more detailed
>> > >> > description
>> > >> > > of
>> > >> > > > > the
>> > >> > > > > > > > > > feature.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > (Q) What work was completed?
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * The Batch Profiler runs on Spark and was
>> > >> > implemented
>> > >> > > in
>> > >> > > > > > Java
>> > >> > > > > > > to
>> > >> > > > > > > > > > > remain
>> > >> > > > > > > > > > > > consistent with our current Java-heavy code
>> base.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * The Batch Profiler is executed from the
>> > >> > command-line.
>> > >> > > > It
>> > >> > > > > > can
>> > >> > > > > > > be
>> > >> > > > > > > > > > > > launched using a script or by calling
>> > >> `spark-submit`,
>> > >> > > which
>> > >> > > > > may
>> > >> > > > > > > be
>> > >> > > > > > > > > > useful
>> > >> > > > > > > > > > > > for advanced users.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * Input telemetry can be consumed from
>> multiple
>> > >> > > sources;
>> > >> > > > > for
>> > >> > > > > > > > > example
>> > >> > > > > > > > > > > HDFS
>> > >> > > > > > > > > > > > or the local file system.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * Input telemetry can be consumed in multiple
>> > >> > formats;
>> > >> > > > for
>> > >> > > > > > > > example
>> > >> > > > > > > > > > JSON
>> > >> > > > > > > > > > > > or ORC.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * The 'output' profile measurements are
>> > persisted
>> > >> in
>> > >> > > > HBase
>> > >> > > > > > and
>> > >> > > > > > > is
>> > >> > > > > > > > > > > > consistent with the Storm Profiler.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * It can be run on any underlying engine
>> > >> supported by
>> > >> > > > > Spark.
>> > >> > > > > > I
>> > >> > > > > > > > have
>> > >> > > > > > > > > > > > tested it both in 'local' mode and on a YARN
>> > >> cluster.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * It is installed automatically by the Metron
>> > >> MPack.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * A README was added that documents usage
>> > >> > instructions.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * The existing Profiler code was refactored
>> so
>> > >> that
>> > >> > as
>> > >> > > > much
>> > >> > > > > > > code
>> > >> > > > > > > > as
>> > >> > > > > > > > > > > > possible is shared between the 3 Profiler
>> ports;
>> > >> Storm,
>> > >> > > the
>> > >> > > > > > > Stellar
>> > >> > > > > > > > > > REPL,
>> > >> > > > > > > > > > > > and Spark. For example, the logic which
>> determines
>> > >> the
>> > >> > > > > > timestamp
>> > >> > > > > > > > of a
>> > >> > > > > > > > > > > > message was refactored so that it could be
>> reused
>> > by
>> > >> > all
>> > >> > > > > ports.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       * metron-profiler-common: The common
>> > Profiler
>> > >> > code
>> > >> > > > > shared
>> > >> > > > > > > > > amongst
>> > >> > > > > > > > > > > > each port.
>> > >> > > > > > > > > > > >       * metron-profiler-storm: Profiler on
>> Storm
>> > >> > > > > > > > > > > >       * metron-profiler-spark: Profiler on
>> Spark
>> > >> > > > > > > > > > > >       * metron-profiler-repl: Profiler on the
>> > >> Stellar
>> > >> > > REPL
>> > >> > > > > > > > > > > >       * metron-profiler-client: The client code
>> > for
>> > >> > > > > retrieving
>> > >> > > > > > > > > profile
>> > >> > > > > > > > > > > > data; for example PROFILE_GET.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * There are 3 separate RPM and DEB packages
>> now
>> > >> > created
>> > >> > > > for
>> > >> > > > > > the
>> > >> > > > > > > > > > > Profiler.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
>> > >> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
>> > >> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * The Profiler integration tests were
>> enhanced
>> > to
>> > >> > > > leverage
>> > >> > > > > > the
>> > >> > > > > > > > > > Profiler
>> > >> > > > > > > > > > > > Client logic to validate the results.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * Review METRON-1699 [1] for a complete
>> > >> break-down of
>> > >> > > the
>> > >> > > > > > tasks
>> > >> > > > > > > > > that
>> > >> > > > > > > > > > > have
>> > >> > > > > > > > > > > > been completed on the feature branch.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > (Q) What limitations exist?
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * You must manually install Spark to use the
>> > Batch
>> > >> > > > > Profiler.
>> > >> > > > > > > The
>> > >> > > > > > > > > > Metron
>> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron
>> dependency
>> > >> and
>> > >> > so
>> > >> > > > does
>> > >> > > > > > not
>> > >> > > > > > > > > > install
>> > >> > > > > > > > > > > > it automatically.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * You do not configure the Batch Profiler in
>> > >> Ambari.
>> > >> > It
>> > >> > > > is
>> > >> > > > > > > > > configured
>> > >> > > > > > > > > > > > and executed completely from the command-line.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >   * To run the Batch Profiler in 'Full Dev',
>> you
>> > >> have
>> > >> > to
>> > >> > > > take
>> > >> > > > > > the
>> > >> > > > > > > > > > > following
>> > >> > > > > > > > > > > > manual steps. Some of these are arguably
>> > limitations
>> > >> > with
>> > >> > > > how
>> > >> > > > > > > > Ambari
>> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP that we
>> > run.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       1. Install Spark 2 using Ambari.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       2. Tell Spark how to talk with HBase.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >
>>  SPARK_HOME=/usr/hdp/current/spark2-client
>> > >> > > > > > > > > > > >         cp
>> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
>> > >> > > > > > > > > > > > $SPARK_HOME/conf/
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       3. Create the Spark History directory in
>> > HDFS.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
>> > >> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >       4. Change the default input path to
>> > >> > > > > > > > `hdfs://localhost:8020/...`
>> > >> > > > > > > > > > to
>> > >> > > > > > > > > > > > match the port defined by HDP, instead of port
>> > 9000.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > [1]
>> > >> https://issues.apache.org/jira/browse/METRON-1699
>> > >> > > > > > > > > > > > [2]
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
>> > >> > > > > > > > > > > > [3]
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > -------------------
>> > >> > > > > > > > > > > Thank you,
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > James Sirota
>> > >> > > > > > > > > > > PMC- Apache Metron
>> > >> > > > > > > > > > > jsirota AT apache DOT org
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

Thanks for the review.  With
https://github.com/apache/metron/pull/1209 complete,
I think the feature branch is ready to be merged.  Sounds like I have
Mike's support.  Anyone else have comments, concerns, questions?

On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> I just made a couple minor comments on that PR, and I am in agreement about
> the readiness for merging with master. Good stuff Nick.
>
> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <ni...@nickallen.org> wrote:
>
> > Here is a PR that adds the input time constraints to the Batch Profiler
> > (METRON-1787);  https://github.com/apache/metron/pull/1209.
> >
> > It seems that the consensus is that this is probably the last feature we
> > need before merging the FB into master.  The other two can wait until
> after
> > the feature branch has been merged.  Let me know if you disagree.
> >
> > Thanks
> >
> >
> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org> wrote:
> >
> > > Yeah, agreed.  Per use case 3, when deploying to production there
> really
> > > wouldn't be a huge overlap like 3 months of already profiled data.  Its
> > day
> > > 1, the profile was just deployed around the same time as you are
> running
> > > the Batch Profiler, so the overlap is in minutes, maybe hours.  But I
> can
> > > definitely see the usefulness of the feature for re-runs, etc as you
> have
> > > described.
> > >
> > > Based on this discussion, I created a few JIRAs.  Thanks all for the
> > great
> > > feedback and keep it coming.
> > >
> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler
> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch Profiler
> > > [3] METRON-1789 - MPack Should Define Default Input Path for Batch
> > > Profiler
> > >
> > >
> > > --
> > > [1] https://issues.apache.org/jira/browse/METRON-1787
> > > [2] https://issues.apache.org/jira/browse/METRON-1788
> > > [3] https://issues.apache.org/jira/browse/METRON-1789
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
> > > michael.miklavcic@gmail.com> wrote:
> > >
> > >> I think we might want to allow the flexibility to choose the date
> range
> > >> then. I don't yet feel like I have a good enough understanding of all
> > the
> > >> ways in which users would want to seed to force them to run the batch
> > job
> > >> over all the data. It might also make it easier to deal with
> > remediation,
> > >> ie an error doesn't force you to re-run over the entire history. Same
> > goes
> > >> for testing out the profile seeing batch job in the first place.
> > >>
> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org>
> wrote:
> > >>
> > >> > Assuming you have 9 months of data archived, yes.
> > >> >
> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
> > >> > michael.miklavcic@gmail.com> wrote:
> > >> >
> > >> > > So in the case of 3 - if you had 6 months of data that hadn't been
> > >> > profiled
> > >> > > and another 3 that had been profiled (9 months total data), in its
> > >> > current
> > >> > > form the batch job runs over all 9 months?
> > >> > >
> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org>
> > >> wrote:
> > >> > >
> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns about
> > >> overlap
> > >> > or
> > >> > > > gaps after the seeding is performed?
> > >> > > >
> > >> > > > Good point.  Right now, if the Streaming and Batch Profiler
> > overlap
> > >> the
> > >> > > > last write wins.  And presumably the output of the Streaming and
> > >> Batch
> > >> > > > Profiler are the same, so no worries, right? :)
> > >> > > >
> > >> > > > So it kind of works, but it is definitely not ideal for use case
> > >> 3.  I
> > >> > > > could add --begin and --end args to constrain the time frame
> over
> > >> which
> > >> > > the
> > >> > > > Batch Profiler runs.  I do not have that in the feature branch.
> > It
> > >> > would
> > >> > > > be easy enough to add though.
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> > >> > > > michael.miklavcic@gmail.com> wrote:
> > >> > > >
> > >> > > > > Ok, makes sense. That's sort of what I was thinking as well,
> > Nick.
> > >> > > > Pulling
> > >> > > > > at this thread just a bit more...
> > >> > > > >
> > >> > > > >    1. I have an existing system that's been up a while, and I
> > have
> > >> > > added
> > >> > > > k
> > >> > > > >    profiles - assume these are the first profiles I've
> created.
> > >> > > > >       1. I would have t0 - tm (where m is the time when the
> > >> profiles
> > >> > > were
> > >> > > > >       first installed) worth of data that has not been
> profiled
> > >> yet.
> > >> > > > >       2. The batch profiler process would be to take that
> exact
> > >> > profile
> > >> > > > >       definition from ZK and run the batch loader with that
> from
> > >> the
> > >> > > CLI.
> > >> > > > >       3. Profiles are now up to date from t0 - tCurrent
> > >> > > > >    2. I've already done #1 above. Time goes by and now I want
> to
> > >> add
> > >> > a
> > >> > > > new
> > >> > > > >    profile.
> > >> > > > >       1. Same first step above
> > >> > > > >       2. I would run the batch loader with *only* that new
> > profile
> > >> > > > >       definition to seed?
> > >> > > > >
> > >> > > > > Forgive me if I missed this in PR's and discussion in the FB,
> > but
> > >> how
> > >> > > do
> > >> > > > we
> > >> > > > > establish "tm" from 1.1 above? Any concerns about overlap or
> > gaps
> > >> > after
> > >> > > > the
> > >> > > > > seeding is performed?
> > >> > > > >
> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <
> nick@nickallen.org
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > I think more often than not, you would want to load your
> > profile
> > >> > > > > definition
> > >> > > > > > from a file.  This is why I considered the 'load from Zk'
> more
> > >> of a
> > >> > > > > > nice-to-have.
> > >> > > > > >
> > >> > > > > >    - In use case 1 and 2, this would definitely be the case.
> > >> The
> > >> > > > > profiles
> > >> > > > > >    I am working with are speculative and I am using the
> batch
> > >> > > profiler
> > >> > > > to
> > >> > > > > >    determine if they are worth keeping.  In this case, my
> > >> > speculative
> > >> > > > > > profiles
> > >> > > > > >    would not be in Zk (yet).
> > >> > > > > >    - In use case 3, I could see it go either way.  It might
> be
> > >> > useful
> > >> > > > to
> > >> > > > > >    load from Zk, but it certainly isn't a blocker.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > > So if the config does not correctly match the profiler
> > config
> > >> > held
> > >> > > in
> > >> > > > > ZK
> > >> > > > > > and
> > >> > > > > > the user runs the batch seeding job, what happens?
> > >> > > > > >
> > >> > > > > > You would just get a profile that is slightly different over
> > the
> > >> > > entire
> > >> > > > > > time span.  This is not a new risk.  If the user changes
> their
> > >> > > Profile
> > >> > > > > > definitions in Zk, the same thing would happen.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > >> > > > > > michael.miklavcic@gmail.com> wrote:
> > >> > > > > >
> > >> > > > > > > I think I'm torn on this, specifically because it's batch
> > and
> > >> > would
> > >> > > > > > > generally be run as-needed. Justin, can you elaborate on
> > your
> > >> > > > concerns
> > >> > > > > > > there? This feels functionally very similar to our flat
> file
> > >> > > loaders,
> > >> > > > > > which
> > >> > > > > > > all have inputs for config from the CLI only. On the other
> > >> hand,
> > >> > > our
> > >> > > > > flat
> > >> > > > > > > file loaders are not typically seeding an existing
> > structure.
> > >> My
> > >> > > > > concern
> > >> > > > > > of
> > >> > > > > > > a local file profiler config stems from this stated goal:
> > >> > > > > > > > The goal would be to enable “profile seeding” which
> allows
> > >> > > profiles
> > >> > > > > to
> > >> > > > > > be
> > >> > > > > > > populated from a time before the profile was created.
> > >> > > > > > > So if the config does not correctly match the profiler
> > config
> > >> > held
> > >> > > in
> > >> > > > > ZK
> > >> > > > > > > and the user runs the batch seeding job, what happens?
> > >> > > > > > >
> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> > >> > > justinjleet@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > The profile not being able to read from ZK feels like a
> > >> fairly
> > >> > > > > > > substantial,
> > >> > > > > > > > if subtle, set of potential problems.  I'd like to see
> > that
> > >> in
> > >> > > > either
> > >> > > > > > > > before merging or at least pretty soon after merging.
> Is
> > >> it a
> > >> > > lot
> > >> > > > of
> > >> > > > > > > work
> > >> > > > > > > > to add that functionality based on where things are
> right
> > >> now?
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
> > >> nick@nickallen.org
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Here is another limitation that I just thought. It can
> > >> only
> > >> > > read
> > >> > > > a
> > >> > > > > > > > profile
> > >> > > > > > > > > definition from a file.  It probably also makes sense
> to
> > >> add
> > >> > an
> > >> > > > > > option
> > >> > > > > > > > that
> > >> > > > > > > > > allows it to read the current Profiler configuration
> > from
> > >> > > > > Zookeeper.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > > Is it worth setting up a default config that pulls
> > from
> > >> the
> > >> > > > main
> > >> > > > > > > > indexing
> > >> > > > > > > > > output?
> > >> > > > > > > > >
> > >> > > > > > > > > Yes, I think that makes sense.  We want the Batch
> > >> Profiler to
> > >> > > > point
> > >> > > > > > to
> > >> > > > > > > > the
> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is
> deployed.
> > >> When
> > >> > > > > Metron
> > >> > > > > > > gets
> > >> > > > > > > > > spun-up on a cluster, I should be able to just run the
> > >> Batch
> > >> > > > > Profiler
> > >> > > > > > > > > without having to fuss with the input path.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> > >> > > > justinjleet@gmail.com
> > >> > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Re:
> > >> > > > > > > > > >
> > >> > > > > > > > > > >  * You do not configure the Batch Profiler in
> > >> Ambari.  It
> > >> > > is
> > >> > > > > > > > configured
> > >> > > > > > > > > > > and executed completely from the command-line.
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > Is it worth setting up a default config that pulls
> > from
> > >> the
> > >> > > > main
> > >> > > > > > > > indexing
> > >> > > > > > > > > > output?  I'm a little on the fence about it, but it
> > >> seems
> > >> > > like
> > >> > > > > > making
> > >> > > > > > > > the
> > >> > > > > > > > > > most common case more or less built-in would be
> nice.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Having said that, I do not consider that a
> requirement
> > >> for
> > >> > > > > merging
> > >> > > > > > > the
> > >> > > > > > > > > > feature branch.
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > >> > > > > jsirota@apache.org>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > I think what you have outlined above is a good
> > initial
> > >> > stab
> > >> > > > at
> > >> > > > > > the
> > >> > > > > > > > > > > feature.  Manual install of spark is not a big
> deal.
> > >> > > > > Configuring
> > >> > > > > > > via
> > >> > > > > > > > > > > command line while we mature this feature is ok as
> > >> well.
> > >> > > > > Doesn't
> > >> > > > > > > > look
> > >> > > > > > > > > > like
> > >> > > > > > > > > > > configuration steps are too hard.  I think you
> > should
> > >> > > merge.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > James
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <
> nick@nickallen.org
> > >:
> > >> > > > > > > > > > > > I would like to open a discussion to get the
> Batch
> > >> > > Profiler
> > >> > > > > > > feature
> > >> > > > > > > > > > > branch
> > >> > > > > > > > > > > > merged into master as part of METRON-1699 [1]
> > Create
> > >> > > Batch
> > >> > > > > > > > Profiler.
> > >> > > > > > > > > > All
> > >> > > > > > > > > > > > of the work that I had in mind for our first
> draft
> > >> of
> > >> > the
> > >> > > > > Batch
> > >> > > > > > > > > > Profiler
> > >> > > > > > > > > > > > has been completed. Please take a look through
> > what
> > >> I
> > >> > > have
> > >> > > > > and
> > >> > > > > > > let
> > >> > > > > > > > me
> > >> > > > > > > > > > > know
> > >> > > > > > > > > > > > if there are other features that you think are
> > >> required
> > >> > > > > > *before*
> > >> > > > > > > we
> > >> > > > > > > > > > > merge.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Previous list discussions on this topic include
> > [2]
> > >> and
> > >> > > > [3].
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > (Q) What can I do with the feature branch?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * With the Batch Profiler, you can
> backfill/seed
> > >> > > profiles
> > >> > > > > > using
> > >> > > > > > > > > > > archived
> > >> > > > > > > > > > > > telemetry. This enables the following types of
> use
> > >> > cases.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       1. As a Security Data Scientist, I want to
> > >> > > understand
> > >> > > > > the
> > >> > > > > > > > > > > historical
> > >> > > > > > > > > > > > behaviors and trends of a profile that I have
> > >> created
> > >> > so
> > >> > > > > that I
> > >> > > > > > > can
> > >> > > > > > > > > > > > determine if I have created a feature set that
> has
> > >> > > > predictive
> > >> > > > > > > value
> > >> > > > > > > > > for
> > >> > > > > > > > > > > > model building.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       2. As a Security Data Scientist, I want to
> > >> > > understand
> > >> > > > > the
> > >> > > > > > > > > > > historical
> > >> > > > > > > > > > > > behaviors and trends of a profile that I have
> > >> created
> > >> > so
> > >> > > > > that I
> > >> > > > > > > can
> > >> > > > > > > > > > > > determine if I have defined the profile
> correctly
> > >> and
> > >> > > > > created a
> > >> > > > > > > > > feature
> > >> > > > > > > > > > > set
> > >> > > > > > > > > > > > that matches reality.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       3. As a Security Platform Engineer, I want
> > to
> > >> > > > generate
> > >> > > > > a
> > >> > > > > > > > > profile
> > >> > > > > > > > > > > > using archived telemetry when I deploy a new
> model
> > >> to
> > >> > > > > > production
> > >> > > > > > > so
> > >> > > > > > > > > > that
> > >> > > > > > > > > > > > models depending on that profile can function on
> > >> day 1.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * METRON-1699 [1] includes a more detailed
> > >> > description
> > >> > > of
> > >> > > > > the
> > >> > > > > > > > > > feature.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > (Q) What work was completed?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * The Batch Profiler runs on Spark and was
> > >> > implemented
> > >> > > in
> > >> > > > > > Java
> > >> > > > > > > to
> > >> > > > > > > > > > > remain
> > >> > > > > > > > > > > > consistent with our current Java-heavy code
> base.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * The Batch Profiler is executed from the
> > >> > command-line.
> > >> > > > It
> > >> > > > > > can
> > >> > > > > > > be
> > >> > > > > > > > > > > > launched using a script or by calling
> > >> `spark-submit`,
> > >> > > which
> > >> > > > > may
> > >> > > > > > > be
> > >> > > > > > > > > > useful
> > >> > > > > > > > > > > > for advanced users.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * Input telemetry can be consumed from
> multiple
> > >> > > sources;
> > >> > > > > for
> > >> > > > > > > > > example
> > >> > > > > > > > > > > HDFS
> > >> > > > > > > > > > > > or the local file system.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * Input telemetry can be consumed in multiple
> > >> > formats;
> > >> > > > for
> > >> > > > > > > > example
> > >> > > > > > > > > > JSON
> > >> > > > > > > > > > > > or ORC.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * The 'output' profile measurements are
> > persisted
> > >> in
> > >> > > > HBase
> > >> > > > > > and
> > >> > > > > > > is
> > >> > > > > > > > > > > > consistent with the Storm Profiler.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * It can be run on any underlying engine
> > >> supported by
> > >> > > > > Spark.
> > >> > > > > > I
> > >> > > > > > > > have
> > >> > > > > > > > > > > > tested it both in 'local' mode and on a YARN
> > >> cluster.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * It is installed automatically by the Metron
> > >> MPack.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * A README was added that documents usage
> > >> > instructions.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * The existing Profiler code was refactored so
> > >> that
> > >> > as
> > >> > > > much
> > >> > > > > > > code
> > >> > > > > > > > as
> > >> > > > > > > > > > > > possible is shared between the 3 Profiler ports;
> > >> Storm,
> > >> > > the
> > >> > > > > > > Stellar
> > >> > > > > > > > > > REPL,
> > >> > > > > > > > > > > > and Spark. For example, the logic which
> determines
> > >> the
> > >> > > > > > timestamp
> > >> > > > > > > > of a
> > >> > > > > > > > > > > > message was refactored so that it could be
> reused
> > by
> > >> > all
> > >> > > > > ports.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       * metron-profiler-common: The common
> > Profiler
> > >> > code
> > >> > > > > shared
> > >> > > > > > > > > amongst
> > >> > > > > > > > > > > > each port.
> > >> > > > > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > >> > > > > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > >> > > > > > > > > > > >       * metron-profiler-repl: Profiler on the
> > >> Stellar
> > >> > > REPL
> > >> > > > > > > > > > > >       * metron-profiler-client: The client code
> > for
> > >> > > > > retrieving
> > >> > > > > > > > > profile
> > >> > > > > > > > > > > > data; for example PROFILE_GET.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * There are 3 separate RPM and DEB packages
> now
> > >> > created
> > >> > > > for
> > >> > > > > > the
> > >> > > > > > > > > > > Profiler.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
> > >> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
> > >> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * The Profiler integration tests were enhanced
> > to
> > >> > > > leverage
> > >> > > > > > the
> > >> > > > > > > > > > Profiler
> > >> > > > > > > > > > > > Client logic to validate the results.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * Review METRON-1699 [1] for a complete
> > >> break-down of
> > >> > > the
> > >> > > > > > tasks
> > >> > > > > > > > > that
> > >> > > > > > > > > > > have
> > >> > > > > > > > > > > > been completed on the feature branch.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > (Q) What limitations exist?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * You must manually install Spark to use the
> > Batch
> > >> > > > > Profiler.
> > >> > > > > > > The
> > >> > > > > > > > > > Metron
> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron
> dependency
> > >> and
> > >> > so
> > >> > > > does
> > >> > > > > > not
> > >> > > > > > > > > > install
> > >> > > > > > > > > > > > it automatically.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * You do not configure the Batch Profiler in
> > >> Ambari.
> > >> > It
> > >> > > > is
> > >> > > > > > > > > configured
> > >> > > > > > > > > > > > and executed completely from the command-line.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you
> > >> have
> > >> > to
> > >> > > > take
> > >> > > > > > the
> > >> > > > > > > > > > > following
> > >> > > > > > > > > > > > manual steps. Some of these are arguably
> > limitations
> > >> > with
> > >> > > > how
> > >> > > > > > > > Ambari
> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP that we
> > run.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       1. Install Spark 2 using Ambari.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       2. Tell Spark how to talk with HBase.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
>  SPARK_HOME=/usr/hdp/current/spark2-client
> > >> > > > > > > > > > > >         cp
> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > >> > > > > > > > > > > > $SPARK_HOME/conf/
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       3. Create the Spark History directory in
> > HDFS.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > >> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >       4. Change the default input path to
> > >> > > > > > > > `hdfs://localhost:8020/...`
> > >> > > > > > > > > > to
> > >> > > > > > > > > > > > match the port defined by HDP, instead of port
> > 9000.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > [1]
> > >> https://issues.apache.org/jira/browse/METRON-1699
> > >> > > > > > > > > > > > [2]
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > >> > > > > > > > > > > > [3]
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > -------------------
> > >> > > > > > > > > > > Thank you,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > James Sirota
> > >> > > > > > > > > > > PMC- Apache Metron
> > >> > > > > > > > > > > jsirota AT apache DOT org
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Michael Miklavcic <mi...@gmail.com>.

I just made a couple minor comments on that PR, and I am in agreement about
the readiness for merging with master. Good stuff Nick.

On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <ni...@nickallen.org> wrote:

> Here is a PR that adds the input time constraints to the Batch Profiler
> (METRON-1787);  https://github.com/apache/metron/pull/1209.
>
> It seems that the consensus is that this is probably the last feature we
> need before merging the FB into master.  The other two can wait until after
> the feature branch has been merged.  Let me know if you disagree.
>
> Thanks
>
>
> On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org> wrote:
>
> > Yeah, agreed.  Per use case 3, when deploying to production there really
> > wouldn't be a huge overlap like 3 months of already profiled data.  Its
> day
> > 1, the profile was just deployed around the same time as you are running
> > the Batch Profiler, so the overlap is in minutes, maybe hours.  But I can
> > definitely see the usefulness of the feature for re-runs, etc as you have
> > described.
> >
> > Based on this discussion, I created a few JIRAs.  Thanks all for the
> great
> > feedback and keep it coming.
> >
> > [1] METRON-1787 - Input Time Constraints for Batch Profiler
> > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch Profiler
> > [3] METRON-1789 - MPack Should Define Default Input Path for Batch
> > Profiler
> >
> >
> > --
> > [1] https://issues.apache.org/jira/browse/METRON-1787
> > [2] https://issues.apache.org/jira/browse/METRON-1788
> > [3] https://issues.apache.org/jira/browse/METRON-1789
> >
> >
> >
> >
> >
> >
> > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
> > michael.miklavcic@gmail.com> wrote:
> >
> >> I think we might want to allow the flexibility to choose the date range
> >> then. I don't yet feel like I have a good enough understanding of all
> the
> >> ways in which users would want to seed to force them to run the batch
> job
> >> over all the data. It might also make it easier to deal with
> remediation,
> >> ie an error doesn't force you to re-run over the entire history. Same
> goes
> >> for testing out the profile seeing batch job in the first place.
> >>
> >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org> wrote:
> >>
> >> > Assuming you have 9 months of data archived, yes.
> >> >
> >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
> >> > michael.miklavcic@gmail.com> wrote:
> >> >
> >> > > So in the case of 3 - if you had 6 months of data that hadn't been
> >> > profiled
> >> > > and another 3 that had been profiled (9 months total data), in its
> >> > current
> >> > > form the batch job runs over all 9 months?
> >> > >
> >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org>
> >> wrote:
> >> > >
> >> > > > > How do we establish "tm" from 1.1 above? Any concerns about
> >> overlap
> >> > or
> >> > > > gaps after the seeding is performed?
> >> > > >
> >> > > > Good point.  Right now, if the Streaming and Batch Profiler
> overlap
> >> the
> >> > > > last write wins.  And presumably the output of the Streaming and
> >> Batch
> >> > > > Profiler are the same, so no worries, right? :)
> >> > > >
> >> > > > So it kind of works, but it is definitely not ideal for use case
> >> 3.  I
> >> > > > could add --begin and --end args to constrain the time frame over
> >> which
> >> > > the
> >> > > > Batch Profiler runs.  I do not have that in the feature branch.
> It
> >> > would
> >> > > > be easy enough to add though.
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> >> > > > michael.miklavcic@gmail.com> wrote:
> >> > > >
> >> > > > > Ok, makes sense. That's sort of what I was thinking as well,
> Nick.
> >> > > > Pulling
> >> > > > > at this thread just a bit more...
> >> > > > >
> >> > > > >    1. I have an existing system that's been up a while, and I
> have
> >> > > added
> >> > > > k
> >> > > > >    profiles - assume these are the first profiles I've created.
> >> > > > >       1. I would have t0 - tm (where m is the time when the
> >> profiles
> >> > > were
> >> > > > >       first installed) worth of data that has not been profiled
> >> yet.
> >> > > > >       2. The batch profiler process would be to take that exact
> >> > profile
> >> > > > >       definition from ZK and run the batch loader with that from
> >> the
> >> > > CLI.
> >> > > > >       3. Profiles are now up to date from t0 - tCurrent
> >> > > > >    2. I've already done #1 above. Time goes by and now I want to
> >> add
> >> > a
> >> > > > new
> >> > > > >    profile.
> >> > > > >       1. Same first step above
> >> > > > >       2. I would run the batch loader with *only* that new
> profile
> >> > > > >       definition to seed?
> >> > > > >
> >> > > > > Forgive me if I missed this in PR's and discussion in the FB,
> but
> >> how
> >> > > do
> >> > > > we
> >> > > > > establish "tm" from 1.1 above? Any concerns about overlap or
> gaps
> >> > after
> >> > > > the
> >> > > > > seeding is performed?
> >> > > > >
> >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <nick@nickallen.org
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > I think more often than not, you would want to load your
> profile
> >> > > > > definition
> >> > > > > > from a file.  This is why I considered the 'load from Zk' more
> >> of a
> >> > > > > > nice-to-have.
> >> > > > > >
> >> > > > > >    - In use case 1 and 2, this would definitely be the case.
> >> The
> >> > > > > profiles
> >> > > > > >    I am working with are speculative and I am using the batch
> >> > > profiler
> >> > > > to
> >> > > > > >    determine if they are worth keeping.  In this case, my
> >> > speculative
> >> > > > > > profiles
> >> > > > > >    would not be in Zk (yet).
> >> > > > > >    - In use case 3, I could see it go either way.  It might be
> >> > useful
> >> > > > to
> >> > > > > >    load from Zk, but it certainly isn't a blocker.
> >> > > > > >
> >> > > > > >
> >> > > > > > > So if the config does not correctly match the profiler
> config
> >> > held
> >> > > in
> >> > > > > ZK
> >> > > > > > and
> >> > > > > > the user runs the batch seeding job, what happens?
> >> > > > > >
> >> > > > > > You would just get a profile that is slightly different over
> the
> >> > > entire
> >> > > > > > time span.  This is not a new risk.  If the user changes their
> >> > > Profile
> >> > > > > > definitions in Zk, the same thing would happen.
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> >> > > > > > michael.miklavcic@gmail.com> wrote:
> >> > > > > >
> >> > > > > > > I think I'm torn on this, specifically because it's batch
> and
> >> > would
> >> > > > > > > generally be run as-needed. Justin, can you elaborate on
> your
> >> > > > concerns
> >> > > > > > > there? This feels functionally very similar to our flat file
> >> > > loaders,
> >> > > > > > which
> >> > > > > > > all have inputs for config from the CLI only. On the other
> >> hand,
> >> > > our
> >> > > > > flat
> >> > > > > > > file loaders are not typically seeding an existing
> structure.
> >> My
> >> > > > > concern
> >> > > > > > of
> >> > > > > > > a local file profiler config stems from this stated goal:
> >> > > > > > > > The goal would be to enable “profile seeding” which allows
> >> > > profiles
> >> > > > > to
> >> > > > > > be
> >> > > > > > > populated from a time before the profile was created.
> >> > > > > > > So if the config does not correctly match the profiler
> config
> >> > held
> >> > > in
> >> > > > > ZK
> >> > > > > > > and the user runs the batch seeding job, what happens?
> >> > > > > > >
> >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> >> > > justinjleet@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > The profile not being able to read from ZK feels like a
> >> fairly
> >> > > > > > > substantial,
> >> > > > > > > > if subtle, set of potential problems.  I'd like to see
> that
> >> in
> >> > > > either
> >> > > > > > > > before merging or at least pretty soon after merging.  Is
> >> it a
> >> > > lot
> >> > > > of
> >> > > > > > > work
> >> > > > > > > > to add that functionality based on where things are right
> >> now?
> >> > > > > > > >
> >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
> >> nick@nickallen.org
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Here is another limitation that I just thought. It can
> >> only
> >> > > read
> >> > > > a
> >> > > > > > > > profile
> >> > > > > > > > > definition from a file.  It probably also makes sense to
> >> add
> >> > an
> >> > > > > > option
> >> > > > > > > > that
> >> > > > > > > > > allows it to read the current Profiler configuration
> from
> >> > > > > Zookeeper.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > > Is it worth setting up a default config that pulls
> from
> >> the
> >> > > > main
> >> > > > > > > > indexing
> >> > > > > > > > > output?
> >> > > > > > > > >
> >> > > > > > > > > Yes, I think that makes sense.  We want the Batch
> >> Profiler to
> >> > > > point
> >> > > > > > to
> >> > > > > > > > the
> >> > > > > > > > > right HDFS URL, no matter where/how Metron is deployed.
> >> When
> >> > > > > Metron
> >> > > > > > > gets
> >> > > > > > > > > spun-up on a cluster, I should be able to just run the
> >> Batch
> >> > > > > Profiler
> >> > > > > > > > > without having to fuss with the input path.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> >> > > > justinjleet@gmail.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Re:
> >> > > > > > > > > >
> >> > > > > > > > > > >  * You do not configure the Batch Profiler in
> >> Ambari.  It
> >> > > is
> >> > > > > > > > configured
> >> > > > > > > > > > > and executed completely from the command-line.
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > Is it worth setting up a default config that pulls
> from
> >> the
> >> > > > main
> >> > > > > > > > indexing
> >> > > > > > > > > > output?  I'm a little on the fence about it, but it
> >> seems
> >> > > like
> >> > > > > > making
> >> > > > > > > > the
> >> > > > > > > > > > most common case more or less built-in would be nice.
> >> > > > > > > > > >
> >> > > > > > > > > > Having said that, I do not consider that a requirement
> >> for
> >> > > > > merging
> >> > > > > > > the
> >> > > > > > > > > > feature branch.
> >> > > > > > > > > >
> >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> >> > > > > jsirota@apache.org>
> >> > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > I think what you have outlined above is a good
> initial
> >> > stab
> >> > > > at
> >> > > > > > the
> >> > > > > > > > > > > feature.  Manual install of spark is not a big deal.
> >> > > > > Configuring
> >> > > > > > > via
> >> > > > > > > > > > > command line while we mature this feature is ok as
> >> well.
> >> > > > > Doesn't
> >> > > > > > > > look
> >> > > > > > > > > > like
> >> > > > > > > > > > > configuration steps are too hard.  I think you
> should
> >> > > merge.
> >> > > > > > > > > > >
> >> > > > > > > > > > > James
> >> > > > > > > > > > >
> >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <nick@nickallen.org
> >:
> >> > > > > > > > > > > > I would like to open a discussion to get the Batch
> >> > > Profiler
> >> > > > > > > feature
> >> > > > > > > > > > > branch
> >> > > > > > > > > > > > merged into master as part of METRON-1699 [1]
> Create
> >> > > Batch
> >> > > > > > > > Profiler.
> >> > > > > > > > > > All
> >> > > > > > > > > > > > of the work that I had in mind for our first draft
> >> of
> >> > the
> >> > > > > Batch
> >> > > > > > > > > > Profiler
> >> > > > > > > > > > > > has been completed. Please take a look through
> what
> >> I
> >> > > have
> >> > > > > and
> >> > > > > > > let
> >> > > > > > > > me
> >> > > > > > > > > > > know
> >> > > > > > > > > > > > if there are other features that you think are
> >> required
> >> > > > > > *before*
> >> > > > > > > we
> >> > > > > > > > > > > merge.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Previous list discussions on this topic include
> [2]
> >> and
> >> > > > [3].
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > (Q) What can I do with the feature branch?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * With the Batch Profiler, you can backfill/seed
> >> > > profiles
> >> > > > > > using
> >> > > > > > > > > > > archived
> >> > > > > > > > > > > > telemetry. This enables the following types of use
> >> > cases.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       1. As a Security Data Scientist, I want to
> >> > > understand
> >> > > > > the
> >> > > > > > > > > > > historical
> >> > > > > > > > > > > > behaviors and trends of a profile that I have
> >> created
> >> > so
> >> > > > > that I
> >> > > > > > > can
> >> > > > > > > > > > > > determine if I have created a feature set that has
> >> > > > predictive
> >> > > > > > > value
> >> > > > > > > > > for
> >> > > > > > > > > > > > model building.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       2. As a Security Data Scientist, I want to
> >> > > understand
> >> > > > > the
> >> > > > > > > > > > > historical
> >> > > > > > > > > > > > behaviors and trends of a profile that I have
> >> created
> >> > so
> >> > > > > that I
> >> > > > > > > can
> >> > > > > > > > > > > > determine if I have defined the profile correctly
> >> and
> >> > > > > created a
> >> > > > > > > > > feature
> >> > > > > > > > > > > set
> >> > > > > > > > > > > > that matches reality.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       3. As a Security Platform Engineer, I want
> to
> >> > > > generate
> >> > > > > a
> >> > > > > > > > > profile
> >> > > > > > > > > > > > using archived telemetry when I deploy a new model
> >> to
> >> > > > > > production
> >> > > > > > > so
> >> > > > > > > > > > that
> >> > > > > > > > > > > > models depending on that profile can function on
> >> day 1.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * METRON-1699 [1] includes a more detailed
> >> > description
> >> > > of
> >> > > > > the
> >> > > > > > > > > > feature.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > (Q) What work was completed?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * The Batch Profiler runs on Spark and was
> >> > implemented
> >> > > in
> >> > > > > > Java
> >> > > > > > > to
> >> > > > > > > > > > > remain
> >> > > > > > > > > > > > consistent with our current Java-heavy code base.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * The Batch Profiler is executed from the
> >> > command-line.
> >> > > > It
> >> > > > > > can
> >> > > > > > > be
> >> > > > > > > > > > > > launched using a script or by calling
> >> `spark-submit`,
> >> > > which
> >> > > > > may
> >> > > > > > > be
> >> > > > > > > > > > useful
> >> > > > > > > > > > > > for advanced users.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * Input telemetry can be consumed from multiple
> >> > > sources;
> >> > > > > for
> >> > > > > > > > > example
> >> > > > > > > > > > > HDFS
> >> > > > > > > > > > > > or the local file system.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * Input telemetry can be consumed in multiple
> >> > formats;
> >> > > > for
> >> > > > > > > > example
> >> > > > > > > > > > JSON
> >> > > > > > > > > > > > or ORC.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * The 'output' profile measurements are
> persisted
> >> in
> >> > > > HBase
> >> > > > > > and
> >> > > > > > > is
> >> > > > > > > > > > > > consistent with the Storm Profiler.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * It can be run on any underlying engine
> >> supported by
> >> > > > > Spark.
> >> > > > > > I
> >> > > > > > > > have
> >> > > > > > > > > > > > tested it both in 'local' mode and on a YARN
> >> cluster.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * It is installed automatically by the Metron
> >> MPack.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * A README was added that documents usage
> >> > instructions.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * The existing Profiler code was refactored so
> >> that
> >> > as
> >> > > > much
> >> > > > > > > code
> >> > > > > > > > as
> >> > > > > > > > > > > > possible is shared between the 3 Profiler ports;
> >> Storm,
> >> > > the
> >> > > > > > > Stellar
> >> > > > > > > > > > REPL,
> >> > > > > > > > > > > > and Spark. For example, the logic which determines
> >> the
> >> > > > > > timestamp
> >> > > > > > > > of a
> >> > > > > > > > > > > > message was refactored so that it could be reused
> by
> >> > all
> >> > > > > ports.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       * metron-profiler-common: The common
> Profiler
> >> > code
> >> > > > > shared
> >> > > > > > > > > amongst
> >> > > > > > > > > > > > each port.
> >> > > > > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> >> > > > > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> >> > > > > > > > > > > >       * metron-profiler-repl: Profiler on the
> >> Stellar
> >> > > REPL
> >> > > > > > > > > > > >       * metron-profiler-client: The client code
> for
> >> > > > > retrieving
> >> > > > > > > > > profile
> >> > > > > > > > > > > > data; for example PROFILE_GET.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * There are 3 separate RPM and DEB packages now
> >> > created
> >> > > > for
> >> > > > > > the
> >> > > > > > > > > > > Profiler.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
> >> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
> >> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * The Profiler integration tests were enhanced
> to
> >> > > > leverage
> >> > > > > > the
> >> > > > > > > > > > Profiler
> >> > > > > > > > > > > > Client logic to validate the results.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * Review METRON-1699 [1] for a complete
> >> break-down of
> >> > > the
> >> > > > > > tasks
> >> > > > > > > > > that
> >> > > > > > > > > > > have
> >> > > > > > > > > > > > been completed on the feature branch.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > (Q) What limitations exist?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * You must manually install Spark to use the
> Batch
> >> > > > > Profiler.
> >> > > > > > > The
> >> > > > > > > > > > Metron
> >> > > > > > > > > > > > MPack does not treat Spark as a Metron dependency
> >> and
> >> > so
> >> > > > does
> >> > > > > > not
> >> > > > > > > > > > install
> >> > > > > > > > > > > > it automatically.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * You do not configure the Batch Profiler in
> >> Ambari.
> >> > It
> >> > > > is
> >> > > > > > > > > configured
> >> > > > > > > > > > > > and executed completely from the command-line.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you
> >> have
> >> > to
> >> > > > take
> >> > > > > > the
> >> > > > > > > > > > > following
> >> > > > > > > > > > > > manual steps. Some of these are arguably
> limitations
> >> > with
> >> > > > how
> >> > > > > > > > Ambari
> >> > > > > > > > > > > > installs Spark 2 in the version of HDP that we
> run.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       1. Install Spark 2 using Ambari.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       2. Tell Spark how to talk with HBase.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> >> > > > > > > > > > > >         cp
> >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> >> > > > > > > > > > > > $SPARK_HOME/conf/
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       3. Create the Spark History directory in
> HDFS.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
> >> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >       4. Change the default input path to
> >> > > > > > > > `hdfs://localhost:8020/...`
> >> > > > > > > > > > to
> >> > > > > > > > > > > > match the port defined by HDP, instead of port
> 9000.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > [1]
> >> https://issues.apache.org/jira/browse/METRON-1699
> >> > > > > > > > > > > > [2]
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> >> > > > > > > > > > > > [3]
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> >> > > > > > > > > > >
> >> > > > > > > > > > > -------------------
> >> > > > > > > > > > > Thank you,
> >> > > > > > > > > > >
> >> > > > > > > > > > > James Sirota
> >> > > > > > > > > > > PMC- Apache Metron
> >> > > > > > > > > > > jsirota AT apache DOT org
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

Here is a PR that adds the input time constraints to the Batch Profiler
(METRON-1787);  https://github.com/apache/metron/pull/1209.

It seems that the consensus is that this is probably the last feature we
need before merging the FB into master.  The other two can wait until after
the feature branch has been merged.  Let me know if you disagree.

Thanks


On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <ni...@nickallen.org> wrote:

> Yeah, agreed.  Per use case 3, when deploying to production there really
> wouldn't be a huge overlap like 3 months of already profiled data.  Its day
> 1, the profile was just deployed around the same time as you are running
> the Batch Profiler, so the overlap is in minutes, maybe hours.  But I can
> definitely see the usefulness of the feature for re-runs, etc as you have
> described.
>
> Based on this discussion, I created a few JIRAs.  Thanks all for the great
> feedback and keep it coming.
>
> [1] METRON-1787 - Input Time Constraints for Batch Profiler
> [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch Profiler
> [3] METRON-1789 - MPack Should Define Default Input Path for Batch
> Profiler
>
>
> --
> [1] https://issues.apache.org/jira/browse/METRON-1787
> [2] https://issues.apache.org/jira/browse/METRON-1788
> [3] https://issues.apache.org/jira/browse/METRON-1789
>
>
>
>
>
>
> On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
>> I think we might want to allow the flexibility to choose the date range
>> then. I don't yet feel like I have a good enough understanding of all the
>> ways in which users would want to seed to force them to run the batch job
>> over all the data. It might also make it easier to deal with remediation,
>> ie an error doesn't force you to re-run over the entire history. Same goes
>> for testing out the profile seeing batch job in the first place.
>>
>> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org> wrote:
>>
>> > Assuming you have 9 months of data archived, yes.
>> >
>> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
>> > michael.miklavcic@gmail.com> wrote:
>> >
>> > > So in the case of 3 - if you had 6 months of data that hadn't been
>> > profiled
>> > > and another 3 that had been profiled (9 months total data), in its
>> > current
>> > > form the batch job runs over all 9 months?
>> > >
>> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org>
>> wrote:
>> > >
>> > > > > How do we establish "tm" from 1.1 above? Any concerns about
>> overlap
>> > or
>> > > > gaps after the seeding is performed?
>> > > >
>> > > > Good point.  Right now, if the Streaming and Batch Profiler overlap
>> the
>> > > > last write wins.  And presumably the output of the Streaming and
>> Batch
>> > > > Profiler are the same, so no worries, right? :)
>> > > >
>> > > > So it kind of works, but it is definitely not ideal for use case
>> 3.  I
>> > > > could add --begin and --end args to constrain the time frame over
>> which
>> > > the
>> > > > Batch Profiler runs.  I do not have that in the feature branch.  It
>> > would
>> > > > be easy enough to add though.
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
>> > > > michael.miklavcic@gmail.com> wrote:
>> > > >
>> > > > > Ok, makes sense. That's sort of what I was thinking as well, Nick.
>> > > > Pulling
>> > > > > at this thread just a bit more...
>> > > > >
>> > > > >    1. I have an existing system that's been up a while, and I have
>> > > added
>> > > > k
>> > > > >    profiles - assume these are the first profiles I've created.
>> > > > >       1. I would have t0 - tm (where m is the time when the
>> profiles
>> > > were
>> > > > >       first installed) worth of data that has not been profiled
>> yet.
>> > > > >       2. The batch profiler process would be to take that exact
>> > profile
>> > > > >       definition from ZK and run the batch loader with that from
>> the
>> > > CLI.
>> > > > >       3. Profiles are now up to date from t0 - tCurrent
>> > > > >    2. I've already done #1 above. Time goes by and now I want to
>> add
>> > a
>> > > > new
>> > > > >    profile.
>> > > > >       1. Same first step above
>> > > > >       2. I would run the batch loader with *only* that new profile
>> > > > >       definition to seed?
>> > > > >
>> > > > > Forgive me if I missed this in PR's and discussion in the FB, but
>> how
>> > > do
>> > > > we
>> > > > > establish "tm" from 1.1 above? Any concerns about overlap or gaps
>> > after
>> > > > the
>> > > > > seeding is performed?
>> > > > >
>> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org>
>> > > wrote:
>> > > > >
>> > > > > > I think more often than not, you would want to load your profile
>> > > > > definition
>> > > > > > from a file.  This is why I considered the 'load from Zk' more
>> of a
>> > > > > > nice-to-have.
>> > > > > >
>> > > > > >    - In use case 1 and 2, this would definitely be the case.
>> The
>> > > > > profiles
>> > > > > >    I am working with are speculative and I am using the batch
>> > > profiler
>> > > > to
>> > > > > >    determine if they are worth keeping.  In this case, my
>> > speculative
>> > > > > > profiles
>> > > > > >    would not be in Zk (yet).
>> > > > > >    - In use case 3, I could see it go either way.  It might be
>> > useful
>> > > > to
>> > > > > >    load from Zk, but it certainly isn't a blocker.
>> > > > > >
>> > > > > >
>> > > > > > > So if the config does not correctly match the profiler config
>> > held
>> > > in
>> > > > > ZK
>> > > > > > and
>> > > > > > the user runs the batch seeding job, what happens?
>> > > > > >
>> > > > > > You would just get a profile that is slightly different over the
>> > > entire
>> > > > > > time span.  This is not a new risk.  If the user changes their
>> > > Profile
>> > > > > > definitions in Zk, the same thing would happen.
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
>> > > > > > michael.miklavcic@gmail.com> wrote:
>> > > > > >
>> > > > > > > I think I'm torn on this, specifically because it's batch and
>> > would
>> > > > > > > generally be run as-needed. Justin, can you elaborate on your
>> > > > concerns
>> > > > > > > there? This feels functionally very similar to our flat file
>> > > loaders,
>> > > > > > which
>> > > > > > > all have inputs for config from the CLI only. On the other
>> hand,
>> > > our
>> > > > > flat
>> > > > > > > file loaders are not typically seeding an existing structure.
>> My
>> > > > > concern
>> > > > > > of
>> > > > > > > a local file profiler config stems from this stated goal:
>> > > > > > > > The goal would be to enable “profile seeding” which allows
>> > > profiles
>> > > > > to
>> > > > > > be
>> > > > > > > populated from a time before the profile was created.
>> > > > > > > So if the config does not correctly match the profiler config
>> > held
>> > > in
>> > > > > ZK
>> > > > > > > and the user runs the batch seeding job, what happens?
>> > > > > > >
>> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
>> > > justinjleet@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > The profile not being able to read from ZK feels like a
>> fairly
>> > > > > > > substantial,
>> > > > > > > > if subtle, set of potential problems.  I'd like to see that
>> in
>> > > > either
>> > > > > > > > before merging or at least pretty soon after merging.  Is
>> it a
>> > > lot
>> > > > of
>> > > > > > > work
>> > > > > > > > to add that functionality based on where things are right
>> now?
>> > > > > > > >
>> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
>> nick@nickallen.org
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Here is another limitation that I just thought. It can
>> only
>> > > read
>> > > > a
>> > > > > > > > profile
>> > > > > > > > > definition from a file.  It probably also makes sense to
>> add
>> > an
>> > > > > > option
>> > > > > > > > that
>> > > > > > > > > allows it to read the current Profiler configuration from
>> > > > > Zookeeper.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Is it worth setting up a default config that pulls from
>> the
>> > > > main
>> > > > > > > > indexing
>> > > > > > > > > output?
>> > > > > > > > >
>> > > > > > > > > Yes, I think that makes sense.  We want the Batch
>> Profiler to
>> > > > point
>> > > > > > to
>> > > > > > > > the
>> > > > > > > > > right HDFS URL, no matter where/how Metron is deployed.
>> When
>> > > > > Metron
>> > > > > > > gets
>> > > > > > > > > spun-up on a cluster, I should be able to just run the
>> Batch
>> > > > > Profiler
>> > > > > > > > > without having to fuss with the input path.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
>> > > > justinjleet@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Re:
>> > > > > > > > > >
>> > > > > > > > > > >  * You do not configure the Batch Profiler in
>> Ambari.  It
>> > > is
>> > > > > > > > configured
>> > > > > > > > > > > and executed completely from the command-line.
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > Is it worth setting up a default config that pulls from
>> the
>> > > > main
>> > > > > > > > indexing
>> > > > > > > > > > output?  I'm a little on the fence about it, but it
>> seems
>> > > like
>> > > > > > making
>> > > > > > > > the
>> > > > > > > > > > most common case more or less built-in would be nice.
>> > > > > > > > > >
>> > > > > > > > > > Having said that, I do not consider that a requirement
>> for
>> > > > > merging
>> > > > > > > the
>> > > > > > > > > > feature branch.
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
>> > > > > jsirota@apache.org>
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I think what you have outlined above is a good initial
>> > stab
>> > > > at
>> > > > > > the
>> > > > > > > > > > > feature.  Manual install of spark is not a big deal.
>> > > > > Configuring
>> > > > > > > via
>> > > > > > > > > > > command line while we mature this feature is ok as
>> well.
>> > > > > Doesn't
>> > > > > > > > look
>> > > > > > > > > > like
>> > > > > > > > > > > configuration steps are too hard.  I think you should
>> > > merge.
>> > > > > > > > > > >
>> > > > > > > > > > > James
>> > > > > > > > > > >
>> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
>> > > > > > > > > > > > I would like to open a discussion to get the Batch
>> > > Profiler
>> > > > > > > feature
>> > > > > > > > > > > branch
>> > > > > > > > > > > > merged into master as part of METRON-1699 [1] Create
>> > > Batch
>> > > > > > > > Profiler.
>> > > > > > > > > > All
>> > > > > > > > > > > > of the work that I had in mind for our first draft
>> of
>> > the
>> > > > > Batch
>> > > > > > > > > > Profiler
>> > > > > > > > > > > > has been completed. Please take a look through what
>> I
>> > > have
>> > > > > and
>> > > > > > > let
>> > > > > > > > me
>> > > > > > > > > > > know
>> > > > > > > > > > > > if there are other features that you think are
>> required
>> > > > > > *before*
>> > > > > > > we
>> > > > > > > > > > > merge.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Previous list discussions on this topic include [2]
>> and
>> > > > [3].
>> > > > > > > > > > > >
>> > > > > > > > > > > > (Q) What can I do with the feature branch?
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * With the Batch Profiler, you can backfill/seed
>> > > profiles
>> > > > > > using
>> > > > > > > > > > > archived
>> > > > > > > > > > > > telemetry. This enables the following types of use
>> > cases.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       1. As a Security Data Scientist, I want to
>> > > understand
>> > > > > the
>> > > > > > > > > > > historical
>> > > > > > > > > > > > behaviors and trends of a profile that I have
>> created
>> > so
>> > > > > that I
>> > > > > > > can
>> > > > > > > > > > > > determine if I have created a feature set that has
>> > > > predictive
>> > > > > > > value
>> > > > > > > > > for
>> > > > > > > > > > > > model building.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       2. As a Security Data Scientist, I want to
>> > > understand
>> > > > > the
>> > > > > > > > > > > historical
>> > > > > > > > > > > > behaviors and trends of a profile that I have
>> created
>> > so
>> > > > > that I
>> > > > > > > can
>> > > > > > > > > > > > determine if I have defined the profile correctly
>> and
>> > > > > created a
>> > > > > > > > > feature
>> > > > > > > > > > > set
>> > > > > > > > > > > > that matches reality.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       3. As a Security Platform Engineer, I want to
>> > > > generate
>> > > > > a
>> > > > > > > > > profile
>> > > > > > > > > > > > using archived telemetry when I deploy a new model
>> to
>> > > > > > production
>> > > > > > > so
>> > > > > > > > > > that
>> > > > > > > > > > > > models depending on that profile can function on
>> day 1.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * METRON-1699 [1] includes a more detailed
>> > description
>> > > of
>> > > > > the
>> > > > > > > > > > feature.
>> > > > > > > > > > > >
>> > > > > > > > > > > > (Q) What work was completed?
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The Batch Profiler runs on Spark and was
>> > implemented
>> > > in
>> > > > > > Java
>> > > > > > > to
>> > > > > > > > > > > remain
>> > > > > > > > > > > > consistent with our current Java-heavy code base.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The Batch Profiler is executed from the
>> > command-line.
>> > > > It
>> > > > > > can
>> > > > > > > be
>> > > > > > > > > > > > launched using a script or by calling
>> `spark-submit`,
>> > > which
>> > > > > may
>> > > > > > > be
>> > > > > > > > > > useful
>> > > > > > > > > > > > for advanced users.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * Input telemetry can be consumed from multiple
>> > > sources;
>> > > > > for
>> > > > > > > > > example
>> > > > > > > > > > > HDFS
>> > > > > > > > > > > > or the local file system.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * Input telemetry can be consumed in multiple
>> > formats;
>> > > > for
>> > > > > > > > example
>> > > > > > > > > > JSON
>> > > > > > > > > > > > or ORC.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The 'output' profile measurements are persisted
>> in
>> > > > HBase
>> > > > > > and
>> > > > > > > is
>> > > > > > > > > > > > consistent with the Storm Profiler.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * It can be run on any underlying engine
>> supported by
>> > > > > Spark.
>> > > > > > I
>> > > > > > > > have
>> > > > > > > > > > > > tested it both in 'local' mode and on a YARN
>> cluster.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * It is installed automatically by the Metron
>> MPack.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * A README was added that documents usage
>> > instructions.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The existing Profiler code was refactored so
>> that
>> > as
>> > > > much
>> > > > > > > code
>> > > > > > > > as
>> > > > > > > > > > > > possible is shared between the 3 Profiler ports;
>> Storm,
>> > > the
>> > > > > > > Stellar
>> > > > > > > > > > REPL,
>> > > > > > > > > > > > and Spark. For example, the logic which determines
>> the
>> > > > > > timestamp
>> > > > > > > > of a
>> > > > > > > > > > > > message was refactored so that it could be reused by
>> > all
>> > > > > ports.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       * metron-profiler-common: The common Profiler
>> > code
>> > > > > shared
>> > > > > > > > > amongst
>> > > > > > > > > > > > each port.
>> > > > > > > > > > > >       * metron-profiler-storm: Profiler on Storm
>> > > > > > > > > > > >       * metron-profiler-spark: Profiler on Spark
>> > > > > > > > > > > >       * metron-profiler-repl: Profiler on the
>> Stellar
>> > > REPL
>> > > > > > > > > > > >       * metron-profiler-client: The client code for
>> > > > > retrieving
>> > > > > > > > > profile
>> > > > > > > > > > > > data; for example PROFILE_GET.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * There are 3 separate RPM and DEB packages now
>> > created
>> > > > for
>> > > > > > the
>> > > > > > > > > > > Profiler.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
>> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
>> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The Profiler integration tests were enhanced to
>> > > > leverage
>> > > > > > the
>> > > > > > > > > > Profiler
>> > > > > > > > > > > > Client logic to validate the results.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * Review METRON-1699 [1] for a complete
>> break-down of
>> > > the
>> > > > > > tasks
>> > > > > > > > > that
>> > > > > > > > > > > have
>> > > > > > > > > > > > been completed on the feature branch.
>> > > > > > > > > > > >
>> > > > > > > > > > > > (Q) What limitations exist?
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * You must manually install Spark to use the Batch
>> > > > > Profiler.
>> > > > > > > The
>> > > > > > > > > > Metron
>> > > > > > > > > > > > MPack does not treat Spark as a Metron dependency
>> and
>> > so
>> > > > does
>> > > > > > not
>> > > > > > > > > > install
>> > > > > > > > > > > > it automatically.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * You do not configure the Batch Profiler in
>> Ambari.
>> > It
>> > > > is
>> > > > > > > > > configured
>> > > > > > > > > > > > and executed completely from the command-line.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you
>> have
>> > to
>> > > > take
>> > > > > > the
>> > > > > > > > > > > following
>> > > > > > > > > > > > manual steps. Some of these are arguably limitations
>> > with
>> > > > how
>> > > > > > > > Ambari
>> > > > > > > > > > > > installs Spark 2 in the version of HDP that we run.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       1. Install Spark 2 using Ambari.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       2. Tell Spark how to talk with HBase.
>> > > > > > > > > > > >
>> > > > > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
>> > > > > > > > > > > >         cp
>> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
>> > > > > > > > > > > > $SPARK_HOME/conf/
>> > > > > > > > > > > >
>> > > > > > > > > > > >       3. Create the Spark History directory in HDFS.
>> > > > > > > > > > > >
>> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
>> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
>> > > > > > > > > > > >
>> > > > > > > > > > > >       4. Change the default input path to
>> > > > > > > > `hdfs://localhost:8020/...`
>> > > > > > > > > > to
>> > > > > > > > > > > > match the port defined by HDP, instead of port 9000.
>> > > > > > > > > > > >
>> > > > > > > > > > > > [1]
>> https://issues.apache.org/jira/browse/METRON-1699
>> > > > > > > > > > > > [2]
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
>> > > > > > > > > > > > [3]
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
>> > > > > > > > > > >
>> > > > > > > > > > > -------------------
>> > > > > > > > > > > Thank you,
>> > > > > > > > > > >
>> > > > > > > > > > > James Sirota
>> > > > > > > > > > > PMC- Apache Metron
>> > > > > > > > > > > jsirota AT apache DOT org
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

Yeah, agreed.  Per use case 3, when deploying to production there really
wouldn't be a huge overlap like 3 months of already profiled data.  Its day
1, the profile was just deployed around the same time as you are running
the Batch Profiler, so the overlap is in minutes, maybe hours.  But I can
definitely see the usefulness of the feature for re-runs, etc as you have
described.

Based on this discussion, I created a few JIRAs.  Thanks all for the great
feedback and keep it coming.

[1] METRON-1787 - Input Time Constraints for Batch Profiler
[2] METRON-1788 - Fetch Profile Definitions from Zk for Batch Profiler
[3] METRON-1789 - MPack Should Define Default Input Path for Batch Profiler


--
[1] https://issues.apache.org/jira/browse/METRON-1787
[2] https://issues.apache.org/jira/browse/METRON-1788
[3] https://issues.apache.org/jira/browse/METRON-1789






On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> I think we might want to allow the flexibility to choose the date range
> then. I don't yet feel like I have a good enough understanding of all the
> ways in which users would want to seed to force them to run the batch job
> over all the data. It might also make it easier to deal with remediation,
> ie an error doesn't force you to re-run over the entire history. Same goes
> for testing out the profile seeing batch job in the first place.
>
> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org> wrote:
>
> > Assuming you have 9 months of data archived, yes.
> >
> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
> > michael.miklavcic@gmail.com> wrote:
> >
> > > So in the case of 3 - if you had 6 months of data that hadn't been
> > profiled
> > > and another 3 that had been profiled (9 months total data), in its
> > current
> > > form the batch job runs over all 9 months?
> > >
> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org>
> wrote:
> > >
> > > > > How do we establish "tm" from 1.1 above? Any concerns about overlap
> > or
> > > > gaps after the seeding is performed?
> > > >
> > > > Good point.  Right now, if the Streaming and Batch Profiler overlap
> the
> > > > last write wins.  And presumably the output of the Streaming and
> Batch
> > > > Profiler are the same, so no worries, right? :)
> > > >
> > > > So it kind of works, but it is definitely not ideal for use case 3.
> I
> > > > could add --begin and --end args to constrain the time frame over
> which
> > > the
> > > > Batch Profiler runs.  I do not have that in the feature branch.  It
> > would
> > > > be easy enough to add though.
> > > >
> > > >
> > > >
> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> > > > michael.miklavcic@gmail.com> wrote:
> > > >
> > > > > Ok, makes sense. That's sort of what I was thinking as well, Nick.
> > > > Pulling
> > > > > at this thread just a bit more...
> > > > >
> > > > >    1. I have an existing system that's been up a while, and I have
> > > added
> > > > k
> > > > >    profiles - assume these are the first profiles I've created.
> > > > >       1. I would have t0 - tm (where m is the time when the
> profiles
> > > were
> > > > >       first installed) worth of data that has not been profiled
> yet.
> > > > >       2. The batch profiler process would be to take that exact
> > profile
> > > > >       definition from ZK and run the batch loader with that from
> the
> > > CLI.
> > > > >       3. Profiles are now up to date from t0 - tCurrent
> > > > >    2. I've already done #1 above. Time goes by and now I want to
> add
> > a
> > > > new
> > > > >    profile.
> > > > >       1. Same first step above
> > > > >       2. I would run the batch loader with *only* that new profile
> > > > >       definition to seed?
> > > > >
> > > > > Forgive me if I missed this in PR's and discussion in the FB, but
> how
> > > do
> > > > we
> > > > > establish "tm" from 1.1 above? Any concerns about overlap or gaps
> > after
> > > > the
> > > > > seeding is performed?
> > > > >
> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org>
> > > wrote:
> > > > >
> > > > > > I think more often than not, you would want to load your profile
> > > > > definition
> > > > > > from a file.  This is why I considered the 'load from Zk' more
> of a
> > > > > > nice-to-have.
> > > > > >
> > > > > >    - In use case 1 and 2, this would definitely be the case.  The
> > > > > profiles
> > > > > >    I am working with are speculative and I am using the batch
> > > profiler
> > > > to
> > > > > >    determine if they are worth keeping.  In this case, my
> > speculative
> > > > > > profiles
> > > > > >    would not be in Zk (yet).
> > > > > >    - In use case 3, I could see it go either way.  It might be
> > useful
> > > > to
> > > > > >    load from Zk, but it certainly isn't a blocker.
> > > > > >
> > > > > >
> > > > > > > So if the config does not correctly match the profiler config
> > held
> > > in
> > > > > ZK
> > > > > > and
> > > > > > the user runs the batch seeding job, what happens?
> > > > > >
> > > > > > You would just get a profile that is slightly different over the
> > > entire
> > > > > > time span.  This is not a new risk.  If the user changes their
> > > Profile
> > > > > > definitions in Zk, the same thing would happen.
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > > > > > michael.miklavcic@gmail.com> wrote:
> > > > > >
> > > > > > > I think I'm torn on this, specifically because it's batch and
> > would
> > > > > > > generally be run as-needed. Justin, can you elaborate on your
> > > > concerns
> > > > > > > there? This feels functionally very similar to our flat file
> > > loaders,
> > > > > > which
> > > > > > > all have inputs for config from the CLI only. On the other
> hand,
> > > our
> > > > > flat
> > > > > > > file loaders are not typically seeding an existing structure.
> My
> > > > > concern
> > > > > > of
> > > > > > > a local file profiler config stems from this stated goal:
> > > > > > > > The goal would be to enable “profile seeding” which allows
> > > profiles
> > > > > to
> > > > > > be
> > > > > > > populated from a time before the profile was created.
> > > > > > > So if the config does not correctly match the profiler config
> > held
> > > in
> > > > > ZK
> > > > > > > and the user runs the batch seeding job, what happens?
> > > > > > >
> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> > > justinjleet@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > The profile not being able to read from ZK feels like a
> fairly
> > > > > > > substantial,
> > > > > > > > if subtle, set of potential problems.  I'd like to see that
> in
> > > > either
> > > > > > > > before merging or at least pretty soon after merging.  Is it
> a
> > > lot
> > > > of
> > > > > > > work
> > > > > > > > to add that functionality based on where things are right
> now?
> > > > > > > >
> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
> nick@nickallen.org
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Here is another limitation that I just thought. It can only
> > > read
> > > > a
> > > > > > > > profile
> > > > > > > > > definition from a file.  It probably also makes sense to
> add
> > an
> > > > > > option
> > > > > > > > that
> > > > > > > > > allows it to read the current Profiler configuration from
> > > > > Zookeeper.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Is it worth setting up a default config that pulls from
> the
> > > > main
> > > > > > > > indexing
> > > > > > > > > output?
> > > > > > > > >
> > > > > > > > > Yes, I think that makes sense.  We want the Batch Profiler
> to
> > > > point
> > > > > > to
> > > > > > > > the
> > > > > > > > > right HDFS URL, no matter where/how Metron is deployed.
> When
> > > > > Metron
> > > > > > > gets
> > > > > > > > > spun-up on a cluster, I should be able to just run the
> Batch
> > > > > Profiler
> > > > > > > > > without having to fuss with the input path.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> > > > justinjleet@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Re:
> > > > > > > > > >
> > > > > > > > > > >  * You do not configure the Batch Profiler in Ambari.
> It
> > > is
> > > > > > > > configured
> > > > > > > > > > > and executed completely from the command-line.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Is it worth setting up a default config that pulls from
> the
> > > > main
> > > > > > > > indexing
> > > > > > > > > > output?  I'm a little on the fence about it, but it seems
> > > like
> > > > > > making
> > > > > > > > the
> > > > > > > > > > most common case more or less built-in would be nice.
> > > > > > > > > >
> > > > > > > > > > Having said that, I do not consider that a requirement
> for
> > > > > merging
> > > > > > > the
> > > > > > > > > > feature branch.
> > > > > > > > > >
> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > > > > jsirota@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I think what you have outlined above is a good initial
> > stab
> > > > at
> > > > > > the
> > > > > > > > > > > feature.  Manual install of spark is not a big deal.
> > > > > Configuring
> > > > > > > via
> > > > > > > > > > > command line while we mature this feature is ok as
> well.
> > > > > Doesn't
> > > > > > > > look
> > > > > > > > > > like
> > > > > > > > > > > configuration steps are too hard.  I think you should
> > > merge.
> > > > > > > > > > >
> > > > > > > > > > > James
> > > > > > > > > > >
> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > > > > > > I would like to open a discussion to get the Batch
> > > Profiler
> > > > > > > feature
> > > > > > > > > > > branch
> > > > > > > > > > > > merged into master as part of METRON-1699 [1] Create
> > > Batch
> > > > > > > > Profiler.
> > > > > > > > > > All
> > > > > > > > > > > > of the work that I had in mind for our first draft of
> > the
> > > > > Batch
> > > > > > > > > > Profiler
> > > > > > > > > > > > has been completed. Please take a look through what I
> > > have
> > > > > and
> > > > > > > let
> > > > > > > > me
> > > > > > > > > > > know
> > > > > > > > > > > > if there are other features that you think are
> required
> > > > > > *before*
> > > > > > > we
> > > > > > > > > > > merge.
> > > > > > > > > > > >
> > > > > > > > > > > > Previous list discussions on this topic include [2]
> and
> > > > [3].
> > > > > > > > > > > >
> > > > > > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > > > > > >
> > > > > > > > > > > >   * With the Batch Profiler, you can backfill/seed
> > > profiles
> > > > > > using
> > > > > > > > > > > archived
> > > > > > > > > > > > telemetry. This enables the following types of use
> > cases.
> > > > > > > > > > > >
> > > > > > > > > > > >       1. As a Security Data Scientist, I want to
> > > understand
> > > > > the
> > > > > > > > > > > historical
> > > > > > > > > > > > behaviors and trends of a profile that I have created
> > so
> > > > > that I
> > > > > > > can
> > > > > > > > > > > > determine if I have created a feature set that has
> > > > predictive
> > > > > > > value
> > > > > > > > > for
> > > > > > > > > > > > model building.
> > > > > > > > > > > >
> > > > > > > > > > > >       2. As a Security Data Scientist, I want to
> > > understand
> > > > > the
> > > > > > > > > > > historical
> > > > > > > > > > > > behaviors and trends of a profile that I have created
> > so
> > > > > that I
> > > > > > > can
> > > > > > > > > > > > determine if I have defined the profile correctly and
> > > > > created a
> > > > > > > > > feature
> > > > > > > > > > > set
> > > > > > > > > > > > that matches reality.
> > > > > > > > > > > >
> > > > > > > > > > > >       3. As a Security Platform Engineer, I want to
> > > > generate
> > > > > a
> > > > > > > > > profile
> > > > > > > > > > > > using archived telemetry when I deploy a new model to
> > > > > > production
> > > > > > > so
> > > > > > > > > > that
> > > > > > > > > > > > models depending on that profile can function on day
> 1.
> > > > > > > > > > > >
> > > > > > > > > > > >   * METRON-1699 [1] includes a more detailed
> > description
> > > of
> > > > > the
> > > > > > > > > > feature.
> > > > > > > > > > > >
> > > > > > > > > > > > (Q) What work was completed?
> > > > > > > > > > > >
> > > > > > > > > > > >   * The Batch Profiler runs on Spark and was
> > implemented
> > > in
> > > > > > Java
> > > > > > > to
> > > > > > > > > > > remain
> > > > > > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > > > > > >
> > > > > > > > > > > >   * The Batch Profiler is executed from the
> > command-line.
> > > > It
> > > > > > can
> > > > > > > be
> > > > > > > > > > > > launched using a script or by calling `spark-submit`,
> > > which
> > > > > may
> > > > > > > be
> > > > > > > > > > useful
> > > > > > > > > > > > for advanced users.
> > > > > > > > > > > >
> > > > > > > > > > > >   * Input telemetry can be consumed from multiple
> > > sources;
> > > > > for
> > > > > > > > > example
> > > > > > > > > > > HDFS
> > > > > > > > > > > > or the local file system.
> > > > > > > > > > > >
> > > > > > > > > > > >   * Input telemetry can be consumed in multiple
> > formats;
> > > > for
> > > > > > > > example
> > > > > > > > > > JSON
> > > > > > > > > > > > or ORC.
> > > > > > > > > > > >
> > > > > > > > > > > >   * The 'output' profile measurements are persisted
> in
> > > > HBase
> > > > > > and
> > > > > > > is
> > > > > > > > > > > > consistent with the Storm Profiler.
> > > > > > > > > > > >
> > > > > > > > > > > >   * It can be run on any underlying engine supported
> by
> > > > > Spark.
> > > > > > I
> > > > > > > > have
> > > > > > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > > > > > >
> > > > > > > > > > > >   * It is installed automatically by the Metron
> MPack.
> > > > > > > > > > > >
> > > > > > > > > > > >   * A README was added that documents usage
> > instructions.
> > > > > > > > > > > >
> > > > > > > > > > > >   * The existing Profiler code was refactored so that
> > as
> > > > much
> > > > > > > code
> > > > > > > > as
> > > > > > > > > > > > possible is shared between the 3 Profiler ports;
> Storm,
> > > the
> > > > > > > Stellar
> > > > > > > > > > REPL,
> > > > > > > > > > > > and Spark. For example, the logic which determines
> the
> > > > > > timestamp
> > > > > > > > of a
> > > > > > > > > > > > message was refactored so that it could be reused by
> > all
> > > > > ports.
> > > > > > > > > > > >
> > > > > > > > > > > >       * metron-profiler-common: The common Profiler
> > code
> > > > > shared
> > > > > > > > > amongst
> > > > > > > > > > > > each port.
> > > > > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > > > > > >       * metron-profiler-repl: Profiler on the Stellar
> > > REPL
> > > > > > > > > > > >       * metron-profiler-client: The client code for
> > > > > retrieving
> > > > > > > > > profile
> > > > > > > > > > > > data; for example PROFILE_GET.
> > > > > > > > > > > >
> > > > > > > > > > > >   * There are 3 separate RPM and DEB packages now
> > created
> > > > for
> > > > > > the
> > > > > > > > > > > Profiler.
> > > > > > > > > > > >
> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > > > > > >
> > > > > > > > > > > >   * The Profiler integration tests were enhanced to
> > > > leverage
> > > > > > the
> > > > > > > > > > Profiler
> > > > > > > > > > > > Client logic to validate the results.
> > > > > > > > > > > >
> > > > > > > > > > > >   * Review METRON-1699 [1] for a complete break-down
> of
> > > the
> > > > > > tasks
> > > > > > > > > that
> > > > > > > > > > > have
> > > > > > > > > > > > been completed on the feature branch.
> > > > > > > > > > > >
> > > > > > > > > > > > (Q) What limitations exist?
> > > > > > > > > > > >
> > > > > > > > > > > >   * You must manually install Spark to use the Batch
> > > > > Profiler.
> > > > > > > The
> > > > > > > > > > Metron
> > > > > > > > > > > > MPack does not treat Spark as a Metron dependency and
> > so
> > > > does
> > > > > > not
> > > > > > > > > > install
> > > > > > > > > > > > it automatically.
> > > > > > > > > > > >
> > > > > > > > > > > >   * You do not configure the Batch Profiler in
> Ambari.
> > It
> > > > is
> > > > > > > > > configured
> > > > > > > > > > > > and executed completely from the command-line.
> > > > > > > > > > > >
> > > > > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have
> > to
> > > > take
> > > > > > the
> > > > > > > > > > > following
> > > > > > > > > > > > manual steps. Some of these are arguably limitations
> > with
> > > > how
> > > > > > > > Ambari
> > > > > > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > > > > > >
> > > > > > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > > > > > >
> > > > > > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > > > > > >
> > > > > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > > > > > >         cp
> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > > > > > $SPARK_HOME/conf/
> > > > > > > > > > > >
> > > > > > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > > > > > >
> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > > > > > >
> > > > > > > > > > > >       4. Change the default input path to
> > > > > > > > `hdfs://localhost:8020/...`
> > > > > > > > > > to
> > > > > > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > > > > > [3]
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > > > > > >
> > > > > > > > > > > -------------------
> > > > > > > > > > > Thank you,
> > > > > > > > > > >
> > > > > > > > > > > James Sirota
> > > > > > > > > > > PMC- Apache Metron
> > > > > > > > > > > jsirota AT apache DOT org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Michael Miklavcic <mi...@gmail.com>.

I think we might want to allow the flexibility to choose the date range
then. I don't yet feel like I have a good enough understanding of all the
ways in which users would want to seed to force them to run the batch job
over all the data. It might also make it easier to deal with remediation,
ie an error doesn't force you to re-run over the entire history. Same goes
for testing out the profile seeing batch job in the first place.

On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <ni...@nickallen.org> wrote:

> Assuming you have 9 months of data archived, yes.
>
> On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
> > So in the case of 3 - if you had 6 months of data that hadn't been
> profiled
> > and another 3 that had been profiled (9 months total data), in its
> current
> > form the batch job runs over all 9 months?
> >
> > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org> wrote:
> >
> > > > How do we establish "tm" from 1.1 above? Any concerns about overlap
> or
> > > gaps after the seeding is performed?
> > >
> > > Good point.  Right now, if the Streaming and Batch Profiler overlap the
> > > last write wins.  And presumably the output of the Streaming and Batch
> > > Profiler are the same, so no worries, right? :)
> > >
> > > So it kind of works, but it is definitely not ideal for use case 3.  I
> > > could add --begin and --end args to constrain the time frame over which
> > the
> > > Batch Profiler runs.  I do not have that in the feature branch.  It
> would
> > > be easy enough to add though.
> > >
> > >
> > >
> > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> > > michael.miklavcic@gmail.com> wrote:
> > >
> > > > Ok, makes sense. That's sort of what I was thinking as well, Nick.
> > > Pulling
> > > > at this thread just a bit more...
> > > >
> > > >    1. I have an existing system that's been up a while, and I have
> > added
> > > k
> > > >    profiles - assume these are the first profiles I've created.
> > > >       1. I would have t0 - tm (where m is the time when the profiles
> > were
> > > >       first installed) worth of data that has not been profiled yet.
> > > >       2. The batch profiler process would be to take that exact
> profile
> > > >       definition from ZK and run the batch loader with that from the
> > CLI.
> > > >       3. Profiles are now up to date from t0 - tCurrent
> > > >    2. I've already done #1 above. Time goes by and now I want to add
> a
> > > new
> > > >    profile.
> > > >       1. Same first step above
> > > >       2. I would run the batch loader with *only* that new profile
> > > >       definition to seed?
> > > >
> > > > Forgive me if I missed this in PR's and discussion in the FB, but how
> > do
> > > we
> > > > establish "tm" from 1.1 above? Any concerns about overlap or gaps
> after
> > > the
> > > > seeding is performed?
> > > >
> > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org>
> > wrote:
> > > >
> > > > > I think more often than not, you would want to load your profile
> > > > definition
> > > > > from a file.  This is why I considered the 'load from Zk' more of a
> > > > > nice-to-have.
> > > > >
> > > > >    - In use case 1 and 2, this would definitely be the case.  The
> > > > profiles
> > > > >    I am working with are speculative and I am using the batch
> > profiler
> > > to
> > > > >    determine if they are worth keeping.  In this case, my
> speculative
> > > > > profiles
> > > > >    would not be in Zk (yet).
> > > > >    - In use case 3, I could see it go either way.  It might be
> useful
> > > to
> > > > >    load from Zk, but it certainly isn't a blocker.
> > > > >
> > > > >
> > > > > > So if the config does not correctly match the profiler config
> held
> > in
> > > > ZK
> > > > > and
> > > > > the user runs the batch seeding job, what happens?
> > > > >
> > > > > You would just get a profile that is slightly different over the
> > entire
> > > > > time span.  This is not a new risk.  If the user changes their
> > Profile
> > > > > definitions in Zk, the same thing would happen.
> > > > >
> > > > >
> > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > > > > michael.miklavcic@gmail.com> wrote:
> > > > >
> > > > > > I think I'm torn on this, specifically because it's batch and
> would
> > > > > > generally be run as-needed. Justin, can you elaborate on your
> > > concerns
> > > > > > there? This feels functionally very similar to our flat file
> > loaders,
> > > > > which
> > > > > > all have inputs for config from the CLI only. On the other hand,
> > our
> > > > flat
> > > > > > file loaders are not typically seeding an existing structure. My
> > > > concern
> > > > > of
> > > > > > a local file profiler config stems from this stated goal:
> > > > > > > The goal would be to enable “profile seeding” which allows
> > profiles
> > > > to
> > > > > be
> > > > > > populated from a time before the profile was created.
> > > > > > So if the config does not correctly match the profiler config
> held
> > in
> > > > ZK
> > > > > > and the user runs the batch seeding job, what happens?
> > > > > >
> > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> > justinjleet@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > The profile not being able to read from ZK feels like a fairly
> > > > > > substantial,
> > > > > > > if subtle, set of potential problems.  I'd like to see that in
> > > either
> > > > > > > before merging or at least pretty soon after merging.  Is it a
> > lot
> > > of
> > > > > > work
> > > > > > > to add that functionality based on where things are right now?
> > > > > > >
> > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <nick@nickallen.org
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Here is another limitation that I just thought. It can only
> > read
> > > a
> > > > > > > profile
> > > > > > > > definition from a file.  It probably also makes sense to add
> an
> > > > > option
> > > > > > > that
> > > > > > > > allows it to read the current Profiler configuration from
> > > > Zookeeper.
> > > > > > > >
> > > > > > > >
> > > > > > > > > Is it worth setting up a default config that pulls from the
> > > main
> > > > > > > indexing
> > > > > > > > output?
> > > > > > > >
> > > > > > > > Yes, I think that makes sense.  We want the Batch Profiler to
> > > point
> > > > > to
> > > > > > > the
> > > > > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> > > > Metron
> > > > > > gets
> > > > > > > > spun-up on a cluster, I should be able to just run the Batch
> > > > Profiler
> > > > > > > > without having to fuss with the input path.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> > > justinjleet@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Re:
> > > > > > > > >
> > > > > > > > > >  * You do not configure the Batch Profiler in Ambari.  It
> > is
> > > > > > > configured
> > > > > > > > > > and executed completely from the command-line.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is it worth setting up a default config that pulls from the
> > > main
> > > > > > > indexing
> > > > > > > > > output?  I'm a little on the fence about it, but it seems
> > like
> > > > > making
> > > > > > > the
> > > > > > > > > most common case more or less built-in would be nice.
> > > > > > > > >
> > > > > > > > > Having said that, I do not consider that a requirement for
> > > > merging
> > > > > > the
> > > > > > > > > feature branch.
> > > > > > > > >
> > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > > > jsirota@apache.org>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I think what you have outlined above is a good initial
> stab
> > > at
> > > > > the
> > > > > > > > > > feature.  Manual install of spark is not a big deal.
> > > > Configuring
> > > > > > via
> > > > > > > > > > command line while we mature this feature is ok as well.
> > > > Doesn't
> > > > > > > look
> > > > > > > > > like
> > > > > > > > > > configuration steps are too hard.  I think you should
> > merge.
> > > > > > > > > >
> > > > > > > > > > James
> > > > > > > > > >
> > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > > > > > I would like to open a discussion to get the Batch
> > Profiler
> > > > > > feature
> > > > > > > > > > branch
> > > > > > > > > > > merged into master as part of METRON-1699 [1] Create
> > Batch
> > > > > > > Profiler.
> > > > > > > > > All
> > > > > > > > > > > of the work that I had in mind for our first draft of
> the
> > > > Batch
> > > > > > > > > Profiler
> > > > > > > > > > > has been completed. Please take a look through what I
> > have
> > > > and
> > > > > > let
> > > > > > > me
> > > > > > > > > > know
> > > > > > > > > > > if there are other features that you think are required
> > > > > *before*
> > > > > > we
> > > > > > > > > > merge.
> > > > > > > > > > >
> > > > > > > > > > > Previous list discussions on this topic include [2] and
> > > [3].
> > > > > > > > > > >
> > > > > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > > > > >
> > > > > > > > > > >   * With the Batch Profiler, you can backfill/seed
> > profiles
> > > > > using
> > > > > > > > > > archived
> > > > > > > > > > > telemetry. This enables the following types of use
> cases.
> > > > > > > > > > >
> > > > > > > > > > >       1. As a Security Data Scientist, I want to
> > understand
> > > > the
> > > > > > > > > > historical
> > > > > > > > > > > behaviors and trends of a profile that I have created
> so
> > > > that I
> > > > > > can
> > > > > > > > > > > determine if I have created a feature set that has
> > > predictive
> > > > > > value
> > > > > > > > for
> > > > > > > > > > > model building.
> > > > > > > > > > >
> > > > > > > > > > >       2. As a Security Data Scientist, I want to
> > understand
> > > > the
> > > > > > > > > > historical
> > > > > > > > > > > behaviors and trends of a profile that I have created
> so
> > > > that I
> > > > > > can
> > > > > > > > > > > determine if I have defined the profile correctly and
> > > > created a
> > > > > > > > feature
> > > > > > > > > > set
> > > > > > > > > > > that matches reality.
> > > > > > > > > > >
> > > > > > > > > > >       3. As a Security Platform Engineer, I want to
> > > generate
> > > > a
> > > > > > > > profile
> > > > > > > > > > > using archived telemetry when I deploy a new model to
> > > > > production
> > > > > > so
> > > > > > > > > that
> > > > > > > > > > > models depending on that profile can function on day 1.
> > > > > > > > > > >
> > > > > > > > > > >   * METRON-1699 [1] includes a more detailed
> description
> > of
> > > > the
> > > > > > > > > feature.
> > > > > > > > > > >
> > > > > > > > > > > (Q) What work was completed?
> > > > > > > > > > >
> > > > > > > > > > >   * The Batch Profiler runs on Spark and was
> implemented
> > in
> > > > > Java
> > > > > > to
> > > > > > > > > > remain
> > > > > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > > > > >
> > > > > > > > > > >   * The Batch Profiler is executed from the
> command-line.
> > > It
> > > > > can
> > > > > > be
> > > > > > > > > > > launched using a script or by calling `spark-submit`,
> > which
> > > > may
> > > > > > be
> > > > > > > > > useful
> > > > > > > > > > > for advanced users.
> > > > > > > > > > >
> > > > > > > > > > >   * Input telemetry can be consumed from multiple
> > sources;
> > > > for
> > > > > > > > example
> > > > > > > > > > HDFS
> > > > > > > > > > > or the local file system.
> > > > > > > > > > >
> > > > > > > > > > >   * Input telemetry can be consumed in multiple
> formats;
> > > for
> > > > > > > example
> > > > > > > > > JSON
> > > > > > > > > > > or ORC.
> > > > > > > > > > >
> > > > > > > > > > >   * The 'output' profile measurements are persisted in
> > > HBase
> > > > > and
> > > > > > is
> > > > > > > > > > > consistent with the Storm Profiler.
> > > > > > > > > > >
> > > > > > > > > > >   * It can be run on any underlying engine supported by
> > > > Spark.
> > > > > I
> > > > > > > have
> > > > > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > > > > >
> > > > > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > > > > >
> > > > > > > > > > >   * A README was added that documents usage
> instructions.
> > > > > > > > > > >
> > > > > > > > > > >   * The existing Profiler code was refactored so that
> as
> > > much
> > > > > > code
> > > > > > > as
> > > > > > > > > > > possible is shared between the 3 Profiler ports; Storm,
> > the
> > > > > > Stellar
> > > > > > > > > REPL,
> > > > > > > > > > > and Spark. For example, the logic which determines the
> > > > > timestamp
> > > > > > > of a
> > > > > > > > > > > message was refactored so that it could be reused by
> all
> > > > ports.
> > > > > > > > > > >
> > > > > > > > > > >       * metron-profiler-common: The common Profiler
> code
> > > > shared
> > > > > > > > amongst
> > > > > > > > > > > each port.
> > > > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > > > > >       * metron-profiler-repl: Profiler on the Stellar
> > REPL
> > > > > > > > > > >       * metron-profiler-client: The client code for
> > > > retrieving
> > > > > > > > profile
> > > > > > > > > > > data; for example PROFILE_GET.
> > > > > > > > > > >
> > > > > > > > > > >   * There are 3 separate RPM and DEB packages now
> created
> > > for
> > > > > the
> > > > > > > > > > Profiler.
> > > > > > > > > > >
> > > > > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > > > > >
> > > > > > > > > > >   * The Profiler integration tests were enhanced to
> > > leverage
> > > > > the
> > > > > > > > > Profiler
> > > > > > > > > > > Client logic to validate the results.
> > > > > > > > > > >
> > > > > > > > > > >   * Review METRON-1699 [1] for a complete break-down of
> > the
> > > > > tasks
> > > > > > > > that
> > > > > > > > > > have
> > > > > > > > > > > been completed on the feature branch.
> > > > > > > > > > >
> > > > > > > > > > > (Q) What limitations exist?
> > > > > > > > > > >
> > > > > > > > > > >   * You must manually install Spark to use the Batch
> > > > Profiler.
> > > > > > The
> > > > > > > > > Metron
> > > > > > > > > > > MPack does not treat Spark as a Metron dependency and
> so
> > > does
> > > > > not
> > > > > > > > > install
> > > > > > > > > > > it automatically.
> > > > > > > > > > >
> > > > > > > > > > >   * You do not configure the Batch Profiler in Ambari.
> It
> > > is
> > > > > > > > configured
> > > > > > > > > > > and executed completely from the command-line.
> > > > > > > > > > >
> > > > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have
> to
> > > take
> > > > > the
> > > > > > > > > > following
> > > > > > > > > > > manual steps. Some of these are arguably limitations
> with
> > > how
> > > > > > > Ambari
> > > > > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > > > > >
> > > > > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > > > > >
> > > > > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > > > > >
> > > > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > > > > >         cp
> > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > > > > $SPARK_HOME/conf/
> > > > > > > > > > >
> > > > > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > > > > >
> > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > > > > >
> > > > > > > > > > >       4. Change the default input path to
> > > > > > > `hdfs://localhost:8020/...`
> > > > > > > > > to
> > > > > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > > > > [2]
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > > > > [3]
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > > > > >
> > > > > > > > > > -------------------
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > James Sirota
> > > > > > > > > > PMC- Apache Metron
> > > > > > > > > > jsirota AT apache DOT org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

Assuming you have 9 months of data archived, yes.

On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> So in the case of 3 - if you had 6 months of data that hadn't been profiled
> and another 3 that had been profiled (9 months total data), in its current
> form the batch job runs over all 9 months?
>
> On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org> wrote:
>
> > > How do we establish "tm" from 1.1 above? Any concerns about overlap or
> > gaps after the seeding is performed?
> >
> > Good point.  Right now, if the Streaming and Batch Profiler overlap the
> > last write wins.  And presumably the output of the Streaming and Batch
> > Profiler are the same, so no worries, right? :)
> >
> > So it kind of works, but it is definitely not ideal for use case 3.  I
> > could add --begin and --end args to constrain the time frame over which
> the
> > Batch Profiler runs.  I do not have that in the feature branch.  It would
> > be easy enough to add though.
> >
> >
> >
> > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> > michael.miklavcic@gmail.com> wrote:
> >
> > > Ok, makes sense. That's sort of what I was thinking as well, Nick.
> > Pulling
> > > at this thread just a bit more...
> > >
> > >    1. I have an existing system that's been up a while, and I have
> added
> > k
> > >    profiles - assume these are the first profiles I've created.
> > >       1. I would have t0 - tm (where m is the time when the profiles
> were
> > >       first installed) worth of data that has not been profiled yet.
> > >       2. The batch profiler process would be to take that exact profile
> > >       definition from ZK and run the batch loader with that from the
> CLI.
> > >       3. Profiles are now up to date from t0 - tCurrent
> > >    2. I've already done #1 above. Time goes by and now I want to add a
> > new
> > >    profile.
> > >       1. Same first step above
> > >       2. I would run the batch loader with *only* that new profile
> > >       definition to seed?
> > >
> > > Forgive me if I missed this in PR's and discussion in the FB, but how
> do
> > we
> > > establish "tm" from 1.1 above? Any concerns about overlap or gaps after
> > the
> > > seeding is performed?
> > >
> > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org>
> wrote:
> > >
> > > > I think more often than not, you would want to load your profile
> > > definition
> > > > from a file.  This is why I considered the 'load from Zk' more of a
> > > > nice-to-have.
> > > >
> > > >    - In use case 1 and 2, this would definitely be the case.  The
> > > profiles
> > > >    I am working with are speculative and I am using the batch
> profiler
> > to
> > > >    determine if they are worth keeping.  In this case, my speculative
> > > > profiles
> > > >    would not be in Zk (yet).
> > > >    - In use case 3, I could see it go either way.  It might be useful
> > to
> > > >    load from Zk, but it certainly isn't a blocker.
> > > >
> > > >
> > > > > So if the config does not correctly match the profiler config held
> in
> > > ZK
> > > > and
> > > > the user runs the batch seeding job, what happens?
> > > >
> > > > You would just get a profile that is slightly different over the
> entire
> > > > time span.  This is not a new risk.  If the user changes their
> Profile
> > > > definitions in Zk, the same thing would happen.
> > > >
> > > >
> > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > > > michael.miklavcic@gmail.com> wrote:
> > > >
> > > > > I think I'm torn on this, specifically because it's batch and would
> > > > > generally be run as-needed. Justin, can you elaborate on your
> > concerns
> > > > > there? This feels functionally very similar to our flat file
> loaders,
> > > > which
> > > > > all have inputs for config from the CLI only. On the other hand,
> our
> > > flat
> > > > > file loaders are not typically seeding an existing structure. My
> > > concern
> > > > of
> > > > > a local file profiler config stems from this stated goal:
> > > > > > The goal would be to enable “profile seeding” which allows
> profiles
> > > to
> > > > be
> > > > > populated from a time before the profile was created.
> > > > > So if the config does not correctly match the profiler config held
> in
> > > ZK
> > > > > and the user runs the batch seeding job, what happens?
> > > > >
> > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
> justinjleet@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > The profile not being able to read from ZK feels like a fairly
> > > > > substantial,
> > > > > > if subtle, set of potential problems.  I'd like to see that in
> > either
> > > > > > before merging or at least pretty soon after merging.  Is it a
> lot
> > of
> > > > > work
> > > > > > to add that functionality based on where things are right now?
> > > > > >
> > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org>
> > > wrote:
> > > > > >
> > > > > > > Here is another limitation that I just thought. It can only
> read
> > a
> > > > > > profile
> > > > > > > definition from a file.  It probably also makes sense to add an
> > > > option
> > > > > > that
> > > > > > > allows it to read the current Profiler configuration from
> > > Zookeeper.
> > > > > > >
> > > > > > >
> > > > > > > > Is it worth setting up a default config that pulls from the
> > main
> > > > > > indexing
> > > > > > > output?
> > > > > > >
> > > > > > > Yes, I think that makes sense.  We want the Batch Profiler to
> > point
> > > > to
> > > > > > the
> > > > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> > > Metron
> > > > > gets
> > > > > > > spun-up on a cluster, I should be able to just run the Batch
> > > Profiler
> > > > > > > without having to fuss with the input path.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> > justinjleet@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Re:
> > > > > > > >
> > > > > > > > >  * You do not configure the Batch Profiler in Ambari.  It
> is
> > > > > > configured
> > > > > > > > > and executed completely from the command-line.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Is it worth setting up a default config that pulls from the
> > main
> > > > > > indexing
> > > > > > > > output?  I'm a little on the fence about it, but it seems
> like
> > > > making
> > > > > > the
> > > > > > > > most common case more or less built-in would be nice.
> > > > > > > >
> > > > > > > > Having said that, I do not consider that a requirement for
> > > merging
> > > > > the
> > > > > > > > feature branch.
> > > > > > > >
> > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > > jsirota@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I think what you have outlined above is a good initial stab
> > at
> > > > the
> > > > > > > > > feature.  Manual install of spark is not a big deal.
> > > Configuring
> > > > > via
> > > > > > > > > command line while we mature this feature is ok as well.
> > > Doesn't
> > > > > > look
> > > > > > > > like
> > > > > > > > > configuration steps are too hard.  I think you should
> merge.
> > > > > > > > >
> > > > > > > > > James
> > > > > > > > >
> > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > > > > I would like to open a discussion to get the Batch
> Profiler
> > > > > feature
> > > > > > > > > branch
> > > > > > > > > > merged into master as part of METRON-1699 [1] Create
> Batch
> > > > > > Profiler.
> > > > > > > > All
> > > > > > > > > > of the work that I had in mind for our first draft of the
> > > Batch
> > > > > > > > Profiler
> > > > > > > > > > has been completed. Please take a look through what I
> have
> > > and
> > > > > let
> > > > > > me
> > > > > > > > > know
> > > > > > > > > > if there are other features that you think are required
> > > > *before*
> > > > > we
> > > > > > > > > merge.
> > > > > > > > > >
> > > > > > > > > > Previous list discussions on this topic include [2] and
> > [3].
> > > > > > > > > >
> > > > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > > > >
> > > > > > > > > >   * With the Batch Profiler, you can backfill/seed
> profiles
> > > > using
> > > > > > > > > archived
> > > > > > > > > > telemetry. This enables the following types of use cases.
> > > > > > > > > >
> > > > > > > > > >       1. As a Security Data Scientist, I want to
> understand
> > > the
> > > > > > > > > historical
> > > > > > > > > > behaviors and trends of a profile that I have created so
> > > that I
> > > > > can
> > > > > > > > > > determine if I have created a feature set that has
> > predictive
> > > > > value
> > > > > > > for
> > > > > > > > > > model building.
> > > > > > > > > >
> > > > > > > > > >       2. As a Security Data Scientist, I want to
> understand
> > > the
> > > > > > > > > historical
> > > > > > > > > > behaviors and trends of a profile that I have created so
> > > that I
> > > > > can
> > > > > > > > > > determine if I have defined the profile correctly and
> > > created a
> > > > > > > feature
> > > > > > > > > set
> > > > > > > > > > that matches reality.
> > > > > > > > > >
> > > > > > > > > >       3. As a Security Platform Engineer, I want to
> > generate
> > > a
> > > > > > > profile
> > > > > > > > > > using archived telemetry when I deploy a new model to
> > > > production
> > > > > so
> > > > > > > > that
> > > > > > > > > > models depending on that profile can function on day 1.
> > > > > > > > > >
> > > > > > > > > >   * METRON-1699 [1] includes a more detailed description
> of
> > > the
> > > > > > > > feature.
> > > > > > > > > >
> > > > > > > > > > (Q) What work was completed?
> > > > > > > > > >
> > > > > > > > > >   * The Batch Profiler runs on Spark and was implemented
> in
> > > > Java
> > > > > to
> > > > > > > > > remain
> > > > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > > > >
> > > > > > > > > >   * The Batch Profiler is executed from the command-line.
> > It
> > > > can
> > > > > be
> > > > > > > > > > launched using a script or by calling `spark-submit`,
> which
> > > may
> > > > > be
> > > > > > > > useful
> > > > > > > > > > for advanced users.
> > > > > > > > > >
> > > > > > > > > >   * Input telemetry can be consumed from multiple
> sources;
> > > for
> > > > > > > example
> > > > > > > > > HDFS
> > > > > > > > > > or the local file system.
> > > > > > > > > >
> > > > > > > > > >   * Input telemetry can be consumed in multiple formats;
> > for
> > > > > > example
> > > > > > > > JSON
> > > > > > > > > > or ORC.
> > > > > > > > > >
> > > > > > > > > >   * The 'output' profile measurements are persisted in
> > HBase
> > > > and
> > > > > is
> > > > > > > > > > consistent with the Storm Profiler.
> > > > > > > > > >
> > > > > > > > > >   * It can be run on any underlying engine supported by
> > > Spark.
> > > > I
> > > > > > have
> > > > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > > > >
> > > > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > > > >
> > > > > > > > > >   * A README was added that documents usage instructions.
> > > > > > > > > >
> > > > > > > > > >   * The existing Profiler code was refactored so that as
> > much
> > > > > code
> > > > > > as
> > > > > > > > > > possible is shared between the 3 Profiler ports; Storm,
> the
> > > > > Stellar
> > > > > > > > REPL,
> > > > > > > > > > and Spark. For example, the logic which determines the
> > > > timestamp
> > > > > > of a
> > > > > > > > > > message was refactored so that it could be reused by all
> > > ports.
> > > > > > > > > >
> > > > > > > > > >       * metron-profiler-common: The common Profiler code
> > > shared
> > > > > > > amongst
> > > > > > > > > > each port.
> > > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > > > >       * metron-profiler-repl: Profiler on the Stellar
> REPL
> > > > > > > > > >       * metron-profiler-client: The client code for
> > > retrieving
> > > > > > > profile
> > > > > > > > > > data; for example PROFILE_GET.
> > > > > > > > > >
> > > > > > > > > >   * There are 3 separate RPM and DEB packages now created
> > for
> > > > the
> > > > > > > > > Profiler.
> > > > > > > > > >
> > > > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > > > >
> > > > > > > > > >   * The Profiler integration tests were enhanced to
> > leverage
> > > > the
> > > > > > > > Profiler
> > > > > > > > > > Client logic to validate the results.
> > > > > > > > > >
> > > > > > > > > >   * Review METRON-1699 [1] for a complete break-down of
> the
> > > > tasks
> > > > > > > that
> > > > > > > > > have
> > > > > > > > > > been completed on the feature branch.
> > > > > > > > > >
> > > > > > > > > > (Q) What limitations exist?
> > > > > > > > > >
> > > > > > > > > >   * You must manually install Spark to use the Batch
> > > Profiler.
> > > > > The
> > > > > > > > Metron
> > > > > > > > > > MPack does not treat Spark as a Metron dependency and so
> > does
> > > > not
> > > > > > > > install
> > > > > > > > > > it automatically.
> > > > > > > > > >
> > > > > > > > > >   * You do not configure the Batch Profiler in Ambari. It
> > is
> > > > > > > configured
> > > > > > > > > > and executed completely from the command-line.
> > > > > > > > > >
> > > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to
> > take
> > > > the
> > > > > > > > > following
> > > > > > > > > > manual steps. Some of these are arguably limitations with
> > how
> > > > > > Ambari
> > > > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > > > >
> > > > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > > > >
> > > > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > > > >
> > > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > > > >         cp
> > /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > > > $SPARK_HOME/conf/
> > > > > > > > > >
> > > > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > > > >
> > > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > > > >
> > > > > > > > > >       4. Change the default input path to
> > > > > > `hdfs://localhost:8020/...`
> > > > > > > > to
> > > > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > > > [2]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > > > >
> > > > > > > > > -------------------
> > > > > > > > > Thank you,
> > > > > > > > >
> > > > > > > > > James Sirota
> > > > > > > > > PMC- Apache Metron
> > > > > > > > > jsirota AT apache DOT org
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Michael Miklavcic <mi...@gmail.com>.

So in the case of 3 - if you had 6 months of data that hadn't been profiled
and another 3 that had been profiled (9 months total data), in its current
form the batch job runs over all 9 months?

On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <ni...@nickallen.org> wrote:

> > How do we establish "tm" from 1.1 above? Any concerns about overlap or
> gaps after the seeding is performed?
>
> Good point.  Right now, if the Streaming and Batch Profiler overlap the
> last write wins.  And presumably the output of the Streaming and Batch
> Profiler are the same, so no worries, right? :)
>
> So it kind of works, but it is definitely not ideal for use case 3.  I
> could add --begin and --end args to constrain the time frame over which the
> Batch Profiler runs.  I do not have that in the feature branch.  It would
> be easy enough to add though.
>
>
>
> On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
> > Ok, makes sense. That's sort of what I was thinking as well, Nick.
> Pulling
> > at this thread just a bit more...
> >
> >    1. I have an existing system that's been up a while, and I have added
> k
> >    profiles - assume these are the first profiles I've created.
> >       1. I would have t0 - tm (where m is the time when the profiles were
> >       first installed) worth of data that has not been profiled yet.
> >       2. The batch profiler process would be to take that exact profile
> >       definition from ZK and run the batch loader with that from the CLI.
> >       3. Profiles are now up to date from t0 - tCurrent
> >    2. I've already done #1 above. Time goes by and now I want to add a
> new
> >    profile.
> >       1. Same first step above
> >       2. I would run the batch loader with *only* that new profile
> >       definition to seed?
> >
> > Forgive me if I missed this in PR's and discussion in the FB, but how do
> we
> > establish "tm" from 1.1 above? Any concerns about overlap or gaps after
> the
> > seeding is performed?
> >
> > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org> wrote:
> >
> > > I think more often than not, you would want to load your profile
> > definition
> > > from a file.  This is why I considered the 'load from Zk' more of a
> > > nice-to-have.
> > >
> > >    - In use case 1 and 2, this would definitely be the case.  The
> > profiles
> > >    I am working with are speculative and I am using the batch profiler
> to
> > >    determine if they are worth keeping.  In this case, my speculative
> > > profiles
> > >    would not be in Zk (yet).
> > >    - In use case 3, I could see it go either way.  It might be useful
> to
> > >    load from Zk, but it certainly isn't a blocker.
> > >
> > >
> > > > So if the config does not correctly match the profiler config held in
> > ZK
> > > and
> > > the user runs the batch seeding job, what happens?
> > >
> > > You would just get a profile that is slightly different over the entire
> > > time span.  This is not a new risk.  If the user changes their Profile
> > > definitions in Zk, the same thing would happen.
> > >
> > >
> > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > > michael.miklavcic@gmail.com> wrote:
> > >
> > > > I think I'm torn on this, specifically because it's batch and would
> > > > generally be run as-needed. Justin, can you elaborate on your
> concerns
> > > > there? This feels functionally very similar to our flat file loaders,
> > > which
> > > > all have inputs for config from the CLI only. On the other hand, our
> > flat
> > > > file loaders are not typically seeding an existing structure. My
> > concern
> > > of
> > > > a local file profiler config stems from this stated goal:
> > > > > The goal would be to enable “profile seeding” which allows profiles
> > to
> > > be
> > > > populated from a time before the profile was created.
> > > > So if the config does not correctly match the profiler config held in
> > ZK
> > > > and the user runs the batch seeding job, what happens?
> > > >
> > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <ju...@gmail.com>
> > > > wrote:
> > > >
> > > > > The profile not being able to read from ZK feels like a fairly
> > > > substantial,
> > > > > if subtle, set of potential problems.  I'd like to see that in
> either
> > > > > before merging or at least pretty soon after merging.  Is it a lot
> of
> > > > work
> > > > > to add that functionality based on where things are right now?
> > > > >
> > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org>
> > wrote:
> > > > >
> > > > > > Here is another limitation that I just thought. It can only read
> a
> > > > > profile
> > > > > > definition from a file.  It probably also makes sense to add an
> > > option
> > > > > that
> > > > > > allows it to read the current Profiler configuration from
> > Zookeeper.
> > > > > >
> > > > > >
> > > > > > > Is it worth setting up a default config that pulls from the
> main
> > > > > indexing
> > > > > > output?
> > > > > >
> > > > > > Yes, I think that makes sense.  We want the Batch Profiler to
> point
> > > to
> > > > > the
> > > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> > Metron
> > > > gets
> > > > > > spun-up on a cluster, I should be able to just run the Batch
> > Profiler
> > > > > > without having to fuss with the input path.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> justinjleet@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Re:
> > > > > > >
> > > > > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > > > > configured
> > > > > > > > and executed completely from the command-line.
> > > > > > > >
> > > > > > >
> > > > > > > Is it worth setting up a default config that pulls from the
> main
> > > > > indexing
> > > > > > > output?  I'm a little on the fence about it, but it seems like
> > > making
> > > > > the
> > > > > > > most common case more or less built-in would be nice.
> > > > > > >
> > > > > > > Having said that, I do not consider that a requirement for
> > merging
> > > > the
> > > > > > > feature branch.
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > jsirota@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > I think what you have outlined above is a good initial stab
> at
> > > the
> > > > > > > > feature.  Manual install of spark is not a big deal.
> > Configuring
> > > > via
> > > > > > > > command line while we mature this feature is ok as well.
> > Doesn't
> > > > > look
> > > > > > > like
> > > > > > > > configuration steps are too hard.  I think you should merge.
> > > > > > > >
> > > > > > > > James
> > > > > > > >
> > > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > > > I would like to open a discussion to get the Batch Profiler
> > > > feature
> > > > > > > > branch
> > > > > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > > > > Profiler.
> > > > > > > All
> > > > > > > > > of the work that I had in mind for our first draft of the
> > Batch
> > > > > > > Profiler
> > > > > > > > > has been completed. Please take a look through what I have
> > and
> > > > let
> > > > > me
> > > > > > > > know
> > > > > > > > > if there are other features that you think are required
> > > *before*
> > > > we
> > > > > > > > merge.
> > > > > > > > >
> > > > > > > > > Previous list discussions on this topic include [2] and
> [3].
> > > > > > > > >
> > > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > > >
> > > > > > > > >   * With the Batch Profiler, you can backfill/seed profiles
> > > using
> > > > > > > > archived
> > > > > > > > > telemetry. This enables the following types of use cases.
> > > > > > > > >
> > > > > > > > >       1. As a Security Data Scientist, I want to understand
> > the
> > > > > > > > historical
> > > > > > > > > behaviors and trends of a profile that I have created so
> > that I
> > > > can
> > > > > > > > > determine if I have created a feature set that has
> predictive
> > > > value
> > > > > > for
> > > > > > > > > model building.
> > > > > > > > >
> > > > > > > > >       2. As a Security Data Scientist, I want to understand
> > the
> > > > > > > > historical
> > > > > > > > > behaviors and trends of a profile that I have created so
> > that I
> > > > can
> > > > > > > > > determine if I have defined the profile correctly and
> > created a
> > > > > > feature
> > > > > > > > set
> > > > > > > > > that matches reality.
> > > > > > > > >
> > > > > > > > >       3. As a Security Platform Engineer, I want to
> generate
> > a
> > > > > > profile
> > > > > > > > > using archived telemetry when I deploy a new model to
> > > production
> > > > so
> > > > > > > that
> > > > > > > > > models depending on that profile can function on day 1.
> > > > > > > > >
> > > > > > > > >   * METRON-1699 [1] includes a more detailed description of
> > the
> > > > > > > feature.
> > > > > > > > >
> > > > > > > > > (Q) What work was completed?
> > > > > > > > >
> > > > > > > > >   * The Batch Profiler runs on Spark and was implemented in
> > > Java
> > > > to
> > > > > > > > remain
> > > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > > >
> > > > > > > > >   * The Batch Profiler is executed from the command-line.
> It
> > > can
> > > > be
> > > > > > > > > launched using a script or by calling `spark-submit`, which
> > may
> > > > be
> > > > > > > useful
> > > > > > > > > for advanced users.
> > > > > > > > >
> > > > > > > > >   * Input telemetry can be consumed from multiple sources;
> > for
> > > > > > example
> > > > > > > > HDFS
> > > > > > > > > or the local file system.
> > > > > > > > >
> > > > > > > > >   * Input telemetry can be consumed in multiple formats;
> for
> > > > > example
> > > > > > > JSON
> > > > > > > > > or ORC.
> > > > > > > > >
> > > > > > > > >   * The 'output' profile measurements are persisted in
> HBase
> > > and
> > > > is
> > > > > > > > > consistent with the Storm Profiler.
> > > > > > > > >
> > > > > > > > >   * It can be run on any underlying engine supported by
> > Spark.
> > > I
> > > > > have
> > > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > > >
> > > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > > >
> > > > > > > > >   * A README was added that documents usage instructions.
> > > > > > > > >
> > > > > > > > >   * The existing Profiler code was refactored so that as
> much
> > > > code
> > > > > as
> > > > > > > > > possible is shared between the 3 Profiler ports; Storm, the
> > > > Stellar
> > > > > > > REPL,
> > > > > > > > > and Spark. For example, the logic which determines the
> > > timestamp
> > > > > of a
> > > > > > > > > message was refactored so that it could be reused by all
> > ports.
> > > > > > > > >
> > > > > > > > >       * metron-profiler-common: The common Profiler code
> > shared
> > > > > > amongst
> > > > > > > > > each port.
> > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > > > > >       * metron-profiler-client: The client code for
> > retrieving
> > > > > > profile
> > > > > > > > > data; for example PROFILE_GET.
> > > > > > > > >
> > > > > > > > >   * There are 3 separate RPM and DEB packages now created
> for
> > > the
> > > > > > > > Profiler.
> > > > > > > > >
> > > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > > >
> > > > > > > > >   * The Profiler integration tests were enhanced to
> leverage
> > > the
> > > > > > > Profiler
> > > > > > > > > Client logic to validate the results.
> > > > > > > > >
> > > > > > > > >   * Review METRON-1699 [1] for a complete break-down of the
> > > tasks
> > > > > > that
> > > > > > > > have
> > > > > > > > > been completed on the feature branch.
> > > > > > > > >
> > > > > > > > > (Q) What limitations exist?
> > > > > > > > >
> > > > > > > > >   * You must manually install Spark to use the Batch
> > Profiler.
> > > > The
> > > > > > > Metron
> > > > > > > > > MPack does not treat Spark as a Metron dependency and so
> does
> > > not
> > > > > > > install
> > > > > > > > > it automatically.
> > > > > > > > >
> > > > > > > > >   * You do not configure the Batch Profiler in Ambari. It
> is
> > > > > > configured
> > > > > > > > > and executed completely from the command-line.
> > > > > > > > >
> > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to
> take
> > > the
> > > > > > > > following
> > > > > > > > > manual steps. Some of these are arguably limitations with
> how
> > > > > Ambari
> > > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > > >
> > > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > > >
> > > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > > >
> > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > > >         cp
> /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > > $SPARK_HOME/conf/
> > > > > > > > >
> > > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > > >
> > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > > >
> > > > > > > > >       4. Change the default input path to
> > > > > `hdfs://localhost:8020/...`
> > > > > > > to
> > > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > > >
> > > > > > > > -------------------
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > James Sirota
> > > > > > > > PMC- Apache Metron
> > > > > > > > jsirota AT apache DOT org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

> How do we establish "tm" from 1.1 above? Any concerns about overlap or
gaps after the seeding is performed?

Good point.  Right now, if the Streaming and Batch Profiler overlap the
last write wins.  And presumably the output of the Streaming and Batch
Profiler are the same, so no worries, right? :)

So it kind of works, but it is definitely not ideal for use case 3.  I
could add --begin and --end args to constrain the time frame over which the
Batch Profiler runs.  I do not have that in the feature branch.  It would
be easy enough to add though.



On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> Ok, makes sense. That's sort of what I was thinking as well, Nick. Pulling
> at this thread just a bit more...
>
>    1. I have an existing system that's been up a while, and I have added k
>    profiles - assume these are the first profiles I've created.
>       1. I would have t0 - tm (where m is the time when the profiles were
>       first installed) worth of data that has not been profiled yet.
>       2. The batch profiler process would be to take that exact profile
>       definition from ZK and run the batch loader with that from the CLI.
>       3. Profiles are now up to date from t0 - tCurrent
>    2. I've already done #1 above. Time goes by and now I want to add a new
>    profile.
>       1. Same first step above
>       2. I would run the batch loader with *only* that new profile
>       definition to seed?
>
> Forgive me if I missed this in PR's and discussion in the FB, but how do we
> establish "tm" from 1.1 above? Any concerns about overlap or gaps after the
> seeding is performed?
>
> On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org> wrote:
>
> > I think more often than not, you would want to load your profile
> definition
> > from a file.  This is why I considered the 'load from Zk' more of a
> > nice-to-have.
> >
> >    - In use case 1 and 2, this would definitely be the case.  The
> profiles
> >    I am working with are speculative and I am using the batch profiler to
> >    determine if they are worth keeping.  In this case, my speculative
> > profiles
> >    would not be in Zk (yet).
> >    - In use case 3, I could see it go either way.  It might be useful to
> >    load from Zk, but it certainly isn't a blocker.
> >
> >
> > > So if the config does not correctly match the profiler config held in
> ZK
> > and
> > the user runs the batch seeding job, what happens?
> >
> > You would just get a profile that is slightly different over the entire
> > time span.  This is not a new risk.  If the user changes their Profile
> > definitions in Zk, the same thing would happen.
> >
> >
> > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > michael.miklavcic@gmail.com> wrote:
> >
> > > I think I'm torn on this, specifically because it's batch and would
> > > generally be run as-needed. Justin, can you elaborate on your concerns
> > > there? This feels functionally very similar to our flat file loaders,
> > which
> > > all have inputs for config from the CLI only. On the other hand, our
> flat
> > > file loaders are not typically seeding an existing structure. My
> concern
> > of
> > > a local file profiler config stems from this stated goal:
> > > > The goal would be to enable “profile seeding” which allows profiles
> to
> > be
> > > populated from a time before the profile was created.
> > > So if the config does not correctly match the profiler config held in
> ZK
> > > and the user runs the batch seeding job, what happens?
> > >
> > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <ju...@gmail.com>
> > > wrote:
> > >
> > > > The profile not being able to read from ZK feels like a fairly
> > > substantial,
> > > > if subtle, set of potential problems.  I'd like to see that in either
> > > > before merging or at least pretty soon after merging.  Is it a lot of
> > > work
> > > > to add that functionality based on where things are right now?
> > > >
> > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org>
> wrote:
> > > >
> > > > > Here is another limitation that I just thought. It can only read a
> > > > profile
> > > > > definition from a file.  It probably also makes sense to add an
> > option
> > > > that
> > > > > allows it to read the current Profiler configuration from
> Zookeeper.
> > > > >
> > > > >
> > > > > > Is it worth setting up a default config that pulls from the main
> > > > indexing
> > > > > output?
> > > > >
> > > > > Yes, I think that makes sense.  We want the Batch Profiler to point
> > to
> > > > the
> > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> Metron
> > > gets
> > > > > spun-up on a cluster, I should be able to just run the Batch
> Profiler
> > > > > without having to fuss with the input path.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <justinjleet@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Re:
> > > > > >
> > > > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > > > configured
> > > > > > > and executed completely from the command-line.
> > > > > > >
> > > > > >
> > > > > > Is it worth setting up a default config that pulls from the main
> > > > indexing
> > > > > > output?  I'm a little on the fence about it, but it seems like
> > making
> > > > the
> > > > > > most common case more or less built-in would be nice.
> > > > > >
> > > > > > Having said that, I do not consider that a requirement for
> merging
> > > the
> > > > > > feature branch.
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> jsirota@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > I think what you have outlined above is a good initial stab at
> > the
> > > > > > > feature.  Manual install of spark is not a big deal.
> Configuring
> > > via
> > > > > > > command line while we mature this feature is ok as well.
> Doesn't
> > > > look
> > > > > > like
> > > > > > > configuration steps are too hard.  I think you should merge.
> > > > > > >
> > > > > > > James
> > > > > > >
> > > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > > I would like to open a discussion to get the Batch Profiler
> > > feature
> > > > > > > branch
> > > > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > > > Profiler.
> > > > > > All
> > > > > > > > of the work that I had in mind for our first draft of the
> Batch
> > > > > > Profiler
> > > > > > > > has been completed. Please take a look through what I have
> and
> > > let
> > > > me
> > > > > > > know
> > > > > > > > if there are other features that you think are required
> > *before*
> > > we
> > > > > > > merge.
> > > > > > > >
> > > > > > > > Previous list discussions on this topic include [2] and [3].
> > > > > > > >
> > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > >
> > > > > > > >   * With the Batch Profiler, you can backfill/seed profiles
> > using
> > > > > > > archived
> > > > > > > > telemetry. This enables the following types of use cases.
> > > > > > > >
> > > > > > > >       1. As a Security Data Scientist, I want to understand
> the
> > > > > > > historical
> > > > > > > > behaviors and trends of a profile that I have created so
> that I
> > > can
> > > > > > > > determine if I have created a feature set that has predictive
> > > value
> > > > > for
> > > > > > > > model building.
> > > > > > > >
> > > > > > > >       2. As a Security Data Scientist, I want to understand
> the
> > > > > > > historical
> > > > > > > > behaviors and trends of a profile that I have created so
> that I
> > > can
> > > > > > > > determine if I have defined the profile correctly and
> created a
> > > > > feature
> > > > > > > set
> > > > > > > > that matches reality.
> > > > > > > >
> > > > > > > >       3. As a Security Platform Engineer, I want to generate
> a
> > > > > profile
> > > > > > > > using archived telemetry when I deploy a new model to
> > production
> > > so
> > > > > > that
> > > > > > > > models depending on that profile can function on day 1.
> > > > > > > >
> > > > > > > >   * METRON-1699 [1] includes a more detailed description of
> the
> > > > > > feature.
> > > > > > > >
> > > > > > > > (Q) What work was completed?
> > > > > > > >
> > > > > > > >   * The Batch Profiler runs on Spark and was implemented in
> > Java
> > > to
> > > > > > > remain
> > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > >
> > > > > > > >   * The Batch Profiler is executed from the command-line. It
> > can
> > > be
> > > > > > > > launched using a script or by calling `spark-submit`, which
> may
> > > be
> > > > > > useful
> > > > > > > > for advanced users.
> > > > > > > >
> > > > > > > >   * Input telemetry can be consumed from multiple sources;
> for
> > > > > example
> > > > > > > HDFS
> > > > > > > > or the local file system.
> > > > > > > >
> > > > > > > >   * Input telemetry can be consumed in multiple formats; for
> > > > example
> > > > > > JSON
> > > > > > > > or ORC.
> > > > > > > >
> > > > > > > >   * The 'output' profile measurements are persisted in HBase
> > and
> > > is
> > > > > > > > consistent with the Storm Profiler.
> > > > > > > >
> > > > > > > >   * It can be run on any underlying engine supported by
> Spark.
> > I
> > > > have
> > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > >
> > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > >
> > > > > > > >   * A README was added that documents usage instructions.
> > > > > > > >
> > > > > > > >   * The existing Profiler code was refactored so that as much
> > > code
> > > > as
> > > > > > > > possible is shared between the 3 Profiler ports; Storm, the
> > > Stellar
> > > > > > REPL,
> > > > > > > > and Spark. For example, the logic which determines the
> > timestamp
> > > > of a
> > > > > > > > message was refactored so that it could be reused by all
> ports.
> > > > > > > >
> > > > > > > >       * metron-profiler-common: The common Profiler code
> shared
> > > > > amongst
> > > > > > > > each port.
> > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > > > >       * metron-profiler-client: The client code for
> retrieving
> > > > > profile
> > > > > > > > data; for example PROFILE_GET.
> > > > > > > >
> > > > > > > >   * There are 3 separate RPM and DEB packages now created for
> > the
> > > > > > > Profiler.
> > > > > > > >
> > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > >
> > > > > > > >   * The Profiler integration tests were enhanced to leverage
> > the
> > > > > > Profiler
> > > > > > > > Client logic to validate the results.
> > > > > > > >
> > > > > > > >   * Review METRON-1699 [1] for a complete break-down of the
> > tasks
> > > > > that
> > > > > > > have
> > > > > > > > been completed on the feature branch.
> > > > > > > >
> > > > > > > > (Q) What limitations exist?
> > > > > > > >
> > > > > > > >   * You must manually install Spark to use the Batch
> Profiler.
> > > The
> > > > > > Metron
> > > > > > > > MPack does not treat Spark as a Metron dependency and so does
> > not
> > > > > > install
> > > > > > > > it automatically.
> > > > > > > >
> > > > > > > >   * You do not configure the Batch Profiler in Ambari. It is
> > > > > configured
> > > > > > > > and executed completely from the command-line.
> > > > > > > >
> > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to take
> > the
> > > > > > > following
> > > > > > > > manual steps. Some of these are arguably limitations with how
> > > > Ambari
> > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > >
> > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > >
> > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > >
> > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > $SPARK_HOME/conf/
> > > > > > > >
> > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > >
> > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > >
> > > > > > > >       4. Change the default input path to
> > > > `hdfs://localhost:8020/...`
> > > > > > to
> > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > [2]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > >
> > > > > > > -------------------
> > > > > > > Thank you,
> > > > > > >
> > > > > > > James Sirota
> > > > > > > PMC- Apache Metron
> > > > > > > jsirota AT apache DOT org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Michael Miklavcic <mi...@gmail.com>.

Ok, makes sense. That's sort of what I was thinking as well, Nick. Pulling
at this thread just a bit more...

   1. I have an existing system that's been up a while, and I have added k
   profiles - assume these are the first profiles I've created.
      1. I would have t0 - tm (where m is the time when the profiles were
      first installed) worth of data that has not been profiled yet.
      2. The batch profiler process would be to take that exact profile
      definition from ZK and run the batch loader with that from the CLI.
      3. Profiles are now up to date from t0 - tCurrent
   2. I've already done #1 above. Time goes by and now I want to add a new
   profile.
      1. Same first step above
      2. I would run the batch loader with *only* that new profile
      definition to seed?

Forgive me if I missed this in PR's and discussion in the FB, but how do we
establish "tm" from 1.1 above? Any concerns about overlap or gaps after the
seeding is performed?

On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <ni...@nickallen.org> wrote:

> I think more often than not, you would want to load your profile definition
> from a file.  This is why I considered the 'load from Zk' more of a
> nice-to-have.
>
>    - In use case 1 and 2, this would definitely be the case.  The profiles
>    I am working with are speculative and I am using the batch profiler to
>    determine if they are worth keeping.  In this case, my speculative
> profiles
>    would not be in Zk (yet).
>    - In use case 3, I could see it go either way.  It might be useful to
>    load from Zk, but it certainly isn't a blocker.
>
>
> > So if the config does not correctly match the profiler config held in ZK
> and
> the user runs the batch seeding job, what happens?
>
> You would just get a profile that is slightly different over the entire
> time span.  This is not a new risk.  If the user changes their Profile
> definitions in Zk, the same thing would happen.
>
>
> On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
> > I think I'm torn on this, specifically because it's batch and would
> > generally be run as-needed. Justin, can you elaborate on your concerns
> > there? This feels functionally very similar to our flat file loaders,
> which
> > all have inputs for config from the CLI only. On the other hand, our flat
> > file loaders are not typically seeding an existing structure. My concern
> of
> > a local file profiler config stems from this stated goal:
> > > The goal would be to enable “profile seeding” which allows profiles to
> be
> > populated from a time before the profile was created.
> > So if the config does not correctly match the profiler config held in ZK
> > and the user runs the batch seeding job, what happens?
> >
> > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <ju...@gmail.com>
> > wrote:
> >
> > > The profile not being able to read from ZK feels like a fairly
> > substantial,
> > > if subtle, set of potential problems.  I'd like to see that in either
> > > before merging or at least pretty soon after merging.  Is it a lot of
> > work
> > > to add that functionality based on where things are right now?
> > >
> > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org> wrote:
> > >
> > > > Here is another limitation that I just thought. It can only read a
> > > profile
> > > > definition from a file.  It probably also makes sense to add an
> option
> > > that
> > > > allows it to read the current Profiler configuration from Zookeeper.
> > > >
> > > >
> > > > > Is it worth setting up a default config that pulls from the main
> > > indexing
> > > > output?
> > > >
> > > > Yes, I think that makes sense.  We want the Batch Profiler to point
> to
> > > the
> > > > right HDFS URL, no matter where/how Metron is deployed.  When Metron
> > gets
> > > > spun-up on a cluster, I should be able to just run the Batch Profiler
> > > > without having to fuss with the input path.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <ju...@gmail.com>
> > > wrote:
> > > >
> > > > > Re:
> > > > >
> > > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > > configured
> > > > > > and executed completely from the command-line.
> > > > > >
> > > > >
> > > > > Is it worth setting up a default config that pulls from the main
> > > indexing
> > > > > output?  I'm a little on the fence about it, but it seems like
> making
> > > the
> > > > > most common case more or less built-in would be nice.
> > > > >
> > > > > Having said that, I do not consider that a requirement for merging
> > the
> > > > > feature branch.
> > > > >
> > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <js...@apache.org>
> > > > wrote:
> > > > >
> > > > > > I think what you have outlined above is a good initial stab at
> the
> > > > > > feature.  Manual install of spark is not a big deal.  Configuring
> > via
> > > > > > command line while we mature this feature is ok as well.  Doesn't
> > > look
> > > > > like
> > > > > > configuration steps are too hard.  I think you should merge.
> > > > > >
> > > > > > James
> > > > > >
> > > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > > I would like to open a discussion to get the Batch Profiler
> > feature
> > > > > > branch
> > > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > > Profiler.
> > > > > All
> > > > > > > of the work that I had in mind for our first draft of the Batch
> > > > > Profiler
> > > > > > > has been completed. Please take a look through what I have and
> > let
> > > me
> > > > > > know
> > > > > > > if there are other features that you think are required
> *before*
> > we
> > > > > > merge.
> > > > > > >
> > > > > > > Previous list discussions on this topic include [2] and [3].
> > > > > > >
> > > > > > > (Q) What can I do with the feature branch?
> > > > > > >
> > > > > > >   * With the Batch Profiler, you can backfill/seed profiles
> using
> > > > > > archived
> > > > > > > telemetry. This enables the following types of use cases.
> > > > > > >
> > > > > > >       1. As a Security Data Scientist, I want to understand the
> > > > > > historical
> > > > > > > behaviors and trends of a profile that I have created so that I
> > can
> > > > > > > determine if I have created a feature set that has predictive
> > value
> > > > for
> > > > > > > model building.
> > > > > > >
> > > > > > >       2. As a Security Data Scientist, I want to understand the
> > > > > > historical
> > > > > > > behaviors and trends of a profile that I have created so that I
> > can
> > > > > > > determine if I have defined the profile correctly and created a
> > > > feature
> > > > > > set
> > > > > > > that matches reality.
> > > > > > >
> > > > > > >       3. As a Security Platform Engineer, I want to generate a
> > > > profile
> > > > > > > using archived telemetry when I deploy a new model to
> production
> > so
> > > > > that
> > > > > > > models depending on that profile can function on day 1.
> > > > > > >
> > > > > > >   * METRON-1699 [1] includes a more detailed description of the
> > > > > feature.
> > > > > > >
> > > > > > > (Q) What work was completed?
> > > > > > >
> > > > > > >   * The Batch Profiler runs on Spark and was implemented in
> Java
> > to
> > > > > > remain
> > > > > > > consistent with our current Java-heavy code base.
> > > > > > >
> > > > > > >   * The Batch Profiler is executed from the command-line. It
> can
> > be
> > > > > > > launched using a script or by calling `spark-submit`, which may
> > be
> > > > > useful
> > > > > > > for advanced users.
> > > > > > >
> > > > > > >   * Input telemetry can be consumed from multiple sources; for
> > > > example
> > > > > > HDFS
> > > > > > > or the local file system.
> > > > > > >
> > > > > > >   * Input telemetry can be consumed in multiple formats; for
> > > example
> > > > > JSON
> > > > > > > or ORC.
> > > > > > >
> > > > > > >   * The 'output' profile measurements are persisted in HBase
> and
> > is
> > > > > > > consistent with the Storm Profiler.
> > > > > > >
> > > > > > >   * It can be run on any underlying engine supported by Spark.
> I
> > > have
> > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > >
> > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > >
> > > > > > >   * A README was added that documents usage instructions.
> > > > > > >
> > > > > > >   * The existing Profiler code was refactored so that as much
> > code
> > > as
> > > > > > > possible is shared between the 3 Profiler ports; Storm, the
> > Stellar
> > > > > REPL,
> > > > > > > and Spark. For example, the logic which determines the
> timestamp
> > > of a
> > > > > > > message was refactored so that it could be reused by all ports.
> > > > > > >
> > > > > > >       * metron-profiler-common: The common Profiler code shared
> > > > amongst
> > > > > > > each port.
> > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > > >       * metron-profiler-client: The client code for retrieving
> > > > profile
> > > > > > > data; for example PROFILE_GET.
> > > > > > >
> > > > > > >   * There are 3 separate RPM and DEB packages now created for
> the
> > > > > > Profiler.
> > > > > > >
> > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > >
> > > > > > >   * The Profiler integration tests were enhanced to leverage
> the
> > > > > Profiler
> > > > > > > Client logic to validate the results.
> > > > > > >
> > > > > > >   * Review METRON-1699 [1] for a complete break-down of the
> tasks
> > > > that
> > > > > > have
> > > > > > > been completed on the feature branch.
> > > > > > >
> > > > > > > (Q) What limitations exist?
> > > > > > >
> > > > > > >   * You must manually install Spark to use the Batch Profiler.
> > The
> > > > > Metron
> > > > > > > MPack does not treat Spark as a Metron dependency and so does
> not
> > > > > install
> > > > > > > it automatically.
> > > > > > >
> > > > > > >   * You do not configure the Batch Profiler in Ambari. It is
> > > > configured
> > > > > > > and executed completely from the command-line.
> > > > > > >
> > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to take
> the
> > > > > > following
> > > > > > > manual steps. Some of these are arguably limitations with how
> > > Ambari
> > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > >
> > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > >
> > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > >
> > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > $SPARK_HOME/conf/
> > > > > > >
> > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > >
> > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > >
> > > > > > >       4. Change the default input path to
> > > `hdfs://localhost:8020/...`
> > > > > to
> > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > [2]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > >
> > > > > > -------------------
> > > > > > Thank you,
> > > > > >
> > > > > > James Sirota
> > > > > > PMC- Apache Metron
> > > > > > jsirota AT apache DOT org
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

I think more often than not, you would want to load your profile definition
from a file.  This is why I considered the 'load from Zk' more of a
nice-to-have.

   - In use case 1 and 2, this would definitely be the case.  The profiles
   I am working with are speculative and I am using the batch profiler to
   determine if they are worth keeping.  In this case, my speculative profiles
   would not be in Zk (yet).
   - In use case 3, I could see it go either way.  It might be useful to
   load from Zk, but it certainly isn't a blocker.


> So if the config does not correctly match the profiler config held in ZK and
the user runs the batch seeding job, what happens?

You would just get a profile that is slightly different over the entire
time span.  This is not a new risk.  If the user changes their Profile
definitions in Zk, the same thing would happen.


On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> I think I'm torn on this, specifically because it's batch and would
> generally be run as-needed. Justin, can you elaborate on your concerns
> there? This feels functionally very similar to our flat file loaders, which
> all have inputs for config from the CLI only. On the other hand, our flat
> file loaders are not typically seeding an existing structure. My concern of
> a local file profiler config stems from this stated goal:
> > The goal would be to enable “profile seeding” which allows profiles to be
> populated from a time before the profile was created.
> So if the config does not correctly match the profiler config held in ZK
> and the user runs the batch seeding job, what happens?
>
> On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <ju...@gmail.com>
> wrote:
>
> > The profile not being able to read from ZK feels like a fairly
> substantial,
> > if subtle, set of potential problems.  I'd like to see that in either
> > before merging or at least pretty soon after merging.  Is it a lot of
> work
> > to add that functionality based on where things are right now?
> >
> > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org> wrote:
> >
> > > Here is another limitation that I just thought. It can only read a
> > profile
> > > definition from a file.  It probably also makes sense to add an option
> > that
> > > allows it to read the current Profiler configuration from Zookeeper.
> > >
> > >
> > > > Is it worth setting up a default config that pulls from the main
> > indexing
> > > output?
> > >
> > > Yes, I think that makes sense.  We want the Batch Profiler to point to
> > the
> > > right HDFS URL, no matter where/how Metron is deployed.  When Metron
> gets
> > > spun-up on a cluster, I should be able to just run the Batch Profiler
> > > without having to fuss with the input path.
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <ju...@gmail.com>
> > wrote:
> > >
> > > > Re:
> > > >
> > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > configured
> > > > > and executed completely from the command-line.
> > > > >
> > > >
> > > > Is it worth setting up a default config that pulls from the main
> > indexing
> > > > output?  I'm a little on the fence about it, but it seems like making
> > the
> > > > most common case more or less built-in would be nice.
> > > >
> > > > Having said that, I do not consider that a requirement for merging
> the
> > > > feature branch.
> > > >
> > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <js...@apache.org>
> > > wrote:
> > > >
> > > > > I think what you have outlined above is a good initial stab at the
> > > > > feature.  Manual install of spark is not a big deal.  Configuring
> via
> > > > > command line while we mature this feature is ok as well.  Doesn't
> > look
> > > > like
> > > > > configuration steps are too hard.  I think you should merge.
> > > > >
> > > > > James
> > > > >
> > > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > > I would like to open a discussion to get the Batch Profiler
> feature
> > > > > branch
> > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > Profiler.
> > > > All
> > > > > > of the work that I had in mind for our first draft of the Batch
> > > > Profiler
> > > > > > has been completed. Please take a look through what I have and
> let
> > me
> > > > > know
> > > > > > if there are other features that you think are required *before*
> we
> > > > > merge.
> > > > > >
> > > > > > Previous list discussions on this topic include [2] and [3].
> > > > > >
> > > > > > (Q) What can I do with the feature branch?
> > > > > >
> > > > > >   * With the Batch Profiler, you can backfill/seed profiles using
> > > > > archived
> > > > > > telemetry. This enables the following types of use cases.
> > > > > >
> > > > > >       1. As a Security Data Scientist, I want to understand the
> > > > > historical
> > > > > > behaviors and trends of a profile that I have created so that I
> can
> > > > > > determine if I have created a feature set that has predictive
> value
> > > for
> > > > > > model building.
> > > > > >
> > > > > >       2. As a Security Data Scientist, I want to understand the
> > > > > historical
> > > > > > behaviors and trends of a profile that I have created so that I
> can
> > > > > > determine if I have defined the profile correctly and created a
> > > feature
> > > > > set
> > > > > > that matches reality.
> > > > > >
> > > > > >       3. As a Security Platform Engineer, I want to generate a
> > > profile
> > > > > > using archived telemetry when I deploy a new model to production
> so
> > > > that
> > > > > > models depending on that profile can function on day 1.
> > > > > >
> > > > > >   * METRON-1699 [1] includes a more detailed description of the
> > > > feature.
> > > > > >
> > > > > > (Q) What work was completed?
> > > > > >
> > > > > >   * The Batch Profiler runs on Spark and was implemented in Java
> to
> > > > > remain
> > > > > > consistent with our current Java-heavy code base.
> > > > > >
> > > > > >   * The Batch Profiler is executed from the command-line. It can
> be
> > > > > > launched using a script or by calling `spark-submit`, which may
> be
> > > > useful
> > > > > > for advanced users.
> > > > > >
> > > > > >   * Input telemetry can be consumed from multiple sources; for
> > > example
> > > > > HDFS
> > > > > > or the local file system.
> > > > > >
> > > > > >   * Input telemetry can be consumed in multiple formats; for
> > example
> > > > JSON
> > > > > > or ORC.
> > > > > >
> > > > > >   * The 'output' profile measurements are persisted in HBase and
> is
> > > > > > consistent with the Storm Profiler.
> > > > > >
> > > > > >   * It can be run on any underlying engine supported by Spark. I
> > have
> > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > >
> > > > > >   * It is installed automatically by the Metron MPack.
> > > > > >
> > > > > >   * A README was added that documents usage instructions.
> > > > > >
> > > > > >   * The existing Profiler code was refactored so that as much
> code
> > as
> > > > > > possible is shared between the 3 Profiler ports; Storm, the
> Stellar
> > > > REPL,
> > > > > > and Spark. For example, the logic which determines the timestamp
> > of a
> > > > > > message was refactored so that it could be reused by all ports.
> > > > > >
> > > > > >       * metron-profiler-common: The common Profiler code shared
> > > amongst
> > > > > > each port.
> > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > >       * metron-profiler-client: The client code for retrieving
> > > profile
> > > > > > data; for example PROFILE_GET.
> > > > > >
> > > > > >   * There are 3 separate RPM and DEB packages now created for the
> > > > > Profiler.
> > > > > >
> > > > > >       * metron-profiler-storm-*.rpm
> > > > > >       * metron-profiler-spark-*.rpm
> > > > > >       * metron-profiler-repl-*.rpm
> > > > > >
> > > > > >   * The Profiler integration tests were enhanced to leverage the
> > > > Profiler
> > > > > > Client logic to validate the results.
> > > > > >
> > > > > >   * Review METRON-1699 [1] for a complete break-down of the tasks
> > > that
> > > > > have
> > > > > > been completed on the feature branch.
> > > > > >
> > > > > > (Q) What limitations exist?
> > > > > >
> > > > > >   * You must manually install Spark to use the Batch Profiler.
> The
> > > > Metron
> > > > > > MPack does not treat Spark as a Metron dependency and so does not
> > > > install
> > > > > > it automatically.
> > > > > >
> > > > > >   * You do not configure the Batch Profiler in Ambari. It is
> > > configured
> > > > > > and executed completely from the command-line.
> > > > > >
> > > > > >   * To run the Batch Profiler in 'Full Dev', you have to take the
> > > > > following
> > > > > > manual steps. Some of these are arguably limitations with how
> > Ambari
> > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > >
> > > > > >       1. Install Spark 2 using Ambari.
> > > > > >
> > > > > >       2. Tell Spark how to talk with HBase.
> > > > > >
> > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > $SPARK_HOME/conf/
> > > > > >
> > > > > >       3. Create the Spark History directory in HDFS.
> > > > > >
> > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > >
> > > > > >       4. Change the default input path to
> > `hdfs://localhost:8020/...`
> > > > to
> > > > > > match the port defined by HDP, instead of port 9000.
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > [3]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > >
> > > > > -------------------
> > > > > Thank you,
> > > > >
> > > > > James Sirota
> > > > > PMC- Apache Metron
> > > > > jsirota AT apache DOT org
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Michael Miklavcic <mi...@gmail.com>.

I think I'm torn on this, specifically because it's batch and would
generally be run as-needed. Justin, can you elaborate on your concerns
there? This feels functionally very similar to our flat file loaders, which
all have inputs for config from the CLI only. On the other hand, our flat
file loaders are not typically seeding an existing structure. My concern of
a local file profiler config stems from this stated goal:
> The goal would be to enable “profile seeding” which allows profiles to be
populated from a time before the profile was created.
So if the config does not correctly match the profiler config held in ZK
and the user runs the batch seeding job, what happens?

On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <ju...@gmail.com> wrote:

> The profile not being able to read from ZK feels like a fairly substantial,
> if subtle, set of potential problems.  I'd like to see that in either
> before merging or at least pretty soon after merging.  Is it a lot of work
> to add that functionality based on where things are right now?
>
> On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org> wrote:
>
> > Here is another limitation that I just thought. It can only read a
> profile
> > definition from a file.  It probably also makes sense to add an option
> that
> > allows it to read the current Profiler configuration from Zookeeper.
> >
> >
> > > Is it worth setting up a default config that pulls from the main
> indexing
> > output?
> >
> > Yes, I think that makes sense.  We want the Batch Profiler to point to
> the
> > right HDFS URL, no matter where/how Metron is deployed.  When Metron gets
> > spun-up on a cluster, I should be able to just run the Batch Profiler
> > without having to fuss with the input path.
> >
> >
> >
> >
> >
> > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <ju...@gmail.com>
> wrote:
> >
> > > Re:
> > >
> > > >  * You do not configure the Batch Profiler in Ambari.  It is
> configured
> > > > and executed completely from the command-line.
> > > >
> > >
> > > Is it worth setting up a default config that pulls from the main
> indexing
> > > output?  I'm a little on the fence about it, but it seems like making
> the
> > > most common case more or less built-in would be nice.
> > >
> > > Having said that, I do not consider that a requirement for merging the
> > > feature branch.
> > >
> > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <js...@apache.org>
> > wrote:
> > >
> > > > I think what you have outlined above is a good initial stab at the
> > > > feature.  Manual install of spark is not a big deal.  Configuring via
> > > > command line while we mature this feature is ok as well.  Doesn't
> look
> > > like
> > > > configuration steps are too hard.  I think you should merge.
> > > >
> > > > James
> > > >
> > > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > > I would like to open a discussion to get the Batch Profiler feature
> > > > branch
> > > > > merged into master as part of METRON-1699 [1] Create Batch
> Profiler.
> > > All
> > > > > of the work that I had in mind for our first draft of the Batch
> > > Profiler
> > > > > has been completed. Please take a look through what I have and let
> me
> > > > know
> > > > > if there are other features that you think are required *before* we
> > > > merge.
> > > > >
> > > > > Previous list discussions on this topic include [2] and [3].
> > > > >
> > > > > (Q) What can I do with the feature branch?
> > > > >
> > > > >   * With the Batch Profiler, you can backfill/seed profiles using
> > > > archived
> > > > > telemetry. This enables the following types of use cases.
> > > > >
> > > > >       1. As a Security Data Scientist, I want to understand the
> > > > historical
> > > > > behaviors and trends of a profile that I have created so that I can
> > > > > determine if I have created a feature set that has predictive value
> > for
> > > > > model building.
> > > > >
> > > > >       2. As a Security Data Scientist, I want to understand the
> > > > historical
> > > > > behaviors and trends of a profile that I have created so that I can
> > > > > determine if I have defined the profile correctly and created a
> > feature
> > > > set
> > > > > that matches reality.
> > > > >
> > > > >       3. As a Security Platform Engineer, I want to generate a
> > profile
> > > > > using archived telemetry when I deploy a new model to production so
> > > that
> > > > > models depending on that profile can function on day 1.
> > > > >
> > > > >   * METRON-1699 [1] includes a more detailed description of the
> > > feature.
> > > > >
> > > > > (Q) What work was completed?
> > > > >
> > > > >   * The Batch Profiler runs on Spark and was implemented in Java to
> > > > remain
> > > > > consistent with our current Java-heavy code base.
> > > > >
> > > > >   * The Batch Profiler is executed from the command-line. It can be
> > > > > launched using a script or by calling `spark-submit`, which may be
> > > useful
> > > > > for advanced users.
> > > > >
> > > > >   * Input telemetry can be consumed from multiple sources; for
> > example
> > > > HDFS
> > > > > or the local file system.
> > > > >
> > > > >   * Input telemetry can be consumed in multiple formats; for
> example
> > > JSON
> > > > > or ORC.
> > > > >
> > > > >   * The 'output' profile measurements are persisted in HBase and is
> > > > > consistent with the Storm Profiler.
> > > > >
> > > > >   * It can be run on any underlying engine supported by Spark. I
> have
> > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > >
> > > > >   * It is installed automatically by the Metron MPack.
> > > > >
> > > > >   * A README was added that documents usage instructions.
> > > > >
> > > > >   * The existing Profiler code was refactored so that as much code
> as
> > > > > possible is shared between the 3 Profiler ports; Storm, the Stellar
> > > REPL,
> > > > > and Spark. For example, the logic which determines the timestamp
> of a
> > > > > message was refactored so that it could be reused by all ports.
> > > > >
> > > > >       * metron-profiler-common: The common Profiler code shared
> > amongst
> > > > > each port.
> > > > >       * metron-profiler-storm: Profiler on Storm
> > > > >       * metron-profiler-spark: Profiler on Spark
> > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > >       * metron-profiler-client: The client code for retrieving
> > profile
> > > > > data; for example PROFILE_GET.
> > > > >
> > > > >   * There are 3 separate RPM and DEB packages now created for the
> > > > Profiler.
> > > > >
> > > > >       * metron-profiler-storm-*.rpm
> > > > >       * metron-profiler-spark-*.rpm
> > > > >       * metron-profiler-repl-*.rpm
> > > > >
> > > > >   * The Profiler integration tests were enhanced to leverage the
> > > Profiler
> > > > > Client logic to validate the results.
> > > > >
> > > > >   * Review METRON-1699 [1] for a complete break-down of the tasks
> > that
> > > > have
> > > > > been completed on the feature branch.
> > > > >
> > > > > (Q) What limitations exist?
> > > > >
> > > > >   * You must manually install Spark to use the Batch Profiler. The
> > > Metron
> > > > > MPack does not treat Spark as a Metron dependency and so does not
> > > install
> > > > > it automatically.
> > > > >
> > > > >   * You do not configure the Batch Profiler in Ambari. It is
> > configured
> > > > > and executed completely from the command-line.
> > > > >
> > > > >   * To run the Batch Profiler in 'Full Dev', you have to take the
> > > > following
> > > > > manual steps. Some of these are arguably limitations with how
> Ambari
> > > > > installs Spark 2 in the version of HDP that we run.
> > > > >
> > > > >       1. Install Spark 2 using Ambari.
> > > > >
> > > > >       2. Tell Spark how to talk with HBase.
> > > > >
> > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > $SPARK_HOME/conf/
> > > > >
> > > > >       3. Create the Spark History directory in HDFS.
> > > > >
> > > > >         export HADOOP_USER_NAME=hdfs
> > > > >         hdfs dfs -mkdir /spark2-history
> > > > >
> > > > >       4. Change the default input path to
> `hdfs://localhost:8020/...`
> > > to
> > > > > match the port defined by HDP, instead of port 9000.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > [3]
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > >
> > > > -------------------
> > > > Thank you,
> > > >
> > > > James Sirota
> > > > PMC- Apache Metron
> > > > jsirota AT apache DOT org
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Justin Leet <ju...@gmail.com>.

The profile not being able to read from ZK feels like a fairly substantial,
if subtle, set of potential problems.  I'd like to see that in either
before merging or at least pretty soon after merging.  Is it a lot of work
to add that functionality based on where things are right now?

On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <ni...@nickallen.org> wrote:

> Here is another limitation that I just thought. It can only read a profile
> definition from a file.  It probably also makes sense to add an option that
> allows it to read the current Profiler configuration from Zookeeper.
>
>
> > Is it worth setting up a default config that pulls from the main indexing
> output?
>
> Yes, I think that makes sense.  We want the Batch Profiler to point to the
> right HDFS URL, no matter where/how Metron is deployed.  When Metron gets
> spun-up on a cluster, I should be able to just run the Batch Profiler
> without having to fuss with the input path.
>
>
>
>
>
> On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <ju...@gmail.com> wrote:
>
> > Re:
> >
> > >  * You do not configure the Batch Profiler in Ambari.  It is configured
> > > and executed completely from the command-line.
> > >
> >
> > Is it worth setting up a default config that pulls from the main indexing
> > output?  I'm a little on the fence about it, but it seems like making the
> > most common case more or less built-in would be nice.
> >
> > Having said that, I do not consider that a requirement for merging the
> > feature branch.
> >
> > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <js...@apache.org>
> wrote:
> >
> > > I think what you have outlined above is a good initial stab at the
> > > feature.  Manual install of spark is not a big deal.  Configuring via
> > > command line while we mature this feature is ok as well.  Doesn't look
> > like
> > > configuration steps are too hard.  I think you should merge.
> > >
> > > James
> > >
> > > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > > I would like to open a discussion to get the Batch Profiler feature
> > > branch
> > > > merged into master as part of METRON-1699 [1] Create Batch Profiler.
> > All
> > > > of the work that I had in mind for our first draft of the Batch
> > Profiler
> > > > has been completed. Please take a look through what I have and let me
> > > know
> > > > if there are other features that you think are required *before* we
> > > merge.
> > > >
> > > > Previous list discussions on this topic include [2] and [3].
> > > >
> > > > (Q) What can I do with the feature branch?
> > > >
> > > >   * With the Batch Profiler, you can backfill/seed profiles using
> > > archived
> > > > telemetry. This enables the following types of use cases.
> > > >
> > > >       1. As a Security Data Scientist, I want to understand the
> > > historical
> > > > behaviors and trends of a profile that I have created so that I can
> > > > determine if I have created a feature set that has predictive value
> for
> > > > model building.
> > > >
> > > >       2. As a Security Data Scientist, I want to understand the
> > > historical
> > > > behaviors and trends of a profile that I have created so that I can
> > > > determine if I have defined the profile correctly and created a
> feature
> > > set
> > > > that matches reality.
> > > >
> > > >       3. As a Security Platform Engineer, I want to generate a
> profile
> > > > using archived telemetry when I deploy a new model to production so
> > that
> > > > models depending on that profile can function on day 1.
> > > >
> > > >   * METRON-1699 [1] includes a more detailed description of the
> > feature.
> > > >
> > > > (Q) What work was completed?
> > > >
> > > >   * The Batch Profiler runs on Spark and was implemented in Java to
> > > remain
> > > > consistent with our current Java-heavy code base.
> > > >
> > > >   * The Batch Profiler is executed from the command-line. It can be
> > > > launched using a script or by calling `spark-submit`, which may be
> > useful
> > > > for advanced users.
> > > >
> > > >   * Input telemetry can be consumed from multiple sources; for
> example
> > > HDFS
> > > > or the local file system.
> > > >
> > > >   * Input telemetry can be consumed in multiple formats; for example
> > JSON
> > > > or ORC.
> > > >
> > > >   * The 'output' profile measurements are persisted in HBase and is
> > > > consistent with the Storm Profiler.
> > > >
> > > >   * It can be run on any underlying engine supported by Spark. I have
> > > > tested it both in 'local' mode and on a YARN cluster.
> > > >
> > > >   * It is installed automatically by the Metron MPack.
> > > >
> > > >   * A README was added that documents usage instructions.
> > > >
> > > >   * The existing Profiler code was refactored so that as much code as
> > > > possible is shared between the 3 Profiler ports; Storm, the Stellar
> > REPL,
> > > > and Spark. For example, the logic which determines the timestamp of a
> > > > message was refactored so that it could be reused by all ports.
> > > >
> > > >       * metron-profiler-common: The common Profiler code shared
> amongst
> > > > each port.
> > > >       * metron-profiler-storm: Profiler on Storm
> > > >       * metron-profiler-spark: Profiler on Spark
> > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > >       * metron-profiler-client: The client code for retrieving
> profile
> > > > data; for example PROFILE_GET.
> > > >
> > > >   * There are 3 separate RPM and DEB packages now created for the
> > > Profiler.
> > > >
> > > >       * metron-profiler-storm-*.rpm
> > > >       * metron-profiler-spark-*.rpm
> > > >       * metron-profiler-repl-*.rpm
> > > >
> > > >   * The Profiler integration tests were enhanced to leverage the
> > Profiler
> > > > Client logic to validate the results.
> > > >
> > > >   * Review METRON-1699 [1] for a complete break-down of the tasks
> that
> > > have
> > > > been completed on the feature branch.
> > > >
> > > > (Q) What limitations exist?
> > > >
> > > >   * You must manually install Spark to use the Batch Profiler. The
> > Metron
> > > > MPack does not treat Spark as a Metron dependency and so does not
> > install
> > > > it automatically.
> > > >
> > > >   * You do not configure the Batch Profiler in Ambari. It is
> configured
> > > > and executed completely from the command-line.
> > > >
> > > >   * To run the Batch Profiler in 'Full Dev', you have to take the
> > > following
> > > > manual steps. Some of these are arguably limitations with how Ambari
> > > > installs Spark 2 in the version of HDP that we run.
> > > >
> > > >       1. Install Spark 2 using Ambari.
> > > >
> > > >       2. Tell Spark how to talk with HBase.
> > > >
> > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > $SPARK_HOME/conf/
> > > >
> > > >       3. Create the Spark History directory in HDFS.
> > > >
> > > >         export HADOOP_USER_NAME=hdfs
> > > >         hdfs dfs -mkdir /spark2-history
> > > >
> > > >       4. Change the default input path to `hdfs://localhost:8020/...`
> > to
> > > > match the port defined by HDP, instead of port 9000.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > [2]
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > [3]
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PMC- Apache Metron
> > > jsirota AT apache DOT org
> > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Nick Allen <ni...@nickallen.org>.

Here is another limitation that I just thought. It can only read a profile
definition from a file.  It probably also makes sense to add an option that
allows it to read the current Profiler configuration from Zookeeper.


> Is it worth setting up a default config that pulls from the main indexing
output?

Yes, I think that makes sense.  We want the Batch Profiler to point to the
right HDFS URL, no matter where/how Metron is deployed.  When Metron gets
spun-up on a cluster, I should be able to just run the Batch Profiler
without having to fuss with the input path.





On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <ju...@gmail.com> wrote:

> Re:
>
> >  * You do not configure the Batch Profiler in Ambari.  It is configured
> > and executed completely from the command-line.
> >
>
> Is it worth setting up a default config that pulls from the main indexing
> output?  I'm a little on the fence about it, but it seems like making the
> most common case more or less built-in would be nice.
>
> Having said that, I do not consider that a requirement for merging the
> feature branch.
>
> On Wed, Sep 19, 2018 at 11:23 AM James Sirota <js...@apache.org> wrote:
>
> > I think what you have outlined above is a good initial stab at the
> > feature.  Manual install of spark is not a big deal.  Configuring via
> > command line while we mature this feature is ok as well.  Doesn't look
> like
> > configuration steps are too hard.  I think you should merge.
> >
> > James
> >
> > 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > > I would like to open a discussion to get the Batch Profiler feature
> > branch
> > > merged into master as part of METRON-1699 [1] Create Batch Profiler.
> All
> > > of the work that I had in mind for our first draft of the Batch
> Profiler
> > > has been completed. Please take a look through what I have and let me
> > know
> > > if there are other features that you think are required *before* we
> > merge.
> > >
> > > Previous list discussions on this topic include [2] and [3].
> > >
> > > (Q) What can I do with the feature branch?
> > >
> > >   * With the Batch Profiler, you can backfill/seed profiles using
> > archived
> > > telemetry. This enables the following types of use cases.
> > >
> > >       1. As a Security Data Scientist, I want to understand the
> > historical
> > > behaviors and trends of a profile that I have created so that I can
> > > determine if I have created a feature set that has predictive value for
> > > model building.
> > >
> > >       2. As a Security Data Scientist, I want to understand the
> > historical
> > > behaviors and trends of a profile that I have created so that I can
> > > determine if I have defined the profile correctly and created a feature
> > set
> > > that matches reality.
> > >
> > >       3. As a Security Platform Engineer, I want to generate a profile
> > > using archived telemetry when I deploy a new model to production so
> that
> > > models depending on that profile can function on day 1.
> > >
> > >   * METRON-1699 [1] includes a more detailed description of the
> feature.
> > >
> > > (Q) What work was completed?
> > >
> > >   * The Batch Profiler runs on Spark and was implemented in Java to
> > remain
> > > consistent with our current Java-heavy code base.
> > >
> > >   * The Batch Profiler is executed from the command-line. It can be
> > > launched using a script or by calling `spark-submit`, which may be
> useful
> > > for advanced users.
> > >
> > >   * Input telemetry can be consumed from multiple sources; for example
> > HDFS
> > > or the local file system.
> > >
> > >   * Input telemetry can be consumed in multiple formats; for example
> JSON
> > > or ORC.
> > >
> > >   * The 'output' profile measurements are persisted in HBase and is
> > > consistent with the Storm Profiler.
> > >
> > >   * It can be run on any underlying engine supported by Spark. I have
> > > tested it both in 'local' mode and on a YARN cluster.
> > >
> > >   * It is installed automatically by the Metron MPack.
> > >
> > >   * A README was added that documents usage instructions.
> > >
> > >   * The existing Profiler code was refactored so that as much code as
> > > possible is shared between the 3 Profiler ports; Storm, the Stellar
> REPL,
> > > and Spark. For example, the logic which determines the timestamp of a
> > > message was refactored so that it could be reused by all ports.
> > >
> > >       * metron-profiler-common: The common Profiler code shared amongst
> > > each port.
> > >       * metron-profiler-storm: Profiler on Storm
> > >       * metron-profiler-spark: Profiler on Spark
> > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > >       * metron-profiler-client: The client code for retrieving profile
> > > data; for example PROFILE_GET.
> > >
> > >   * There are 3 separate RPM and DEB packages now created for the
> > Profiler.
> > >
> > >       * metron-profiler-storm-*.rpm
> > >       * metron-profiler-spark-*.rpm
> > >       * metron-profiler-repl-*.rpm
> > >
> > >   * The Profiler integration tests were enhanced to leverage the
> Profiler
> > > Client logic to validate the results.
> > >
> > >   * Review METRON-1699 [1] for a complete break-down of the tasks that
> > have
> > > been completed on the feature branch.
> > >
> > > (Q) What limitations exist?
> > >
> > >   * You must manually install Spark to use the Batch Profiler. The
> Metron
> > > MPack does not treat Spark as a Metron dependency and so does not
> install
> > > it automatically.
> > >
> > >   * You do not configure the Batch Profiler in Ambari. It is configured
> > > and executed completely from the command-line.
> > >
> > >   * To run the Batch Profiler in 'Full Dev', you have to take the
> > following
> > > manual steps. Some of these are arguably limitations with how Ambari
> > > installs Spark 2 in the version of HDP that we run.
> > >
> > >       1. Install Spark 2 using Ambari.
> > >
> > >       2. Tell Spark how to talk with HBase.
> > >
> > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > $SPARK_HOME/conf/
> > >
> > >       3. Create the Spark History directory in HDFS.
> > >
> > >         export HADOOP_USER_NAME=hdfs
> > >         hdfs dfs -mkdir /spark2-history
> > >
> > >       4. Change the default input path to `hdfs://localhost:8020/...`
> to
> > > match the port defined by HDP, instead of port 9000.
> > >
> > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > [2]
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > [3]
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PMC- Apache Metron
> > jsirota AT apache DOT org
> >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by Justin Leet <ju...@gmail.com>.

Re:

>  * You do not configure the Batch Profiler in Ambari.  It is configured
> and executed completely from the command-line.
>

Is it worth setting up a default config that pulls from the main indexing
output?  I'm a little on the fence about it, but it seems like making the
most common case more or less built-in would be nice.

Having said that, I do not consider that a requirement for merging the
feature branch.

On Wed, Sep 19, 2018 at 11:23 AM James Sirota <js...@apache.org> wrote:

> I think what you have outlined above is a good initial stab at the
> feature.  Manual install of spark is not a big deal.  Configuring via
> command line while we mature this feature is ok as well.  Doesn't look like
> configuration steps are too hard.  I think you should merge.
>
> James
>
> 19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> > I would like to open a discussion to get the Batch Profiler feature
> branch
> > merged into master as part of METRON-1699 [1] Create Batch Profiler. All
> > of the work that I had in mind for our first draft of the Batch Profiler
> > has been completed. Please take a look through what I have and let me
> know
> > if there are other features that you think are required *before* we
> merge.
> >
> > Previous list discussions on this topic include [2] and [3].
> >
> > (Q) What can I do with the feature branch?
> >
> >   * With the Batch Profiler, you can backfill/seed profiles using
> archived
> > telemetry. This enables the following types of use cases.
> >
> >       1. As a Security Data Scientist, I want to understand the
> historical
> > behaviors and trends of a profile that I have created so that I can
> > determine if I have created a feature set that has predictive value for
> > model building.
> >
> >       2. As a Security Data Scientist, I want to understand the
> historical
> > behaviors and trends of a profile that I have created so that I can
> > determine if I have defined the profile correctly and created a feature
> set
> > that matches reality.
> >
> >       3. As a Security Platform Engineer, I want to generate a profile
> > using archived telemetry when I deploy a new model to production so that
> > models depending on that profile can function on day 1.
> >
> >   * METRON-1699 [1] includes a more detailed description of the feature.
> >
> > (Q) What work was completed?
> >
> >   * The Batch Profiler runs on Spark and was implemented in Java to
> remain
> > consistent with our current Java-heavy code base.
> >
> >   * The Batch Profiler is executed from the command-line. It can be
> > launched using a script or by calling `spark-submit`, which may be useful
> > for advanced users.
> >
> >   * Input telemetry can be consumed from multiple sources; for example
> HDFS
> > or the local file system.
> >
> >   * Input telemetry can be consumed in multiple formats; for example JSON
> > or ORC.
> >
> >   * The 'output' profile measurements are persisted in HBase and is
> > consistent with the Storm Profiler.
> >
> >   * It can be run on any underlying engine supported by Spark. I have
> > tested it both in 'local' mode and on a YARN cluster.
> >
> >   * It is installed automatically by the Metron MPack.
> >
> >   * A README was added that documents usage instructions.
> >
> >   * The existing Profiler code was refactored so that as much code as
> > possible is shared between the 3 Profiler ports; Storm, the Stellar REPL,
> > and Spark. For example, the logic which determines the timestamp of a
> > message was refactored so that it could be reused by all ports.
> >
> >       * metron-profiler-common: The common Profiler code shared amongst
> > each port.
> >       * metron-profiler-storm: Profiler on Storm
> >       * metron-profiler-spark: Profiler on Spark
> >       * metron-profiler-repl: Profiler on the Stellar REPL
> >       * metron-profiler-client: The client code for retrieving profile
> > data; for example PROFILE_GET.
> >
> >   * There are 3 separate RPM and DEB packages now created for the
> Profiler.
> >
> >       * metron-profiler-storm-*.rpm
> >       * metron-profiler-spark-*.rpm
> >       * metron-profiler-repl-*.rpm
> >
> >   * The Profiler integration tests were enhanced to leverage the Profiler
> > Client logic to validate the results.
> >
> >   * Review METRON-1699 [1] for a complete break-down of the tasks that
> have
> > been completed on the feature branch.
> >
> > (Q) What limitations exist?
> >
> >   * You must manually install Spark to use the Batch Profiler. The Metron
> > MPack does not treat Spark as a Metron dependency and so does not install
> > it automatically.
> >
> >   * You do not configure the Batch Profiler in Ambari. It is configured
> > and executed completely from the command-line.
> >
> >   * To run the Batch Profiler in 'Full Dev', you have to take the
> following
> > manual steps. Some of these are arguably limitations with how Ambari
> > installs Spark 2 in the version of HDP that we run.
> >
> >       1. Install Spark 2 using Ambari.
> >
> >       2. Tell Spark how to talk with HBase.
> >
> >         SPARK_HOME=/usr/hdp/current/spark2-client
> >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > $SPARK_HOME/conf/
> >
> >       3. Create the Spark History directory in HDFS.
> >
> >         export HADOOP_USER_NAME=hdfs
> >         hdfs dfs -mkdir /spark2-history
> >
> >       4. Change the default input path to `hdfs://localhost:8020/...` to
> > match the port defined by HDP, instead of port 9000.
> >
> > [1] https://issues.apache.org/jira/browse/METRON-1699
> > [2]
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > [3]
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
>
> -------------------
> Thank you,
>
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
>
>

Re: [DISCUSS] Batch Profiler Feature Branch

Posted by James Sirota <js...@apache.org>.

I think what you have outlined above is a good initial stab at the feature.  Manual install of spark is not a big deal.  Configuring via command line while we mature this feature is ok as well.  Doesn't look like configuration steps are too hard.  I think you should merge.

James 

19.09.2018, 08:15, "Nick Allen" <ni...@nickallen.org>:
> I would like to open a discussion to get the Batch Profiler feature branch
> merged into master as part of METRON-1699 [1] Create Batch Profiler. All
> of the work that I had in mind for our first draft of the Batch Profiler
> has been completed. Please take a look through what I have and let me know
> if there are other features that you think are required *before* we merge.
>
> Previous list discussions on this topic include [2] and [3].
>
> (Q) What can I do with the feature branch?
>
>   * With the Batch Profiler, you can backfill/seed profiles using archived
> telemetry. This enables the following types of use cases.
>
>       1. As a Security Data Scientist, I want to understand the historical
> behaviors and trends of a profile that I have created so that I can
> determine if I have created a feature set that has predictive value for
> model building.
>
>       2. As a Security Data Scientist, I want to understand the historical
> behaviors and trends of a profile that I have created so that I can
> determine if I have defined the profile correctly and created a feature set
> that matches reality.
>
>       3. As a Security Platform Engineer, I want to generate a profile
> using archived telemetry when I deploy a new model to production so that
> models depending on that profile can function on day 1.
>
>   * METRON-1699 [1] includes a more detailed description of the feature.
>
> (Q) What work was completed?
>
>   * The Batch Profiler runs on Spark and was implemented in Java to remain
> consistent with our current Java-heavy code base.
>
>   * The Batch Profiler is executed from the command-line. It can be
> launched using a script or by calling `spark-submit`, which may be useful
> for advanced users.
>
>   * Input telemetry can be consumed from multiple sources; for example HDFS
> or the local file system.
>
>   * Input telemetry can be consumed in multiple formats; for example JSON
> or ORC.
>
>   * The 'output' profile measurements are persisted in HBase and is
> consistent with the Storm Profiler.
>
>   * It can be run on any underlying engine supported by Spark. I have
> tested it both in 'local' mode and on a YARN cluster.
>
>   * It is installed automatically by the Metron MPack.
>
>   * A README was added that documents usage instructions.
>
>   * The existing Profiler code was refactored so that as much code as
> possible is shared between the 3 Profiler ports; Storm, the Stellar REPL,
> and Spark. For example, the logic which determines the timestamp of a
> message was refactored so that it could be reused by all ports.
>
>       * metron-profiler-common: The common Profiler code shared amongst
> each port.
>       * metron-profiler-storm: Profiler on Storm
>       * metron-profiler-spark: Profiler on Spark
>       * metron-profiler-repl: Profiler on the Stellar REPL
>       * metron-profiler-client: The client code for retrieving profile
> data; for example PROFILE_GET.
>
>   * There are 3 separate RPM and DEB packages now created for the Profiler.
>
>       * metron-profiler-storm-*.rpm
>       * metron-profiler-spark-*.rpm
>       * metron-profiler-repl-*.rpm
>
>   * The Profiler integration tests were enhanced to leverage the Profiler
> Client logic to validate the results.
>
>   * Review METRON-1699 [1] for a complete break-down of the tasks that have
> been completed on the feature branch.
>
> (Q) What limitations exist?
>
>   * You must manually install Spark to use the Batch Profiler. The Metron
> MPack does not treat Spark as a Metron dependency and so does not install
> it automatically.
>
>   * You do not configure the Batch Profiler in Ambari. It is configured
> and executed completely from the command-line.
>
>   * To run the Batch Profiler in 'Full Dev', you have to take the following
> manual steps. Some of these are arguably limitations with how Ambari
> installs Spark 2 in the version of HDP that we run.
>
>       1. Install Spark 2 using Ambari.
>
>       2. Tell Spark how to talk with HBase.
>
>         SPARK_HOME=/usr/hdp/current/spark2-client
>         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> $SPARK_HOME/conf/
>
>       3. Create the Spark History directory in HDFS.
>
>         export HADOOP_USER_NAME=hdfs
>         hdfs dfs -mkdir /spark2-history
>
>       4. Change the default input path to `hdfs://localhost:8020/...` to
> match the port defined by HDP, instead of port 9000.
>
> [1] https://issues.apache.org/jira/browse/METRON-1699
> [2]
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> [3]
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E

------------------- 
Thank you,

James Sirota
PMC- Apache Metron
jsirota AT apache DOT org