You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@metron.apache.org by Casey Stella <ce...@gmail.com> on 2017/06/21 21:07:36 UTC

[DISCUSS] Mutation of Indexed Data

Hi All,

I know we've had a couple of these already, but we're due for another
discussion of a sensible approach to mutating indexed data.  The motivation
for this is users will want to update fields to correct and augment data.
These corrections are invaluable for things like feedback for ML models or
just plain providing better context when evaluating alerts, etc.

Rather than posing a solution, I'd like to pose the characteristics of a
solution and we can fight about those first. ;)

In my mind, the following are the characteristics that I'd look for:

   - Changes should be considered additional or replacement fields for
   existing fields
   - Changes need to be available in the web view in near real time (on the
   order of milliseconds)
   - Changes should be available in the batch view
      - I'd be ok with eventually consistent with the web view, thoughts?
   - Changes should have lineage preserved
      - Current value is the optimized path
      - Lineage search is the less optimized path
   - If HBase is part of a solution
      - maintain a scan-free solution
      - maintain a coprocessor-free solution

Most of what I've thought of is something along the lines:

   - Diffs are stored in columns in a HBase row(s)
      - row: GUID:current would have one column with the current
      representation
      - row: GUID:lineage would have an ordered set of columns representing
      the lineage diffs
   - Mutable indices is directly updated (e.g. solr or ES)
   - We'd probably want to provide transparent read support downstream
   which supports merging for batch read:
      - a spark dataframe
      - a hive serde

What I'd like to get out of this discussion is an architecture document
with a suggested approach and the necessary JIRAs to split this up. If
anyone has suggestions or comments about any of this, please speak up.  I'd
like to actually get this done in the near-term. :)

Best,

Casey

Re: [DISCUSS] Mutation of Indexed Data

Posted by Otto Fowler <ot...@gmail.com>.

This problem is not uncommon, I would think.  This should be implemented as
‘clean’ as possible such that it can be spun out.
It would also be a candidate for a feature/collaboration/long branch


On June 26, 2017 at 12:44:44, Casey Stella (cestella@gmail.com) wrote:

When we're talking about a "transaction log", an edit could involve
multiple delete/additions, so are we proposing storing a diff to the JSON
map as the representation of a particular transaction? I proposed
pre-caching the current value to lessen the burden on the reader (i.e. not
having to merge the transactions into current state), what do we think of
that?

Also, I want to ensure we maintain a solution that is scan-free: the edits
should exist as separate columns rather than separate rows in the NoSQL
store.

Thoughts?

On Mon, Jun 26, 2017 at 5:36 PM, James Sirota <js...@apache.org> wrote:

> It is clear to me that we need an independently-stored transaction log
> that is de-coupled from any of our existing systems. So Simon’s idea of
> storing the transaction logs in Hbase and being able to reference them
via
> a global ID resonates with me. I like it for the following reasons:
>
> - It makes Metron more pluggable as far as adding additional sources for
> data storage (for example a graph data base) as well as disabling
existing
> data sources.
>
> - It makes enforcing consistency of data between data sources easier.
> Each data storage system can be pointed to look at the transaction log so
> when user modifies data in system X and it gets recorded in the
transaction
> log, systems Y and Z can listen for this change and adjust their data
> accordingly based on the global ID
>
> Thanks, James
>
>
> 22.06.2017, 14:09, "Justin Leet" <ju...@gmail.com>:
> > Thanks, Jon, that looks like it should work for the key. I didn't
realize
> > that guid got handled that way, which makes life much easier there.
> Almost
> > like we already needed to identify messages or something. At the point
we
> > should be good, since we can easily retrieve, update, put on it.
> >
> > We'll also need to make sure any long term storage solution also uses
it.
> >
> > On Thu, Jun 22, 2017 at 12:52 PM, Zeolla@GMail.com <ze...@gmail.com>
> wrote:
> >
> >> The key should be a solved problem as of METRON-765
> >> <https://github.com/apache/metron/commit/
> 27b0d6e31de94317b085766349a892
> >> 395f0d3309>,
> >> right? It provides a single key for a given message that globally
> stored
> >> with the message, regardless of where/how.
> >>
> >> Jon
> >>
> >> On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <ju...@gmail.com>
> wrote:
> >>
> >> > First off, I agree with the characteristics.
> >> >
> >> > For the data stores, we'll need to be able to make sure we can
> actually
> >> > handle the collapsing of the updates into a single view. Casey
> mentioned
> >> > making the long term stores transparent, but there's potentially
> work for
> >> > the near real time stores: we need to make sure we actually do
> updates,
> >> > rather than create new docs that aren't linked to the old ones. This
> >> should
> >> > be entirely transparent and handled by a service layer, rather than
> >> > anything hardcoded to a datastore.
> >> >
> >> > For ES at least, the only way to do this is to retrieve, mutate it,
> and
> >> > then reindex (even the updates API does that dance under the hood
for
> >> you,
> >> > and since we're potentially doing non trivial changes we might need
> to
> >> > manage it ourselves). This implies the existence of a key, even if
> one
> >> > isn't enforced by ES (Which I don't believe it will be). We need to
> be
> >> > able to grab the doc(s?) to be updated, not end up with similar ones
> that
> >> > shouldn't be mutated. I assume this is also true (at least the
> >> > generalities) of Solr as well.
> >> >
> >> > In concert with your other thread, couldn't part of this key end up
> being
> >> > metadata (either user defined or environment defined)? For example,
> in a
> >> > situation where customer id is applied as metadata, it's possible
two
> >> > customers feed off the same datasource, but may need to mutate
> >> > independently. At this point, we have metadata that is effectively
> >> keyed.
> >> > We don't want to update both docs, but there's not a real way to
> >> > distinguish them. And maybe that's something we push off for the
> short
> >> > term, but it seems potentially nontrivial.
> >> >
> >> > In terms of consistency, I'd definitely agree that the long-term
> storage
> >> > can be eventually consistent. Any type of bulk spelunking, Spark
> jobs,
> >> > dashboarding, etc. shouldn't need up to the millisecond data.
> >> >
> >> > Basically, I'm thinking the real time store is the snapshot of
> current
> >> > state, and the long term store is the full record complete with the
> >> lineage
> >> > history.
> >> >
> >> > I'm also interested in people's opinions on how we want to manage
> HDFS.
> >> > Assuming we do use HBase to store our updates, that means that every
> HDFS
> >> > op has to join onto that HBase table to get any updates that HDFS is
> >> > missing (unless we implement some writeback and merge for HDFS
data).
> >> I'm
> >> > worried that our two datastores are really: ES, HDFS+HBase. And that
> >> > keeping that data actually synced to end users is going to be
> painful.
> >> >
> >> > Justin
> >> >
> >> >
> >> > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> >> > simon@simonellistonball.com> wrote:
> >> >
> >> > > I'd say that was an excellent set of requirements (very similar to
> the
> >> > one
> >> > > we arrived on with the last discuss thread on this)
> >> > >
> >> > > My vote remains a transaction log in hbase given the relatively
low
> >> > volume
> >> > > (human scale) i would not expect this to need anything fancy like
> >> > > compaction into hdfs state, but that does make a good argument for
> a
> >> long
> >> > > term dataframe solution for spark, with a short term stop gap
> using a
> >> > > joined data frame and shc.
> >> > >
> >> > > Simon
> >> > >
> >> > > Sent from my iPhone
> >> > >
> >> > > > On 22 Jun 2017, at 05:11, Otto Fowler <ot...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > Can you clarify what data stores are at play here?
> >> > > >
> >> > > >
> >> > > > On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com)
> >> wrote:
> >> > > >
> >> > > > Hi All,
> >> > > >
> >> > > > I know we've had a couple of these already, but we're due for
> another
> >> > > > discussion of a sensible approach to mutating indexed data. The
> >> > > motivation
> >> > > > for this is users will want to update fields to correct and
> augment
> >> > data.
> >> > > > These corrections are invaluable for things like feedback for ML
> >> models
> >> > > or
> >> > > > just plain providing better context when evaluating alerts, etc.
> >> > > >
> >> > > > Rather than posing a solution, I'd like to pose the
> characteristics
> >> of
> >> > a
> >> > > > solution and we can fight about those first. ;)
> >> > > >
> >> > > > In my mind, the following are the characteristics that I'd look
> for:
> >> > > >
> >> > > > - Changes should be considered additional or replacement fields
> for
> >> > > > existing fields
> >> > > > - Changes need to be available in the web view in near real time
> (on
> >> > the
> >> > > > order of milliseconds)
> >> > > > - Changes should be available in the batch view
> >> > > > - I'd be ok with eventually consistent with the web view,
> thoughts?
> >> > > > - Changes should have lineage preserved
> >> > > > - Current value is the optimized path
> >> > > > - Lineage search is the less optimized path
> >> > > > - If HBase is part of a solution
> >> > > > - maintain a scan-free solution
> >> > > > - maintain a coprocessor-free solution
> >> > > >
> >> > > > Most of what I've thought of is something along the lines:
> >> > > >
> >> > > > - Diffs are stored in columns in a HBase row(s)
> >> > > > - row: GUID:current would have one column with the current
> >> > > > representation
> >> > > > - row: GUID:lineage would have an ordered set of columns
> representing
> >> > > > the lineage diffs
> >> > > > - Mutable indices is directly updated (e.g. solr or ES)
> >> > > > - We'd probably want to provide transparent read support
> downstream
> >> > > > which supports merging for batch read:
> >> > > > - a spark dataframe
> >> > > > - a hive serde
> >> > > >
> >> > > > What I'd like to get out of this discussion is an architecture
> >> document
> >> > > > with a suggested approach and the necessary JIRAs to split this
> up.
> >> If
> >> > > > anyone has suggestions or comments about any of this, please
> speak
> >> up.
> >> > > I'd
> >> > > > like to actually get this done in the near-term. :)
> >> > > >
> >> > > > Best,
> >> > > >
> >> > > > Casey
> >> > >
> >> >
> >> --
> >>
> >> Jon
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Re: [DISCUSS] Mutation of Indexed Data

Posted by Casey Stella <ce...@gmail.com>.

When we're talking about a "transaction log", an edit could involve
multiple delete/additions, so are we proposing storing a diff to the JSON
map as the representation of a particular transaction?  I proposed
pre-caching the current value to lessen the burden on the reader (i.e. not
having to merge the transactions into current state), what do we think of
that?

Also, I want to ensure we maintain a solution that is scan-free: the edits
should exist as separate columns rather than separate rows in the NoSQL
store.

Thoughts?

On Mon, Jun 26, 2017 at 5:36 PM, James Sirota <js...@apache.org> wrote:

> It is clear to me that we need an independently-stored transaction log
> that is de-coupled from any of our existing systems.  So Simon’s idea of
> storing the transaction logs in Hbase and being able to reference them via
> a global ID resonates with me.  I like it for the following reasons:
>
> - It makes Metron more pluggable as far as adding additional sources for
> data storage (for example a graph data base) as well as disabling existing
> data sources.
>
> - It makes enforcing consistency of data between data sources easier.
> Each data storage system can be pointed to look at the transaction log so
> when user modifies data in system X and it gets recorded in the transaction
> log, systems Y and Z can listen for this change and adjust their data
> accordingly based on the global ID
>
> Thanks, James
>
>
> 22.06.2017, 14:09, "Justin Leet" <ju...@gmail.com>:
> > Thanks, Jon, that looks like it should work for the key. I didn't realize
> > that guid got handled that way, which makes life much easier there.
> Almost
> > like we already needed to identify messages or something. At the point we
> > should be good, since we can easily retrieve, update, put on it.
> >
> > We'll also need to make sure any long term storage solution also uses it.
> >
> > On Thu, Jun 22, 2017 at 12:52 PM, Zeolla@GMail.com <ze...@gmail.com>
> wrote:
> >
> >>  The key should be a solved problem as of METRON-765
> >>  <https://github.com/apache/metron/commit/
> 27b0d6e31de94317b085766349a892
> >>  395f0d3309>,
> >>  right? It provides a single key for a given message that globally
> stored
> >>  with the message, regardless of where/how.
> >>
> >>  Jon
> >>
> >>  On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <ju...@gmail.com>
> wrote:
> >>
> >>  > First off, I agree with the characteristics.
> >>  >
> >>  > For the data stores, we'll need to be able to make sure we can
> actually
> >>  > handle the collapsing of the updates into a single view. Casey
> mentioned
> >>  > making the long term stores transparent, but there's potentially
> work for
> >>  > the near real time stores: we need to make sure we actually do
> updates,
> >>  > rather than create new docs that aren't linked to the old ones. This
> >>  should
> >>  > be entirely transparent and handled by a service layer, rather than
> >>  > anything hardcoded to a datastore.
> >>  >
> >>  > For ES at least, the only way to do this is to retrieve, mutate it,
> and
> >>  > then reindex (even the updates API does that dance under the hood for
> >>  you,
> >>  > and since we're potentially doing non trivial changes we might need
> to
> >>  > manage it ourselves). This implies the existence of a key, even if
> one
> >>  > isn't enforced by ES (Which I don't believe it will be). We need to
> be
> >>  > able to grab the doc(s?) to be updated, not end up with similar ones
> that
> >>  > shouldn't be mutated. I assume this is also true (at least the
> >>  > generalities) of Solr as well.
> >>  >
> >>  > In concert with your other thread, couldn't part of this key end up
> being
> >>  > metadata (either user defined or environment defined)? For example,
> in a
> >>  > situation where customer id is applied as metadata, it's possible two
> >>  > customers feed off the same datasource, but may need to mutate
> >>  > independently. At this point, we have metadata that is effectively
> >>  keyed.
> >>  > We don't want to update both docs, but there's not a real way to
> >>  > distinguish them. And maybe that's something we push off for the
> short
> >>  > term, but it seems potentially nontrivial.
> >>  >
> >>  > In terms of consistency, I'd definitely agree that the long-term
> storage
> >>  > can be eventually consistent. Any type of bulk spelunking, Spark
> jobs,
> >>  > dashboarding, etc. shouldn't need up to the millisecond data.
> >>  >
> >>  > Basically, I'm thinking the real time store is the snapshot of
> current
> >>  > state, and the long term store is the full record complete with the
> >>  lineage
> >>  > history.
> >>  >
> >>  > I'm also interested in people's opinions on how we want to manage
> HDFS.
> >>  > Assuming we do use HBase to store our updates, that means that every
> HDFS
> >>  > op has to join onto that HBase table to get any updates that HDFS is
> >>  > missing (unless we implement some writeback and merge for HDFS data).
> >>  I'm
> >>  > worried that our two datastores are really: ES, HDFS+HBase. And that
> >>  > keeping that data actually synced to end users is going to be
> painful.
> >>  >
> >>  > Justin
> >>  >
> >>  >
> >>  > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> >>  > simon@simonellistonball.com> wrote:
> >>  >
> >>  > > I'd say that was an excellent set of requirements (very similar to
> the
> >>  > one
> >>  > > we arrived on with the last discuss thread on this)
> >>  > >
> >>  > > My vote remains a transaction log in hbase given the relatively low
> >>  > volume
> >>  > > (human scale) i would not expect this to need anything fancy like
> >>  > > compaction into hdfs state, but that does make a good argument for
> a
> >>  long
> >>  > > term dataframe solution for spark, with a short term stop gap
> using a
> >>  > > joined data frame and shc.
> >>  > >
> >>  > > Simon
> >>  > >
> >>  > > Sent from my iPhone
> >>  > >
> >>  > > > On 22 Jun 2017, at 05:11, Otto Fowler <ot...@gmail.com>
> >>  wrote:
> >>  > > >
> >>  > > > Can you clarify what data stores are at play here?
> >>  > > >
> >>  > > >
> >>  > > > On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com)
> >>  wrote:
> >>  > > >
> >>  > > > Hi All,
> >>  > > >
> >>  > > > I know we've had a couple of these already, but we're due for
> another
> >>  > > > discussion of a sensible approach to mutating indexed data. The
> >>  > > motivation
> >>  > > > for this is users will want to update fields to correct and
> augment
> >>  > data.
> >>  > > > These corrections are invaluable for things like feedback for ML
> >>  models
> >>  > > or
> >>  > > > just plain providing better context when evaluating alerts, etc.
> >>  > > >
> >>  > > > Rather than posing a solution, I'd like to pose the
> characteristics
> >>  of
> >>  > a
> >>  > > > solution and we can fight about those first. ;)
> >>  > > >
> >>  > > > In my mind, the following are the characteristics that I'd look
> for:
> >>  > > >
> >>  > > > - Changes should be considered additional or replacement fields
> for
> >>  > > > existing fields
> >>  > > > - Changes need to be available in the web view in near real time
> (on
> >>  > the
> >>  > > > order of milliseconds)
> >>  > > > - Changes should be available in the batch view
> >>  > > > - I'd be ok with eventually consistent with the web view,
> thoughts?
> >>  > > > - Changes should have lineage preserved
> >>  > > > - Current value is the optimized path
> >>  > > > - Lineage search is the less optimized path
> >>  > > > - If HBase is part of a solution
> >>  > > > - maintain a scan-free solution
> >>  > > > - maintain a coprocessor-free solution
> >>  > > >
> >>  > > > Most of what I've thought of is something along the lines:
> >>  > > >
> >>  > > > - Diffs are stored in columns in a HBase row(s)
> >>  > > > - row: GUID:current would have one column with the current
> >>  > > > representation
> >>  > > > - row: GUID:lineage would have an ordered set of columns
> representing
> >>  > > > the lineage diffs
> >>  > > > - Mutable indices is directly updated (e.g. solr or ES)
> >>  > > > - We'd probably want to provide transparent read support
> downstream
> >>  > > > which supports merging for batch read:
> >>  > > > - a spark dataframe
> >>  > > > - a hive serde
> >>  > > >
> >>  > > > What I'd like to get out of this discussion is an architecture
> >>  document
> >>  > > > with a suggested approach and the necessary JIRAs to split this
> up.
> >>  If
> >>  > > > anyone has suggestions or comments about any of this, please
> speak
> >>  up.
> >>  > > I'd
> >>  > > > like to actually get this done in the near-term. :)
> >>  > > >
> >>  > > > Best,
> >>  > > >
> >>  > > > Casey
> >>  > >
> >>  >
> >>  --
> >>
> >>  Jon
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Re: [DISCUSS] Mutation of Indexed Data

Posted by James Sirota <js...@apache.org>.

It is clear to me that we need an independently-stored transaction log that is de-coupled from any of our existing systems.  So Simon’s idea of storing the transaction logs in Hbase and being able to reference them via a global ID resonates with me.  I like it for the following reasons:

- It makes Metron more pluggable as far as adding additional sources for data storage (for example a graph data base) as well as disabling existing data sources.  

- It makes enforcing consistency of data between data sources easier.  Each data storage system can be pointed to look at the transaction log so when user modifies data in system X and it gets recorded in the transaction log, systems Y and Z can listen for this change and adjust their data accordingly based on the global ID

Thanks, James


22.06.2017, 14:09, "Justin Leet" <ju...@gmail.com>:
> Thanks, Jon, that looks like it should work for the key. I didn't realize
> that guid got handled that way, which makes life much easier there. Almost
> like we already needed to identify messages or something. At the point we
> should be good, since we can easily retrieve, update, put on it.
>
> We'll also need to make sure any long term storage solution also uses it.
>
> On Thu, Jun 22, 2017 at 12:52 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:
>
>>  The key should be a solved problem as of METRON-765
>>  <https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892
>>  395f0d3309>,
>>  right? It provides a single key for a given message that globally stored
>>  with the message, regardless of where/how.
>>
>>  Jon
>>
>>  On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <ju...@gmail.com> wrote:
>>
>>  > First off, I agree with the characteristics.
>>  >
>>  > For the data stores, we'll need to be able to make sure we can actually
>>  > handle the collapsing of the updates into a single view. Casey mentioned
>>  > making the long term stores transparent, but there's potentially work for
>>  > the near real time stores: we need to make sure we actually do updates,
>>  > rather than create new docs that aren't linked to the old ones. This
>>  should
>>  > be entirely transparent and handled by a service layer, rather than
>>  > anything hardcoded to a datastore.
>>  >
>>  > For ES at least, the only way to do this is to retrieve, mutate it, and
>>  > then reindex (even the updates API does that dance under the hood for
>>  you,
>>  > and since we're potentially doing non trivial changes we might need to
>>  > manage it ourselves). This implies the existence of a key, even if one
>>  > isn't enforced by ES (Which I don't believe it will be). We need to be
>>  > able to grab the doc(s?) to be updated, not end up with similar ones that
>>  > shouldn't be mutated. I assume this is also true (at least the
>>  > generalities) of Solr as well.
>>  >
>>  > In concert with your other thread, couldn't part of this key end up being
>>  > metadata (either user defined or environment defined)? For example, in a
>>  > situation where customer id is applied as metadata, it's possible two
>>  > customers feed off the same datasource, but may need to mutate
>>  > independently. At this point, we have metadata that is effectively
>>  keyed.
>>  > We don't want to update both docs, but there's not a real way to
>>  > distinguish them. And maybe that's something we push off for the short
>>  > term, but it seems potentially nontrivial.
>>  >
>>  > In terms of consistency, I'd definitely agree that the long-term storage
>>  > can be eventually consistent. Any type of bulk spelunking, Spark jobs,
>>  > dashboarding, etc. shouldn't need up to the millisecond data.
>>  >
>>  > Basically, I'm thinking the real time store is the snapshot of current
>>  > state, and the long term store is the full record complete with the
>>  lineage
>>  > history.
>>  >
>>  > I'm also interested in people's opinions on how we want to manage HDFS.
>>  > Assuming we do use HBase to store our updates, that means that every HDFS
>>  > op has to join onto that HBase table to get any updates that HDFS is
>>  > missing (unless we implement some writeback and merge for HDFS data).
>>  I'm
>>  > worried that our two datastores are really: ES, HDFS+HBase. And that
>>  > keeping that data actually synced to end users is going to be painful.
>>  >
>>  > Justin
>>  >
>>  >
>>  > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
>>  > simon@simonellistonball.com> wrote:
>>  >
>>  > > I'd say that was an excellent set of requirements (very similar to the
>>  > one
>>  > > we arrived on with the last discuss thread on this)
>>  > >
>>  > > My vote remains a transaction log in hbase given the relatively low
>>  > volume
>>  > > (human scale) i would not expect this to need anything fancy like
>>  > > compaction into hdfs state, but that does make a good argument for a
>>  long
>>  > > term dataframe solution for spark, with a short term stop gap using a
>>  > > joined data frame and shc.
>>  > >
>>  > > Simon
>>  > >
>>  > > Sent from my iPhone
>>  > >
>>  > > > On 22 Jun 2017, at 05:11, Otto Fowler <ot...@gmail.com>
>>  wrote:
>>  > > >
>>  > > > Can you clarify what data stores are at play here?
>>  > > >
>>  > > >
>>  > > > On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com)
>>  wrote:
>>  > > >
>>  > > > Hi All,
>>  > > >
>>  > > > I know we've had a couple of these already, but we're due for another
>>  > > > discussion of a sensible approach to mutating indexed data. The
>>  > > motivation
>>  > > > for this is users will want to update fields to correct and augment
>>  > data.
>>  > > > These corrections are invaluable for things like feedback for ML
>>  models
>>  > > or
>>  > > > just plain providing better context when evaluating alerts, etc.
>>  > > >
>>  > > > Rather than posing a solution, I'd like to pose the characteristics
>>  of
>>  > a
>>  > > > solution and we can fight about those first. ;)
>>  > > >
>>  > > > In my mind, the following are the characteristics that I'd look for:
>>  > > >
>>  > > > - Changes should be considered additional or replacement fields for
>>  > > > existing fields
>>  > > > - Changes need to be available in the web view in near real time (on
>>  > the
>>  > > > order of milliseconds)
>>  > > > - Changes should be available in the batch view
>>  > > > - I'd be ok with eventually consistent with the web view, thoughts?
>>  > > > - Changes should have lineage preserved
>>  > > > - Current value is the optimized path
>>  > > > - Lineage search is the less optimized path
>>  > > > - If HBase is part of a solution
>>  > > > - maintain a scan-free solution
>>  > > > - maintain a coprocessor-free solution
>>  > > >
>>  > > > Most of what I've thought of is something along the lines:
>>  > > >
>>  > > > - Diffs are stored in columns in a HBase row(s)
>>  > > > - row: GUID:current would have one column with the current
>>  > > > representation
>>  > > > - row: GUID:lineage would have an ordered set of columns representing
>>  > > > the lineage diffs
>>  > > > - Mutable indices is directly updated (e.g. solr or ES)
>>  > > > - We'd probably want to provide transparent read support downstream
>>  > > > which supports merging for batch read:
>>  > > > - a spark dataframe
>>  > > > - a hive serde
>>  > > >
>>  > > > What I'd like to get out of this discussion is an architecture
>>  document
>>  > > > with a suggested approach and the necessary JIRAs to split this up.
>>  If
>>  > > > anyone has suggestions or comments about any of this, please speak
>>  up.
>>  > > I'd
>>  > > > like to actually get this done in the near-term. :)
>>  > > >
>>  > > > Best,
>>  > > >
>>  > > > Casey
>>  > >
>>  >
>>  --
>>
>>  Jon

------------------- 
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Mutation of Indexed Data

Posted by Justin Leet <ju...@gmail.com>.

Thanks, Jon, that looks like it should work for the key.  I didn't realize
that guid got handled that way, which makes life much easier there.  Almost
like we already needed to identify messages or something.  At the point we
should be good, since we can easily retrieve, update, put on it.

We'll also need to make sure any long term storage solution also uses it.


On Thu, Jun 22, 2017 at 12:52 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:

> The key should be a solved problem as of METRON-765
> <https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892
> 395f0d3309>,
> right?  It provides a single key for a given message that globally stored
> with the message, regardless of where/how.
>
> Jon
>
> On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <ju...@gmail.com> wrote:
>
> > First off, I agree with the characteristics.
> >
> > For the data stores, we'll need to be able to make sure we can actually
> > handle the collapsing of the updates into a single view.  Casey mentioned
> > making the long term stores transparent, but there's potentially work for
> > the near real time stores: we need to make sure we actually do updates,
> > rather than create new docs that aren't linked to the old ones. This
> should
> > be entirely transparent and handled by a service layer, rather than
> > anything hardcoded to a datastore.
> >
> > For ES at least, the only way to do this is to retrieve, mutate it, and
> > then reindex (even the updates API does that dance under the hood for
> you,
> > and since we're potentially doing non trivial changes we might need to
> > manage it ourselves).  This implies the existence of a key, even if one
> > isn't enforced by ES (Which I don't believe it will be).  We need to be
> > able to grab the doc(s?) to be updated, not end up with similar ones that
> > shouldn't be mutated.  I assume this is also true (at least the
> > generalities) of Solr as well.
> >
> > In concert with your other thread, couldn't part of this key end up being
> > metadata (either user defined or environment defined)?  For example, in a
> > situation where customer id is applied as metadata, it's possible two
> > customers feed off the same datasource, but may need to mutate
> > independently.  At this point, we have metadata that is effectively
> keyed.
> > We don't want to update both docs, but there's not a real way to
> > distinguish them.  And maybe that's something we push off for the short
> > term, but it seems potentially nontrivial.
> >
> > In terms of consistency, I'd definitely agree that the long-term storage
> > can be eventually consistent.  Any type of bulk spelunking, Spark jobs,
> > dashboarding, etc. shouldn't need up to the millisecond data.
> >
> > Basically, I'm thinking the real time store is the snapshot of current
> > state, and the long term store is the full record complete with the
> lineage
> > history.
> >
> > I'm also interested in people's opinions on how we want to manage HDFS.
> > Assuming we do use HBase to store our updates, that means that every HDFS
> > op has to join onto that HBase table to get any updates that HDFS is
> > missing (unless we implement some writeback and merge for HDFS data).
> I'm
> > worried that our two datastores are really: ES, HDFS+HBase.  And that
> > keeping that data actually synced to end users is going to be painful.
> >
> > Justin
> >
> >
> > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> > simon@simonellistonball.com> wrote:
> >
> > > I'd say that was an excellent set of requirements (very similar to the
> > one
> > > we arrived on with the last discuss thread on this)
> > >
> > > My vote remains a transaction log in hbase given the relatively low
> > volume
> > > (human scale) i would not expect this to need anything fancy like
> > > compaction into hdfs state, but that does make a good argument for a
> long
> > > term dataframe solution for spark, with a short term stop gap using a
> > > joined data frame and shc.
> > >
> > > Simon
> > >
> > > Sent from my iPhone
> > >
> > > > On 22 Jun 2017, at 05:11, Otto Fowler <ot...@gmail.com>
> wrote:
> > > >
> > > > Can you clarify what data stores are at play here?
> > > >
> > > >
> > > > On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com)
> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I know we've had a couple of these already, but we're due for another
> > > > discussion of a sensible approach to mutating indexed data. The
> > > motivation
> > > > for this is users will want to update fields to correct and augment
> > data.
> > > > These corrections are invaluable for things like feedback for ML
> models
> > > or
> > > > just plain providing better context when evaluating alerts, etc.
> > > >
> > > > Rather than posing a solution, I'd like to pose the characteristics
> of
> > a
> > > > solution and we can fight about those first. ;)
> > > >
> > > > In my mind, the following are the characteristics that I'd look for:
> > > >
> > > > - Changes should be considered additional or replacement fields for
> > > > existing fields
> > > > - Changes need to be available in the web view in near real time (on
> > the
> > > > order of milliseconds)
> > > > - Changes should be available in the batch view
> > > > - I'd be ok with eventually consistent with the web view, thoughts?
> > > > - Changes should have lineage preserved
> > > > - Current value is the optimized path
> > > > - Lineage search is the less optimized path
> > > > - If HBase is part of a solution
> > > > - maintain a scan-free solution
> > > > - maintain a coprocessor-free solution
> > > >
> > > > Most of what I've thought of is something along the lines:
> > > >
> > > > - Diffs are stored in columns in a HBase row(s)
> > > > - row: GUID:current would have one column with the current
> > > > representation
> > > > - row: GUID:lineage would have an ordered set of columns representing
> > > > the lineage diffs
> > > > - Mutable indices is directly updated (e.g. solr or ES)
> > > > - We'd probably want to provide transparent read support downstream
> > > > which supports merging for batch read:
> > > > - a spark dataframe
> > > > - a hive serde
> > > >
> > > > What I'd like to get out of this discussion is an architecture
> document
> > > > with a suggested approach and the necessary JIRAs to split this up.
> If
> > > > anyone has suggestions or comments about any of this, please speak
> up.
> > > I'd
> > > > like to actually get this done in the near-term. :)
> > > >
> > > > Best,
> > > >
> > > > Casey
> > >
> >
> --
>
> Jon
>

Re: [DISCUSS] Mutation of Indexed Data

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

The key should be a solved problem as of METRON-765
<https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892395f0d3309>,
right?  It provides a single key for a given message that globally stored
with the message, regardless of where/how.

Jon

On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <ju...@gmail.com> wrote:

> First off, I agree with the characteristics.
>
> For the data stores, we'll need to be able to make sure we can actually
> handle the collapsing of the updates into a single view.  Casey mentioned
> making the long term stores transparent, but there's potentially work for
> the near real time stores: we need to make sure we actually do updates,
> rather than create new docs that aren't linked to the old ones. This should
> be entirely transparent and handled by a service layer, rather than
> anything hardcoded to a datastore.
>
> For ES at least, the only way to do this is to retrieve, mutate it, and
> then reindex (even the updates API does that dance under the hood for you,
> and since we're potentially doing non trivial changes we might need to
> manage it ourselves).  This implies the existence of a key, even if one
> isn't enforced by ES (Which I don't believe it will be).  We need to be
> able to grab the doc(s?) to be updated, not end up with similar ones that
> shouldn't be mutated.  I assume this is also true (at least the
> generalities) of Solr as well.
>
> In concert with your other thread, couldn't part of this key end up being
> metadata (either user defined or environment defined)?  For example, in a
> situation where customer id is applied as metadata, it's possible two
> customers feed off the same datasource, but may need to mutate
> independently.  At this point, we have metadata that is effectively keyed.
> We don't want to update both docs, but there's not a real way to
> distinguish them.  And maybe that's something we push off for the short
> term, but it seems potentially nontrivial.
>
> In terms of consistency, I'd definitely agree that the long-term storage
> can be eventually consistent.  Any type of bulk spelunking, Spark jobs,
> dashboarding, etc. shouldn't need up to the millisecond data.
>
> Basically, I'm thinking the real time store is the snapshot of current
> state, and the long term store is the full record complete with the lineage
> history.
>
> I'm also interested in people's opinions on how we want to manage HDFS.
> Assuming we do use HBase to store our updates, that means that every HDFS
> op has to join onto that HBase table to get any updates that HDFS is
> missing (unless we implement some writeback and merge for HDFS data).  I'm
> worried that our two datastores are really: ES, HDFS+HBase.  And that
> keeping that data actually synced to end users is going to be painful.
>
> Justin
>
>
> On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> simon@simonellistonball.com> wrote:
>
> > I'd say that was an excellent set of requirements (very similar to the
> one
> > we arrived on with the last discuss thread on this)
> >
> > My vote remains a transaction log in hbase given the relatively low
> volume
> > (human scale) i would not expect this to need anything fancy like
> > compaction into hdfs state, but that does make a good argument for a long
> > term dataframe solution for spark, with a short term stop gap using a
> > joined data frame and shc.
> >
> > Simon
> >
> > Sent from my iPhone
> >
> > > On 22 Jun 2017, at 05:11, Otto Fowler <ot...@gmail.com> wrote:
> > >
> > > Can you clarify what data stores are at play here?
> > >
> > >
> > > On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com) wrote:
> > >
> > > Hi All,
> > >
> > > I know we've had a couple of these already, but we're due for another
> > > discussion of a sensible approach to mutating indexed data. The
> > motivation
> > > for this is users will want to update fields to correct and augment
> data.
> > > These corrections are invaluable for things like feedback for ML models
> > or
> > > just plain providing better context when evaluating alerts, etc.
> > >
> > > Rather than posing a solution, I'd like to pose the characteristics of
> a
> > > solution and we can fight about those first. ;)
> > >
> > > In my mind, the following are the characteristics that I'd look for:
> > >
> > > - Changes should be considered additional or replacement fields for
> > > existing fields
> > > - Changes need to be available in the web view in near real time (on
> the
> > > order of milliseconds)
> > > - Changes should be available in the batch view
> > > - I'd be ok with eventually consistent with the web view, thoughts?
> > > - Changes should have lineage preserved
> > > - Current value is the optimized path
> > > - Lineage search is the less optimized path
> > > - If HBase is part of a solution
> > > - maintain a scan-free solution
> > > - maintain a coprocessor-free solution
> > >
> > > Most of what I've thought of is something along the lines:
> > >
> > > - Diffs are stored in columns in a HBase row(s)
> > > - row: GUID:current would have one column with the current
> > > representation
> > > - row: GUID:lineage would have an ordered set of columns representing
> > > the lineage diffs
> > > - Mutable indices is directly updated (e.g. solr or ES)
> > > - We'd probably want to provide transparent read support downstream
> > > which supports merging for batch read:
> > > - a spark dataframe
> > > - a hive serde
> > >
> > > What I'd like to get out of this discussion is an architecture document
> > > with a suggested approach and the necessary JIRAs to split this up. If
> > > anyone has suggestions or comments about any of this, please speak up.
> > I'd
> > > like to actually get this done in the near-term. :)
> > >
> > > Best,
> > >
> > > Casey
> >
>
-- 

Jon

Re: [DISCUSS] Mutation of Indexed Data

Posted by Justin Leet <ju...@gmail.com>.

First off, I agree with the characteristics.

For the data stores, we'll need to be able to make sure we can actually
handle the collapsing of the updates into a single view.  Casey mentioned
making the long term stores transparent, but there's potentially work for
the near real time stores: we need to make sure we actually do updates,
rather than create new docs that aren't linked to the old ones. This should
be entirely transparent and handled by a service layer, rather than
anything hardcoded to a datastore.

For ES at least, the only way to do this is to retrieve, mutate it, and
then reindex (even the updates API does that dance under the hood for you,
and since we're potentially doing non trivial changes we might need to
manage it ourselves).  This implies the existence of a key, even if one
isn't enforced by ES (Which I don't believe it will be).  We need to be
able to grab the doc(s?) to be updated, not end up with similar ones that
shouldn't be mutated.  I assume this is also true (at least the
generalities) of Solr as well.

In concert with your other thread, couldn't part of this key end up being
metadata (either user defined or environment defined)?  For example, in a
situation where customer id is applied as metadata, it's possible two
customers feed off the same datasource, but may need to mutate
independently.  At this point, we have metadata that is effectively keyed.
We don't want to update both docs, but there's not a real way to
distinguish them.  And maybe that's something we push off for the short
term, but it seems potentially nontrivial.

In terms of consistency, I'd definitely agree that the long-term storage
can be eventually consistent.  Any type of bulk spelunking, Spark jobs,
dashboarding, etc. shouldn't need up to the millisecond data.

Basically, I'm thinking the real time store is the snapshot of current
state, and the long term store is the full record complete with the lineage
history.

I'm also interested in people's opinions on how we want to manage HDFS.
Assuming we do use HBase to store our updates, that means that every HDFS
op has to join onto that HBase table to get any updates that HDFS is
missing (unless we implement some writeback and merge for HDFS data).  I'm
worried that our two datastores are really: ES, HDFS+HBase.  And that
keeping that data actually synced to end users is going to be painful.

Justin

On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
simon@simonellistonball.com> wrote:

> I'd say that was an excellent set of requirements (very similar to the one
> we arrived on with the last discuss thread on this)
>
> My vote remains a transaction log in hbase given the relatively low volume
> (human scale) i would not expect this to need anything fancy like
> compaction into hdfs state, but that does make a good argument for a long
> term dataframe solution for spark, with a short term stop gap using a
> joined data frame and shc.
>
> Simon
>
> Sent from my iPhone
>
> > On 22 Jun 2017, at 05:11, Otto Fowler <ot...@gmail.com> wrote:
> >
> > Can you clarify what data stores are at play here?
> >
> >
> > On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com) wrote:
> >
> > Hi All,
> >
> > I know we've had a couple of these already, but we're due for another
> > discussion of a sensible approach to mutating indexed data. The
> motivation
> > for this is users will want to update fields to correct and augment data.
> > These corrections are invaluable for things like feedback for ML models
> or
> > just plain providing better context when evaluating alerts, etc.
> >
> > Rather than posing a solution, I'd like to pose the characteristics of a
> > solution and we can fight about those first. ;)
> >
> > In my mind, the following are the characteristics that I'd look for:
> >
> > - Changes should be considered additional or replacement fields for
> > existing fields
> > - Changes need to be available in the web view in near real time (on the
> > order of milliseconds)
> > - Changes should be available in the batch view
> > - I'd be ok with eventually consistent with the web view, thoughts?
> > - Changes should have lineage preserved
> > - Current value is the optimized path
> > - Lineage search is the less optimized path
> > - If HBase is part of a solution
> > - maintain a scan-free solution
> > - maintain a coprocessor-free solution
> >
> > Most of what I've thought of is something along the lines:
> >
> > - Diffs are stored in columns in a HBase row(s)
> > - row: GUID:current would have one column with the current
> > representation
> > - row: GUID:lineage would have an ordered set of columns representing
> > the lineage diffs
> > - Mutable indices is directly updated (e.g. solr or ES)
> > - We'd probably want to provide transparent read support downstream
> > which supports merging for batch read:
> > - a spark dataframe
> > - a hive serde
> >
> > What I'd like to get out of this discussion is an architecture document
> > with a suggested approach and the necessary JIRAs to split this up. If
> > anyone has suggestions or comments about any of this, please speak up.
> I'd
> > like to actually get this done in the near-term. :)
> >
> > Best,
> >
> > Casey
>

Re: [DISCUSS] Mutation of Indexed Data

Posted by Simon Elliston Ball <si...@simonellistonball.com>.

I'd say that was an excellent set of requirements (very similar to the one we arrived on with the last discuss thread on this)

My vote remains a transaction log in hbase given the relatively low volume (human scale) i would not expect this to need anything fancy like compaction into hdfs state, but that does make a good argument for a long term dataframe solution for spark, with a short term stop gap using a joined data frame and shc.

Simon 

Sent from my iPhone

> On 22 Jun 2017, at 05:11, Otto Fowler <ot...@gmail.com> wrote:
> 
> Can you clarify what data stores are at play here?
> 
> 
> On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com) wrote:
> 
> Hi All,
> 
> I know we've had a couple of these already, but we're due for another
> discussion of a sensible approach to mutating indexed data. The motivation
> for this is users will want to update fields to correct and augment data.
> These corrections are invaluable for things like feedback for ML models or
> just plain providing better context when evaluating alerts, etc.
> 
> Rather than posing a solution, I'd like to pose the characteristics of a
> solution and we can fight about those first. ;)
> 
> In my mind, the following are the characteristics that I'd look for:
> 
> - Changes should be considered additional or replacement fields for
> existing fields
> - Changes need to be available in the web view in near real time (on the
> order of milliseconds)
> - Changes should be available in the batch view
> - I'd be ok with eventually consistent with the web view, thoughts?
> - Changes should have lineage preserved
> - Current value is the optimized path
> - Lineage search is the less optimized path
> - If HBase is part of a solution
> - maintain a scan-free solution
> - maintain a coprocessor-free solution
> 
> Most of what I've thought of is something along the lines:
> 
> - Diffs are stored in columns in a HBase row(s)
> - row: GUID:current would have one column with the current
> representation
> - row: GUID:lineage would have an ordered set of columns representing
> the lineage diffs
> - Mutable indices is directly updated (e.g. solr or ES)
> - We'd probably want to provide transparent read support downstream
> which supports merging for batch read:
> - a spark dataframe
> - a hive serde
> 
> What I'd like to get out of this discussion is an architecture document
> with a suggested approach and the necessary JIRAs to split this up. If
> anyone has suggestions or comments about any of this, please speak up. I'd
> like to actually get this done in the near-term. :)
> 
> Best,
> 
> Casey

Re: [DISCUSS] Mutation of Indexed Data

Posted by Otto Fowler <ot...@gmail.com>.

Can you clarify what data stores are at play here?


On June 21, 2017 at 17:07:42, Casey Stella (cestella@gmail.com) wrote:

Hi All,

I know we've had a couple of these already, but we're due for another
discussion of a sensible approach to mutating indexed data. The motivation
for this is users will want to update fields to correct and augment data.
These corrections are invaluable for things like feedback for ML models or
just plain providing better context when evaluating alerts, etc.

Rather than posing a solution, I'd like to pose the characteristics of a
solution and we can fight about those first. ;)

In my mind, the following are the characteristics that I'd look for:

- Changes should be considered additional or replacement fields for
existing fields
- Changes need to be available in the web view in near real time (on the
order of milliseconds)
- Changes should be available in the batch view
- I'd be ok with eventually consistent with the web view, thoughts?
- Changes should have lineage preserved
- Current value is the optimized path
- Lineage search is the less optimized path
- If HBase is part of a solution
- maintain a scan-free solution
- maintain a coprocessor-free solution

Most of what I've thought of is something along the lines:

- Diffs are stored in columns in a HBase row(s)
- row: GUID:current would have one column with the current
representation
- row: GUID:lineage would have an ordered set of columns representing
the lineage diffs
- Mutable indices is directly updated (e.g. solr or ES)
- We'd probably want to provide transparent read support downstream
which supports merging for batch read:
- a spark dataframe
- a hive serde

What I'd like to get out of this discussion is an architecture document
with a suggested approach and the necessary JIRAs to split this up. If
anyone has suggestions or comments about any of this, please speak up. I'd
like to actually get this done in the near-term. :)

Best,

Casey