You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by Casey Stella <ce...@gmail.com> on 2017/06/21 20:24:21 UTC

[DISCUSS] Metadata Ingest

Hi All,

I wanted to call attention to a JIRA (METRON-1001) that I just submitted
and possibly discuss it more broader than on the PR.

Currently, we only ingest data in Metron. Often, there is valuable metadata
constructed up-stream of Metron that is relevant to enrichment and
cross-cuts many data formats. Take, for instance, a multi-tenancy case
where multiple sources come in and you'd like to tag the data with the
customer ID. In this case you're stuck finding ways to add the metadata to
each data source's format. Rather than do that, we should allow metadata to
be ingested along with the data associated with it.

In my mind, there are two sources of metadata relevant to support:

   - User defined metadata (e.g. customer IDs)
   - Environmental metadata (e.g. the actual kafka topic in the case of a
   wildcard topic)

I propose the following:

   - The parsers allow metadata to be exposed as stellar variables for use
   in field transformations
   - We use the kafka key to pass user-defined metadata in the form of a
   JSON map
   - We expose the kafka topic as metadata
   - We allow the ability to turn on/off metadata handling
   - We allow the ability to turn on/off merging metadata with data if
   metadata handling is on
   - This be entirely backwards compatible so parsers do not need to change.

I've coded up a reference implementation located at
https://github.com/apache/metron/pull/621 which I will be hacking on in
reaction to this discussion.

Thoughts?

Re: [DISCUSS] Metadata Ingest

Posted by Casey Stella <ce...@gmail.com>.
I resisted replying to this until I had some proper documentation written
for the PR and a test plan (see PR for test plan with worked examples).
Hopefully between the two, the motivation will be made a bit clearer.

You can find that documentation at
https://github.com/cestella/incubator-metron/blob/6ebfcc05cac0a41d22d06b633007f46012201d25/metron-platform/metron-parsers/README.md#metadata

This should get us started and we can move along from there if we want to
refine it.  Regarding what you said, Simon, about NiFi, it's very much the
aim of this to fit hand-in-glove with systems like NiFi (keeping metadata
and data separate and being able to pick up that handoff).

I hope this is a bit clearer now.

On Wed, Jun 21, 2017 at 11:50 PM, Simon Elliston Ball <
simon@simonellistonball.com> wrote:

> I really like this idea. A good use case I imagine would be to have
> something like asa data, tagged with some custom meta data (e.g. Tenant ID
> in a multi-tenant install) but not have to mess with the actual parser. To
> that extent it makes sense to expose said meta data via stellar so users
> can decide how to incorporate it into a metron object.
>
> That said, I think we should lay down some principles or conventions on
> the expected form of meta data, as we do with the main data fields to
> ensure some consistency across implementations, or at least get people
> started.
>
> I also think we should add the functionality to our reference application
> docs showing how to set meta data in keys in NiFi. This approach certainly
> complements the way NiFi tags and thinks about meta data well, and that
> would be worth highlighting in the example.
>
> Simon
>
> > On 22 Jun 2017, at 04:25, Otto Fowler <ot...@gmail.com> wrote:
> >
> > First:  Thanks Casey.
> >
> > I submitted a review in the PR, that I will not duplicate here.
> >
> > I would say however the following:
> >
> > - I would like to understand the problem we are trying to solve with this
> > more.  This seems like a good idea, and a capability we obviously can
> > imagine how to implement, but there are things we need to think through.
> >
> > - While adding metadata “in context” is correct ( the kafka topic to the
> > parser is in context ), I would like to talk about if some of this
> activity
> > is more enrichment than not, and should be handled/exposed there, where
> we
> > have the splitter/joiner pattern already.
> >
> > - Other than exposing the metadata, I am not sure I understand the
> > difference between this and just adding fields as you currently can.
> >
> >
> >
> > On June 21, 2017 at 16:24:27, Casey Stella (cestella@gmail.com) wrote:
> >
> > Hi All,
> >
> > I wanted to call attention to a JIRA (METRON-1001) that I just submitted
> > and possibly discuss it more broader than on the PR.
> >
> > Currently, we only ingest data in Metron. Often, there is valuable
> metadata
> > constructed up-stream of Metron that is relevant to enrichment and
> > cross-cuts many data formats. Take, for instance, a multi-tenancy case
> > where multiple sources come in and you'd like to tag the data with the
> > customer ID. In this case you're stuck finding ways to add the metadata
> to
> > each data source's format. Rather than do that, we should allow metadata
> to
> > be ingested along with the data associated with it.
> >
> > In my mind, there are two sources of metadata relevant to support:
> >
> > - User defined metadata (e.g. customer IDs)
> > - Environmental metadata (e.g. the actual kafka topic in the case of a
> > wildcard topic)
> >
> > I propose the following:
> >
> > - The parsers allow metadata to be exposed as stellar variables for use
> > in field transformations
> > - We use the kafka key to pass user-defined metadata in the form of a
> > JSON map
> > - We expose the kafka topic as metadata
> > - We allow the ability to turn on/off metadata handling
> > - We allow the ability to turn on/off merging metadata with data if
> > metadata handling is on
> > - This be entirely backwards compatible so parsers do not need to change.
> >
> > I've coded up a reference implementation located at
> > https://github.com/apache/metron/pull/621 which I will be hacking on in
> > reaction to this discussion.
> >
> > Thoughts?
>

Re: [DISCUSS] Metadata Ingest

Posted by Simon Elliston Ball <si...@simonellistonball.com>.
I really like this idea. A good use case I imagine would be to have something like asa data, tagged with some custom meta data (e.g. Tenant ID in a multi-tenant install) but not have to mess with the actual parser. To that extent it makes sense to expose said meta data via stellar so users can decide how to incorporate it into a metron object.

That said, I think we should lay down some principles or conventions on the expected form of meta data, as we do with the main data fields to ensure some consistency across implementations, or at least get people started.

I also think we should add the functionality to our reference application docs showing how to set meta data in keys in NiFi. This approach certainly complements the way NiFi tags and thinks about meta data well, and that would be worth highlighting in the example.

Simon  

> On 22 Jun 2017, at 04:25, Otto Fowler <ot...@gmail.com> wrote:
> 
> First:  Thanks Casey.
> 
> I submitted a review in the PR, that I will not duplicate here.
> 
> I would say however the following:
> 
> - I would like to understand the problem we are trying to solve with this
> more.  This seems like a good idea, and a capability we obviously can
> imagine how to implement, but there are things we need to think through.
> 
> - While adding metadata “in context” is correct ( the kafka topic to the
> parser is in context ), I would like to talk about if some of this activity
> is more enrichment than not, and should be handled/exposed there, where we
> have the splitter/joiner pattern already.
> 
> - Other than exposing the metadata, I am not sure I understand the
> difference between this and just adding fields as you currently can.
> 
> 
> 
> On June 21, 2017 at 16:24:27, Casey Stella (cestella@gmail.com) wrote:
> 
> Hi All,
> 
> I wanted to call attention to a JIRA (METRON-1001) that I just submitted
> and possibly discuss it more broader than on the PR.
> 
> Currently, we only ingest data in Metron. Often, there is valuable metadata
> constructed up-stream of Metron that is relevant to enrichment and
> cross-cuts many data formats. Take, for instance, a multi-tenancy case
> where multiple sources come in and you'd like to tag the data with the
> customer ID. In this case you're stuck finding ways to add the metadata to
> each data source's format. Rather than do that, we should allow metadata to
> be ingested along with the data associated with it.
> 
> In my mind, there are two sources of metadata relevant to support:
> 
> - User defined metadata (e.g. customer IDs)
> - Environmental metadata (e.g. the actual kafka topic in the case of a
> wildcard topic)
> 
> I propose the following:
> 
> - The parsers allow metadata to be exposed as stellar variables for use
> in field transformations
> - We use the kafka key to pass user-defined metadata in the form of a
> JSON map
> - We expose the kafka topic as metadata
> - We allow the ability to turn on/off metadata handling
> - We allow the ability to turn on/off merging metadata with data if
> metadata handling is on
> - This be entirely backwards compatible so parsers do not need to change.
> 
> I've coded up a reference implementation located at
> https://github.com/apache/metron/pull/621 which I will be hacking on in
> reaction to this discussion.
> 
> Thoughts?

Re: [DISCUSS] Metadata Ingest

Posted by Otto Fowler <ot...@gmail.com>.
First:  Thanks Casey.

I submitted a review in the PR, that I will not duplicate here.

I would say however the following:

- I would like to understand the problem we are trying to solve with this
more.  This seems like a good idea, and a capability we obviously can
imagine how to implement, but there are things we need to think through.

- While adding metadata “in context” is correct ( the kafka topic to the
parser is in context ), I would like to talk about if some of this activity
is more enrichment than not, and should be handled/exposed there, where we
have the splitter/joiner pattern already.

- Other than exposing the metadata, I am not sure I understand the
difference between this and just adding fields as you currently can.



On June 21, 2017 at 16:24:27, Casey Stella (cestella@gmail.com) wrote:

Hi All,

I wanted to call attention to a JIRA (METRON-1001) that I just submitted
and possibly discuss it more broader than on the PR.

Currently, we only ingest data in Metron. Often, there is valuable metadata
constructed up-stream of Metron that is relevant to enrichment and
cross-cuts many data formats. Take, for instance, a multi-tenancy case
where multiple sources come in and you'd like to tag the data with the
customer ID. In this case you're stuck finding ways to add the metadata to
each data source's format. Rather than do that, we should allow metadata to
be ingested along with the data associated with it.

In my mind, there are two sources of metadata relevant to support:

- User defined metadata (e.g. customer IDs)
- Environmental metadata (e.g. the actual kafka topic in the case of a
wildcard topic)

I propose the following:

- The parsers allow metadata to be exposed as stellar variables for use
in field transformations
- We use the kafka key to pass user-defined metadata in the form of a
JSON map
- We expose the kafka topic as metadata
- We allow the ability to turn on/off metadata handling
- We allow the ability to turn on/off merging metadata with data if
metadata handling is on
- This be entirely backwards compatible so parsers do not need to change.

I've coded up a reference implementation located at
https://github.com/apache/metron/pull/621 which I will be hacking on in
reaction to this discussion.

Thoughts?