You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by Nick Allen <ni...@nickallen.org> on 2017/05/02 13:24:35 UTC

Re: Normalization topology or separate normalization bolt for parsing topology

Before worrying about how to ingest this 'noisy' data, I would want to
better understand root cause.  If you cannot even get a valid date format,
are you sure the data can be trusted?

Rather than bending over backwards to try to ingest it, I would first make
sure the telemetry is not totally bogus to begin with.  Maybe it is better
that the data is dropped in cases like this.

IMHO, that is how I would tackle a problem like this.  Not all data can be
trusted.







On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <al...@gmail.com> wrote:

> Are you sure? The syslog_host name is way more complicated than something
> that can be a coincidence. I need to double check with one of the security
> device experts, but I thought it is some kind of noises.
>
> Yes, we do have more use cases that seem to be corrupted. For example,
> having duplicate IP addresses or corrupted date format. Please have a look
> at the following message. At least I am sure the date format is corrupted
> in this one.
>
> <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP connection
> 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
> *y.y.y.y/p2*
>
> Cheers,
> Ali
>
> On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> simon@simonellistonball.com> wrote:
>
> > Is that instance, you're looking at valid syslog which should be parsed
> as
> > such. The repeat host is not really a host in syslog terms, it's an
> > application name header which happens to be the same. This is definitely
> a
> > parser bug which should be handled, esp since the header is perfectly RFC
> > compliant.
> >
> > Do you have any other such cases? My view is that parsers should be
> > written with more any case, so should extract all the fields they can
> from
> > malformed logs, rather than throwing exceptions, but that's more about
> the
> > way we write parsers than having some kind of pre-clean.
> >
> > Simon
> >
> > Sent from my iPad
> >
> > > On 27 Apr 2017, at 08:04, Ali Nazemian <al...@gmail.com> wrote:
> > >
> > > I do agree there is a fair amount of overhead for using another bolt
> for
> > > this purpose. I am not pointing to the way of implementation. It might
> > be a
> > > way of implementation to segregate two extension points without adding
> > > overhead; I haven't thought about it yet. However, the main issue is
> > > sometimes the type of noise is something that generates an exception on
> > the
> > > parsing side. For example, have a look at the following log:
> > >
> > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > (ryanmar)
> > >
> > > Clearly duplicate syslog_host throws an exception on parsing, so how
> > > are we going to deal with that at post-parse transformation? It cannot
> > > pass the parsing. This is only a single example of cases that might
> > > affect the production data. Unless Stellar transformation is something
> > > that can be done at pre-parse and for the entire message.
> > >
> > >
> > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > simon@simonellistonball.com> wrote:
> > >
> > >> Ali,
> > >>
> > >> Sounds very much like what you’re talking about when you say
> > >> normalization, and what I would understand it as, is the process
> > fulfilled
> > >> by stellar field transformation in the parser config. Agreed that some
> > of
> > >> these will be general, based on common metron standard schema, but
> > others
> > >> will be organisation specific (custom fields overloaded with different
> > >> meanings for instance in CEF, for example). These are very much one of
> > the
> > >> reasons we have the stellar transformation step. I don’t think that
> > should
> > >> be moved to a separate bolt to be honest, because that comes with a
> fair
> > >> amount of overhead, but logically it is in the parser config rather
> than
> > >> the parser, so seems to serve this purpose in the post-parse
> transform,
> > no?
> > >>
> > >> Simon
> > >>
> > >>
> > >>
> > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <al...@gmail.com>
> wrote:
> > >>>
> > >>> Hi Simon,
> > >>>
> > >>> The reason I am asking for a specific normalisation step is due to
> the
> > >> fact
> > >>> that normalisation is not a general use case which can be used by
> other
> > >>> users. It is completely bounded to our application. The way we have
> > fixed
> > >>> it, for now, is to add a normalisation step to the parser and clear
> the
> > >>> incoming data so the parser step can work on that, but I don't like
> it.
> > >>> There is no point of creating a parser that can handle all of the
> > >> possible
> > >>> noises that can exist in the production data. Even if it is possible
> to
> > >>> predict every kind of noise in production data there is no point for
> > >> Metron
> > >>> community to focus on building a general purpose parser for a
> specific
> > >>> device while they can spend that time on developing a cool feature.
> > Even
> > >> if
> > >>> it is possible to predict noises and it is acceptable for the
> community
> > >> to
> > >>> spend their time on creating that kind of parser why every Metron
> user
> > >> need
> > >>> that extra normalisation? A user data might be clear at the first
> step
> > >> and
> > >>> obviously, it only decreases the total throughput without any use for
> > >> that
> > >>> specific user.
> > >>>
> > >>> Imagine there is an additional bolt for normalisation and there is a
> > >>> mechanism to customise the normalisation without changing the general
> > >>> parser for a specific device. We can have a general parser as a
> common
> > >>> parser for that device and leave the normalisation development to
> > users.
> > >>> However, it is very important to provide the normalisation step as
> fast
> > >> as
> > >>> possible.
> > >>>
> > >>> Cheers,
> > >>> Ali
> > >>>
> > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <ce...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar.  I
> would
> > >>>> expect the job of the parser, however, to handle structural issues.
> > In
> > >> my
> > >>>> mind, parsing is about transforming structures into fields and the
> > role
> > >> of
> > >>>> the field transformations are to transform values.  There's obvious
> > >> overlap
> > >>>> there wherein parsers may do some normalizations/transformations
> (i.e.
> > >> look
> > >>>> how grok handles timestamps), but it almost always gets us into
> > trouble
> > >>>> when parsers do even moderately complex value transformations.
> > >>>>
> > >>>> As I type this, though, I think I see your point.  What you really
> > want
> > >> is
> > >>>> to chain parsers, have a pre-parser to bring you 80% of the way
> there
> > >> and
> > >>>> hammer out all the structural issues so you might be able to use a
> > more
> > >>>> generic parser down the chain.  I have often thought that maybe we
> > >> should
> > >>>> expose parsers as Stellar functions which take raw data and emit
> whole
> > >>>> messages.  This would allow us to compose parsers, so imagine the
> > above
> > >>>> example where you've written a stellar function to normalize the
> input
> > >> and
> > >>>> you're then passing it to a CSV parser, you could run
> > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise specify a
> > >>>> parser.
> > >>>>
> > >>>> As for speed, the stellar expression would get compiled into a java
> > >> object,
> > >>>> so it shouldn't be appreciable overhead since we no longer lex and
> > parse
> > >>>> for every message.
> > >>>>
> > >>>> Is this kinda how you were seeing it?
> > >>>>
> > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > >>>> simon@simonellistonball.com> wrote:
> > >>>>
> > >>>>> The challenge there I suspect is going to be that you essentially
> end
> > >> up
> > >>>>> with the actual parser doing very little of value, and then
> > effectively
> > >>>>> trying to write a parser in stellar against a few broad strings,
> > which
> > >>>>> would likely give you all sorts of performance problems.
> > >>>>>
> > >>>>> One solution is to write a very defensive and flexible parser, but
> > that
> > >>>>> would tend to be time consuming.
> > >>>>>
> > >>>>> There is also something to be said for doing some basic
> > transformation
> > >>>>> before the parser topic kafka in something like nifi, but again,
> > >>>>> performance can be an issue there.
> > >>>>>
> > >>>>> If the noise is about broken structure for example, maybe a simple
> > >>>>> pre-process step as part of your parser would make sense, e.g.
> > >> stripping
> > >>>>> syslog headers, or character set conversion, removing very broken
> > bits
> > >> as
> > >>>>> part of the parse method.
> > >>>>>
> > >>>>> In terms of normalisation post-parse, I agree, that 100% a job for
> > >>>>> Stellar, and the fieldTransformations capability. Something I would
> > >> like
> > >>>> to
> > >>>>> see would be a means to use that transformation step to map to a
> well
> > >>>> known
> > >>>>> (though loosely enforced) schema provided by a governance
> framework,
> > >> but
> > >>>>> that is a much bigger topic of conversation.
> > >>>>>
> > >>>>> Not of course that not everything has to be parsed just because
> it’s
> > in
> > >>>>> the message. A relatively loose fitting parser which pulls out the
> > >>>> relevant
> > >>>>> data for the use case would be fine, and likely a lot more tolerant
> > of
> > >>>>> noise than something that felt the need for every field. We do
> after
> > >> all
> > >>>>> store the original_string for you if you really absolutely have to
> > had
> > >>>>> everything, so a more schema-on-read philosophy certainly applies
> and
> > >>>> will
> > >>>>> likely side-step a lot of your issues.
> > >>>>>
> > >>>>> Simon
> > >>>>>
> > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <ce...@gmail.com>
> wrote:
> > >>>>>>
> > >>>>>> Ok, that's another story.  hmmmm, we don't generally pre-parse
> > becuase
> > >>>> we
> > >>>>>> try to not assume any particular format there (i.e. it could be
> > >>>> strings,
> > >>>>>> could be byte arrays).  Maybe the right answer is to pass the raw,
> > >>>>>> non-normalized data (best effort tyep of thing) through the parser
> > and
> > >>>> do
> > >>>>>> the normalization post-parse..or is there a problem with that?
> > >>>>>>
> > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > alinazemian@gmail.com>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Casey,
> > >>>>>>>
> > >>>>>>> It is actually pre-parse process, not a post-parse one. These
> type
> > of
> > >>>>>>> noises affect the position of an attribute for example and give
> us
> > >>>>> parsing
> > >>>>>>> exception. The timestamp example was not a good one because that
> is
> > >>>>>>> actually a post-parse exception.
> > >>>>>>>
> > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> cestella@gmail.com
> > >
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> So, further transformation post-parse was one of the motivating
> > >>>> reasons
> > >>>>>>> for
> > >>>>>>>> Stellar (to do that transformation post-parse).  Is there a
> > >>>> capability
> > >>>>>>> that
> > >>>>>>>> it's lacking that we can add to fit your usecase?
> > >>>>>>>>
> > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > >> alinazemian@gmail.com
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > >>>>>>>>>
> > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > >>>> alinazemian@gmail.com
> > >>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Currently, we are using normal regex at the Java source code
> to
> > >>>>>>> handle
> > >>>>>>>>>> those situations. However, it would be nice to have a separate
> > >> bolt
> > >>>>>>> and
> > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue
> > >>>> regarding
> > >>>>>>>>> that.
> > >>>>>>>>>> The main reason I am asking for such a feature is the fact
> that
> > >>>> lack
> > >>>>>>> of
> > >>>>>>>>>> such a feature makes the process of creating some parser for
> the
> > >>>>>>>>> community
> > >>>>>>>>>> a little painful for us. We need to maintain two different
> > >>>> versions,
> > >>>>>>>> one
> > >>>>>>>>>> for community another for the internal use case. Clearly,
> noise
> > is
> > >>>> an
> > >>>>>>>>>> inevitable part of real world use cases.
> > >>>>>>>>>>
> > >>>>>>>>>> Cheers,
> > >>>>>>>>>> Ali
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > >>>>>>> ottobackwards@gmail.com
> > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you
> using
> > >>>> any
> > >>>>>>>>>>> Stellar to do it?
> > >>>>>>>>>>> Can you create a jira?
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > >>>> alinazemian@gmail.com)
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> We are facing certain use cases in Metron production that
> > happen
> > >>>> to
> > >>>>>>> be
> > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp,
> > >> duplicate
> > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization we
> > have
> > >>>>>>> added
> > >>>>>>>>> an
> > >>>>>>>>>>> additional step for the corresponding parsers to do the data
> > >>>>>>> cleaning.
> > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly related
> > to
> > >>>> the
> > >>>>>>>>>>> device
> > >>>>>>>>>>> that is generating the data and can be used for the same type
> > of
> > >>>>>>>> device
> > >>>>>>>>>>> everywhere, but normalization is very production dependent
> and
> > >>>> there
> > >>>>>>>> is
> > >>>>>>>>>>> no
> > >>>>>>>>>>> point of mixing normalization with parsing. It would be nice
> to
> > >>>>>>> have a
> > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> production
> > >>>>>>>>>>> related cleaning process. In that case, eveybody can easily
> > >>>>>>> contribute
> > >>>>>>>>> to
> > >>>>>>>>>>> Metron community with additional parsers without being
> worried
> > >>>> about
> > >>>>>>>>>>> mixing
> > >>>>>>>>>>> parsers and data cleaning process.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ali
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> A.Nazemian
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> A.Nazemian
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> A.Nazemian
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> A.Nazemian
> > >>
> > >>
> > >
> > >
> > > --
> > > A.Nazemian
> >
>
>
>
> --
> A.Nazemian
>

Re: Normalization topology or separate normalization bolt for parsing topology

Posted by Nick Allen <ni...@nickallen.org>.
> Clearly, a generic parser would be useful for the community not a type of
parser that is highly customised for our noisy environment.

Increasing the number of generic parsers for the community is definitely a
good goal.  I agree with you there.

Could we achieve the same goal by making our parsers more configurable?  As
a simple example, maybe a user could configure particular fields to be
either required or optional.​

   - For my use of Parser X, I am going to configure the "timestamp" field
   to be "required".  I want the parser to fail the message if the timestamp
   field is invalid.


   - But when you are using Parser X, you would configure the "timestamp"
   field as "optional".  When a malformed timestamp arrives, it ignores the
   timestamp (maybe stamps its own valid timestamp) and allows the message to
   continue on.

In ways like this we can provide some flexibility to users of Parser X to
achieve the very important goal that you outlined, but without an
architectural change.




On May 2, 2017 9:05 PM, "Ali Nazemian" <al...@gmail.com> wrote:

Hi Nick,

I am happy to continue the development using the current architecture and
embed the pre-parsing steps in the parser code. However, this would be
against the policy to have a contribution to Metron community to expand the
range of supported devices. Clearly, a generic parser would be useful for
the community not a type of parser that is highly customised for our noisy
environment. I was looking for decoupling Parsing and Normalisation to
implement a generic parser which can be used by others as well.

I think this is more a type of strategic decision which can increase the
number of generic parsers that will be contributed back to the community in
future. Ideally, it would be better that official Metron developers focus
on Metron features instead of developing generic parsers.

Thanks,
Ali

On Wed, May 3, 2017 at 3:03 AM, Nick Allen <ni...@nickallen.org> wrote:

> Yes, and currently that normalization step is the Parsers.
>
> I am not saying the message has to be entirely clear and well-defined.
But
> there are a minimum set of expectations that you must have of any data
that
> you're ingesting.   Once it meets that "minimum set", the parser should be
> able to ingest and normalize the message.  Any oddities beyond that
> "minimum set" can be handled with Stellar either post-Parsing or in
> Enrichment.
>
> It is, of course, a judgement call as to what that minimum set is for you.
> You would just need a Parser that matches your definition of "minimum
set".
>
> My main point here is that I am not seeing a need to re-architect
> anything.  I think we have the right tools, IMHO.
>
>
>
>
>
>
>
>
>
> On Tue, May 2, 2017 at 10:33 AM, Ali Nazemian <al...@gmail.com>
> wrote:
>
> > Hi Nick,
> >
> > The date could be corrupted due to any reason, and sometimes we haven't
> got
> > any control on the device. Obviously, it is not a big deal if we lose
> <166>
> > severity message, but it could be a different situation for <161>
> > severity or an actual critical threat. However, I have mentioned those
> > defects as an example to pointed the importance of having a
normalisation
> > step in Metron processing chain.
> >
> > I still think there is no guarantee to have an entirely clear and
> > well-defined message in the real world use case. If we recognise this
> > situation as a problem, then finding a high performance and flexible
> > solution is not very hard.
> >
> > Cheers,
> > Ali
> >
> > On Tue, May 2, 2017 at 11:24 PM, Nick Allen <ni...@nickallen.org> wrote:
> >
> > > Before worrying about how to ingest this 'noisy' data, I would want to
> > > better understand root cause.  If you cannot even get a valid date
> > format,
> > > are you sure the data can be trusted?
> > >
> > > Rather than bending over backwards to try to ingest it, I would first
> > make
> > > sure the telemetry is not totally bogus to begin with.  Maybe it is
> > better
> > > that the data is dropped in cases like this.
> > >
> > > IMHO, that is how I would tackle a problem like this.  Not all data
can
> > be
> > > trusted.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <al...@gmail.com>
> > > wrote:
> > >
> > > > Are you sure? The syslog_host name is way more complicated than
> > something
> > > > that can be a coincidence. I need to double check with one of the
> > > security
> > > > device experts, but I thought it is some kind of noises.
> > > >
> > > > Yes, we do have more use cases that seem to be corrupted. For
> example,
> > > > having duplicate IP addresses or corrupted date format. Please have
a
> > > look
> > > > at the following message. At least I am sure the date format is
> > corrupted
> > > > in this one.
> > > >
> > > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> > > connection
> > > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to
> inside:*y.y.y.y/p2*
> > > > *y.y.y.y/p2*
> > > >
> > > > Cheers,
> > > > Ali
> > > >
> > > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > > > simon@simonellistonball.com> wrote:
> > > >
> > > > > Is that instance, you're looking at valid syslog which should be
> > parsed
> > > > as
> > > > > such. The repeat host is not really a host in syslog terms, it's
an
> > > > > application name header which happens to be the same. This is
> > > definitely
> > > > a
> > > > > parser bug which should be handled, esp since the header is
> perfectly
> > > RFC
> > > > > compliant.
> > > > >
> > > > > Do you have any other such cases? My view is that parsers should
be
> > > > > written with more any case, so should extract all the fields they
> can
> > > > from
> > > > > malformed logs, rather than throwing exceptions, but that's more
> > about
> > > > the
> > > > > way we write parsers than having some kind of pre-clean.
> > > > >
> > > > > Simon
> > > > >
> > > > > Sent from my iPad
> > > > >
> > > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <al...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > I do agree there is a fair amount of overhead for using another
> > bolt
> > > > for
> > > > > > this purpose. I am not pointing to the way of implementation. It
> > > might
> > > > > be a
> > > > > > way of implementation to segregate two extension points without
> > > adding
> > > > > > overhead; I haven't thought about it yet. However, the main
issue
> > is
> > > > > > sometimes the type of noise is something that generates an
> > exception
> > > on
> > > > > the
> > > > > > parsing side. For example, have a look at the following log:
> > > > > >
> > > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown
> ICMP
> > > > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > > > (ryanmar)
> > > > > >
> > > > > > Clearly duplicate syslog_host throws an exception on parsing, so
> > how
> > > > > > are we going to deal with that at post-parse transformation? It
> > > cannot
> > > > > > pass the parsing. This is only a single example of cases that
> might
> > > > > > affect the production data. Unless Stellar transformation is
> > > something
> > > > > > that can be done at pre-parse and for the entire message.
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > > > simon@simonellistonball.com> wrote:
> > > > > >
> > > > > >> Ali,
> > > > > >>
> > > > > >> Sounds very much like what you’re talking about when you say
> > > > > >> normalization, and what I would understand it as, is the
process
> > > > > fulfilled
> > > > > >> by stellar field transformation in the parser config. Agreed
> that
> > > some
> > > > > of
> > > > > >> these will be general, based on common metron standard schema,
> but
> > > > > others
> > > > > >> will be organisation specific (custom fields overloaded with
> > > different
> > > > > >> meanings for instance in CEF, for example). These are very much
> > one
> > > of
> > > > > the
> > > > > >> reasons we have the stellar transformation step. I don’t think
> > that
> > > > > should
> > > > > >> be moved to a separate bolt to be honest, because that comes
> with
> > a
> > > > fair
> > > > > >> amount of overhead, but logically it is in the parser config
> > rather
> > > > than
> > > > > >> the parser, so seems to serve this purpose in the post-parse
> > > > transform,
> > > > > no?
> > > > > >>
> > > > > >> Simon
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <al...@gmail.com>
> > > > wrote:
> > > > > >>>
> > > > > >>> Hi Simon,
> > > > > >>>
> > > > > >>> The reason I am asking for a specific normalisation step is
due
> > to
> > > > the
> > > > > >> fact
> > > > > >>> that normalisation is not a general use case which can be used
> by
> > > > other
> > > > > >>> users. It is completely bounded to our application. The way we
> > have
> > > > > fixed
> > > > > >>> it, for now, is to add a normalisation step to the parser and
> > clear
> > > > the
> > > > > >>> incoming data so the parser step can work on that, but I don't
> > like
> > > > it.
> > > > > >>> There is no point of creating a parser that can handle all of
> the
> > > > > >> possible
> > > > > >>> noises that can exist in the production data. Even if it is
> > > possible
> > > > to
> > > > > >>> predict every kind of noise in production data there is no
> point
> > > for
> > > > > >> Metron
> > > > > >>> community to focus on building a general purpose parser for a
> > > > specific
> > > > > >>> device while they can spend that time on developing a cool
> > feature.
> > > > > Even
> > > > > >> if
> > > > > >>> it is possible to predict noises and it is acceptable for the
> > > > community
> > > > > >> to
> > > > > >>> spend their time on creating that kind of parser why every
> Metron
> > > > user
> > > > > >> need
> > > > > >>> that extra normalisation? A user data might be clear at the
> first
> > > > step
> > > > > >> and
> > > > > >>> obviously, it only decreases the total throughput without any
> use
> > > for
> > > > > >> that
> > > > > >>> specific user.
> > > > > >>>
> > > > > >>> Imagine there is an additional bolt for normalisation and
there
> > is
> > > a
> > > > > >>> mechanism to customise the normalisation without changing the
> > > general
> > > > > >>> parser for a specific device. We can have a general parser as
a
> > > > common
> > > > > >>> parser for that device and leave the normalisation development
> to
> > > > > users.
> > > > > >>> However, it is very important to provide the normalisation
step
> > as
> > > > fast
> > > > > >> as
> > > > > >>> possible.
> > > > > >>>
> > > > > >>> Cheers,
> > > > > >>> Ali
> > > > > >>>
> > > > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <
> > cestella@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Yeah, we definitely don't want to rewrite parsing in
> Stellar.  I
> > > > would
> > > > > >>>> expect the job of the parser, however, to handle structural
> > > issues.
> > > > > In
> > > > > >> my
> > > > > >>>> mind, parsing is about transforming structures into fields
and
> > the
> > > > > role
> > > > > >> of
> > > > > >>>> the field transformations are to transform values.  There's
> > > obvious
> > > > > >> overlap
> > > > > >>>> there wherein parsers may do some
> normalizations/transformations
> > > > (i.e.
> > > > > >> look
> > > > > >>>> how grok handles timestamps), but it almost always gets us
> into
> > > > > trouble
> > > > > >>>> when parsers do even moderately complex value
transformations.
> > > > > >>>>
> > > > > >>>> As I type this, though, I think I see your point.  What you
> > really
> > > > > want
> > > > > >> is
> > > > > >>>> to chain parsers, have a pre-parser to bring you 80% of the
> way
> > > > there
> > > > > >> and
> > > > > >>>> hammer out all the structural issues so you might be able to
> > use a
> > > > > more
> > > > > >>>> generic parser down the chain.  I have often thought that
> maybe
> > we
> > > > > >> should
> > > > > >>>> expose parsers as Stellar functions which take raw data and
> emit
> > > > whole
> > > > > >>>> messages.  This would allow us to compose parsers, so imagine
> > the
> > > > > above
> > > > > >>>> example where you've written a stellar function to normalize
> the
> > > > input
> > > > > >> and
> > > > > >>>> you're then passing it to a CSV parser, you could run
> > > > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise
> > > specify a
> > > > > >>>> parser.
> > > > > >>>>
> > > > > >>>> As for speed, the stellar expression would get compiled into
a
> > > java
> > > > > >> object,
> > > > > >>>> so it shouldn't be appreciable overhead since we no longer
lex
> > and
> > > > > parse
> > > > > >>>> for every message.
> > > > > >>>>
> > > > > >>>> Is this kinda how you were seeing it?
> > > > > >>>>
> > > > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > > > > >>>> simon@simonellistonball.com> wrote:
> > > > > >>>>
> > > > > >>>>> The challenge there I suspect is going to be that you
> > essentially
> > > > end
> > > > > >> up
> > > > > >>>>> with the actual parser doing very little of value, and then
> > > > > effectively
> > > > > >>>>> trying to write a parser in stellar against a few broad
> > strings,
> > > > > which
> > > > > >>>>> would likely give you all sorts of performance problems.
> > > > > >>>>>
> > > > > >>>>> One solution is to write a very defensive and flexible
> parser,
> > > but
> > > > > that
> > > > > >>>>> would tend to be time consuming.
> > > > > >>>>>
> > > > > >>>>> There is also something to be said for doing some basic
> > > > > transformation
> > > > > >>>>> before the parser topic kafka in something like nifi, but
> > again,
> > > > > >>>>> performance can be an issue there.
> > > > > >>>>>
> > > > > >>>>> If the noise is about broken structure for example, maybe a
> > > simple
> > > > > >>>>> pre-process step as part of your parser would make sense,
> e.g.
> > > > > >> stripping
> > > > > >>>>> syslog headers, or character set conversion, removing very
> > broken
> > > > > bits
> > > > > >> as
> > > > > >>>>> part of the parse method.
> > > > > >>>>>
> > > > > >>>>> In terms of normalisation post-parse, I agree, that 100% a
> job
> > > for
> > > > > >>>>> Stellar, and the fieldTransformations capability. Something
I
> > > would
> > > > > >> like
> > > > > >>>> to
> > > > > >>>>> see would be a means to use that transformation step to map
> to
> > a
> > > > well
> > > > > >>>> known
> > > > > >>>>> (though loosely enforced) schema provided by a governance
> > > > framework,
> > > > > >> but
> > > > > >>>>> that is a much bigger topic of conversation.
> > > > > >>>>>
> > > > > >>>>> Not of course that not everything has to be parsed just
> because
> > > > it’s
> > > > > in
> > > > > >>>>> the message. A relatively loose fitting parser which pulls
> out
> > > the
> > > > > >>>> relevant
> > > > > >>>>> data for the use case would be fine, and likely a lot more
> > > tolerant
> > > > > of
> > > > > >>>>> noise than something that felt the need for every field. We
> do
> > > > after
> > > > > >> all
> > > > > >>>>> store the original_string for you if you really absolutely
> have
> > > to
> > > > > had
> > > > > >>>>> everything, so a more schema-on-read philosophy certainly
> > applies
> > > > and
> > > > > >>>> will
> > > > > >>>>> likely side-step a lot of your issues.
> > > > > >>>>>
> > > > > >>>>> Simon
> > > > > >>>>>
> > > > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <ce...@gmail.com>
> > > > wrote:
> > > > > >>>>>>
> > > > > >>>>>> Ok, that's another story.  hmmmm, we don't generally
> pre-parse
> > > > > becuase
> > > > > >>>> we
> > > > > >>>>>> try to not assume any particular format there (i.e. it
could
> > be
> > > > > >>>> strings,
> > > > > >>>>>> could be byte arrays).  Maybe the right answer is to pass
> the
> > > raw,
> > > > > >>>>>> non-normalized data (best effort tyep of thing) through the
> > > parser
> > > > > and
> > > > > >>>> do
> > > > > >>>>>> the normalization post-parse..or is there a problem with
> that?
> > > > > >>>>>>
> > > > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > > > > alinazemian@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hi Casey,
> > > > > >>>>>>>
> > > > > >>>>>>> It is actually pre-parse process, not a post-parse one.
> These
> > > > type
> > > > > of
> > > > > >>>>>>> noises affect the position of an attribute for example and
> > give
> > > > us
> > > > > >>>>> parsing
> > > > > >>>>>>> exception. The timestamp example was not a good one
because
> > > that
> > > > is
> > > > > >>>>>>> actually a post-parse exception.
> > > > > >>>>>>>
> > > > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> > > > cestella@gmail.com
> > > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> So, further transformation post-parse was one of the
> > > motivating
> > > > > >>>> reasons
> > > > > >>>>>>> for
> > > > > >>>>>>>> Stellar (to do that transformation post-parse).  Is there
> a
> > > > > >>>> capability
> > > > > >>>>>>> that
> > > > > >>>>>>>> it's lacking that we can add to fit your usecase?
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > > > > >> alinazemian@gmail.com
> > > > > >>>>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > > > > >>>> alinazemian@gmail.com
> > > > > >>>>>>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Currently, we are using normal regex at the Java source
> > code
> > > > to
> > > > > >>>>>>> handle
> > > > > >>>>>>>>>> those situations. However, it would be nice to have a
> > > separate
> > > > > >> bolt
> > > > > >>>>>>> and
> > > > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira
> issue
> > > > > >>>> regarding
> > > > > >>>>>>>>> that.
> > > > > >>>>>>>>>> The main reason I am asking for such a feature is the
> fact
> > > > that
> > > > > >>>> lack
> > > > > >>>>>>> of
> > > > > >>>>>>>>>> such a feature makes the process of creating some
parser
> > for
> > > > the
> > > > > >>>>>>>>> community
> > > > > >>>>>>>>>> a little painful for us. We need to maintain two
> different
> > > > > >>>> versions,
> > > > > >>>>>>>> one
> > > > > >>>>>>>>>> for community another for the internal use case.
> Clearly,
> > > > noise
> > > > > is
> > > > > >>>> an
> > > > > >>>>>>>>>> inevitable part of real world use cases.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Cheers,
> > > > > >>>>>>>>>> Ali
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > > > >>>>>>> ottobackwards@gmail.com
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are
> you
> > > > using
> > > > > >>>> any
> > > > > >>>>>>>>>>> Stellar to do it?
> > > > > >>>>>>>>>>> Can you create a jira?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > > > > >>>> alinazemian@gmail.com)
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> We are facing certain use cases in Metron production
> that
> > > > > happen
> > > > > >>>> to
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>> related to noisy stream. For example, a wrong
> timestamp,
> > > > > >> duplicate
> > > > > >>>>>>>>>>> hostname/IP address, etc. To deal with the
> normalization
> > we
> > > > > have
> > > > > >>>>>>> added
> > > > > >>>>>>>>> an
> > > > > >>>>>>>>>>> additional step for the corresponding parsers to do
the
> > > data
> > > > > >>>>>>> cleaning.
> > > > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly
> > > related
> > > > > to
> > > > > >>>> the
> > > > > >>>>>>>>>>> device
> > > > > >>>>>>>>>>> that is generating the data and can be used for the
> same
> > > type
> > > > > of
> > > > > >>>>>>>> device
> > > > > >>>>>>>>>>> everywhere, but normalization is very production
> > dependent
> > > > and
> > > > > >>>> there
> > > > > >>>>>>>> is
> > > > > >>>>>>>>>>> no
> > > > > >>>>>>>>>>> point of mixing normalization with parsing. It would
be
> > > nice
> > > > to
> > > > > >>>>>>> have a
> > > > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> > > > production
> > > > > >>>>>>>>>>> related cleaning process. In that case, eveybody can
> > easily
> > > > > >>>>>>> contribute
> > > > > >>>>>>>>> to
> > > > > >>>>>>>>>>> Metron community with additional parsers without being
> > > > worried
> > > > > >>>> about
> > > > > >>>>>>>>>>> mixing
> > > > > >>>>>>>>>>> parsers and data cleaning process.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Ali
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> --
> > > > > >>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> A.Nazemian
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> A.Nazemian
> > > > > >>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



--
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

Posted by Ali Nazemian <al...@gmail.com>.
Hi Nick,

I am happy to continue the development using the current architecture and
embed the pre-parsing steps in the parser code. However, this would be
against the policy to have a contribution to Metron community to expand the
range of supported devices. Clearly, a generic parser would be useful for
the community not a type of parser that is highly customised for our noisy
environment. I was looking for decoupling Parsing and Normalisation to
implement a generic parser which can be used by others as well.

I think this is more a type of strategic decision which can increase the
number of generic parsers that will be contributed back to the community in
future. Ideally, it would be better that official Metron developers focus
on Metron features instead of developing generic parsers.

Thanks,
Ali

On Wed, May 3, 2017 at 3:03 AM, Nick Allen <ni...@nickallen.org> wrote:

> Yes, and currently that normalization step is the Parsers.
>
> I am not saying the message has to be entirely clear and well-defined.  But
> there are a minimum set of expectations that you must have of any data that
> you're ingesting.   Once it meets that "minimum set", the parser should be
> able to ingest and normalize the message.  Any oddities beyond that
> "minimum set" can be handled with Stellar either post-Parsing or in
> Enrichment.
>
> It is, of course, a judgement call as to what that minimum set is for you.
> You would just need a Parser that matches your definition of "minimum set".
>
> My main point here is that I am not seeing a need to re-architect
> anything.  I think we have the right tools, IMHO.
>
>
>
>
>
>
>
>
>
> On Tue, May 2, 2017 at 10:33 AM, Ali Nazemian <al...@gmail.com>
> wrote:
>
> > Hi Nick,
> >
> > The date could be corrupted due to any reason, and sometimes we haven't
> got
> > any control on the device. Obviously, it is not a big deal if we lose
> <166>
> > severity message, but it could be a different situation for <161>
> > severity or an actual critical threat. However, I have mentioned those
> > defects as an example to pointed the importance of having a normalisation
> > step in Metron processing chain.
> >
> > I still think there is no guarantee to have an entirely clear and
> > well-defined message in the real world use case. If we recognise this
> > situation as a problem, then finding a high performance and flexible
> > solution is not very hard.
> >
> > Cheers,
> > Ali
> >
> > On Tue, May 2, 2017 at 11:24 PM, Nick Allen <ni...@nickallen.org> wrote:
> >
> > > Before worrying about how to ingest this 'noisy' data, I would want to
> > > better understand root cause.  If you cannot even get a valid date
> > format,
> > > are you sure the data can be trusted?
> > >
> > > Rather than bending over backwards to try to ingest it, I would first
> > make
> > > sure the telemetry is not totally bogus to begin with.  Maybe it is
> > better
> > > that the data is dropped in cases like this.
> > >
> > > IMHO, that is how I would tackle a problem like this.  Not all data can
> > be
> > > trusted.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <al...@gmail.com>
> > > wrote:
> > >
> > > > Are you sure? The syslog_host name is way more complicated than
> > something
> > > > that can be a coincidence. I need to double check with one of the
> > > security
> > > > device experts, but I thought it is some kind of noises.
> > > >
> > > > Yes, we do have more use cases that seem to be corrupted. For
> example,
> > > > having duplicate IP addresses or corrupted date format. Please have a
> > > look
> > > > at the following message. At least I am sure the date format is
> > corrupted
> > > > in this one.
> > > >
> > > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> > > connection
> > > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to
> inside:*y.y.y.y/p2*
> > > > *y.y.y.y/p2*
> > > >
> > > > Cheers,
> > > > Ali
> > > >
> > > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > > > simon@simonellistonball.com> wrote:
> > > >
> > > > > Is that instance, you're looking at valid syslog which should be
> > parsed
> > > > as
> > > > > such. The repeat host is not really a host in syslog terms, it's an
> > > > > application name header which happens to be the same. This is
> > > definitely
> > > > a
> > > > > parser bug which should be handled, esp since the header is
> perfectly
> > > RFC
> > > > > compliant.
> > > > >
> > > > > Do you have any other such cases? My view is that parsers should be
> > > > > written with more any case, so should extract all the fields they
> can
> > > > from
> > > > > malformed logs, rather than throwing exceptions, but that's more
> > about
> > > > the
> > > > > way we write parsers than having some kind of pre-clean.
> > > > >
> > > > > Simon
> > > > >
> > > > > Sent from my iPad
> > > > >
> > > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <al...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > I do agree there is a fair amount of overhead for using another
> > bolt
> > > > for
> > > > > > this purpose. I am not pointing to the way of implementation. It
> > > might
> > > > > be a
> > > > > > way of implementation to segregate two extension points without
> > > adding
> > > > > > overhead; I haven't thought about it yet. However, the main issue
> > is
> > > > > > sometimes the type of noise is something that generates an
> > exception
> > > on
> > > > > the
> > > > > > parsing side. For example, have a look at the following log:
> > > > > >
> > > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown
> ICMP
> > > > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > > > (ryanmar)
> > > > > >
> > > > > > Clearly duplicate syslog_host throws an exception on parsing, so
> > how
> > > > > > are we going to deal with that at post-parse transformation? It
> > > cannot
> > > > > > pass the parsing. This is only a single example of cases that
> might
> > > > > > affect the production data. Unless Stellar transformation is
> > > something
> > > > > > that can be done at pre-parse and for the entire message.
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > > > simon@simonellistonball.com> wrote:
> > > > > >
> > > > > >> Ali,
> > > > > >>
> > > > > >> Sounds very much like what you’re talking about when you say
> > > > > >> normalization, and what I would understand it as, is the process
> > > > > fulfilled
> > > > > >> by stellar field transformation in the parser config. Agreed
> that
> > > some
> > > > > of
> > > > > >> these will be general, based on common metron standard schema,
> but
> > > > > others
> > > > > >> will be organisation specific (custom fields overloaded with
> > > different
> > > > > >> meanings for instance in CEF, for example). These are very much
> > one
> > > of
> > > > > the
> > > > > >> reasons we have the stellar transformation step. I don’t think
> > that
> > > > > should
> > > > > >> be moved to a separate bolt to be honest, because that comes
> with
> > a
> > > > fair
> > > > > >> amount of overhead, but logically it is in the parser config
> > rather
> > > > than
> > > > > >> the parser, so seems to serve this purpose in the post-parse
> > > > transform,
> > > > > no?
> > > > > >>
> > > > > >> Simon
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <al...@gmail.com>
> > > > wrote:
> > > > > >>>
> > > > > >>> Hi Simon,
> > > > > >>>
> > > > > >>> The reason I am asking for a specific normalisation step is due
> > to
> > > > the
> > > > > >> fact
> > > > > >>> that normalisation is not a general use case which can be used
> by
> > > > other
> > > > > >>> users. It is completely bounded to our application. The way we
> > have
> > > > > fixed
> > > > > >>> it, for now, is to add a normalisation step to the parser and
> > clear
> > > > the
> > > > > >>> incoming data so the parser step can work on that, but I don't
> > like
> > > > it.
> > > > > >>> There is no point of creating a parser that can handle all of
> the
> > > > > >> possible
> > > > > >>> noises that can exist in the production data. Even if it is
> > > possible
> > > > to
> > > > > >>> predict every kind of noise in production data there is no
> point
> > > for
> > > > > >> Metron
> > > > > >>> community to focus on building a general purpose parser for a
> > > > specific
> > > > > >>> device while they can spend that time on developing a cool
> > feature.
> > > > > Even
> > > > > >> if
> > > > > >>> it is possible to predict noises and it is acceptable for the
> > > > community
> > > > > >> to
> > > > > >>> spend their time on creating that kind of parser why every
> Metron
> > > > user
> > > > > >> need
> > > > > >>> that extra normalisation? A user data might be clear at the
> first
> > > > step
> > > > > >> and
> > > > > >>> obviously, it only decreases the total throughput without any
> use
> > > for
> > > > > >> that
> > > > > >>> specific user.
> > > > > >>>
> > > > > >>> Imagine there is an additional bolt for normalisation and there
> > is
> > > a
> > > > > >>> mechanism to customise the normalisation without changing the
> > > general
> > > > > >>> parser for a specific device. We can have a general parser as a
> > > > common
> > > > > >>> parser for that device and leave the normalisation development
> to
> > > > > users.
> > > > > >>> However, it is very important to provide the normalisation step
> > as
> > > > fast
> > > > > >> as
> > > > > >>> possible.
> > > > > >>>
> > > > > >>> Cheers,
> > > > > >>> Ali
> > > > > >>>
> > > > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <
> > cestella@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Yeah, we definitely don't want to rewrite parsing in
> Stellar.  I
> > > > would
> > > > > >>>> expect the job of the parser, however, to handle structural
> > > issues.
> > > > > In
> > > > > >> my
> > > > > >>>> mind, parsing is about transforming structures into fields and
> > the
> > > > > role
> > > > > >> of
> > > > > >>>> the field transformations are to transform values.  There's
> > > obvious
> > > > > >> overlap
> > > > > >>>> there wherein parsers may do some
> normalizations/transformations
> > > > (i.e.
> > > > > >> look
> > > > > >>>> how grok handles timestamps), but it almost always gets us
> into
> > > > > trouble
> > > > > >>>> when parsers do even moderately complex value transformations.
> > > > > >>>>
> > > > > >>>> As I type this, though, I think I see your point.  What you
> > really
> > > > > want
> > > > > >> is
> > > > > >>>> to chain parsers, have a pre-parser to bring you 80% of the
> way
> > > > there
> > > > > >> and
> > > > > >>>> hammer out all the structural issues so you might be able to
> > use a
> > > > > more
> > > > > >>>> generic parser down the chain.  I have often thought that
> maybe
> > we
> > > > > >> should
> > > > > >>>> expose parsers as Stellar functions which take raw data and
> emit
> > > > whole
> > > > > >>>> messages.  This would allow us to compose parsers, so imagine
> > the
> > > > > above
> > > > > >>>> example where you've written a stellar function to normalize
> the
> > > > input
> > > > > >> and
> > > > > >>>> you're then passing it to a CSV parser, you could run
> > > > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise
> > > specify a
> > > > > >>>> parser.
> > > > > >>>>
> > > > > >>>> As for speed, the stellar expression would get compiled into a
> > > java
> > > > > >> object,
> > > > > >>>> so it shouldn't be appreciable overhead since we no longer lex
> > and
> > > > > parse
> > > > > >>>> for every message.
> > > > > >>>>
> > > > > >>>> Is this kinda how you were seeing it?
> > > > > >>>>
> > > > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > > > > >>>> simon@simonellistonball.com> wrote:
> > > > > >>>>
> > > > > >>>>> The challenge there I suspect is going to be that you
> > essentially
> > > > end
> > > > > >> up
> > > > > >>>>> with the actual parser doing very little of value, and then
> > > > > effectively
> > > > > >>>>> trying to write a parser in stellar against a few broad
> > strings,
> > > > > which
> > > > > >>>>> would likely give you all sorts of performance problems.
> > > > > >>>>>
> > > > > >>>>> One solution is to write a very defensive and flexible
> parser,
> > > but
> > > > > that
> > > > > >>>>> would tend to be time consuming.
> > > > > >>>>>
> > > > > >>>>> There is also something to be said for doing some basic
> > > > > transformation
> > > > > >>>>> before the parser topic kafka in something like nifi, but
> > again,
> > > > > >>>>> performance can be an issue there.
> > > > > >>>>>
> > > > > >>>>> If the noise is about broken structure for example, maybe a
> > > simple
> > > > > >>>>> pre-process step as part of your parser would make sense,
> e.g.
> > > > > >> stripping
> > > > > >>>>> syslog headers, or character set conversion, removing very
> > broken
> > > > > bits
> > > > > >> as
> > > > > >>>>> part of the parse method.
> > > > > >>>>>
> > > > > >>>>> In terms of normalisation post-parse, I agree, that 100% a
> job
> > > for
> > > > > >>>>> Stellar, and the fieldTransformations capability. Something I
> > > would
> > > > > >> like
> > > > > >>>> to
> > > > > >>>>> see would be a means to use that transformation step to map
> to
> > a
> > > > well
> > > > > >>>> known
> > > > > >>>>> (though loosely enforced) schema provided by a governance
> > > > framework,
> > > > > >> but
> > > > > >>>>> that is a much bigger topic of conversation.
> > > > > >>>>>
> > > > > >>>>> Not of course that not everything has to be parsed just
> because
> > > > it’s
> > > > > in
> > > > > >>>>> the message. A relatively loose fitting parser which pulls
> out
> > > the
> > > > > >>>> relevant
> > > > > >>>>> data for the use case would be fine, and likely a lot more
> > > tolerant
> > > > > of
> > > > > >>>>> noise than something that felt the need for every field. We
> do
> > > > after
> > > > > >> all
> > > > > >>>>> store the original_string for you if you really absolutely
> have
> > > to
> > > > > had
> > > > > >>>>> everything, so a more schema-on-read philosophy certainly
> > applies
> > > > and
> > > > > >>>> will
> > > > > >>>>> likely side-step a lot of your issues.
> > > > > >>>>>
> > > > > >>>>> Simon
> > > > > >>>>>
> > > > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <ce...@gmail.com>
> > > > wrote:
> > > > > >>>>>>
> > > > > >>>>>> Ok, that's another story.  hmmmm, we don't generally
> pre-parse
> > > > > becuase
> > > > > >>>> we
> > > > > >>>>>> try to not assume any particular format there (i.e. it could
> > be
> > > > > >>>> strings,
> > > > > >>>>>> could be byte arrays).  Maybe the right answer is to pass
> the
> > > raw,
> > > > > >>>>>> non-normalized data (best effort tyep of thing) through the
> > > parser
> > > > > and
> > > > > >>>> do
> > > > > >>>>>> the normalization post-parse..or is there a problem with
> that?
> > > > > >>>>>>
> > > > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > > > > alinazemian@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hi Casey,
> > > > > >>>>>>>
> > > > > >>>>>>> It is actually pre-parse process, not a post-parse one.
> These
> > > > type
> > > > > of
> > > > > >>>>>>> noises affect the position of an attribute for example and
> > give
> > > > us
> > > > > >>>>> parsing
> > > > > >>>>>>> exception. The timestamp example was not a good one because
> > > that
> > > > is
> > > > > >>>>>>> actually a post-parse exception.
> > > > > >>>>>>>
> > > > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> > > > cestella@gmail.com
> > > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> So, further transformation post-parse was one of the
> > > motivating
> > > > > >>>> reasons
> > > > > >>>>>>> for
> > > > > >>>>>>>> Stellar (to do that transformation post-parse).  Is there
> a
> > > > > >>>> capability
> > > > > >>>>>>> that
> > > > > >>>>>>>> it's lacking that we can add to fit your usecase?
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > > > > >> alinazemian@gmail.com
> > > > > >>>>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > > > > >>>> alinazemian@gmail.com
> > > > > >>>>>>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Currently, we are using normal regex at the Java source
> > code
> > > > to
> > > > > >>>>>>> handle
> > > > > >>>>>>>>>> those situations. However, it would be nice to have a
> > > separate
> > > > > >> bolt
> > > > > >>>>>>> and
> > > > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira
> issue
> > > > > >>>> regarding
> > > > > >>>>>>>>> that.
> > > > > >>>>>>>>>> The main reason I am asking for such a feature is the
> fact
> > > > that
> > > > > >>>> lack
> > > > > >>>>>>> of
> > > > > >>>>>>>>>> such a feature makes the process of creating some parser
> > for
> > > > the
> > > > > >>>>>>>>> community
> > > > > >>>>>>>>>> a little painful for us. We need to maintain two
> different
> > > > > >>>> versions,
> > > > > >>>>>>>> one
> > > > > >>>>>>>>>> for community another for the internal use case.
> Clearly,
> > > > noise
> > > > > is
> > > > > >>>> an
> > > > > >>>>>>>>>> inevitable part of real world use cases.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Cheers,
> > > > > >>>>>>>>>> Ali
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > > > >>>>>>> ottobackwards@gmail.com
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are
> you
> > > > using
> > > > > >>>> any
> > > > > >>>>>>>>>>> Stellar to do it?
> > > > > >>>>>>>>>>> Can you create a jira?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > > > > >>>> alinazemian@gmail.com)
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> We are facing certain use cases in Metron production
> that
> > > > > happen
> > > > > >>>> to
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>> related to noisy stream. For example, a wrong
> timestamp,
> > > > > >> duplicate
> > > > > >>>>>>>>>>> hostname/IP address, etc. To deal with the
> normalization
> > we
> > > > > have
> > > > > >>>>>>> added
> > > > > >>>>>>>>> an
> > > > > >>>>>>>>>>> additional step for the corresponding parsers to do the
> > > data
> > > > > >>>>>>> cleaning.
> > > > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly
> > > related
> > > > > to
> > > > > >>>> the
> > > > > >>>>>>>>>>> device
> > > > > >>>>>>>>>>> that is generating the data and can be used for the
> same
> > > type
> > > > > of
> > > > > >>>>>>>> device
> > > > > >>>>>>>>>>> everywhere, but normalization is very production
> > dependent
> > > > and
> > > > > >>>> there
> > > > > >>>>>>>> is
> > > > > >>>>>>>>>>> no
> > > > > >>>>>>>>>>> point of mixing normalization with parsing. It would be
> > > nice
> > > > to
> > > > > >>>>>>> have a
> > > > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> > > > production
> > > > > >>>>>>>>>>> related cleaning process. In that case, eveybody can
> > easily
> > > > > >>>>>>> contribute
> > > > > >>>>>>>>> to
> > > > > >>>>>>>>>>> Metron community with additional parsers without being
> > > > worried
> > > > > >>>> about
> > > > > >>>>>>>>>>> mixing
> > > > > >>>>>>>>>>> parsers and data cleaning process.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Ali
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> --
> > > > > >>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> A.Nazemian
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> A.Nazemian
> > > > > >>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

Posted by Nick Allen <ni...@nickallen.org>.
Yes, and currently that normalization step is the Parsers.

I am not saying the message has to be entirely clear and well-defined.  But
there are a minimum set of expectations that you must have of any data that
you're ingesting.   Once it meets that "minimum set", the parser should be
able to ingest and normalize the message.  Any oddities beyond that
"minimum set" can be handled with Stellar either post-Parsing or in
Enrichment.

It is, of course, a judgement call as to what that minimum set is for you.
You would just need a Parser that matches your definition of "minimum set".

My main point here is that I am not seeing a need to re-architect
anything.  I think we have the right tools, IMHO.









On Tue, May 2, 2017 at 10:33 AM, Ali Nazemian <al...@gmail.com> wrote:

> Hi Nick,
>
> The date could be corrupted due to any reason, and sometimes we haven't got
> any control on the device. Obviously, it is not a big deal if we lose <166>
> severity message, but it could be a different situation for <161>
> severity or an actual critical threat. However, I have mentioned those
> defects as an example to pointed the importance of having a normalisation
> step in Metron processing chain.
>
> I still think there is no guarantee to have an entirely clear and
> well-defined message in the real world use case. If we recognise this
> situation as a problem, then finding a high performance and flexible
> solution is not very hard.
>
> Cheers,
> Ali
>
> On Tue, May 2, 2017 at 11:24 PM, Nick Allen <ni...@nickallen.org> wrote:
>
> > Before worrying about how to ingest this 'noisy' data, I would want to
> > better understand root cause.  If you cannot even get a valid date
> format,
> > are you sure the data can be trusted?
> >
> > Rather than bending over backwards to try to ingest it, I would first
> make
> > sure the telemetry is not totally bogus to begin with.  Maybe it is
> better
> > that the data is dropped in cases like this.
> >
> > IMHO, that is how I would tackle a problem like this.  Not all data can
> be
> > trusted.
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <al...@gmail.com>
> > wrote:
> >
> > > Are you sure? The syslog_host name is way more complicated than
> something
> > > that can be a coincidence. I need to double check with one of the
> > security
> > > device experts, but I thought it is some kind of noises.
> > >
> > > Yes, we do have more use cases that seem to be corrupted. For example,
> > > having duplicate IP addresses or corrupted date format. Please have a
> > look
> > > at the following message. At least I am sure the date format is
> corrupted
> > > in this one.
> > >
> > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> > connection
> > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
> > > *y.y.y.y/p2*
> > >
> > > Cheers,
> > > Ali
> > >
> > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > > simon@simonellistonball.com> wrote:
> > >
> > > > Is that instance, you're looking at valid syslog which should be
> parsed
> > > as
> > > > such. The repeat host is not really a host in syslog terms, it's an
> > > > application name header which happens to be the same. This is
> > definitely
> > > a
> > > > parser bug which should be handled, esp since the header is perfectly
> > RFC
> > > > compliant.
> > > >
> > > > Do you have any other such cases? My view is that parsers should be
> > > > written with more any case, so should extract all the fields they can
> > > from
> > > > malformed logs, rather than throwing exceptions, but that's more
> about
> > > the
> > > > way we write parsers than having some kind of pre-clean.
> > > >
> > > > Simon
> > > >
> > > > Sent from my iPad
> > > >
> > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <al...@gmail.com>
> > wrote:
> > > > >
> > > > > I do agree there is a fair amount of overhead for using another
> bolt
> > > for
> > > > > this purpose. I am not pointing to the way of implementation. It
> > might
> > > > be a
> > > > > way of implementation to segregate two extension points without
> > adding
> > > > > overhead; I haven't thought about it yet. However, the main issue
> is
> > > > > sometimes the type of noise is something that generates an
> exception
> > on
> > > > the
> > > > > parsing side. For example, have a look at the following log:
> > > > >
> > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > > (ryanmar)
> > > > >
> > > > > Clearly duplicate syslog_host throws an exception on parsing, so
> how
> > > > > are we going to deal with that at post-parse transformation? It
> > cannot
> > > > > pass the parsing. This is only a single example of cases that might
> > > > > affect the production data. Unless Stellar transformation is
> > something
> > > > > that can be done at pre-parse and for the entire message.
> > > > >
> > > > >
> > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > > simon@simonellistonball.com> wrote:
> > > > >
> > > > >> Ali,
> > > > >>
> > > > >> Sounds very much like what you’re talking about when you say
> > > > >> normalization, and what I would understand it as, is the process
> > > > fulfilled
> > > > >> by stellar field transformation in the parser config. Agreed that
> > some
> > > > of
> > > > >> these will be general, based on common metron standard schema, but
> > > > others
> > > > >> will be organisation specific (custom fields overloaded with
> > different
> > > > >> meanings for instance in CEF, for example). These are very much
> one
> > of
> > > > the
> > > > >> reasons we have the stellar transformation step. I don’t think
> that
> > > > should
> > > > >> be moved to a separate bolt to be honest, because that comes with
> a
> > > fair
> > > > >> amount of overhead, but logically it is in the parser config
> rather
> > > than
> > > > >> the parser, so seems to serve this purpose in the post-parse
> > > transform,
> > > > no?
> > > > >>
> > > > >> Simon
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <al...@gmail.com>
> > > wrote:
> > > > >>>
> > > > >>> Hi Simon,
> > > > >>>
> > > > >>> The reason I am asking for a specific normalisation step is due
> to
> > > the
> > > > >> fact
> > > > >>> that normalisation is not a general use case which can be used by
> > > other
> > > > >>> users. It is completely bounded to our application. The way we
> have
> > > > fixed
> > > > >>> it, for now, is to add a normalisation step to the parser and
> clear
> > > the
> > > > >>> incoming data so the parser step can work on that, but I don't
> like
> > > it.
> > > > >>> There is no point of creating a parser that can handle all of the
> > > > >> possible
> > > > >>> noises that can exist in the production data. Even if it is
> > possible
> > > to
> > > > >>> predict every kind of noise in production data there is no point
> > for
> > > > >> Metron
> > > > >>> community to focus on building a general purpose parser for a
> > > specific
> > > > >>> device while they can spend that time on developing a cool
> feature.
> > > > Even
> > > > >> if
> > > > >>> it is possible to predict noises and it is acceptable for the
> > > community
> > > > >> to
> > > > >>> spend their time on creating that kind of parser why every Metron
> > > user
> > > > >> need
> > > > >>> that extra normalisation? A user data might be clear at the first
> > > step
> > > > >> and
> > > > >>> obviously, it only decreases the total throughput without any use
> > for
> > > > >> that
> > > > >>> specific user.
> > > > >>>
> > > > >>> Imagine there is an additional bolt for normalisation and there
> is
> > a
> > > > >>> mechanism to customise the normalisation without changing the
> > general
> > > > >>> parser for a specific device. We can have a general parser as a
> > > common
> > > > >>> parser for that device and leave the normalisation development to
> > > > users.
> > > > >>> However, it is very important to provide the normalisation step
> as
> > > fast
> > > > >> as
> > > > >>> possible.
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Ali
> > > > >>>
> > > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <
> cestella@gmail.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar.  I
> > > would
> > > > >>>> expect the job of the parser, however, to handle structural
> > issues.
> > > > In
> > > > >> my
> > > > >>>> mind, parsing is about transforming structures into fields and
> the
> > > > role
> > > > >> of
> > > > >>>> the field transformations are to transform values.  There's
> > obvious
> > > > >> overlap
> > > > >>>> there wherein parsers may do some normalizations/transformations
> > > (i.e.
> > > > >> look
> > > > >>>> how grok handles timestamps), but it almost always gets us into
> > > > trouble
> > > > >>>> when parsers do even moderately complex value transformations.
> > > > >>>>
> > > > >>>> As I type this, though, I think I see your point.  What you
> really
> > > > want
> > > > >> is
> > > > >>>> to chain parsers, have a pre-parser to bring you 80% of the way
> > > there
> > > > >> and
> > > > >>>> hammer out all the structural issues so you might be able to
> use a
> > > > more
> > > > >>>> generic parser down the chain.  I have often thought that maybe
> we
> > > > >> should
> > > > >>>> expose parsers as Stellar functions which take raw data and emit
> > > whole
> > > > >>>> messages.  This would allow us to compose parsers, so imagine
> the
> > > > above
> > > > >>>> example where you've written a stellar function to normalize the
> > > input
> > > > >> and
> > > > >>>> you're then passing it to a CSV parser, you could run
> > > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise
> > specify a
> > > > >>>> parser.
> > > > >>>>
> > > > >>>> As for speed, the stellar expression would get compiled into a
> > java
> > > > >> object,
> > > > >>>> so it shouldn't be appreciable overhead since we no longer lex
> and
> > > > parse
> > > > >>>> for every message.
> > > > >>>>
> > > > >>>> Is this kinda how you were seeing it?
> > > > >>>>
> > > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > > > >>>> simon@simonellistonball.com> wrote:
> > > > >>>>
> > > > >>>>> The challenge there I suspect is going to be that you
> essentially
> > > end
> > > > >> up
> > > > >>>>> with the actual parser doing very little of value, and then
> > > > effectively
> > > > >>>>> trying to write a parser in stellar against a few broad
> strings,
> > > > which
> > > > >>>>> would likely give you all sorts of performance problems.
> > > > >>>>>
> > > > >>>>> One solution is to write a very defensive and flexible parser,
> > but
> > > > that
> > > > >>>>> would tend to be time consuming.
> > > > >>>>>
> > > > >>>>> There is also something to be said for doing some basic
> > > > transformation
> > > > >>>>> before the parser topic kafka in something like nifi, but
> again,
> > > > >>>>> performance can be an issue there.
> > > > >>>>>
> > > > >>>>> If the noise is about broken structure for example, maybe a
> > simple
> > > > >>>>> pre-process step as part of your parser would make sense, e.g.
> > > > >> stripping
> > > > >>>>> syslog headers, or character set conversion, removing very
> broken
> > > > bits
> > > > >> as
> > > > >>>>> part of the parse method.
> > > > >>>>>
> > > > >>>>> In terms of normalisation post-parse, I agree, that 100% a job
> > for
> > > > >>>>> Stellar, and the fieldTransformations capability. Something I
> > would
> > > > >> like
> > > > >>>> to
> > > > >>>>> see would be a means to use that transformation step to map to
> a
> > > well
> > > > >>>> known
> > > > >>>>> (though loosely enforced) schema provided by a governance
> > > framework,
> > > > >> but
> > > > >>>>> that is a much bigger topic of conversation.
> > > > >>>>>
> > > > >>>>> Not of course that not everything has to be parsed just because
> > > it’s
> > > > in
> > > > >>>>> the message. A relatively loose fitting parser which pulls out
> > the
> > > > >>>> relevant
> > > > >>>>> data for the use case would be fine, and likely a lot more
> > tolerant
> > > > of
> > > > >>>>> noise than something that felt the need for every field. We do
> > > after
> > > > >> all
> > > > >>>>> store the original_string for you if you really absolutely have
> > to
> > > > had
> > > > >>>>> everything, so a more schema-on-read philosophy certainly
> applies
> > > and
> > > > >>>> will
> > > > >>>>> likely side-step a lot of your issues.
> > > > >>>>>
> > > > >>>>> Simon
> > > > >>>>>
> > > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <ce...@gmail.com>
> > > wrote:
> > > > >>>>>>
> > > > >>>>>> Ok, that's another story.  hmmmm, we don't generally pre-parse
> > > > becuase
> > > > >>>> we
> > > > >>>>>> try to not assume any particular format there (i.e. it could
> be
> > > > >>>> strings,
> > > > >>>>>> could be byte arrays).  Maybe the right answer is to pass the
> > raw,
> > > > >>>>>> non-normalized data (best effort tyep of thing) through the
> > parser
> > > > and
> > > > >>>> do
> > > > >>>>>> the normalization post-parse..or is there a problem with that?
> > > > >>>>>>
> > > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > > > alinazemian@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hi Casey,
> > > > >>>>>>>
> > > > >>>>>>> It is actually pre-parse process, not a post-parse one. These
> > > type
> > > > of
> > > > >>>>>>> noises affect the position of an attribute for example and
> give
> > > us
> > > > >>>>> parsing
> > > > >>>>>>> exception. The timestamp example was not a good one because
> > that
> > > is
> > > > >>>>>>> actually a post-parse exception.
> > > > >>>>>>>
> > > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> > > cestella@gmail.com
> > > > >
> > > > >>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> So, further transformation post-parse was one of the
> > motivating
> > > > >>>> reasons
> > > > >>>>>>> for
> > > > >>>>>>>> Stellar (to do that transformation post-parse).  Is there a
> > > > >>>> capability
> > > > >>>>>>> that
> > > > >>>>>>>> it's lacking that we can add to fit your usecase?
> > > > >>>>>>>>
> > > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > > > >> alinazemian@gmail.com
> > > > >>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > > > >>>> alinazemian@gmail.com
> > > > >>>>>>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Currently, we are using normal regex at the Java source
> code
> > > to
> > > > >>>>>>> handle
> > > > >>>>>>>>>> those situations. However, it would be nice to have a
> > separate
> > > > >> bolt
> > > > >>>>>>> and
> > > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue
> > > > >>>> regarding
> > > > >>>>>>>>> that.
> > > > >>>>>>>>>> The main reason I am asking for such a feature is the fact
> > > that
> > > > >>>> lack
> > > > >>>>>>> of
> > > > >>>>>>>>>> such a feature makes the process of creating some parser
> for
> > > the
> > > > >>>>>>>>> community
> > > > >>>>>>>>>> a little painful for us. We need to maintain two different
> > > > >>>> versions,
> > > > >>>>>>>> one
> > > > >>>>>>>>>> for community another for the internal use case. Clearly,
> > > noise
> > > > is
> > > > >>>> an
> > > > >>>>>>>>>> inevitable part of real world use cases.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Cheers,
> > > > >>>>>>>>>> Ali
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > > >>>>>>> ottobackwards@gmail.com
> > > > >>>>>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you
> > > using
> > > > >>>> any
> > > > >>>>>>>>>>> Stellar to do it?
> > > > >>>>>>>>>>> Can you create a jira?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > > > >>>> alinazemian@gmail.com)
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> We are facing certain use cases in Metron production that
> > > > happen
> > > > >>>> to
> > > > >>>>>>> be
> > > > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp,
> > > > >> duplicate
> > > > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization
> we
> > > > have
> > > > >>>>>>> added
> > > > >>>>>>>>> an
> > > > >>>>>>>>>>> additional step for the corresponding parsers to do the
> > data
> > > > >>>>>>> cleaning.
> > > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly
> > related
> > > > to
> > > > >>>> the
> > > > >>>>>>>>>>> device
> > > > >>>>>>>>>>> that is generating the data and can be used for the same
> > type
> > > > of
> > > > >>>>>>>> device
> > > > >>>>>>>>>>> everywhere, but normalization is very production
> dependent
> > > and
> > > > >>>> there
> > > > >>>>>>>> is
> > > > >>>>>>>>>>> no
> > > > >>>>>>>>>>> point of mixing normalization with parsing. It would be
> > nice
> > > to
> > > > >>>>>>> have a
> > > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> > > production
> > > > >>>>>>>>>>> related cleaning process. In that case, eveybody can
> easily
> > > > >>>>>>> contribute
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>> Metron community with additional parsers without being
> > > worried
> > > > >>>> about
> > > > >>>>>>>>>>> mixing
> > > > >>>>>>>>>>> parsers and data cleaning process.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Regards,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Ali
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> --
> > > > >>>>>>>>>> A.Nazemian
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> --
> > > > >>>>>>>>> A.Nazemian
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> --
> > > > >>>>>>> A.Nazemian
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> A.Nazemian
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > >
> > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
>
>
>
> --
> A.Nazemian
>

Re: Normalization topology or separate normalization bolt for parsing topology

Posted by Ali Nazemian <al...@gmail.com>.
Hi Nick,

The date could be corrupted due to any reason, and sometimes we haven't got
any control on the device. Obviously, it is not a big deal if we lose <166>
severity message, but it could be a different situation for <161>
severity or an actual critical threat. However, I have mentioned those
defects as an example to pointed the importance of having a normalisation
step in Metron processing chain.

I still think there is no guarantee to have an entirely clear and
well-defined message in the real world use case. If we recognise this
situation as a problem, then finding a high performance and flexible
solution is not very hard.

Cheers,
Ali

On Tue, May 2, 2017 at 11:24 PM, Nick Allen <ni...@nickallen.org> wrote:

> Before worrying about how to ingest this 'noisy' data, I would want to
> better understand root cause.  If you cannot even get a valid date format,
> are you sure the data can be trusted?
>
> Rather than bending over backwards to try to ingest it, I would first make
> sure the telemetry is not totally bogus to begin with.  Maybe it is better
> that the data is dropped in cases like this.
>
> IMHO, that is how I would tackle a problem like this.  Not all data can be
> trusted.
>
>
>
>
>
>
>
> On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <al...@gmail.com>
> wrote:
>
> > Are you sure? The syslog_host name is way more complicated than something
> > that can be a coincidence. I need to double check with one of the
> security
> > device experts, but I thought it is some kind of noises.
> >
> > Yes, we do have more use cases that seem to be corrupted. For example,
> > having duplicate IP addresses or corrupted date format. Please have a
> look
> > at the following message. At least I am sure the date format is corrupted
> > in this one.
> >
> > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> connection
> > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
> > *y.y.y.y/p2*
> >
> > Cheers,
> > Ali
> >
> > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > simon@simonellistonball.com> wrote:
> >
> > > Is that instance, you're looking at valid syslog which should be parsed
> > as
> > > such. The repeat host is not really a host in syslog terms, it's an
> > > application name header which happens to be the same. This is
> definitely
> > a
> > > parser bug which should be handled, esp since the header is perfectly
> RFC
> > > compliant.
> > >
> > > Do you have any other such cases? My view is that parsers should be
> > > written with more any case, so should extract all the fields they can
> > from
> > > malformed logs, rather than throwing exceptions, but that's more about
> > the
> > > way we write parsers than having some kind of pre-clean.
> > >
> > > Simon
> > >
> > > Sent from my iPad
> > >
> > > > On 27 Apr 2017, at 08:04, Ali Nazemian <al...@gmail.com>
> wrote:
> > > >
> > > > I do agree there is a fair amount of overhead for using another bolt
> > for
> > > > this purpose. I am not pointing to the way of implementation. It
> might
> > > be a
> > > > way of implementation to segregate two extension points without
> adding
> > > > overhead; I haven't thought about it yet. However, the main issue is
> > > > sometimes the type of noise is something that generates an exception
> on
> > > the
> > > > parsing side. For example, have a look at the following log:
> > > >
> > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > (ryanmar)
> > > >
> > > > Clearly duplicate syslog_host throws an exception on parsing, so how
> > > > are we going to deal with that at post-parse transformation? It
> cannot
> > > > pass the parsing. This is only a single example of cases that might
> > > > affect the production data. Unless Stellar transformation is
> something
> > > > that can be done at pre-parse and for the entire message.
> > > >
> > > >
> > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > simon@simonellistonball.com> wrote:
> > > >
> > > >> Ali,
> > > >>
> > > >> Sounds very much like what you’re talking about when you say
> > > >> normalization, and what I would understand it as, is the process
> > > fulfilled
> > > >> by stellar field transformation in the parser config. Agreed that
> some
> > > of
> > > >> these will be general, based on common metron standard schema, but
> > > others
> > > >> will be organisation specific (custom fields overloaded with
> different
> > > >> meanings for instance in CEF, for example). These are very much one
> of
> > > the
> > > >> reasons we have the stellar transformation step. I don’t think that
> > > should
> > > >> be moved to a separate bolt to be honest, because that comes with a
> > fair
> > > >> amount of overhead, but logically it is in the parser config rather
> > than
> > > >> the parser, so seems to serve this purpose in the post-parse
> > transform,
> > > no?
> > > >>
> > > >> Simon
> > > >>
> > > >>
> > > >>
> > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <al...@gmail.com>
> > wrote:
> > > >>>
> > > >>> Hi Simon,
> > > >>>
> > > >>> The reason I am asking for a specific normalisation step is due to
> > the
> > > >> fact
> > > >>> that normalisation is not a general use case which can be used by
> > other
> > > >>> users. It is completely bounded to our application. The way we have
> > > fixed
> > > >>> it, for now, is to add a normalisation step to the parser and clear
> > the
> > > >>> incoming data so the parser step can work on that, but I don't like
> > it.
> > > >>> There is no point of creating a parser that can handle all of the
> > > >> possible
> > > >>> noises that can exist in the production data. Even if it is
> possible
> > to
> > > >>> predict every kind of noise in production data there is no point
> for
> > > >> Metron
> > > >>> community to focus on building a general purpose parser for a
> > specific
> > > >>> device while they can spend that time on developing a cool feature.
> > > Even
> > > >> if
> > > >>> it is possible to predict noises and it is acceptable for the
> > community
> > > >> to
> > > >>> spend their time on creating that kind of parser why every Metron
> > user
> > > >> need
> > > >>> that extra normalisation? A user data might be clear at the first
> > step
> > > >> and
> > > >>> obviously, it only decreases the total throughput without any use
> for
> > > >> that
> > > >>> specific user.
> > > >>>
> > > >>> Imagine there is an additional bolt for normalisation and there is
> a
> > > >>> mechanism to customise the normalisation without changing the
> general
> > > >>> parser for a specific device. We can have a general parser as a
> > common
> > > >>> parser for that device and leave the normalisation development to
> > > users.
> > > >>> However, it is very important to provide the normalisation step as
> > fast
> > > >> as
> > > >>> possible.
> > > >>>
> > > >>> Cheers,
> > > >>> Ali
> > > >>>
> > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <cestella@gmail.com
> >
> > > >> wrote:
> > > >>>
> > > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar.  I
> > would
> > > >>>> expect the job of the parser, however, to handle structural
> issues.
> > > In
> > > >> my
> > > >>>> mind, parsing is about transforming structures into fields and the
> > > role
> > > >> of
> > > >>>> the field transformations are to transform values.  There's
> obvious
> > > >> overlap
> > > >>>> there wherein parsers may do some normalizations/transformations
> > (i.e.
> > > >> look
> > > >>>> how grok handles timestamps), but it almost always gets us into
> > > trouble
> > > >>>> when parsers do even moderately complex value transformations.
> > > >>>>
> > > >>>> As I type this, though, I think I see your point.  What you really
> > > want
> > > >> is
> > > >>>> to chain parsers, have a pre-parser to bring you 80% of the way
> > there
> > > >> and
> > > >>>> hammer out all the structural issues so you might be able to use a
> > > more
> > > >>>> generic parser down the chain.  I have often thought that maybe we
> > > >> should
> > > >>>> expose parsers as Stellar functions which take raw data and emit
> > whole
> > > >>>> messages.  This would allow us to compose parsers, so imagine the
> > > above
> > > >>>> example where you've written a stellar function to normalize the
> > input
> > > >> and
> > > >>>> you're then passing it to a CSV parser, you could run
> > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise
> specify a
> > > >>>> parser.
> > > >>>>
> > > >>>> As for speed, the stellar expression would get compiled into a
> java
> > > >> object,
> > > >>>> so it shouldn't be appreciable overhead since we no longer lex and
> > > parse
> > > >>>> for every message.
> > > >>>>
> > > >>>> Is this kinda how you were seeing it?
> > > >>>>
> > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > > >>>> simon@simonellistonball.com> wrote:
> > > >>>>
> > > >>>>> The challenge there I suspect is going to be that you essentially
> > end
> > > >> up
> > > >>>>> with the actual parser doing very little of value, and then
> > > effectively
> > > >>>>> trying to write a parser in stellar against a few broad strings,
> > > which
> > > >>>>> would likely give you all sorts of performance problems.
> > > >>>>>
> > > >>>>> One solution is to write a very defensive and flexible parser,
> but
> > > that
> > > >>>>> would tend to be time consuming.
> > > >>>>>
> > > >>>>> There is also something to be said for doing some basic
> > > transformation
> > > >>>>> before the parser topic kafka in something like nifi, but again,
> > > >>>>> performance can be an issue there.
> > > >>>>>
> > > >>>>> If the noise is about broken structure for example, maybe a
> simple
> > > >>>>> pre-process step as part of your parser would make sense, e.g.
> > > >> stripping
> > > >>>>> syslog headers, or character set conversion, removing very broken
> > > bits
> > > >> as
> > > >>>>> part of the parse method.
> > > >>>>>
> > > >>>>> In terms of normalisation post-parse, I agree, that 100% a job
> for
> > > >>>>> Stellar, and the fieldTransformations capability. Something I
> would
> > > >> like
> > > >>>> to
> > > >>>>> see would be a means to use that transformation step to map to a
> > well
> > > >>>> known
> > > >>>>> (though loosely enforced) schema provided by a governance
> > framework,
> > > >> but
> > > >>>>> that is a much bigger topic of conversation.
> > > >>>>>
> > > >>>>> Not of course that not everything has to be parsed just because
> > it’s
> > > in
> > > >>>>> the message. A relatively loose fitting parser which pulls out
> the
> > > >>>> relevant
> > > >>>>> data for the use case would be fine, and likely a lot more
> tolerant
> > > of
> > > >>>>> noise than something that felt the need for every field. We do
> > after
> > > >> all
> > > >>>>> store the original_string for you if you really absolutely have
> to
> > > had
> > > >>>>> everything, so a more schema-on-read philosophy certainly applies
> > and
> > > >>>> will
> > > >>>>> likely side-step a lot of your issues.
> > > >>>>>
> > > >>>>> Simon
> > > >>>>>
> > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <ce...@gmail.com>
> > wrote:
> > > >>>>>>
> > > >>>>>> Ok, that's another story.  hmmmm, we don't generally pre-parse
> > > becuase
> > > >>>> we
> > > >>>>>> try to not assume any particular format there (i.e. it could be
> > > >>>> strings,
> > > >>>>>> could be byte arrays).  Maybe the right answer is to pass the
> raw,
> > > >>>>>> non-normalized data (best effort tyep of thing) through the
> parser
> > > and
> > > >>>> do
> > > >>>>>> the normalization post-parse..or is there a problem with that?
> > > >>>>>>
> > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > > alinazemian@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi Casey,
> > > >>>>>>>
> > > >>>>>>> It is actually pre-parse process, not a post-parse one. These
> > type
> > > of
> > > >>>>>>> noises affect the position of an attribute for example and give
> > us
> > > >>>>> parsing
> > > >>>>>>> exception. The timestamp example was not a good one because
> that
> > is
> > > >>>>>>> actually a post-parse exception.
> > > >>>>>>>
> > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> > cestella@gmail.com
> > > >
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> So, further transformation post-parse was one of the
> motivating
> > > >>>> reasons
> > > >>>>>>> for
> > > >>>>>>>> Stellar (to do that transformation post-parse).  Is there a
> > > >>>> capability
> > > >>>>>>> that
> > > >>>>>>>> it's lacking that we can add to fit your usecase?
> > > >>>>>>>>
> > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > > >> alinazemian@gmail.com
> > > >>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > > >>>>>>>>>
> > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > > >>>> alinazemian@gmail.com
> > > >>>>>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Currently, we are using normal regex at the Java source code
> > to
> > > >>>>>>> handle
> > > >>>>>>>>>> those situations. However, it would be nice to have a
> separate
> > > >> bolt
> > > >>>>>>> and
> > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue
> > > >>>> regarding
> > > >>>>>>>>> that.
> > > >>>>>>>>>> The main reason I am asking for such a feature is the fact
> > that
> > > >>>> lack
> > > >>>>>>> of
> > > >>>>>>>>>> such a feature makes the process of creating some parser for
> > the
> > > >>>>>>>>> community
> > > >>>>>>>>>> a little painful for us. We need to maintain two different
> > > >>>> versions,
> > > >>>>>>>> one
> > > >>>>>>>>>> for community another for the internal use case. Clearly,
> > noise
> > > is
> > > >>>> an
> > > >>>>>>>>>> inevitable part of real world use cases.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Cheers,
> > > >>>>>>>>>> Ali
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > >>>>>>> ottobackwards@gmail.com
> > > >>>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you
> > using
> > > >>>> any
> > > >>>>>>>>>>> Stellar to do it?
> > > >>>>>>>>>>> Can you create a jira?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > > >>>> alinazemian@gmail.com)
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We are facing certain use cases in Metron production that
> > > happen
> > > >>>> to
> > > >>>>>>> be
> > > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp,
> > > >> duplicate
> > > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization we
> > > have
> > > >>>>>>> added
> > > >>>>>>>>> an
> > > >>>>>>>>>>> additional step for the corresponding parsers to do the
> data
> > > >>>>>>> cleaning.
> > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly
> related
> > > to
> > > >>>> the
> > > >>>>>>>>>>> device
> > > >>>>>>>>>>> that is generating the data and can be used for the same
> type
> > > of
> > > >>>>>>>> device
> > > >>>>>>>>>>> everywhere, but normalization is very production dependent
> > and
> > > >>>> there
> > > >>>>>>>> is
> > > >>>>>>>>>>> no
> > > >>>>>>>>>>> point of mixing normalization with parsing. It would be
> nice
> > to
> > > >>>>>>> have a
> > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> > production
> > > >>>>>>>>>>> related cleaning process. In that case, eveybody can easily
> > > >>>>>>> contribute
> > > >>>>>>>>> to
> > > >>>>>>>>>>> Metron community with additional parsers without being
> > worried
> > > >>>> about
> > > >>>>>>>>>>> mixing
> > > >>>>>>>>>>> parsers and data cleaning process.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Regards,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Ali
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> A.Nazemian
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>> A.Nazemian
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> A.Nazemian
> > > >>>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> A.Nazemian
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian