You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@metron.apache.org by James Sirota <js...@apache.org> on 2017/01/20 18:16:57 UTC

[DISCUSS] Error Indexing

We already have a capability to capture bolt errors and validation errors and pipe them into a Kafka topic.  I want to propose that we attach a writer topology to the error and validation failed kafka topics so that we can (a) create a new ES index for these errors and (b) create a new Kibana dashboard to visualize them.  The benefit would be that errors and validation failures would be easier to see and analyze.  

I am seeking feedback on the following:

- How granular would we want this feature to be?  Think we would want one index/dashboard per source?  Or would it be better to collapse everything into the same index?
- Do we care about storing these errors in HDFS as well?  Or is indexing them enough?
- What types of errors should we record?  I am proposing:

For error reporting:
--Message failed to parse
--Enrichment failed to enrich
--Threat intel feed failures 
--Generic catch-all for all other errors

For validation reporting:
--What part of message failed validation
--What stellar validator caused the failure



-------------------�
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

Simply as a unique identifier of the original information which is failing
some step, and thus giving you something to key in on and create a count of
unique events and prioritize issues without the concern of cyclical issues
(if the issue is with indexing a specific message, and you try to index it
again, it will just fail in a loop).

Jon

On Wed, Feb 1, 2017 at 6:59 AM Dima Kovalyov <Di...@sstech.us>
wrote:

> That's a great topic of discussion.
>
> Throughout the thread the idea of having hash of the message that failed
> is changed, can someone please explain why do you plan to use this hash
> and how?
>
> - Dima
>
> On 02/01/2017 06:23 AM, Zeolla@GMail.com wrote:
> > After thinking on this for a few days I recant my previous suggestion of
> > TupleHash256.  It's still a bit early for SHA-3 - no good reference
> > implementations/libraries exist (I did some searching and emailing), it
> is
> > optimized for hardware but no hardware implementation is widely
> accessible,
> > FIPS 140-3 is still not close to finalized, etc.
> >
> > I think we could simulate the benefits of tuplehash by sorting the
> tuples,
> > then doing SHA-256(len(tuple1) | tuple1 | ... | len(tuplen) | tuplen).
> > Happy to entertain opposing thoughts, such as BLAKE2, etc. but with the
> > likely users of Metron, I think sticking with FIPS 140-2 is a solid
> choice.
> >
> > Jon
> >
> > On Thu, Jan 26, 2017, 11:23 AM Zeolla@GMail.com <ze...@gmail.com>
> wrote:
> >
> > So one more thing regarding why I think we should throw an exception on a
> > failed enrichment.  If we do make something like username a constant
> field,
> > in cases where that is used to calculate rawMessage_hash, if it fails to
> > enrich, the hash would be different compared to when it succeeds.  Of
> > course I think the initial intent of adding username as a constant field
> > would be to handle it in the parsers, where that information is provided
> in
> > the messages themselves, but how would Threat Intel know the difference?
> > In my environment I am looking forward to a streaming enrichment that
> adds
> > the username, where applicable, anywhere I have an IP.
> >
> > My hesitant suggestion for a hashing algorithm would be to use
> > TupleHash256, as it is a NIST-provided implementation of SHA-3 (using
> > cSHAKE) for this use case.  Details here
> > <
> http://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-185.pdf>.
> > However, I haven't been able to find a reference implementation of this
> in
> > any language, so that's a bit of a downside.  A more general SHA3-256
> > implementation where we handle ordering could work as well, but would be
> > significantly less optimal.
> >
> > Jon
> >
> > On Thu, Jan 26, 2017 at 10:20 AM Ryan Merriman <me...@gmail.com>
> wrote:
> >
> > Jon, I misread the code in the GenericEnrichmentBolt.  The error is
> > forwarded on so no issues there.
> >
> > Defaulting to the common fields makes sense.  I will dig into the
> > GenericEnrichmentBolt more, maybe there is a way to get the error fields
> > without having to significantly change things.  Any opinion on a hashing
> > algorithm?
> >
> > On Wed, Jan 25, 2017 at 9:37 PM, Zeolla@GMail.com <ze...@gmail.com>
> wrote:
> >
> >> Although hashing the whole message is better than nothing, it misses a
> lot
> >> of the benefits we could get.
> >>
> >> While I'd love to have consistency for this field across all of the
> >> different error.types, it appears that may not be reasonably possible
> >> because of the parsers.  So, how about something like hash all of the
> >> constant
> >> fields
> >> <https://github.com/apache/incubator-metron/blob/master/
> >> metron-platform/metron-common/src/main/java/org/apache/
> >> metron/common/Constants.java>
> >> excluding
> >> timestamp and original_string unless it is a parser, in which case hash
> > the
> >> entire message?  This gives us some measure of event uniqueness and it
> can
> >> grow as we define additional constant fields (I recall discussing with
> >> someone else on the list regarding expanding those standard fields to
> >> include things like usernames but I can't find the specific email
> >> exchange).
> >>
> >> Because some enrichments can be heavily relied on, I think it makes
> sense
> >> to put a message onto the error queue when it throws an exception.  Not
> >> only does this help troubleshoot edge cases, but it makes issues more
> >> obvious when assembling a new enrichment in dev/test.  I can't think of
> a
> >> scenario currently where an enrichment would only be "best effort" and
> > that
> >> I wouldn't want that error indexed and retrievable.  However, this gets
> >> interesting when talking about the various options to solve the "Enrich
> >> enrichment" discussion from earlier in the month.  We can keep that part
> > of
> >> this separate though, as I don't think that's being actively pursued
> right
> >> now.
> >>
> >> Jon
> >>
> >> On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dl...@gmail.com>
> wrote:
> >>
> >> RE: separate JIRA for MPack/Ansible. No objection to tracking them
> >> separately, but for this item to be complete, you'll need both the
> feature
> >> and the ability to install it.
> >>
> >> -D...
> >>
> >>
> >> On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <me...@gmail.com>
> >> wrote:
> >>
> >>> Assuming we're going to write all errors to a single error topic, I
> > think
> >>> it makes sense to agree on an error message schema and handle errors
> >> across
> >>> the 3 different topologies in the same way with a single
> implementation.
> >>> The implementation in ParserBolt (ErrorUtils.handleError) produces the
> >> most
> >>> verbose error object so I think it's a good candidate for the single
> >>> implementation.  Here is the message structure it currently produces:
> >>>
> >>> {
> >>>   "exception": "java.lang.Exception: there was an error",
> >>>   "hostname": "host",
> >>>   "stack": "java.lang.Exception: ...",
> >>>   "time": 1485295416563,
> >>>   "message": "there was an error",
> >>>   "rawMessage": "raw message",
> >>>   "rawMessage_bytes": [],
> >>>   "source.type": "bro_error"
> >>> }
> >>>
> >>> From our discussion so far we need to add a couple fields:  an error
> > type
> >>> and hash id.  Adding these to the message looks like:
> >>>
> >>> {
> >>>   "exception": "java.lang.Exception: there was an error",
> >>>   "hostname": "host",
> >>>   "stack": "java.lang.Exception: ...",
> >>>   "time": 1485295416563,
> >>>   "message": "there was an error",
> >>>   "rawMessage": "raw message",
> >>>   "rawMessage_bytes": [],
> >>>   "source.type": "bro_error",
> >>>   "error.type": "parser_error",
> >>>   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> >>> }
> >>>
> >>> We should also consider expanding the error types I listed earlier.
> >>> Instead of just having "indexing_error" we could have
> >>> "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
> >>>
> >>> Jon, if an exception happens in an enrichment or threat intel bolt the
> >>> message is passed along with no error thrown (only logged).  Everywhere
> >>> else I'm having trouble identifying specific fields that should be
> >> hashed.
> >>> Would hashing the message in every case be acceptable?  Do you know of
> a
> >>> place where we could hash a field instead?  On the topic of exceptions
> > in
> >>> enrichments, are we ok with an error only being logged and not added to
> >> the
> >>> message or emitted to the error queue?
> >>>
> >>>
> >>>
> >>> On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com>
> >>> wrote:
> >>>
> >>>> That use case makes sense to me.  I don't think it will require that
> >> much
> >>>> additional effort either.
> >>>>
> >>>> On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Regarding error vs validation - Either way I'm not very concerned.  I
> >>>>> initially assumed they would be combined and agree with that
> > approach,
> >>> but
> >>>>> splitting them out isn't a very big deal to me either.
> >>>>>
> >>>>> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> >>> else
> >>>>> where it's not possible to pick out the exact thing causing the
> > issue)
> >>> it
> >>>>> would be a hash of the complete message.
> >>>>>
> >>>>> Regarding the architecture, I mostly agree with James except that I
> >>> think
> >>>>> step 3 needs to also be able to somehow group errors via the original
> >>>>> data (identify
> >>>>> replays, identify repeat issues with data in a specific field, issues
> >>> with
> >>>>> consistently different data, etc.).  This is essentially the first
> >> step
> >>> of
> >>>>> troubleshooting, which I assume you are doing if you're looking at
> > the
> >>>>> error dashboard.
> >>>>>
> >>>>> If the hash gets moved out of the initial implementation, I'm fairly
> >>>>> certain you lose this ability.  The point here isn't to handle long
> >>> fields
> >>>>> (although that's a benefit of this approach), it's to attach a unique
> >>>>> identifier to the error/validation issue message that links it to the
> >>>>> original problem.  I'd be happy to consider alternative solutions to
> >>> this
> >>>>> problem (for instance, actually sending across the data itself) I
> > just
> >>>>> haven't been able to think of another way to do this that I like
> >> better.
> >>>>> Jon
> >>>>>
> >>>>> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> We also need a JIRA for any install/Ansible/MPack work needed.
> >>>>>>
> >>>>>> On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
> >>>>> wrote:
> >>>>>>> Now that I had some time to think about it I would collapse all
> >>> error
> >>>>> and
> >>>>>>> validation topics into one.  We can differentiate between
> >> different
> >>>>> views
> >>>>>>> of the data (split by error source etc) via Kibana dashboards.  I
> >>>>> would
> >>>>>>> implement this feature incrementally.  First I would modify all
> >> the
> >>>>> bolts
> >>>>>>> to log to a single topic.  Second, I would get the error indexing
> >>>>> done by
> >>>>>>> attaching the indexing topology to the error topic. Third I would
> >>>>> create
> >>>>>>> the necessary dashboards to view errors and validation failures
> > by
> >>>>>> source.
> >>>>>>> Lastly, I would file a follow-on JIRA to introduce hashing of
> >> errors
> >>>>> or
> >>>>>>> fields that are too long.  It seems like a separate feature that
> >> we
> >>>>> need
> >>>>>> to
> >>>>>>> think through.  We may need a stellar function around that.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> James
> >>>>>>>
> >>>>>>> 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> >>>>>>>> I understand what Jon is talking about. He's proposing we hash
> >> the
> >>>>>> value
> >>>>>>>> that caused the error, not necessarily the error message
> > itself.
> >>>>> For an
> >>>>>>>> enrichment this is easy. Just pass along the field value that
> >>> failed
> >>>>>>>> enrichment. For other cases the field that caused the error may
> >>> not
> >>>>> be
> >>>>>> so
> >>>>>>>> obvious. Take parser validation for example. The message is
> >>>>> validated
> >>>>>> as
> >>>>>>>> a whole and it may not be easy to determine which field is the
> >>>>> cause.
> >>>>>> In
> >>>>>>>> that case would a hash of the whole message work?
> >>>>>>>>
> >>>>>>>> There is a broader architectural discussion that needs to
> > happen
> >>>>> before
> >>>>>>> we
> >>>>>>>> can implement this. Currently we have an indexing topology that
> >>>>> reads
> >>>>>>> from
> >>>>>>>> 1 topic and writes messages to ES but errors are written to
> >>> several
> >>>>>>>> different topics:
> >>>>>>>>
> >>>>>>>>    - parser_error
> >>>>>>>>    - parser_invalid
> >>>>>>>>    - enrichments_error
> >>>>>>>>    - threatintel_error
> >>>>>>>>    - indexing_error
> >>>>>>>>
> >>>>>>>> I can see 4 possible approaches to implementing this:
> >>>>>>>>
> >>>>>>>>    1. Create an index topology for each error topic
> >>>>>>>>       1. Good because we can easily reuse the indexing topology
> >>> and
> >>>>>> would
> >>>>>>>>       require the least development effort
> >>>>>>>>       2. Bad because it would consume a lot of extra worker
> >> slots
> >>>>>>>>    2. Move the topic name into the error JSON message as a new
> >>>>>>> "error_type"
> >>>>>>>>    field and write all messages to the indexing topic
> >>>>>>>>       1. Good because we don't need to create a new topology
> >>>>>>>>       2. Bad because we would be flowing data and errors
> > through
> >>> the
> >>>>>> same
> >>>>>>>>       topology. A spike in errors could affect message
> > indexing.
> >>>>>>>>    3. Compromise between 1 and 2. Create another indexing
> >> topology
> >>>>> that
> >>>>>>> is
> >>>>>>>>    dedicated to indexing errors. Move the topic name into the
> >>> error
> >>>>>> JSON
> >>>>>>>>    message as a new "error_type" field and write all errors to
> > a
> >>>>> single
> >>>>>>> error
> >>>>>>>>    topic.
> >>>>>>>>    4. Write a completely new topology with multiple spouts (1
> >> for
> >>>>> each
> >>>>>>>>    error type listed above) that all feed into a single
> >>>>>>> BulkMessageWriterBolt.
> >>>>>>>>       1. Good because the current topologies would not need to
> >>>>> change
> >>>>>>>>       2. Bad because it would require the most development
> >> effort,
> >>>>>> would
> >>>>>>>>       not reuse existing topologies and takes up more worker
> >> slots
> >>>>>> than 3
> >>>>>>>> Are there other approaches I haven't thought of? I think 1 and
> > 2
> >>> are
> >>>>>> off
> >>>>>>>> the table because they are shortcuts and not good long-term
> >>>>> solutions.
> >>>>>> 3
> >>>>>>>> would be my choice because it introduces less complexity than
> > 4.
> >>>>>>> Thoughts?
> >>>>>>>> Ryan
> >>>>>>>>
> >>>>>>>> On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
> >>> zeolla@gmail.com
> >>>>>>> wrote:
> >>>>>>>>>  In that case the hash would be of the value in the IP field,
> >>> such
> >>>>> as
> >>>>>>>>>  sha3(8.8.8.8).
> >>>>>>>>>
> >>>>>>>>>  Jon
> >>>>>>>>>
> >>>>>>>>>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <
> >> jsirota@apache.org>
> >>>>>> wrote:
> >>>>>>>>>  > Jon,
> >>>>>>>>>  >
> >>>>>>>>>  > I am still not entirely following why we would want to use
> >>>>> hashing.
> >>>>>>> For
> >>>>>>>>>  > example if my error is "Your IP field is invalid and failed
> >>>>>>> validation"
> >>>>>>>>>  > hashing this error string will always result in the same
> >> hash.
> >>>>> Why
> >>>>>>> not
> >>>>>>>>>  > just use the actual error string? Can you provide an
> > example
> >>>>> where
> >>>>>>> you
> >>>>>>>>>  > would use it?
> >>>>>>>>>  >
> >>>>>>>>>  > Thanks,
> >>>>>>>>>  > James
> >>>>>>>>>  >
> >>>>>>>>>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> >>>>>>>>>  > > For 1 - I'm good with that.
> >>>>>>>>>  > >
> >>>>>>>>>  > > I'm talking about hashing the relevant content itself not
> >>> the
> >>>>>>> error.
> >>>>>>>>>  Some
> >>>>>>>>>  > > benefits are (1) minimize load on search index (there's
> >>>>> minimal
> >>>>>>> benefit
> >>>>>>>>>  > in
> >>>>>>>>>  > > spending the CPU and disk to keep it at full fidelity
> >>>>> (tokenize
> >>>>>> and
> >>>>>>>>>  > store))
> >>>>>>>>>  > > (2) provide something to key on for dashboards (assuming
> > a
> >>>>> good
> >>>>>>> hash
> >>>>>>>>>  > > algorithm that avoids collisions and is second preimage
> >>>>>> resistant)
> >>>>>>> and
> >>>>>>>>>  > (3)
> >>>>>>>>>  > > specific to errors, if the issue is that it failed to
> >>> index, a
> >>>>>> hash
> >>>>>>>>>  gives
> >>>>>>>>>  > > us some protection that the issue will not occur twice.
> >>>>>>>>>  > >
> >>>>>>>>>  > > Jon
> >>>>>>>>>  > >
> >>>>>>>>>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> >>>>> jsirota@apache.org>
> >>>>>>> wrote:
> >>>>>>>>>  > >
> >>>>>>>>>  > > Jon,
> >>>>>>>>>  > >
> >>>>>>>>>  > > With regards to 1, collapsing to a single dashboard for
> >> each
> >>>>>> would
> >>>>>>> be
> >>>>>>>>>  > > fine. So we would have one error index and one "failed to
> >>>>>> validate"
> >>>>>>>>>  > > index. The distinction is that errors would be things
> > that
> >>>>> went
> >>>>>>> wrong
> >>>>>>>>>  > > during stream processing (failed to parse, etc...), while
> >>>>>>> validation
> >>>>>>>>>  > > failures are messages that explicitly failed stellar
> >>>>>>> validation/schema
> >>>>>>>>>  > > enforcement. There should be relatively few of the second
> >>>>> type.
> >>>>>>>>>  > >
> >>>>>>>>>  > > With respect to 3, why do you want the error hashed? Why
> >> not
> >>>>> just
> >>>>>>>>>  search
> >>>>>>>>>  > > for the error text?
> >>>>>>>>>  > >
> >>>>>>>>>  > > Thanks,
> >>>>>>>>>  > > James
> >>>>>>>>>  > >
> >>>>>>>>>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> >>>>>>>>>  > >> As someone who currently fills the platform engineer
> >> role,
> >>> I
> >>>>> can
> >>>>>>> give
> >>>>>>>>>  > this
> >>>>>>>>>  > >> idea a huge +1. My thoughts:
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> 1. I think it depends on exactly what data is pushed
> > into
> >>> the
> >>>>>>> index
> >>>>>>>>>  > (#3).
> >>>>>>>>>  > >> However, assuming the errors you proposed recording, I
> >>> can't
> >>>>> see
> >>>>>>> huge
> >>>>>>>>>  > >> benefits to having more than one dashboard. I would be
> >>> happy
> >>>>> to
> >>>>>> be
> >>>>>>>>>  > >> persuaded otherwise.
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> 2. I would say yes, storing the errors in HDFS in
> >> addition
> >>> to
> >>>>>>>>>  indexing
> >>>>>>>>>  > is
> >>>>>>>>>  > >> a good thing. Using METRON-510
> >>>>>>>>>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
> >>> case
> >>>>>>> study,
> >>>>>>>>>  > there
> >>>>>>>>>  > >> is the potential in this environment for
> >>> attacker-controlled
> >>>>>> data
> >>>>>>> to
> >>>>>>>>>  > >
> >>>>>>>>>  > > result
> >>>>>>>>>  > >> in processing errors which could be a method of evading
> >>>>> security
> >>>>>>>>>  > >> monitoring. Once an attack is identified, the long term
> >>> HDFS
> >>>>>>> storage
> >>>>>>>>>  > would
> >>>>>>>>>  > >> allow better historical analysis for
> >>> low-and-slow/persistent
> >>>>>>> attacks
> >>>>>>>>>  > (I'm
> >>>>>>>>>  > >> thinking of a method of data exfil that also won't
> >>>>> successfully
> >>>>>>> get
> >>>>>>>>>  > stored
> >>>>>>>>>  > >> in Lucene, but is hard to identify over a short period
> > of
> >>>>> time).
> >>>>>>>>>  > >> - Along this line, I think that there are various parts
> >> of
> >>>>>> Metron
> >>>>>>>>>  > (this
> >>>>>>>>>  > >> included) which could benefit from having method of
> >>>>> configuring
> >>>>>>> data
> >>>>>>>>>  > aging
> >>>>>>>>>  > >> by bucket in HDFS (Following Nick's comments here
> >>>>>>>>>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> 3. I would potentially add a hash of the content that
> >>> failed
> >>>>>>>>>  > validation to
> >>>>>>>>>  > >> help identify repeats over time with less of a concern
> >> that
> >>>>>> you'd
> >>>>>>>>>  have
> >>>>>>>>>  > >
> >>>>>>>>>  > > back
> >>>>>>>>>  > >> to back failures (i.e. instead of storing the value
> >>> itself).
> >>>>>>>>>  > Additionally,
> >>>>>>>>>  > >> I think it's helpful to be able to search all times
> > there
> >>>>> was an
> >>>>>>>>>  > indexing
> >>>>>>>>>  > >> error (instead of it hitting the catch-all).
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> Jon
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> >>>>>> jsirota@apache.org>
> >>>>>>>>>  > wrote:
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> We already have a capability to capture bolt errors and
> >>>>>> validation
> >>>>>>>>>  > errors
> >>>>>>>>>  > >> and pipe them into a Kafka topic. I want to propose that
> >> we
> >>>>>>> attach a
> >>>>>>>>>  > >> writer topology to the error and validation failed kafka
> >>>>> topics
> >>>>>> so
> >>>>>>>>>  > that we
> >>>>>>>>>  > >> can (a) create a new ES index for these errors and (b)
> >>>>> create a
> >>>>>>> new
> >>>>>>>>>  > Kibana
> >>>>>>>>>  > >> dashboard to visualize them. The benefit would be that
> >>> errors
> >>>>>> and
> >>>>>>>>>  > >> validation failures would be easier to see and analyze.
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> I am seeking feedback on the following:
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> - How granular would we want this feature to be? Think
> > we
> >>>>> would
> >>>>>>> want
> >>>>>>>>>  > one
> >>>>>>>>>  > >> index/dashboard per source? Or would it be better to
> >>> collapse
> >>>>>>>>>  > everything
> >>>>>>>>>  > >> into the same index?
> >>>>>>>>>  > >> - Do we care about storing these errors in HDFS as well?
> >> Or
> >>>>> is
> >>>>>>>>>  indexing
> >>>>>>>>>  > >> them enough?
> >>>>>>>>>  > >> - What types of errors should we record? I am proposing:
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> For error reporting:
> >>>>>>>>>  > >> --Message failed to parse
> >>>>>>>>>  > >> --Enrichment failed to enrich
> >>>>>>>>>  > >> --Threat intel feed failures
> >>>>>>>>>  > >> --Generic catch-all for all other errors
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> For validation reporting:
> >>>>>>>>>  > >> --What part of message failed validation
> >>>>>>>>>  > >> --What stellar validator caused the failure
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> -------------------
> >>>>>>>>>  > >> Thank you,
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> James Sirota
> >>>>>>>>>  > >> PPMC- Apache Metron (Incubating)
> >>>>>>>>>  > >> jsirota AT apache DOT org
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> --
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> Jon
> >>>>>>>>>  > >>
> >>>>>>>>>  > >> Sent from my mobile device
> >>>>>>>>>  > >
> >>>>>>>>>  > > -------------------
> >>>>>>>>>  > > Thank you,
> >>>>>>>>>  > >
> >>>>>>>>>  > > James Sirota
> >>>>>>>>>  > > PPMC- Apache Metron (Incubating)
> >>>>>>>>>  > > jsirota AT apache DOT org
> >>>>>>>>>  > >
> >>>>>>>>>  > > --
> >>>>>>>>>  > >
> >>>>>>>>>  > > Jon
> >>>>>>>>>  > >
> >>>>>>>>>  > > Sent from my mobile device
> >>>>>>>>>  >
> >>>>>>>>>  > -------------------
> >>>>>>>>>  > Thank you,
> >>>>>>>>>  >
> >>>>>>>>>  > James Sirota
> >>>>>>>>>  > PPMC- Apache Metron (Incubating)
> >>>>>>>>>  > jsirota AT apache DOT org
> >>>>>>>>>  >
> >>>>>>>>>  --
> >>>>>>>>>
> >>>>>>>>>  Jon
> >>>>>>>>>
> >>>>>>>>>  Sent from my mobile device
> >>>>>>> -------------------
> >>>>>>> Thank you,
> >>>>>>>
> >>>>>>> James Sirota
> >>>>>>> PPMC- Apache Metron (Incubating)
> >>>>>>> jsirota AT apache DOT org
> >>>>>>>
> >>>>> --
> >>>>>
> >>>>> Jon
> >>>>>
> >>>>> Sent from my mobile device
> >>>>>
> >>>>
> >> --
> >>
> >> Jon
> >>
> >> Sent from my mobile device
> >>
>
> --

Jon

Sent from my mobile device

Re: [DISCUSS] Error Indexing

Posted by Dima Kovalyov <Di...@sstech.us>.

That's a great topic of discussion.

Throughout the thread the idea of having hash of the message that failed
is changed, can someone please explain why do you plan to use this hash
and how?

- Dima

On 02/01/2017 06:23 AM, Zeolla@GMail.com wrote:
> After thinking on this for a few days I recant my previous suggestion of
> TupleHash256.  It's still a bit early for SHA-3 - no good reference
> implementations/libraries exist (I did some searching and emailing), it is
> optimized for hardware but no hardware implementation is widely accessible,
> FIPS 140-3 is still not close to finalized, etc.
>
> I think we could simulate the benefits of tuplehash by sorting the tuples,
> then doing SHA-256(len(tuple1) | tuple1 | ... | len(tuplen) | tuplen).
> Happy to entertain opposing thoughts, such as BLAKE2, etc. but with the
> likely users of Metron, I think sticking with FIPS 140-2 is a solid choice.
>
> Jon
>
> On Thu, Jan 26, 2017, 11:23 AM Zeolla@GMail.com <ze...@gmail.com> wrote:
>
> So one more thing regarding why I think we should throw an exception on a
> failed enrichment.  If we do make something like username a constant field,
> in cases where that is used to calculate rawMessage_hash, if it fails to
> enrich, the hash would be different compared to when it succeeds.  Of
> course I think the initial intent of adding username as a constant field
> would be to handle it in the parsers, where that information is provided in
> the messages themselves, but how would Threat Intel know the difference?
> In my environment I am looking forward to a streaming enrichment that adds
> the username, where applicable, anywhere I have an IP.
>
> My hesitant suggestion for a hashing algorithm would be to use
> TupleHash256, as it is a NIST-provided implementation of SHA-3 (using
> cSHAKE) for this use case.  Details here
> <http://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-185.pdf>.
> However, I haven't been able to find a reference implementation of this in
> any language, so that's a bit of a downside.  A more general SHA3-256
> implementation where we handle ordering could work as well, but would be
> significantly less optimal.
>
> Jon
>
> On Thu, Jan 26, 2017 at 10:20 AM Ryan Merriman <me...@gmail.com> wrote:
>
> Jon, I misread the code in the GenericEnrichmentBolt.  The error is
> forwarded on so no issues there.
>
> Defaulting to the common fields makes sense.  I will dig into the
> GenericEnrichmentBolt more, maybe there is a way to get the error fields
> without having to significantly change things.  Any opinion on a hashing
> algorithm?
>
> On Wed, Jan 25, 2017 at 9:37 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:
>
>> Although hashing the whole message is better than nothing, it misses a lot
>> of the benefits we could get.
>>
>> While I'd love to have consistency for this field across all of the
>> different error.types, it appears that may not be reasonably possible
>> because of the parsers.  So, how about something like hash all of the
>> constant
>> fields
>> <https://github.com/apache/incubator-metron/blob/master/
>> metron-platform/metron-common/src/main/java/org/apache/
>> metron/common/Constants.java>
>> excluding
>> timestamp and original_string unless it is a parser, in which case hash
> the
>> entire message?  This gives us some measure of event uniqueness and it can
>> grow as we define additional constant fields (I recall discussing with
>> someone else on the list regarding expanding those standard fields to
>> include things like usernames but I can't find the specific email
>> exchange).
>>
>> Because some enrichments can be heavily relied on, I think it makes sense
>> to put a message onto the error queue when it throws an exception.  Not
>> only does this help troubleshoot edge cases, but it makes issues more
>> obvious when assembling a new enrichment in dev/test.  I can't think of a
>> scenario currently where an enrichment would only be "best effort" and
> that
>> I wouldn't want that error indexed and retrievable.  However, this gets
>> interesting when talking about the various options to solve the "Enrich
>> enrichment" discussion from earlier in the month.  We can keep that part
> of
>> this separate though, as I don't think that's being actively pursued right
>> now.
>>
>> Jon
>>
>> On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dl...@gmail.com> wrote:
>>
>> RE: separate JIRA for MPack/Ansible. No objection to tracking them
>> separately, but for this item to be complete, you'll need both the feature
>> and the ability to install it.
>>
>> -D...
>>
>>
>> On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <me...@gmail.com>
>> wrote:
>>
>>> Assuming we're going to write all errors to a single error topic, I
> think
>>> it makes sense to agree on an error message schema and handle errors
>> across
>>> the 3 different topologies in the same way with a single implementation.
>>> The implementation in ParserBolt (ErrorUtils.handleError) produces the
>> most
>>> verbose error object so I think it's a good candidate for the single
>>> implementation.  Here is the message structure it currently produces:
>>>
>>> {
>>>   "exception": "java.lang.Exception: there was an error",
>>>   "hostname": "host",
>>>   "stack": "java.lang.Exception: ...",
>>>   "time": 1485295416563,
>>>   "message": "there was an error",
>>>   "rawMessage": "raw message",
>>>   "rawMessage_bytes": [],
>>>   "source.type": "bro_error"
>>> }
>>>
>>> From our discussion so far we need to add a couple fields:  an error
> type
>>> and hash id.  Adding these to the message looks like:
>>>
>>> {
>>>   "exception": "java.lang.Exception: there was an error",
>>>   "hostname": "host",
>>>   "stack": "java.lang.Exception: ...",
>>>   "time": 1485295416563,
>>>   "message": "there was an error",
>>>   "rawMessage": "raw message",
>>>   "rawMessage_bytes": [],
>>>   "source.type": "bro_error",
>>>   "error.type": "parser_error",
>>>   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
>>> }
>>>
>>> We should also consider expanding the error types I listed earlier.
>>> Instead of just having "indexing_error" we could have
>>> "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
>>>
>>> Jon, if an exception happens in an enrichment or threat intel bolt the
>>> message is passed along with no error thrown (only logged).  Everywhere
>>> else I'm having trouble identifying specific fields that should be
>> hashed.
>>> Would hashing the message in every case be acceptable?  Do you know of a
>>> place where we could hash a field instead?  On the topic of exceptions
> in
>>> enrichments, are we ok with an error only being logged and not added to
>> the
>>> message or emitted to the error queue?
>>>
>>>
>>>
>>> On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com>
>>> wrote:
>>>
>>>> That use case makes sense to me.  I don't think it will require that
>> much
>>>> additional effort either.
>>>>
>>>> On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
>>>> wrote:
>>>>
>>>>> Regarding error vs validation - Either way I'm not very concerned.  I
>>>>> initially assumed they would be combined and agree with that
> approach,
>>> but
>>>>> splitting them out isn't a very big deal to me either.
>>>>>
>>>>> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
>>> else
>>>>> where it's not possible to pick out the exact thing causing the
> issue)
>>> it
>>>>> would be a hash of the complete message.
>>>>>
>>>>> Regarding the architecture, I mostly agree with James except that I
>>> think
>>>>> step 3 needs to also be able to somehow group errors via the original
>>>>> data (identify
>>>>> replays, identify repeat issues with data in a specific field, issues
>>> with
>>>>> consistently different data, etc.).  This is essentially the first
>> step
>>> of
>>>>> troubleshooting, which I assume you are doing if you're looking at
> the
>>>>> error dashboard.
>>>>>
>>>>> If the hash gets moved out of the initial implementation, I'm fairly
>>>>> certain you lose this ability.  The point here isn't to handle long
>>> fields
>>>>> (although that's a benefit of this approach), it's to attach a unique
>>>>> identifier to the error/validation issue message that links it to the
>>>>> original problem.  I'd be happy to consider alternative solutions to
>>> this
>>>>> problem (for instance, actually sending across the data itself) I
> just
>>>>> haven't been able to think of another way to do this that I like
>> better.
>>>>> Jon
>>>>>
>>>>> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We also need a JIRA for any install/Ansible/MPack work needed.
>>>>>>
>>>>>> On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
>>>>> wrote:
>>>>>>> Now that I had some time to think about it I would collapse all
>>> error
>>>>> and
>>>>>>> validation topics into one.  We can differentiate between
>> different
>>>>> views
>>>>>>> of the data (split by error source etc) via Kibana dashboards.  I
>>>>> would
>>>>>>> implement this feature incrementally.  First I would modify all
>> the
>>>>> bolts
>>>>>>> to log to a single topic.  Second, I would get the error indexing
>>>>> done by
>>>>>>> attaching the indexing topology to the error topic. Third I would
>>>>> create
>>>>>>> the necessary dashboards to view errors and validation failures
> by
>>>>>> source.
>>>>>>> Lastly, I would file a follow-on JIRA to introduce hashing of
>> errors
>>>>> or
>>>>>>> fields that are too long.  It seems like a separate feature that
>> we
>>>>> need
>>>>>> to
>>>>>>> think through.  We may need a stellar function around that.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> James
>>>>>>>
>>>>>>> 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
>>>>>>>> I understand what Jon is talking about. He's proposing we hash
>> the
>>>>>> value
>>>>>>>> that caused the error, not necessarily the error message
> itself.
>>>>> For an
>>>>>>>> enrichment this is easy. Just pass along the field value that
>>> failed
>>>>>>>> enrichment. For other cases the field that caused the error may
>>> not
>>>>> be
>>>>>> so
>>>>>>>> obvious. Take parser validation for example. The message is
>>>>> validated
>>>>>> as
>>>>>>>> a whole and it may not be easy to determine which field is the
>>>>> cause.
>>>>>> In
>>>>>>>> that case would a hash of the whole message work?
>>>>>>>>
>>>>>>>> There is a broader architectural discussion that needs to
> happen
>>>>> before
>>>>>>> we
>>>>>>>> can implement this. Currently we have an indexing topology that
>>>>> reads
>>>>>>> from
>>>>>>>> 1 topic and writes messages to ES but errors are written to
>>> several
>>>>>>>> different topics:
>>>>>>>>
>>>>>>>>    - parser_error
>>>>>>>>    - parser_invalid
>>>>>>>>    - enrichments_error
>>>>>>>>    - threatintel_error
>>>>>>>>    - indexing_error
>>>>>>>>
>>>>>>>> I can see 4 possible approaches to implementing this:
>>>>>>>>
>>>>>>>>    1. Create an index topology for each error topic
>>>>>>>>       1. Good because we can easily reuse the indexing topology
>>> and
>>>>>> would
>>>>>>>>       require the least development effort
>>>>>>>>       2. Bad because it would consume a lot of extra worker
>> slots
>>>>>>>>    2. Move the topic name into the error JSON message as a new
>>>>>>> "error_type"
>>>>>>>>    field and write all messages to the indexing topic
>>>>>>>>       1. Good because we don't need to create a new topology
>>>>>>>>       2. Bad because we would be flowing data and errors
> through
>>> the
>>>>>> same
>>>>>>>>       topology. A spike in errors could affect message
> indexing.
>>>>>>>>    3. Compromise between 1 and 2. Create another indexing
>> topology
>>>>> that
>>>>>>> is
>>>>>>>>    dedicated to indexing errors. Move the topic name into the
>>> error
>>>>>> JSON
>>>>>>>>    message as a new "error_type" field and write all errors to
> a
>>>>> single
>>>>>>> error
>>>>>>>>    topic.
>>>>>>>>    4. Write a completely new topology with multiple spouts (1
>> for
>>>>> each
>>>>>>>>    error type listed above) that all feed into a single
>>>>>>> BulkMessageWriterBolt.
>>>>>>>>       1. Good because the current topologies would not need to
>>>>> change
>>>>>>>>       2. Bad because it would require the most development
>> effort,
>>>>>> would
>>>>>>>>       not reuse existing topologies and takes up more worker
>> slots
>>>>>> than 3
>>>>>>>> Are there other approaches I haven't thought of? I think 1 and
> 2
>>> are
>>>>>> off
>>>>>>>> the table because they are shortcuts and not good long-term
>>>>> solutions.
>>>>>> 3
>>>>>>>> would be my choice because it introduces less complexity than
> 4.
>>>>>>> Thoughts?
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
>>> zeolla@gmail.com
>>>>>>> wrote:
>>>>>>>>>  In that case the hash would be of the value in the IP field,
>>> such
>>>>> as
>>>>>>>>>  sha3(8.8.8.8).
>>>>>>>>>
>>>>>>>>>  Jon
>>>>>>>>>
>>>>>>>>>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <
>> jsirota@apache.org>
>>>>>> wrote:
>>>>>>>>>  > Jon,
>>>>>>>>>  >
>>>>>>>>>  > I am still not entirely following why we would want to use
>>>>> hashing.
>>>>>>> For
>>>>>>>>>  > example if my error is "Your IP field is invalid and failed
>>>>>>> validation"
>>>>>>>>>  > hashing this error string will always result in the same
>> hash.
>>>>> Why
>>>>>>> not
>>>>>>>>>  > just use the actual error string? Can you provide an
> example
>>>>> where
>>>>>>> you
>>>>>>>>>  > would use it?
>>>>>>>>>  >
>>>>>>>>>  > Thanks,
>>>>>>>>>  > James
>>>>>>>>>  >
>>>>>>>>>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
>>>>>>>>>  > > For 1 - I'm good with that.
>>>>>>>>>  > >
>>>>>>>>>  > > I'm talking about hashing the relevant content itself not
>>> the
>>>>>>> error.
>>>>>>>>>  Some
>>>>>>>>>  > > benefits are (1) minimize load on search index (there's
>>>>> minimal
>>>>>>> benefit
>>>>>>>>>  > in
>>>>>>>>>  > > spending the CPU and disk to keep it at full fidelity
>>>>> (tokenize
>>>>>> and
>>>>>>>>>  > store))
>>>>>>>>>  > > (2) provide something to key on for dashboards (assuming
> a
>>>>> good
>>>>>>> hash
>>>>>>>>>  > > algorithm that avoids collisions and is second preimage
>>>>>> resistant)
>>>>>>> and
>>>>>>>>>  > (3)
>>>>>>>>>  > > specific to errors, if the issue is that it failed to
>>> index, a
>>>>>> hash
>>>>>>>>>  gives
>>>>>>>>>  > > us some protection that the issue will not occur twice.
>>>>>>>>>  > >
>>>>>>>>>  > > Jon
>>>>>>>>>  > >
>>>>>>>>>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
>>>>> jsirota@apache.org>
>>>>>>> wrote:
>>>>>>>>>  > >
>>>>>>>>>  > > Jon,
>>>>>>>>>  > >
>>>>>>>>>  > > With regards to 1, collapsing to a single dashboard for
>> each
>>>>>> would
>>>>>>> be
>>>>>>>>>  > > fine. So we would have one error index and one "failed to
>>>>>> validate"
>>>>>>>>>  > > index. The distinction is that errors would be things
> that
>>>>> went
>>>>>>> wrong
>>>>>>>>>  > > during stream processing (failed to parse, etc...), while
>>>>>>> validation
>>>>>>>>>  > > failures are messages that explicitly failed stellar
>>>>>>> validation/schema
>>>>>>>>>  > > enforcement. There should be relatively few of the second
>>>>> type.
>>>>>>>>>  > >
>>>>>>>>>  > > With respect to 3, why do you want the error hashed? Why
>> not
>>>>> just
>>>>>>>>>  search
>>>>>>>>>  > > for the error text?
>>>>>>>>>  > >
>>>>>>>>>  > > Thanks,
>>>>>>>>>  > > James
>>>>>>>>>  > >
>>>>>>>>>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
>>>>>>>>>  > >> As someone who currently fills the platform engineer
>> role,
>>> I
>>>>> can
>>>>>>> give
>>>>>>>>>  > this
>>>>>>>>>  > >> idea a huge +1. My thoughts:
>>>>>>>>>  > >>
>>>>>>>>>  > >> 1. I think it depends on exactly what data is pushed
> into
>>> the
>>>>>>> index
>>>>>>>>>  > (#3).
>>>>>>>>>  > >> However, assuming the errors you proposed recording, I
>>> can't
>>>>> see
>>>>>>> huge
>>>>>>>>>  > >> benefits to having more than one dashboard. I would be
>>> happy
>>>>> to
>>>>>> be
>>>>>>>>>  > >> persuaded otherwise.
>>>>>>>>>  > >>
>>>>>>>>>  > >> 2. I would say yes, storing the errors in HDFS in
>> addition
>>> to
>>>>>>>>>  indexing
>>>>>>>>>  > is
>>>>>>>>>  > >> a good thing. Using METRON-510
>>>>>>>>>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
>>> case
>>>>>>> study,
>>>>>>>>>  > there
>>>>>>>>>  > >> is the potential in this environment for
>>> attacker-controlled
>>>>>> data
>>>>>>> to
>>>>>>>>>  > >
>>>>>>>>>  > > result
>>>>>>>>>  > >> in processing errors which could be a method of evading
>>>>> security
>>>>>>>>>  > >> monitoring. Once an attack is identified, the long term
>>> HDFS
>>>>>>> storage
>>>>>>>>>  > would
>>>>>>>>>  > >> allow better historical analysis for
>>> low-and-slow/persistent
>>>>>>> attacks
>>>>>>>>>  > (I'm
>>>>>>>>>  > >> thinking of a method of data exfil that also won't
>>>>> successfully
>>>>>>> get
>>>>>>>>>  > stored
>>>>>>>>>  > >> in Lucene, but is hard to identify over a short period
> of
>>>>> time).
>>>>>>>>>  > >> - Along this line, I think that there are various parts
>> of
>>>>>> Metron
>>>>>>>>>  > (this
>>>>>>>>>  > >> included) which could benefit from having method of
>>>>> configuring
>>>>>>> data
>>>>>>>>>  > aging
>>>>>>>>>  > >> by bucket in HDFS (Following Nick's comments here
>>>>>>>>>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
>>>>>>>>>  > >>
>>>>>>>>>  > >> 3. I would potentially add a hash of the content that
>>> failed
>>>>>>>>>  > validation to
>>>>>>>>>  > >> help identify repeats over time with less of a concern
>> that
>>>>>> you'd
>>>>>>>>>  have
>>>>>>>>>  > >
>>>>>>>>>  > > back
>>>>>>>>>  > >> to back failures (i.e. instead of storing the value
>>> itself).
>>>>>>>>>  > Additionally,
>>>>>>>>>  > >> I think it's helpful to be able to search all times
> there
>>>>> was an
>>>>>>>>>  > indexing
>>>>>>>>>  > >> error (instead of it hitting the catch-all).
>>>>>>>>>  > >>
>>>>>>>>>  > >> Jon
>>>>>>>>>  > >>
>>>>>>>>>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
>>>>>> jsirota@apache.org>
>>>>>>>>>  > wrote:
>>>>>>>>>  > >>
>>>>>>>>>  > >> We already have a capability to capture bolt errors and
>>>>>> validation
>>>>>>>>>  > errors
>>>>>>>>>  > >> and pipe them into a Kafka topic. I want to propose that
>> we
>>>>>>> attach a
>>>>>>>>>  > >> writer topology to the error and validation failed kafka
>>>>> topics
>>>>>> so
>>>>>>>>>  > that we
>>>>>>>>>  > >> can (a) create a new ES index for these errors and (b)
>>>>> create a
>>>>>>> new
>>>>>>>>>  > Kibana
>>>>>>>>>  > >> dashboard to visualize them. The benefit would be that
>>> errors
>>>>>> and
>>>>>>>>>  > >> validation failures would be easier to see and analyze.
>>>>>>>>>  > >>
>>>>>>>>>  > >> I am seeking feedback on the following:
>>>>>>>>>  > >>
>>>>>>>>>  > >> - How granular would we want this feature to be? Think
> we
>>>>> would
>>>>>>> want
>>>>>>>>>  > one
>>>>>>>>>  > >> index/dashboard per source? Or would it be better to
>>> collapse
>>>>>>>>>  > everything
>>>>>>>>>  > >> into the same index?
>>>>>>>>>  > >> - Do we care about storing these errors in HDFS as well?
>> Or
>>>>> is
>>>>>>>>>  indexing
>>>>>>>>>  > >> them enough?
>>>>>>>>>  > >> - What types of errors should we record? I am proposing:
>>>>>>>>>  > >>
>>>>>>>>>  > >> For error reporting:
>>>>>>>>>  > >> --Message failed to parse
>>>>>>>>>  > >> --Enrichment failed to enrich
>>>>>>>>>  > >> --Threat intel feed failures
>>>>>>>>>  > >> --Generic catch-all for all other errors
>>>>>>>>>  > >>
>>>>>>>>>  > >> For validation reporting:
>>>>>>>>>  > >> --What part of message failed validation
>>>>>>>>>  > >> --What stellar validator caused the failure
>>>>>>>>>  > >>
>>>>>>>>>  > >> -------------------
>>>>>>>>>  > >> Thank you,
>>>>>>>>>  > >>
>>>>>>>>>  > >> James Sirota
>>>>>>>>>  > >> PPMC- Apache Metron (Incubating)
>>>>>>>>>  > >> jsirota AT apache DOT org
>>>>>>>>>  > >>
>>>>>>>>>  > >> --
>>>>>>>>>  > >>
>>>>>>>>>  > >> Jon
>>>>>>>>>  > >>
>>>>>>>>>  > >> Sent from my mobile device
>>>>>>>>>  > >
>>>>>>>>>  > > -------------------
>>>>>>>>>  > > Thank you,
>>>>>>>>>  > >
>>>>>>>>>  > > James Sirota
>>>>>>>>>  > > PPMC- Apache Metron (Incubating)
>>>>>>>>>  > > jsirota AT apache DOT org
>>>>>>>>>  > >
>>>>>>>>>  > > --
>>>>>>>>>  > >
>>>>>>>>>  > > Jon
>>>>>>>>>  > >
>>>>>>>>>  > > Sent from my mobile device
>>>>>>>>>  >
>>>>>>>>>  > -------------------
>>>>>>>>>  > Thank you,
>>>>>>>>>  >
>>>>>>>>>  > James Sirota
>>>>>>>>>  > PPMC- Apache Metron (Incubating)
>>>>>>>>>  > jsirota AT apache DOT org
>>>>>>>>>  >
>>>>>>>>>  --
>>>>>>>>>
>>>>>>>>>  Jon
>>>>>>>>>
>>>>>>>>>  Sent from my mobile device
>>>>>>> -------------------
>>>>>>> Thank you,
>>>>>>>
>>>>>>> James Sirota
>>>>>>> PPMC- Apache Metron (Incubating)
>>>>>>> jsirota AT apache DOT org
>>>>>>>
>>>>> --
>>>>>
>>>>> Jon
>>>>>
>>>>> Sent from my mobile device
>>>>>
>>>>
>> --
>>
>> Jon
>>
>> Sent from my mobile device
>>

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

After thinking on this for a few days I recant my previous suggestion of
TupleHash256.  It's still a bit early for SHA-3 - no good reference
implementations/libraries exist (I did some searching and emailing), it is
optimized for hardware but no hardware implementation is widely accessible,
FIPS 140-3 is still not close to finalized, etc.

I think we could simulate the benefits of tuplehash by sorting the tuples,
then doing SHA-256(len(tuple1) | tuple1 | ... | len(tuplen) | tuplen).
Happy to entertain opposing thoughts, such as BLAKE2, etc. but with the
likely users of Metron, I think sticking with FIPS 140-2 is a solid choice.

Jon

On Thu, Jan 26, 2017, 11:23 AM Zeolla@GMail.com <ze...@gmail.com> wrote:

So one more thing regarding why I think we should throw an exception on a
failed enrichment.  If we do make something like username a constant field,
in cases where that is used to calculate rawMessage_hash, if it fails to
enrich, the hash would be different compared to when it succeeds.  Of
course I think the initial intent of adding username as a constant field
would be to handle it in the parsers, where that information is provided in
the messages themselves, but how would Threat Intel know the difference?
In my environment I am looking forward to a streaming enrichment that adds
the username, where applicable, anywhere I have an IP.

My hesitant suggestion for a hashing algorithm would be to use
TupleHash256, as it is a NIST-provided implementation of SHA-3 (using
cSHAKE) for this use case.  Details here
<http://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-185.pdf>.
However, I haven't been able to find a reference implementation of this in
any language, so that's a bit of a downside.  A more general SHA3-256
implementation where we handle ordering could work as well, but would be
significantly less optimal.

Jon

On Thu, Jan 26, 2017 at 10:20 AM Ryan Merriman <me...@gmail.com> wrote:

Jon, I misread the code in the GenericEnrichmentBolt.  The error is
forwarded on so no issues there.

Defaulting to the common fields makes sense.  I will dig into the
GenericEnrichmentBolt more, maybe there is a way to get the error fields
without having to significantly change things.  Any opinion on a hashing
algorithm?

On Wed, Jan 25, 2017 at 9:37 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:

> Although hashing the whole message is better than nothing, it misses a lot
> of the benefits we could get.
>
> While I'd love to have consistency for this field across all of the
> different error.types, it appears that may not be reasonably possible
> because of the parsers.  So, how about something like hash all of the
> constant
> fields
> <https://github.com/apache/incubator-metron/blob/master/
> metron-platform/metron-common/src/main/java/org/apache/
> metron/common/Constants.java>
> excluding
> timestamp and original_string unless it is a parser, in which case hash
the
> entire message?  This gives us some measure of event uniqueness and it can
> grow as we define additional constant fields (I recall discussing with
> someone else on the list regarding expanding those standard fields to
> include things like usernames but I can't find the specific email
> exchange).
>
> Because some enrichments can be heavily relied on, I think it makes sense
> to put a message onto the error queue when it throws an exception.  Not
> only does this help troubleshoot edge cases, but it makes issues more
> obvious when assembling a new enrichment in dev/test.  I can't think of a
> scenario currently where an enrichment would only be "best effort" and
that
> I wouldn't want that error indexed and retrievable.  However, this gets
> interesting when talking about the various options to solve the "Enrich
> enrichment" discussion from earlier in the month.  We can keep that part
of
> this separate though, as I don't think that's being actively pursued right
> now.
>
> Jon
>
> On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dl...@gmail.com> wrote:
>
> RE: separate JIRA for MPack/Ansible. No objection to tracking them
> separately, but for this item to be complete, you'll need both the feature
> and the ability to install it.
>
> -D...
>
>
> On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <me...@gmail.com>
> wrote:
>
> > Assuming we're going to write all errors to a single error topic, I
think
> > it makes sense to agree on an error message schema and handle errors
> across
> > the 3 different topologies in the same way with a single implementation.
> > The implementation in ParserBolt (ErrorUtils.handleError) produces the
> most
> > verbose error object so I think it's a good candidate for the single
> > implementation.  Here is the message structure it currently produces:
> >
> > {
> >   "exception": "java.lang.Exception: there was an error",
> >   "hostname": "host",
> >   "stack": "java.lang.Exception: ...",
> >   "time": 1485295416563,
> >   "message": "there was an error",
> >   "rawMessage": "raw message",
> >   "rawMessage_bytes": [],
> >   "source.type": "bro_error"
> > }
> >
> > From our discussion so far we need to add a couple fields:  an error
type
> > and hash id.  Adding these to the message looks like:
> >
> > {
> >   "exception": "java.lang.Exception: there was an error",
> >   "hostname": "host",
> >   "stack": "java.lang.Exception: ...",
> >   "time": 1485295416563,
> >   "message": "there was an error",
> >   "rawMessage": "raw message",
> >   "rawMessage_bytes": [],
> >   "source.type": "bro_error",
> >   "error.type": "parser_error",
> >   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> > }
> >
> > We should also consider expanding the error types I listed earlier.
> > Instead of just having "indexing_error" we could have
> > "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
> >
> > Jon, if an exception happens in an enrichment or threat intel bolt the
> > message is passed along with no error thrown (only logged).  Everywhere
> > else I'm having trouble identifying specific fields that should be
> hashed.
> > Would hashing the message in every case be acceptable?  Do you know of a
> > place where we could hash a field instead?  On the topic of exceptions
in
> > enrichments, are we ok with an error only being logged and not added to
> the
> > message or emitted to the error queue?
> >
> >
> >
> > On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com>
> > wrote:
> >
> > > That use case makes sense to me.  I don't think it will require that
> much
> > > additional effort either.
> > >
> > > On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
> > > wrote:
> > >
> > >> Regarding error vs validation - Either way I'm not very concerned.  I
> > >> initially assumed they would be combined and agree with that
approach,
> > but
> > >> splitting them out isn't a very big deal to me either.
> > >>
> > >> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> > else
> > >> where it's not possible to pick out the exact thing causing the
issue)
> > it
> > >> would be a hash of the complete message.
> > >>
> > >> Regarding the architecture, I mostly agree with James except that I
> > think
> > >> step 3 needs to also be able to somehow group errors via the original
> > >> data (identify
> > >> replays, identify repeat issues with data in a specific field, issues
> > with
> > >> consistently different data, etc.).  This is essentially the first
> step
> > of
> > >> troubleshooting, which I assume you are doing if you're looking at
the
> > >> error dashboard.
> > >>
> > >> If the hash gets moved out of the initial implementation, I'm fairly
> > >> certain you lose this ability.  The point here isn't to handle long
> > fields
> > >> (although that's a benefit of this approach), it's to attach a unique
> > >> identifier to the error/validation issue message that links it to the
> > >> original problem.  I'd be happy to consider alternative solutions to
> > this
> > >> problem (for instance, actually sending across the data itself) I
just
> > >> haven't been able to think of another way to do this that I like
> better.
> > >>
> > >> Jon
> > >>
> > >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
> > >> wrote:
> > >>
> > >> > We also need a JIRA for any install/Ansible/MPack work needed.
> > >> >
> > >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Now that I had some time to think about it I would collapse all
> > error
> > >> and
> > >> > > validation topics into one.  We can differentiate between
> different
> > >> views
> > >> > > of the data (split by error source etc) via Kibana dashboards.  I
> > >> would
> > >> > > implement this feature incrementally.  First I would modify all
> the
> > >> bolts
> > >> > > to log to a single topic.  Second, I would get the error indexing
> > >> done by
> > >> > > attaching the indexing topology to the error topic. Third I would
> > >> create
> > >> > > the necessary dashboards to view errors and validation failures
by
> > >> > source.
> > >> > > Lastly, I would file a follow-on JIRA to introduce hashing of
> errors
> > >> or
> > >> > > fields that are too long.  It seems like a separate feature that
> we
> > >> need
> > >> > to
> > >> > > think through.  We may need a stellar function around that.
> > >> > >
> > >> > > Thanks,
> > >> > > James
> > >> > >
> > >> > > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> > >> > > > I understand what Jon is talking about. He's proposing we hash
> the
> > >> > value
> > >> > > > that caused the error, not necessarily the error message
itself.
> > >> For an
> > >> > > > enrichment this is easy. Just pass along the field value that
> > failed
> > >> > > > enrichment. For other cases the field that caused the error may
> > not
> > >> be
> > >> > so
> > >> > > > obvious. Take parser validation for example. The message is
> > >> validated
> > >> > as
> > >> > > > a whole and it may not be easy to determine which field is the
> > >> cause.
> > >> > In
> > >> > > > that case would a hash of the whole message work?
> > >> > > >
> > >> > > > There is a broader architectural discussion that needs to
happen
> > >> before
> > >> > > we
> > >> > > > can implement this. Currently we have an indexing topology that
> > >> reads
> > >> > > from
> > >> > > > 1 topic and writes messages to ES but errors are written to
> > several
> > >> > > > different topics:
> > >> > > >
> > >> > > >    - parser_error
> > >> > > >    - parser_invalid
> > >> > > >    - enrichments_error
> > >> > > >    - threatintel_error
> > >> > > >    - indexing_error
> > >> > > >
> > >> > > > I can see 4 possible approaches to implementing this:
> > >> > > >
> > >> > > >    1. Create an index topology for each error topic
> > >> > > >       1. Good because we can easily reuse the indexing topology
> > and
> > >> > would
> > >> > > >       require the least development effort
> > >> > > >       2. Bad because it would consume a lot of extra worker
> slots
> > >> > > >    2. Move the topic name into the error JSON message as a new
> > >> > > "error_type"
> > >> > > >    field and write all messages to the indexing topic
> > >> > > >       1. Good because we don't need to create a new topology
> > >> > > >       2. Bad because we would be flowing data and errors
through
> > the
> > >> > same
> > >> > > >       topology. A spike in errors could affect message
indexing.
> > >> > > >    3. Compromise between 1 and 2. Create another indexing
> topology
> > >> that
> > >> > > is
> > >> > > >    dedicated to indexing errors. Move the topic name into the
> > error
> > >> > JSON
> > >> > > >    message as a new "error_type" field and write all errors to
a
> > >> single
> > >> > > error
> > >> > > >    topic.
> > >> > > >    4. Write a completely new topology with multiple spouts (1
> for
> > >> each
> > >> > > >    error type listed above) that all feed into a single
> > >> > > BulkMessageWriterBolt.
> > >> > > >       1. Good because the current topologies would not need to
> > >> change
> > >> > > >       2. Bad because it would require the most development
> effort,
> > >> > would
> > >> > > >       not reuse existing topologies and takes up more worker
> slots
> > >> > than 3
> > >> > > >
> > >> > > > Are there other approaches I haven't thought of? I think 1 and
2
> > are
> > >> > off
> > >> > > > the table because they are shortcuts and not good long-term
> > >> solutions.
> > >> > 3
> > >> > > > would be my choice because it introduces less complexity than
4.
> > >> > > Thoughts?
> > >> > > >
> > >> > > > Ryan
> > >> > > >
> > >> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
> > zeolla@gmail.com
> > >> >
> > >> > > wrote:
> > >> > > >
> > >> > > >>  In that case the hash would be of the value in the IP field,
> > such
> > >> as
> > >> > > >>  sha3(8.8.8.8).
> > >> > > >>
> > >> > > >>  Jon
> > >> > > >>
> > >> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <
> jsirota@apache.org>
> > >> > wrote:
> > >> > > >>
> > >> > > >>  > Jon,
> > >> > > >>  >
> > >> > > >>  > I am still not entirely following why we would want to use
> > >> hashing.
> > >> > > For
> > >> > > >>  > example if my error is "Your IP field is invalid and failed
> > >> > > validation"
> > >> > > >>  > hashing this error string will always result in the same
> hash.
> > >> Why
> > >> > > not
> > >> > > >>  > just use the actual error string? Can you provide an
example
> > >> where
> > >> > > you
> > >> > > >>  > would use it?
> > >> > > >>  >
> > >> > > >>  > Thanks,
> > >> > > >>  > James
> > >> > > >>  >
> > >> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >> > > >>  > > For 1 - I'm good with that.
> > >> > > >>  > >
> > >> > > >>  > > I'm talking about hashing the relevant content itself not
> > the
> > >> > > error.
> > >> > > >>  Some
> > >> > > >>  > > benefits are (1) minimize load on search index (there's
> > >> minimal
> > >> > > benefit
> > >> > > >>  > in
> > >> > > >>  > > spending the CPU and disk to keep it at full fidelity
> > >> (tokenize
> > >> > and
> > >> > > >>  > store))
> > >> > > >>  > > (2) provide something to key on for dashboards (assuming
a
> > >> good
> > >> > > hash
> > >> > > >>  > > algorithm that avoids collisions and is second preimage
> > >> > resistant)
> > >> > > and
> > >> > > >>  > (3)
> > >> > > >>  > > specific to errors, if the issue is that it failed to
> > index, a
> > >> > hash
> > >> > > >>  gives
> > >> > > >>  > > us some protection that the issue will not occur twice.
> > >> > > >>  > >
> > >> > > >>  > > Jon
> > >> > > >>  > >
> > >> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> > >> jsirota@apache.org>
> > >> > > wrote:
> > >> > > >>  > >
> > >> > > >>  > > Jon,
> > >> > > >>  > >
> > >> > > >>  > > With regards to 1, collapsing to a single dashboard for
> each
> > >> > would
> > >> > > be
> > >> > > >>  > > fine. So we would have one error index and one "failed to
> > >> > validate"
> > >> > > >>  > > index. The distinction is that errors would be things
that
> > >> went
> > >> > > wrong
> > >> > > >>  > > during stream processing (failed to parse, etc...), while
> > >> > > validation
> > >> > > >>  > > failures are messages that explicitly failed stellar
> > >> > > validation/schema
> > >> > > >>  > > enforcement. There should be relatively few of the second
> > >> type.
> > >> > > >>  > >
> > >> > > >>  > > With respect to 3, why do you want the error hashed? Why
> not
> > >> just
> > >> > > >>  search
> > >> > > >>  > > for the error text?
> > >> > > >>  > >
> > >> > > >>  > > Thanks,
> > >> > > >>  > > James
> > >> > > >>  > >
> > >> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >> > > >>  > >> As someone who currently fills the platform engineer
> role,
> > I
> > >> can
> > >> > > give
> > >> > > >>  > this
> > >> > > >>  > >> idea a huge +1. My thoughts:
> > >> > > >>  > >>
> > >> > > >>  > >> 1. I think it depends on exactly what data is pushed
into
> > the
> > >> > > index
> > >> > > >>  > (#3).
> > >> > > >>  > >> However, assuming the errors you proposed recording, I
> > can't
> > >> see
> > >> > > huge
> > >> > > >>  > >> benefits to having more than one dashboard. I would be
> > happy
> > >> to
> > >> > be
> > >> > > >>  > >> persuaded otherwise.
> > >> > > >>  > >>
> > >> > > >>  > >> 2. I would say yes, storing the errors in HDFS in
> addition
> > to
> > >> > > >>  indexing
> > >> > > >>  > is
> > >> > > >>  > >> a good thing. Using METRON-510
> > >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
> > case
> > >> > > study,
> > >> > > >>  > there
> > >> > > >>  > >> is the potential in this environment for
> > attacker-controlled
> > >> > data
> > >> > > to
> > >> > > >>  > >
> > >> > > >>  > > result
> > >> > > >>  > >> in processing errors which could be a method of evading
> > >> security
> > >> > > >>  > >> monitoring. Once an attack is identified, the long term
> > HDFS
> > >> > > storage
> > >> > > >>  > would
> > >> > > >>  > >> allow better historical analysis for
> > low-and-slow/persistent
> > >> > > attacks
> > >> > > >>  > (I'm
> > >> > > >>  > >> thinking of a method of data exfil that also won't
> > >> successfully
> > >> > > get
> > >> > > >>  > stored
> > >> > > >>  > >> in Lucene, but is hard to identify over a short period
of
> > >> time).
> > >> > > >>  > >> - Along this line, I think that there are various parts
> of
> > >> > Metron
> > >> > > >>  > (this
> > >> > > >>  > >> included) which could benefit from having method of
> > >> configuring
> > >> > > data
> > >> > > >>  > aging
> > >> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> > >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> > >> > > >>  > >>
> > >> > > >>  > >> 3. I would potentially add a hash of the content that
> > failed
> > >> > > >>  > validation to
> > >> > > >>  > >> help identify repeats over time with less of a concern
> that
> > >> > you'd
> > >> > > >>  have
> > >> > > >>  > >
> > >> > > >>  > > back
> > >> > > >>  > >> to back failures (i.e. instead of storing the value
> > itself).
> > >> > > >>  > Additionally,
> > >> > > >>  > >> I think it's helpful to be able to search all times
there
> > >> was an
> > >> > > >>  > indexing
> > >> > > >>  > >> error (instead of it hitting the catch-all).
> > >> > > >>  > >>
> > >> > > >>  > >> Jon
> > >> > > >>  > >>
> > >> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> > >> > jsirota@apache.org>
> > >> > > >>  > wrote:
> > >> > > >>  > >>
> > >> > > >>  > >> We already have a capability to capture bolt errors and
> > >> > validation
> > >> > > >>  > errors
> > >> > > >>  > >> and pipe them into a Kafka topic. I want to propose that
> we
> > >> > > attach a
> > >> > > >>  > >> writer topology to the error and validation failed kafka
> > >> topics
> > >> > so
> > >> > > >>  > that we
> > >> > > >>  > >> can (a) create a new ES index for these errors and (b)
> > >> create a
> > >> > > new
> > >> > > >>  > Kibana
> > >> > > >>  > >> dashboard to visualize them. The benefit would be that
> > errors
> > >> > and
> > >> > > >>  > >> validation failures would be easier to see and analyze.
> > >> > > >>  > >>
> > >> > > >>  > >> I am seeking feedback on the following:
> > >> > > >>  > >>
> > >> > > >>  > >> - How granular would we want this feature to be? Think
we
> > >> would
> > >> > > want
> > >> > > >>  > one
> > >> > > >>  > >> index/dashboard per source? Or would it be better to
> > collapse
> > >> > > >>  > everything
> > >> > > >>  > >> into the same index?
> > >> > > >>  > >> - Do we care about storing these errors in HDFS as well?
> Or
> > >> is
> > >> > > >>  indexing
> > >> > > >>  > >> them enough?
> > >> > > >>  > >> - What types of errors should we record? I am proposing:
> > >> > > >>  > >>
> > >> > > >>  > >> For error reporting:
> > >> > > >>  > >> --Message failed to parse
> > >> > > >>  > >> --Enrichment failed to enrich
> > >> > > >>  > >> --Threat intel feed failures
> > >> > > >>  > >> --Generic catch-all for all other errors
> > >> > > >>  > >>
> > >> > > >>  > >> For validation reporting:
> > >> > > >>  > >> --What part of message failed validation
> > >> > > >>  > >> --What stellar validator caused the failure
> > >> > > >>  > >>
> > >> > > >>  > >> -------------------
> > >> > > >>  > >> Thank you,
> > >> > > >>  > >>
> > >> > > >>  > >> James Sirota
> > >> > > >>  > >> PPMC- Apache Metron (Incubating)
> > >> > > >>  > >> jsirota AT apache DOT org
> > >> > > >>  > >>
> > >> > > >>  > >> --
> > >> > > >>  > >>
> > >> > > >>  > >> Jon
> > >> > > >>  > >>
> > >> > > >>  > >> Sent from my mobile device
> > >> > > >>  > >
> > >> > > >>  > > -------------------
> > >> > > >>  > > Thank you,
> > >> > > >>  > >
> > >> > > >>  > > James Sirota
> > >> > > >>  > > PPMC- Apache Metron (Incubating)
> > >> > > >>  > > jsirota AT apache DOT org
> > >> > > >>  > >
> > >> > > >>  > > --
> > >> > > >>  > >
> > >> > > >>  > > Jon
> > >> > > >>  > >
> > >> > > >>  > > Sent from my mobile device
> > >> > > >>  >
> > >> > > >>  > -------------------
> > >> > > >>  > Thank you,
> > >> > > >>  >
> > >> > > >>  > James Sirota
> > >> > > >>  > PPMC- Apache Metron (Incubating)
> > >> > > >>  > jsirota AT apache DOT org
> > >> > > >>  >
> > >> > > >>  --
> > >> > > >>
> > >> > > >>  Jon
> > >> > > >>
> > >> > > >>  Sent from my mobile device
> > >> > >
> > >> > > -------------------
> > >> > > Thank you,
> > >> > >
> > >> > > James Sirota
> > >> > > PPMC- Apache Metron (Incubating)
> > >> > > jsirota AT apache DOT org
> > >> > >
> > >> >
> > >> --
> > >>
> > >> Jon
> > >>
> > >> Sent from my mobile device
> > >>
> > >
> > >
> >
>
> --
>
> Jon
>
> Sent from my mobile device
>

-- 

Jon

Sent from my mobile device

-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

So one more thing regarding why I think we should throw an exception on a
failed enrichment.  If we do make something like username a constant field,
in cases where that is used to calculate rawMessage_hash, if it fails to
enrich, the hash would be different compared to when it succeeds.  Of
course I think the initial intent of adding username as a constant field
would be to handle it in the parsers, where that information is provided in
the messages themselves, but how would Threat Intel know the difference?
In my environment I am looking forward to a streaming enrichment that adds
the username, where applicable, anywhere I have an IP.

My hesitant suggestion for a hashing algorithm would be to use
TupleHash256, as it is a NIST-provided implementation of SHA-3 (using
cSHAKE) for this use case.  Details here
<http://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-185.pdf>.
However, I haven't been able to find a reference implementation of this in
any language, so that's a bit of a downside.  A more general SHA3-256
implementation where we handle ordering could work as well, but would be
significantly less optimal.

Jon

On Thu, Jan 26, 2017 at 10:20 AM Ryan Merriman <me...@gmail.com> wrote:

Jon, I misread the code in the GenericEnrichmentBolt.  The error is
forwarded on so no issues there.

Defaulting to the common fields makes sense.  I will dig into the
GenericEnrichmentBolt more, maybe there is a way to get the error fields
without having to significantly change things.  Any opinion on a hashing
algorithm?

On Wed, Jan 25, 2017 at 9:37 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:

> Although hashing the whole message is better than nothing, it misses a lot
> of the benefits we could get.
>
> While I'd love to have consistency for this field across all of the
> different error.types, it appears that may not be reasonably possible
> because of the parsers.  So, how about something like hash all of the
> constant
> fields
> <https://github.com/apache/incubator-metron/blob/master/
> metron-platform/metron-common/src/main/java/org/apache/
> metron/common/Constants.java>
> excluding
> timestamp and original_string unless it is a parser, in which case hash
the
> entire message?  This gives us some measure of event uniqueness and it can
> grow as we define additional constant fields (I recall discussing with
> someone else on the list regarding expanding those standard fields to
> include things like usernames but I can't find the specific email
> exchange).
>
> Because some enrichments can be heavily relied on, I think it makes sense
> to put a message onto the error queue when it throws an exception.  Not
> only does this help troubleshoot edge cases, but it makes issues more
> obvious when assembling a new enrichment in dev/test.  I can't think of a
> scenario currently where an enrichment would only be "best effort" and
that
> I wouldn't want that error indexed and retrievable.  However, this gets
> interesting when talking about the various options to solve the "Enrich
> enrichment" discussion from earlier in the month.  We can keep that part
of
> this separate though, as I don't think that's being actively pursued right
> now.
>
> Jon
>
> On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dl...@gmail.com> wrote:
>
> RE: separate JIRA for MPack/Ansible. No objection to tracking them
> separately, but for this item to be complete, you'll need both the feature
> and the ability to install it.
>
> -D...
>
>
> On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <me...@gmail.com>
> wrote:
>
> > Assuming we're going to write all errors to a single error topic, I
think
> > it makes sense to agree on an error message schema and handle errors
> across
> > the 3 different topologies in the same way with a single implementation.
> > The implementation in ParserBolt (ErrorUtils.handleError) produces the
> most
> > verbose error object so I think it's a good candidate for the single
> > implementation.  Here is the message structure it currently produces:
> >
> > {
> >   "exception": "java.lang.Exception: there was an error",
> >   "hostname": "host",
> >   "stack": "java.lang.Exception: ...",
> >   "time": 1485295416563,
> >   "message": "there was an error",
> >   "rawMessage": "raw message",
> >   "rawMessage_bytes": [],
> >   "source.type": "bro_error"
> > }
> >
> > From our discussion so far we need to add a couple fields:  an error
type
> > and hash id.  Adding these to the message looks like:
> >
> > {
> >   "exception": "java.lang.Exception: there was an error",
> >   "hostname": "host",
> >   "stack": "java.lang.Exception: ...",
> >   "time": 1485295416563,
> >   "message": "there was an error",
> >   "rawMessage": "raw message",
> >   "rawMessage_bytes": [],
> >   "source.type": "bro_error",
> >   "error.type": "parser_error",
> >   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> > }
> >
> > We should also consider expanding the error types I listed earlier.
> > Instead of just having "indexing_error" we could have
> > "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
> >
> > Jon, if an exception happens in an enrichment or threat intel bolt the
> > message is passed along with no error thrown (only logged).  Everywhere
> > else I'm having trouble identifying specific fields that should be
> hashed.
> > Would hashing the message in every case be acceptable?  Do you know of a
> > place where we could hash a field instead?  On the topic of exceptions
in
> > enrichments, are we ok with an error only being logged and not added to
> the
> > message or emitted to the error queue?
> >
> >
> >
> > On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com>
> > wrote:
> >
> > > That use case makes sense to me.  I don't think it will require that
> much
> > > additional effort either.
> > >
> > > On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
> > > wrote:
> > >
> > >> Regarding error vs validation - Either way I'm not very concerned.  I
> > >> initially assumed they would be combined and agree with that
approach,
> > but
> > >> splitting them out isn't a very big deal to me either.
> > >>
> > >> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> > else
> > >> where it's not possible to pick out the exact thing causing the
issue)
> > it
> > >> would be a hash of the complete message.
> > >>
> > >> Regarding the architecture, I mostly agree with James except that I
> > think
> > >> step 3 needs to also be able to somehow group errors via the original
> > >> data (identify
> > >> replays, identify repeat issues with data in a specific field, issues
> > with
> > >> consistently different data, etc.).  This is essentially the first
> step
> > of
> > >> troubleshooting, which I assume you are doing if you're looking at
the
> > >> error dashboard.
> > >>
> > >> If the hash gets moved out of the initial implementation, I'm fairly
> > >> certain you lose this ability.  The point here isn't to handle long
> > fields
> > >> (although that's a benefit of this approach), it's to attach a unique
> > >> identifier to the error/validation issue message that links it to the
> > >> original problem.  I'd be happy to consider alternative solutions to
> > this
> > >> problem (for instance, actually sending across the data itself) I
just
> > >> haven't been able to think of another way to do this that I like
> better.
> > >>
> > >> Jon
> > >>
> > >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
> > >> wrote:
> > >>
> > >> > We also need a JIRA for any install/Ansible/MPack work needed.
> > >> >
> > >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Now that I had some time to think about it I would collapse all
> > error
> > >> and
> > >> > > validation topics into one.  We can differentiate between
> different
> > >> views
> > >> > > of the data (split by error source etc) via Kibana dashboards.  I
> > >> would
> > >> > > implement this feature incrementally.  First I would modify all
> the
> > >> bolts
> > >> > > to log to a single topic.  Second, I would get the error indexing
> > >> done by
> > >> > > attaching the indexing topology to the error topic. Third I would
> > >> create
> > >> > > the necessary dashboards to view errors and validation failures
by
> > >> > source.
> > >> > > Lastly, I would file a follow-on JIRA to introduce hashing of
> errors
> > >> or
> > >> > > fields that are too long.  It seems like a separate feature that
> we
> > >> need
> > >> > to
> > >> > > think through.  We may need a stellar function around that.
> > >> > >
> > >> > > Thanks,
> > >> > > James
> > >> > >
> > >> > > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> > >> > > > I understand what Jon is talking about. He's proposing we hash
> the
> > >> > value
> > >> > > > that caused the error, not necessarily the error message
itself.
> > >> For an
> > >> > > > enrichment this is easy. Just pass along the field value that
> > failed
> > >> > > > enrichment. For other cases the field that caused the error may
> > not
> > >> be
> > >> > so
> > >> > > > obvious. Take parser validation for example. The message is
> > >> validated
> > >> > as
> > >> > > > a whole and it may not be easy to determine which field is the
> > >> cause.
> > >> > In
> > >> > > > that case would a hash of the whole message work?
> > >> > > >
> > >> > > > There is a broader architectural discussion that needs to
happen
> > >> before
> > >> > > we
> > >> > > > can implement this. Currently we have an indexing topology that
> > >> reads
> > >> > > from
> > >> > > > 1 topic and writes messages to ES but errors are written to
> > several
> > >> > > > different topics:
> > >> > > >
> > >> > > >    - parser_error
> > >> > > >    - parser_invalid
> > >> > > >    - enrichments_error
> > >> > > >    - threatintel_error
> > >> > > >    - indexing_error
> > >> > > >
> > >> > > > I can see 4 possible approaches to implementing this:
> > >> > > >
> > >> > > >    1. Create an index topology for each error topic
> > >> > > >       1. Good because we can easily reuse the indexing topology
> > and
> > >> > would
> > >> > > >       require the least development effort
> > >> > > >       2. Bad because it would consume a lot of extra worker
> slots
> > >> > > >    2. Move the topic name into the error JSON message as a new
> > >> > > "error_type"
> > >> > > >    field and write all messages to the indexing topic
> > >> > > >       1. Good because we don't need to create a new topology
> > >> > > >       2. Bad because we would be flowing data and errors
through
> > the
> > >> > same
> > >> > > >       topology. A spike in errors could affect message
indexing.
> > >> > > >    3. Compromise between 1 and 2. Create another indexing
> topology
> > >> that
> > >> > > is
> > >> > > >    dedicated to indexing errors. Move the topic name into the
> > error
> > >> > JSON
> > >> > > >    message as a new "error_type" field and write all errors to
a
> > >> single
> > >> > > error
> > >> > > >    topic.
> > >> > > >    4. Write a completely new topology with multiple spouts (1
> for
> > >> each
> > >> > > >    error type listed above) that all feed into a single
> > >> > > BulkMessageWriterBolt.
> > >> > > >       1. Good because the current topologies would not need to
> > >> change
> > >> > > >       2. Bad because it would require the most development
> effort,
> > >> > would
> > >> > > >       not reuse existing topologies and takes up more worker
> slots
> > >> > than 3
> > >> > > >
> > >> > > > Are there other approaches I haven't thought of? I think 1 and
2
> > are
> > >> > off
> > >> > > > the table because they are shortcuts and not good long-term
> > >> solutions.
> > >> > 3
> > >> > > > would be my choice because it introduces less complexity than
4.
> > >> > > Thoughts?
> > >> > > >
> > >> > > > Ryan
> > >> > > >
> > >> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
> > zeolla@gmail.com
> > >> >
> > >> > > wrote:
> > >> > > >
> > >> > > >>  In that case the hash would be of the value in the IP field,
> > such
> > >> as
> > >> > > >>  sha3(8.8.8.8).
> > >> > > >>
> > >> > > >>  Jon
> > >> > > >>
> > >> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <
> jsirota@apache.org>
> > >> > wrote:
> > >> > > >>
> > >> > > >>  > Jon,
> > >> > > >>  >
> > >> > > >>  > I am still not entirely following why we would want to use
> > >> hashing.
> > >> > > For
> > >> > > >>  > example if my error is "Your IP field is invalid and failed
> > >> > > validation"
> > >> > > >>  > hashing this error string will always result in the same
> hash.
> > >> Why
> > >> > > not
> > >> > > >>  > just use the actual error string? Can you provide an
example
> > >> where
> > >> > > you
> > >> > > >>  > would use it?
> > >> > > >>  >
> > >> > > >>  > Thanks,
> > >> > > >>  > James
> > >> > > >>  >
> > >> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >> > > >>  > > For 1 - I'm good with that.
> > >> > > >>  > >
> > >> > > >>  > > I'm talking about hashing the relevant content itself not
> > the
> > >> > > error.
> > >> > > >>  Some
> > >> > > >>  > > benefits are (1) minimize load on search index (there's
> > >> minimal
> > >> > > benefit
> > >> > > >>  > in
> > >> > > >>  > > spending the CPU and disk to keep it at full fidelity
> > >> (tokenize
> > >> > and
> > >> > > >>  > store))
> > >> > > >>  > > (2) provide something to key on for dashboards (assuming
a
> > >> good
> > >> > > hash
> > >> > > >>  > > algorithm that avoids collisions and is second preimage
> > >> > resistant)
> > >> > > and
> > >> > > >>  > (3)
> > >> > > >>  > > specific to errors, if the issue is that it failed to
> > index, a
> > >> > hash
> > >> > > >>  gives
> > >> > > >>  > > us some protection that the issue will not occur twice.
> > >> > > >>  > >
> > >> > > >>  > > Jon
> > >> > > >>  > >
> > >> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> > >> jsirota@apache.org>
> > >> > > wrote:
> > >> > > >>  > >
> > >> > > >>  > > Jon,
> > >> > > >>  > >
> > >> > > >>  > > With regards to 1, collapsing to a single dashboard for
> each
> > >> > would
> > >> > > be
> > >> > > >>  > > fine. So we would have one error index and one "failed to
> > >> > validate"
> > >> > > >>  > > index. The distinction is that errors would be things
that
> > >> went
> > >> > > wrong
> > >> > > >>  > > during stream processing (failed to parse, etc...), while
> > >> > > validation
> > >> > > >>  > > failures are messages that explicitly failed stellar
> > >> > > validation/schema
> > >> > > >>  > > enforcement. There should be relatively few of the second
> > >> type.
> > >> > > >>  > >
> > >> > > >>  > > With respect to 3, why do you want the error hashed? Why
> not
> > >> just
> > >> > > >>  search
> > >> > > >>  > > for the error text?
> > >> > > >>  > >
> > >> > > >>  > > Thanks,
> > >> > > >>  > > James
> > >> > > >>  > >
> > >> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >> > > >>  > >> As someone who currently fills the platform engineer
> role,
> > I
> > >> can
> > >> > > give
> > >> > > >>  > this
> > >> > > >>  > >> idea a huge +1. My thoughts:
> > >> > > >>  > >>
> > >> > > >>  > >> 1. I think it depends on exactly what data is pushed
into
> > the
> > >> > > index
> > >> > > >>  > (#3).
> > >> > > >>  > >> However, assuming the errors you proposed recording, I
> > can't
> > >> see
> > >> > > huge
> > >> > > >>  > >> benefits to having more than one dashboard. I would be
> > happy
> > >> to
> > >> > be
> > >> > > >>  > >> persuaded otherwise.
> > >> > > >>  > >>
> > >> > > >>  > >> 2. I would say yes, storing the errors in HDFS in
> addition
> > to
> > >> > > >>  indexing
> > >> > > >>  > is
> > >> > > >>  > >> a good thing. Using METRON-510
> > >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
> > case
> > >> > > study,
> > >> > > >>  > there
> > >> > > >>  > >> is the potential in this environment for
> > attacker-controlled
> > >> > data
> > >> > > to
> > >> > > >>  > >
> > >> > > >>  > > result
> > >> > > >>  > >> in processing errors which could be a method of evading
> > >> security
> > >> > > >>  > >> monitoring. Once an attack is identified, the long term
> > HDFS
> > >> > > storage
> > >> > > >>  > would
> > >> > > >>  > >> allow better historical analysis for
> > low-and-slow/persistent
> > >> > > attacks
> > >> > > >>  > (I'm
> > >> > > >>  > >> thinking of a method of data exfil that also won't
> > >> successfully
> > >> > > get
> > >> > > >>  > stored
> > >> > > >>  > >> in Lucene, but is hard to identify over a short period
of
> > >> time).
> > >> > > >>  > >> - Along this line, I think that there are various parts
> of
> > >> > Metron
> > >> > > >>  > (this
> > >> > > >>  > >> included) which could benefit from having method of
> > >> configuring
> > >> > > data
> > >> > > >>  > aging
> > >> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> > >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> > >> > > >>  > >>
> > >> > > >>  > >> 3. I would potentially add a hash of the content that
> > failed
> > >> > > >>  > validation to
> > >> > > >>  > >> help identify repeats over time with less of a concern
> that
> > >> > you'd
> > >> > > >>  have
> > >> > > >>  > >
> > >> > > >>  > > back
> > >> > > >>  > >> to back failures (i.e. instead of storing the value
> > itself).
> > >> > > >>  > Additionally,
> > >> > > >>  > >> I think it's helpful to be able to search all times
there
> > >> was an
> > >> > > >>  > indexing
> > >> > > >>  > >> error (instead of it hitting the catch-all).
> > >> > > >>  > >>
> > >> > > >>  > >> Jon
> > >> > > >>  > >>
> > >> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> > >> > jsirota@apache.org>
> > >> > > >>  > wrote:
> > >> > > >>  > >>
> > >> > > >>  > >> We already have a capability to capture bolt errors and
> > >> > validation
> > >> > > >>  > errors
> > >> > > >>  > >> and pipe them into a Kafka topic. I want to propose that
> we
> > >> > > attach a
> > >> > > >>  > >> writer topology to the error and validation failed kafka
> > >> topics
> > >> > so
> > >> > > >>  > that we
> > >> > > >>  > >> can (a) create a new ES index for these errors and (b)
> > >> create a
> > >> > > new
> > >> > > >>  > Kibana
> > >> > > >>  > >> dashboard to visualize them. The benefit would be that
> > errors
> > >> > and
> > >> > > >>  > >> validation failures would be easier to see and analyze.
> > >> > > >>  > >>
> > >> > > >>  > >> I am seeking feedback on the following:
> > >> > > >>  > >>
> > >> > > >>  > >> - How granular would we want this feature to be? Think
we
> > >> would
> > >> > > want
> > >> > > >>  > one
> > >> > > >>  > >> index/dashboard per source? Or would it be better to
> > collapse
> > >> > > >>  > everything
> > >> > > >>  > >> into the same index?
> > >> > > >>  > >> - Do we care about storing these errors in HDFS as well?
> Or
> > >> is
> > >> > > >>  indexing
> > >> > > >>  > >> them enough?
> > >> > > >>  > >> - What types of errors should we record? I am proposing:
> > >> > > >>  > >>
> > >> > > >>  > >> For error reporting:
> > >> > > >>  > >> --Message failed to parse
> > >> > > >>  > >> --Enrichment failed to enrich
> > >> > > >>  > >> --Threat intel feed failures
> > >> > > >>  > >> --Generic catch-all for all other errors
> > >> > > >>  > >>
> > >> > > >>  > >> For validation reporting:
> > >> > > >>  > >> --What part of message failed validation
> > >> > > >>  > >> --What stellar validator caused the failure
> > >> > > >>  > >>
> > >> > > >>  > >> -------------------
> > >> > > >>  > >> Thank you,
> > >> > > >>  > >>
> > >> > > >>  > >> James Sirota
> > >> > > >>  > >> PPMC- Apache Metron (Incubating)
> > >> > > >>  > >> jsirota AT apache DOT org
> > >> > > >>  > >>
> > >> > > >>  > >> --
> > >> > > >>  > >>
> > >> > > >>  > >> Jon
> > >> > > >>  > >>
> > >> > > >>  > >> Sent from my mobile device
> > >> > > >>  > >
> > >> > > >>  > > -------------------
> > >> > > >>  > > Thank you,
> > >> > > >>  > >
> > >> > > >>  > > James Sirota
> > >> > > >>  > > PPMC- Apache Metron (Incubating)
> > >> > > >>  > > jsirota AT apache DOT org
> > >> > > >>  > >
> > >> > > >>  > > --
> > >> > > >>  > >
> > >> > > >>  > > Jon
> > >> > > >>  > >
> > >> > > >>  > > Sent from my mobile device
> > >> > > >>  >
> > >> > > >>  > -------------------
> > >> > > >>  > Thank you,
> > >> > > >>  >
> > >> > > >>  > James Sirota
> > >> > > >>  > PPMC- Apache Metron (Incubating)
> > >> > > >>  > jsirota AT apache DOT org
> > >> > > >>  >
> > >> > > >>  --
> > >> > > >>
> > >> > > >>  Jon
> > >> > > >>
> > >> > > >>  Sent from my mobile device
> > >> > >
> > >> > > -------------------
> > >> > > Thank you,
> > >> > >
> > >> > > James Sirota
> > >> > > PPMC- Apache Metron (Incubating)
> > >> > > jsirota AT apache DOT org
> > >> > >
> > >> >
> > >> --
> > >>
> > >> Jon
> > >>
> > >> Sent from my mobile device
> > >>
> > >
> > >
> >
>
> --
>
> Jon
>
> Sent from my mobile device
>

-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Error Indexing

Posted by Ryan Merriman <me...@gmail.com>.

Jon, I misread the code in the GenericEnrichmentBolt.  The error is
forwarded on so no issues there.

Defaulting to the common fields makes sense.  I will dig into the
GenericEnrichmentBolt more, maybe there is a way to get the error fields
without having to significantly change things.  Any opinion on a hashing
algorithm?

On Wed, Jan 25, 2017 at 9:37 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:

> Although hashing the whole message is better than nothing, it misses a lot
> of the benefits we could get.
>
> While I'd love to have consistency for this field across all of the
> different error.types, it appears that may not be reasonably possible
> because of the parsers.  So, how about something like hash all of the
> constant
> fields
> <https://github.com/apache/incubator-metron/blob/master/
> metron-platform/metron-common/src/main/java/org/apache/
> metron/common/Constants.java>
> excluding
> timestamp and original_string unless it is a parser, in which case hash the
> entire message?  This gives us some measure of event uniqueness and it can
> grow as we define additional constant fields (I recall discussing with
> someone else on the list regarding expanding those standard fields to
> include things like usernames but I can't find the specific email
> exchange).
>
> Because some enrichments can be heavily relied on, I think it makes sense
> to put a message onto the error queue when it throws an exception.  Not
> only does this help troubleshoot edge cases, but it makes issues more
> obvious when assembling a new enrichment in dev/test.  I can't think of a
> scenario currently where an enrichment would only be "best effort" and that
> I wouldn't want that error indexed and retrievable.  However, this gets
> interesting when talking about the various options to solve the "Enrich
> enrichment" discussion from earlier in the month.  We can keep that part of
> this separate though, as I don't think that's being actively pursued right
> now.
>
> Jon
>
> On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dl...@gmail.com> wrote:
>
> RE: separate JIRA for MPack/Ansible. No objection to tracking them
> separately, but for this item to be complete, you'll need both the feature
> and the ability to install it.
>
> -D...
>
>
> On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <me...@gmail.com>
> wrote:
>
> > Assuming we're going to write all errors to a single error topic, I think
> > it makes sense to agree on an error message schema and handle errors
> across
> > the 3 different topologies in the same way with a single implementation.
> > The implementation in ParserBolt (ErrorUtils.handleError) produces the
> most
> > verbose error object so I think it's a good candidate for the single
> > implementation.  Here is the message structure it currently produces:
> >
> > {
> >   "exception": "java.lang.Exception: there was an error",
> >   "hostname": "host",
> >   "stack": "java.lang.Exception: ...",
> >   "time": 1485295416563,
> >   "message": "there was an error",
> >   "rawMessage": "raw message",
> >   "rawMessage_bytes": [],
> >   "source.type": "bro_error"
> > }
> >
> > From our discussion so far we need to add a couple fields:  an error type
> > and hash id.  Adding these to the message looks like:
> >
> > {
> >   "exception": "java.lang.Exception: there was an error",
> >   "hostname": "host",
> >   "stack": "java.lang.Exception: ...",
> >   "time": 1485295416563,
> >   "message": "there was an error",
> >   "rawMessage": "raw message",
> >   "rawMessage_bytes": [],
> >   "source.type": "bro_error",
> >   "error.type": "parser_error",
> >   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> > }
> >
> > We should also consider expanding the error types I listed earlier.
> > Instead of just having "indexing_error" we could have
> > "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
> >
> > Jon, if an exception happens in an enrichment or threat intel bolt the
> > message is passed along with no error thrown (only logged).  Everywhere
> > else I'm having trouble identifying specific fields that should be
> hashed.
> > Would hashing the message in every case be acceptable?  Do you know of a
> > place where we could hash a field instead?  On the topic of exceptions in
> > enrichments, are we ok with an error only being logged and not added to
> the
> > message or emitted to the error queue?
> >
> >
> >
> > On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com>
> > wrote:
> >
> > > That use case makes sense to me.  I don't think it will require that
> much
> > > additional effort either.
> > >
> > > On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
> > > wrote:
> > >
> > >> Regarding error vs validation - Either way I'm not very concerned.  I
> > >> initially assumed they would be combined and agree with that approach,
> > but
> > >> splitting them out isn't a very big deal to me either.
> > >>
> > >> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> > else
> > >> where it's not possible to pick out the exact thing causing the issue)
> > it
> > >> would be a hash of the complete message.
> > >>
> > >> Regarding the architecture, I mostly agree with James except that I
> > think
> > >> step 3 needs to also be able to somehow group errors via the original
> > >> data (identify
> > >> replays, identify repeat issues with data in a specific field, issues
> > with
> > >> consistently different data, etc.).  This is essentially the first
> step
> > of
> > >> troubleshooting, which I assume you are doing if you're looking at the
> > >> error dashboard.
> > >>
> > >> If the hash gets moved out of the initial implementation, I'm fairly
> > >> certain you lose this ability.  The point here isn't to handle long
> > fields
> > >> (although that's a benefit of this approach), it's to attach a unique
> > >> identifier to the error/validation issue message that links it to the
> > >> original problem.  I'd be happy to consider alternative solutions to
> > this
> > >> problem (for instance, actually sending across the data itself) I just
> > >> haven't been able to think of another way to do this that I like
> better.
> > >>
> > >> Jon
> > >>
> > >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
> > >> wrote:
> > >>
> > >> > We also need a JIRA for any install/Ansible/MPack work needed.
> > >> >
> > >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Now that I had some time to think about it I would collapse all
> > error
> > >> and
> > >> > > validation topics into one.  We can differentiate between
> different
> > >> views
> > >> > > of the data (split by error source etc) via Kibana dashboards.  I
> > >> would
> > >> > > implement this feature incrementally.  First I would modify all
> the
> > >> bolts
> > >> > > to log to a single topic.  Second, I would get the error indexing
> > >> done by
> > >> > > attaching the indexing topology to the error topic. Third I would
> > >> create
> > >> > > the necessary dashboards to view errors and validation failures by
> > >> > source.
> > >> > > Lastly, I would file a follow-on JIRA to introduce hashing of
> errors
> > >> or
> > >> > > fields that are too long.  It seems like a separate feature that
> we
> > >> need
> > >> > to
> > >> > > think through.  We may need a stellar function around that.
> > >> > >
> > >> > > Thanks,
> > >> > > James
> > >> > >
> > >> > > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> > >> > > > I understand what Jon is talking about. He's proposing we hash
> the
> > >> > value
> > >> > > > that caused the error, not necessarily the error message itself.
> > >> For an
> > >> > > > enrichment this is easy. Just pass along the field value that
> > failed
> > >> > > > enrichment. For other cases the field that caused the error may
> > not
> > >> be
> > >> > so
> > >> > > > obvious. Take parser validation for example. The message is
> > >> validated
> > >> > as
> > >> > > > a whole and it may not be easy to determine which field is the
> > >> cause.
> > >> > In
> > >> > > > that case would a hash of the whole message work?
> > >> > > >
> > >> > > > There is a broader architectural discussion that needs to happen
> > >> before
> > >> > > we
> > >> > > > can implement this. Currently we have an indexing topology that
> > >> reads
> > >> > > from
> > >> > > > 1 topic and writes messages to ES but errors are written to
> > several
> > >> > > > different topics:
> > >> > > >
> > >> > > >    - parser_error
> > >> > > >    - parser_invalid
> > >> > > >    - enrichments_error
> > >> > > >    - threatintel_error
> > >> > > >    - indexing_error
> > >> > > >
> > >> > > > I can see 4 possible approaches to implementing this:
> > >> > > >
> > >> > > >    1. Create an index topology for each error topic
> > >> > > >       1. Good because we can easily reuse the indexing topology
> > and
> > >> > would
> > >> > > >       require the least development effort
> > >> > > >       2. Bad because it would consume a lot of extra worker
> slots
> > >> > > >    2. Move the topic name into the error JSON message as a new
> > >> > > "error_type"
> > >> > > >    field and write all messages to the indexing topic
> > >> > > >       1. Good because we don't need to create a new topology
> > >> > > >       2. Bad because we would be flowing data and errors through
> > the
> > >> > same
> > >> > > >       topology. A spike in errors could affect message indexing.
> > >> > > >    3. Compromise between 1 and 2. Create another indexing
> topology
> > >> that
> > >> > > is
> > >> > > >    dedicated to indexing errors. Move the topic name into the
> > error
> > >> > JSON
> > >> > > >    message as a new "error_type" field and write all errors to a
> > >> single
> > >> > > error
> > >> > > >    topic.
> > >> > > >    4. Write a completely new topology with multiple spouts (1
> for
> > >> each
> > >> > > >    error type listed above) that all feed into a single
> > >> > > BulkMessageWriterBolt.
> > >> > > >       1. Good because the current topologies would not need to
> > >> change
> > >> > > >       2. Bad because it would require the most development
> effort,
> > >> > would
> > >> > > >       not reuse existing topologies and takes up more worker
> slots
> > >> > than 3
> > >> > > >
> > >> > > > Are there other approaches I haven't thought of? I think 1 and 2
> > are
> > >> > off
> > >> > > > the table because they are shortcuts and not good long-term
> > >> solutions.
> > >> > 3
> > >> > > > would be my choice because it introduces less complexity than 4.
> > >> > > Thoughts?
> > >> > > >
> > >> > > > Ryan
> > >> > > >
> > >> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
> > zeolla@gmail.com
> > >> >
> > >> > > wrote:
> > >> > > >
> > >> > > >>  In that case the hash would be of the value in the IP field,
> > such
> > >> as
> > >> > > >>  sha3(8.8.8.8).
> > >> > > >>
> > >> > > >>  Jon
> > >> > > >>
> > >> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <
> jsirota@apache.org>
> > >> > wrote:
> > >> > > >>
> > >> > > >>  > Jon,
> > >> > > >>  >
> > >> > > >>  > I am still not entirely following why we would want to use
> > >> hashing.
> > >> > > For
> > >> > > >>  > example if my error is "Your IP field is invalid and failed
> > >> > > validation"
> > >> > > >>  > hashing this error string will always result in the same
> hash.
> > >> Why
> > >> > > not
> > >> > > >>  > just use the actual error string? Can you provide an example
> > >> where
> > >> > > you
> > >> > > >>  > would use it?
> > >> > > >>  >
> > >> > > >>  > Thanks,
> > >> > > >>  > James
> > >> > > >>  >
> > >> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >> > > >>  > > For 1 - I'm good with that.
> > >> > > >>  > >
> > >> > > >>  > > I'm talking about hashing the relevant content itself not
> > the
> > >> > > error.
> > >> > > >>  Some
> > >> > > >>  > > benefits are (1) minimize load on search index (there's
> > >> minimal
> > >> > > benefit
> > >> > > >>  > in
> > >> > > >>  > > spending the CPU and disk to keep it at full fidelity
> > >> (tokenize
> > >> > and
> > >> > > >>  > store))
> > >> > > >>  > > (2) provide something to key on for dashboards (assuming a
> > >> good
> > >> > > hash
> > >> > > >>  > > algorithm that avoids collisions and is second preimage
> > >> > resistant)
> > >> > > and
> > >> > > >>  > (3)
> > >> > > >>  > > specific to errors, if the issue is that it failed to
> > index, a
> > >> > hash
> > >> > > >>  gives
> > >> > > >>  > > us some protection that the issue will not occur twice.
> > >> > > >>  > >
> > >> > > >>  > > Jon
> > >> > > >>  > >
> > >> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> > >> jsirota@apache.org>
> > >> > > wrote:
> > >> > > >>  > >
> > >> > > >>  > > Jon,
> > >> > > >>  > >
> > >> > > >>  > > With regards to 1, collapsing to a single dashboard for
> each
> > >> > would
> > >> > > be
> > >> > > >>  > > fine. So we would have one error index and one "failed to
> > >> > validate"
> > >> > > >>  > > index. The distinction is that errors would be things that
> > >> went
> > >> > > wrong
> > >> > > >>  > > during stream processing (failed to parse, etc...), while
> > >> > > validation
> > >> > > >>  > > failures are messages that explicitly failed stellar
> > >> > > validation/schema
> > >> > > >>  > > enforcement. There should be relatively few of the second
> > >> type.
> > >> > > >>  > >
> > >> > > >>  > > With respect to 3, why do you want the error hashed? Why
> not
> > >> just
> > >> > > >>  search
> > >> > > >>  > > for the error text?
> > >> > > >>  > >
> > >> > > >>  > > Thanks,
> > >> > > >>  > > James
> > >> > > >>  > >
> > >> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >> > > >>  > >> As someone who currently fills the platform engineer
> role,
> > I
> > >> can
> > >> > > give
> > >> > > >>  > this
> > >> > > >>  > >> idea a huge +1. My thoughts:
> > >> > > >>  > >>
> > >> > > >>  > >> 1. I think it depends on exactly what data is pushed into
> > the
> > >> > > index
> > >> > > >>  > (#3).
> > >> > > >>  > >> However, assuming the errors you proposed recording, I
> > can't
> > >> see
> > >> > > huge
> > >> > > >>  > >> benefits to having more than one dashboard. I would be
> > happy
> > >> to
> > >> > be
> > >> > > >>  > >> persuaded otherwise.
> > >> > > >>  > >>
> > >> > > >>  > >> 2. I would say yes, storing the errors in HDFS in
> addition
> > to
> > >> > > >>  indexing
> > >> > > >>  > is
> > >> > > >>  > >> a good thing. Using METRON-510
> > >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
> > case
> > >> > > study,
> > >> > > >>  > there
> > >> > > >>  > >> is the potential in this environment for
> > attacker-controlled
> > >> > data
> > >> > > to
> > >> > > >>  > >
> > >> > > >>  > > result
> > >> > > >>  > >> in processing errors which could be a method of evading
> > >> security
> > >> > > >>  > >> monitoring. Once an attack is identified, the long term
> > HDFS
> > >> > > storage
> > >> > > >>  > would
> > >> > > >>  > >> allow better historical analysis for
> > low-and-slow/persistent
> > >> > > attacks
> > >> > > >>  > (I'm
> > >> > > >>  > >> thinking of a method of data exfil that also won't
> > >> successfully
> > >> > > get
> > >> > > >>  > stored
> > >> > > >>  > >> in Lucene, but is hard to identify over a short period of
> > >> time).
> > >> > > >>  > >> - Along this line, I think that there are various parts
> of
> > >> > Metron
> > >> > > >>  > (this
> > >> > > >>  > >> included) which could benefit from having method of
> > >> configuring
> > >> > > data
> > >> > > >>  > aging
> > >> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> > >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> > >> > > >>  > >>
> > >> > > >>  > >> 3. I would potentially add a hash of the content that
> > failed
> > >> > > >>  > validation to
> > >> > > >>  > >> help identify repeats over time with less of a concern
> that
> > >> > you'd
> > >> > > >>  have
> > >> > > >>  > >
> > >> > > >>  > > back
> > >> > > >>  > >> to back failures (i.e. instead of storing the value
> > itself).
> > >> > > >>  > Additionally,
> > >> > > >>  > >> I think it's helpful to be able to search all times there
> > >> was an
> > >> > > >>  > indexing
> > >> > > >>  > >> error (instead of it hitting the catch-all).
> > >> > > >>  > >>
> > >> > > >>  > >> Jon
> > >> > > >>  > >>
> > >> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> > >> > jsirota@apache.org>
> > >> > > >>  > wrote:
> > >> > > >>  > >>
> > >> > > >>  > >> We already have a capability to capture bolt errors and
> > >> > validation
> > >> > > >>  > errors
> > >> > > >>  > >> and pipe them into a Kafka topic. I want to propose that
> we
> > >> > > attach a
> > >> > > >>  > >> writer topology to the error and validation failed kafka
> > >> topics
> > >> > so
> > >> > > >>  > that we
> > >> > > >>  > >> can (a) create a new ES index for these errors and (b)
> > >> create a
> > >> > > new
> > >> > > >>  > Kibana
> > >> > > >>  > >> dashboard to visualize them. The benefit would be that
> > errors
> > >> > and
> > >> > > >>  > >> validation failures would be easier to see and analyze.
> > >> > > >>  > >>
> > >> > > >>  > >> I am seeking feedback on the following:
> > >> > > >>  > >>
> > >> > > >>  > >> - How granular would we want this feature to be? Think we
> > >> would
> > >> > > want
> > >> > > >>  > one
> > >> > > >>  > >> index/dashboard per source? Or would it be better to
> > collapse
> > >> > > >>  > everything
> > >> > > >>  > >> into the same index?
> > >> > > >>  > >> - Do we care about storing these errors in HDFS as well?
> Or
> > >> is
> > >> > > >>  indexing
> > >> > > >>  > >> them enough?
> > >> > > >>  > >> - What types of errors should we record? I am proposing:
> > >> > > >>  > >>
> > >> > > >>  > >> For error reporting:
> > >> > > >>  > >> --Message failed to parse
> > >> > > >>  > >> --Enrichment failed to enrich
> > >> > > >>  > >> --Threat intel feed failures
> > >> > > >>  > >> --Generic catch-all for all other errors
> > >> > > >>  > >>
> > >> > > >>  > >> For validation reporting:
> > >> > > >>  > >> --What part of message failed validation
> > >> > > >>  > >> --What stellar validator caused the failure
> > >> > > >>  > >>
> > >> > > >>  > >> -------------------
> > >> > > >>  > >> Thank you,
> > >> > > >>  > >>
> > >> > > >>  > >> James Sirota
> > >> > > >>  > >> PPMC- Apache Metron (Incubating)
> > >> > > >>  > >> jsirota AT apache DOT org
> > >> > > >>  > >>
> > >> > > >>  > >> --
> > >> > > >>  > >>
> > >> > > >>  > >> Jon
> > >> > > >>  > >>
> > >> > > >>  > >> Sent from my mobile device
> > >> > > >>  > >
> > >> > > >>  > > -------------------
> > >> > > >>  > > Thank you,
> > >> > > >>  > >
> > >> > > >>  > > James Sirota
> > >> > > >>  > > PPMC- Apache Metron (Incubating)
> > >> > > >>  > > jsirota AT apache DOT org
> > >> > > >>  > >
> > >> > > >>  > > --
> > >> > > >>  > >
> > >> > > >>  > > Jon
> > >> > > >>  > >
> > >> > > >>  > > Sent from my mobile device
> > >> > > >>  >
> > >> > > >>  > -------------------
> > >> > > >>  > Thank you,
> > >> > > >>  >
> > >> > > >>  > James Sirota
> > >> > > >>  > PPMC- Apache Metron (Incubating)
> > >> > > >>  > jsirota AT apache DOT org
> > >> > > >>  >
> > >> > > >>  --
> > >> > > >>
> > >> > > >>  Jon
> > >> > > >>
> > >> > > >>  Sent from my mobile device
> > >> > >
> > >> > > -------------------
> > >> > > Thank you,
> > >> > >
> > >> > > James Sirota
> > >> > > PPMC- Apache Metron (Incubating)
> > >> > > jsirota AT apache DOT org
> > >> > >
> > >> >
> > >> --
> > >>
> > >> Jon
> > >>
> > >> Sent from my mobile device
> > >>
> > >
> > >
> >
>
> --
>
> Jon
>
> Sent from my mobile device
>

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

Although hashing the whole message is better than nothing, it misses a lot
of the benefits we could get.

While I'd love to have consistency for this field across all of the
different error.types, it appears that may not be reasonably possible
because of the parsers.  So, how about something like hash all of the constant
fields
<https://github.com/apache/incubator-metron/blob/master/metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java>
excluding
timestamp and original_string unless it is a parser, in which case hash the
entire message?  This gives us some measure of event uniqueness and it can
grow as we define additional constant fields (I recall discussing with
someone else on the list regarding expanding those standard fields to
include things like usernames but I can't find the specific email exchange).

Because some enrichments can be heavily relied on, I think it makes sense
to put a message onto the error queue when it throws an exception.  Not
only does this help troubleshoot edge cases, but it makes issues more
obvious when assembling a new enrichment in dev/test.  I can't think of a
scenario currently where an enrichment would only be "best effort" and that
I wouldn't want that error indexed and retrievable.  However, this gets
interesting when talking about the various options to solve the "Enrich
enrichment" discussion from earlier in the month.  We can keep that part of
this separate though, as I don't think that's being actively pursued right
now.

Jon

On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dl...@gmail.com> wrote:

RE: separate JIRA for MPack/Ansible. No objection to tracking them
separately, but for this item to be complete, you'll need both the feature
and the ability to install it.

-D...


On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <me...@gmail.com> wrote:

> Assuming we're going to write all errors to a single error topic, I think
> it makes sense to agree on an error message schema and handle errors
across
> the 3 different topologies in the same way with a single implementation.
> The implementation in ParserBolt (ErrorUtils.handleError) produces the
most
> verbose error object so I think it's a good candidate for the single
> implementation.  Here is the message structure it currently produces:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error"
> }
>
> From our discussion so far we need to add a couple fields:  an error type
> and hash id.  Adding these to the message looks like:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error",
>   "error.type": "parser_error",
>   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> }
>
> We should also consider expanding the error types I listed earlier.
> Instead of just having "indexing_error" we could have
> "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
>
> Jon, if an exception happens in an enrichment or threat intel bolt the
> message is passed along with no error thrown (only logged).  Everywhere
> else I'm having trouble identifying specific fields that should be hashed.
> Would hashing the message in every case be acceptable?  Do you know of a
> place where we could hash a field instead?  On the topic of exceptions in
> enrichments, are we ok with an error only being logged and not added to
the
> message or emitted to the error queue?
>
>
>
> On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com>
> wrote:
>
> > That use case makes sense to me.  I don't think it will require that
much
> > additional effort either.
> >
> > On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
> > wrote:
> >
> >> Regarding error vs validation - Either way I'm not very concerned.  I
> >> initially assumed they would be combined and agree with that approach,
> but
> >> splitting them out isn't a very big deal to me either.
> >>
> >> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> else
> >> where it's not possible to pick out the exact thing causing the issue)
> it
> >> would be a hash of the complete message.
> >>
> >> Regarding the architecture, I mostly agree with James except that I
> think
> >> step 3 needs to also be able to somehow group errors via the original
> >> data (identify
> >> replays, identify repeat issues with data in a specific field, issues
> with
> >> consistently different data, etc.).  This is essentially the first step
> of
> >> troubleshooting, which I assume you are doing if you're looking at the
> >> error dashboard.
> >>
> >> If the hash gets moved out of the initial implementation, I'm fairly
> >> certain you lose this ability.  The point here isn't to handle long
> fields
> >> (although that's a benefit of this approach), it's to attach a unique
> >> identifier to the error/validation issue message that links it to the
> >> original problem.  I'd be happy to consider alternative solutions to
> this
> >> problem (for instance, actually sending across the data itself) I just
> >> haven't been able to think of another way to do this that I like
better.
> >>
> >> Jon
> >>
> >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
> >> wrote:
> >>
> >> > We also need a JIRA for any install/Ansible/MPack work needed.
> >> >
> >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
> >> wrote:
> >> >
> >> > > Now that I had some time to think about it I would collapse all
> error
> >> and
> >> > > validation topics into one.  We can differentiate between different
> >> views
> >> > > of the data (split by error source etc) via Kibana dashboards.  I
> >> would
> >> > > implement this feature incrementally.  First I would modify all the
> >> bolts
> >> > > to log to a single topic.  Second, I would get the error indexing
> >> done by
> >> > > attaching the indexing topology to the error topic. Third I would
> >> create
> >> > > the necessary dashboards to view errors and validation failures by
> >> > source.
> >> > > Lastly, I would file a follow-on JIRA to introduce hashing of
errors
> >> or
> >> > > fields that are too long.  It seems like a separate feature that we
> >> need
> >> > to
> >> > > think through.  We may need a stellar function around that.
> >> > >
> >> > > Thanks,
> >> > > James
> >> > >
> >> > > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> >> > > > I understand what Jon is talking about. He's proposing we hash
the
> >> > value
> >> > > > that caused the error, not necessarily the error message itself.
> >> For an
> >> > > > enrichment this is easy. Just pass along the field value that
> failed
> >> > > > enrichment. For other cases the field that caused the error may
> not
> >> be
> >> > so
> >> > > > obvious. Take parser validation for example. The message is
> >> validated
> >> > as
> >> > > > a whole and it may not be easy to determine which field is the
> >> cause.
> >> > In
> >> > > > that case would a hash of the whole message work?
> >> > > >
> >> > > > There is a broader architectural discussion that needs to happen
> >> before
> >> > > we
> >> > > > can implement this. Currently we have an indexing topology that
> >> reads
> >> > > from
> >> > > > 1 topic and writes messages to ES but errors are written to
> several
> >> > > > different topics:
> >> > > >
> >> > > >    - parser_error
> >> > > >    - parser_invalid
> >> > > >    - enrichments_error
> >> > > >    - threatintel_error
> >> > > >    - indexing_error
> >> > > >
> >> > > > I can see 4 possible approaches to implementing this:
> >> > > >
> >> > > >    1. Create an index topology for each error topic
> >> > > >       1. Good because we can easily reuse the indexing topology
> and
> >> > would
> >> > > >       require the least development effort
> >> > > >       2. Bad because it would consume a lot of extra worker slots
> >> > > >    2. Move the topic name into the error JSON message as a new
> >> > > "error_type"
> >> > > >    field and write all messages to the indexing topic
> >> > > >       1. Good because we don't need to create a new topology
> >> > > >       2. Bad because we would be flowing data and errors through
> the
> >> > same
> >> > > >       topology. A spike in errors could affect message indexing.
> >> > > >    3. Compromise between 1 and 2. Create another indexing
topology
> >> that
> >> > > is
> >> > > >    dedicated to indexing errors. Move the topic name into the
> error
> >> > JSON
> >> > > >    message as a new "error_type" field and write all errors to a
> >> single
> >> > > error
> >> > > >    topic.
> >> > > >    4. Write a completely new topology with multiple spouts (1 for
> >> each
> >> > > >    error type listed above) that all feed into a single
> >> > > BulkMessageWriterBolt.
> >> > > >       1. Good because the current topologies would not need to
> >> change
> >> > > >       2. Bad because it would require the most development
effort,
> >> > would
> >> > > >       not reuse existing topologies and takes up more worker
slots
> >> > than 3
> >> > > >
> >> > > > Are there other approaches I haven't thought of? I think 1 and 2
> are
> >> > off
> >> > > > the table because they are shortcuts and not good long-term
> >> solutions.
> >> > 3
> >> > > > would be my choice because it introduces less complexity than 4.
> >> > > Thoughts?
> >> > > >
> >> > > > Ryan
> >> > > >
> >> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
> zeolla@gmail.com
> >> >
> >> > > wrote:
> >> > > >
> >> > > >>  In that case the hash would be of the value in the IP field,
> such
> >> as
> >> > > >>  sha3(8.8.8.8).
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org>
> >> > wrote:
> >> > > >>
> >> > > >>  > Jon,
> >> > > >>  >
> >> > > >>  > I am still not entirely following why we would want to use
> >> hashing.
> >> > > For
> >> > > >>  > example if my error is "Your IP field is invalid and failed
> >> > > validation"
> >> > > >>  > hashing this error string will always result in the same
hash.
> >> Why
> >> > > not
> >> > > >>  > just use the actual error string? Can you provide an example
> >> where
> >> > > you
> >> > > >>  > would use it?
> >> > > >>  >
> >> > > >>  > Thanks,
> >> > > >>  > James
> >> > > >>  >
> >> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> >> > > >>  > > For 1 - I'm good with that.
> >> > > >>  > >
> >> > > >>  > > I'm talking about hashing the relevant content itself not
> the
> >> > > error.
> >> > > >>  Some
> >> > > >>  > > benefits are (1) minimize load on search index (there's
> >> minimal
> >> > > benefit
> >> > > >>  > in
> >> > > >>  > > spending the CPU and disk to keep it at full fidelity
> >> (tokenize
> >> > and
> >> > > >>  > store))
> >> > > >>  > > (2) provide something to key on for dashboards (assuming a
> >> good
> >> > > hash
> >> > > >>  > > algorithm that avoids collisions and is second preimage
> >> > resistant)
> >> > > and
> >> > > >>  > (3)
> >> > > >>  > > specific to errors, if the issue is that it failed to
> index, a
> >> > hash
> >> > > >>  gives
> >> > > >>  > > us some protection that the issue will not occur twice.
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> >> jsirota@apache.org>
> >> > > wrote:
> >> > > >>  > >
> >> > > >>  > > Jon,
> >> > > >>  > >
> >> > > >>  > > With regards to 1, collapsing to a single dashboard for
each
> >> > would
> >> > > be
> >> > > >>  > > fine. So we would have one error index and one "failed to
> >> > validate"
> >> > > >>  > > index. The distinction is that errors would be things that
> >> went
> >> > > wrong
> >> > > >>  > > during stream processing (failed to parse, etc...), while
> >> > > validation
> >> > > >>  > > failures are messages that explicitly failed stellar
> >> > > validation/schema
> >> > > >>  > > enforcement. There should be relatively few of the second
> >> type.
> >> > > >>  > >
> >> > > >>  > > With respect to 3, why do you want the error hashed? Why
not
> >> just
> >> > > >>  search
> >> > > >>  > > for the error text?
> >> > > >>  > >
> >> > > >>  > > Thanks,
> >> > > >>  > > James
> >> > > >>  > >
> >> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> >> > > >>  > >> As someone who currently fills the platform engineer role,
> I
> >> can
> >> > > give
> >> > > >>  > this
> >> > > >>  > >> idea a huge +1. My thoughts:
> >> > > >>  > >>
> >> > > >>  > >> 1. I think it depends on exactly what data is pushed into
> the
> >> > > index
> >> > > >>  > (#3).
> >> > > >>  > >> However, assuming the errors you proposed recording, I
> can't
> >> see
> >> > > huge
> >> > > >>  > >> benefits to having more than one dashboard. I would be
> happy
> >> to
> >> > be
> >> > > >>  > >> persuaded otherwise.
> >> > > >>  > >>
> >> > > >>  > >> 2. I would say yes, storing the errors in HDFS in addition
> to
> >> > > >>  indexing
> >> > > >>  > is
> >> > > >>  > >> a good thing. Using METRON-510
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
> case
> >> > > study,
> >> > > >>  > there
> >> > > >>  > >> is the potential in this environment for
> attacker-controlled
> >> > data
> >> > > to
> >> > > >>  > >
> >> > > >>  > > result
> >> > > >>  > >> in processing errors which could be a method of evading
> >> security
> >> > > >>  > >> monitoring. Once an attack is identified, the long term
> HDFS
> >> > > storage
> >> > > >>  > would
> >> > > >>  > >> allow better historical analysis for
> low-and-slow/persistent
> >> > > attacks
> >> > > >>  > (I'm
> >> > > >>  > >> thinking of a method of data exfil that also won't
> >> successfully
> >> > > get
> >> > > >>  > stored
> >> > > >>  > >> in Lucene, but is hard to identify over a short period of
> >> time).
> >> > > >>  > >> - Along this line, I think that there are various parts of
> >> > Metron
> >> > > >>  > (this
> >> > > >>  > >> included) which could benefit from having method of
> >> configuring
> >> > > data
> >> > > >>  > aging
> >> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> >> > > >>  > >>
> >> > > >>  > >> 3. I would potentially add a hash of the content that
> failed
> >> > > >>  > validation to
> >> > > >>  > >> help identify repeats over time with less of a concern
that
> >> > you'd
> >> > > >>  have
> >> > > >>  > >
> >> > > >>  > > back
> >> > > >>  > >> to back failures (i.e. instead of storing the value
> itself).
> >> > > >>  > Additionally,
> >> > > >>  > >> I think it's helpful to be able to search all times there
> >> was an
> >> > > >>  > indexing
> >> > > >>  > >> error (instead of it hitting the catch-all).
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> >> > jsirota@apache.org>
> >> > > >>  > wrote:
> >> > > >>  > >>
> >> > > >>  > >> We already have a capability to capture bolt errors and
> >> > validation
> >> > > >>  > errors
> >> > > >>  > >> and pipe them into a Kafka topic. I want to propose that
we
> >> > > attach a
> >> > > >>  > >> writer topology to the error and validation failed kafka
> >> topics
> >> > so
> >> > > >>  > that we
> >> > > >>  > >> can (a) create a new ES index for these errors and (b)
> >> create a
> >> > > new
> >> > > >>  > Kibana
> >> > > >>  > >> dashboard to visualize them. The benefit would be that
> errors
> >> > and
> >> > > >>  > >> validation failures would be easier to see and analyze.
> >> > > >>  > >>
> >> > > >>  > >> I am seeking feedback on the following:
> >> > > >>  > >>
> >> > > >>  > >> - How granular would we want this feature to be? Think we
> >> would
> >> > > want
> >> > > >>  > one
> >> > > >>  > >> index/dashboard per source? Or would it be better to
> collapse
> >> > > >>  > everything
> >> > > >>  > >> into the same index?
> >> > > >>  > >> - Do we care about storing these errors in HDFS as well?
Or
> >> is
> >> > > >>  indexing
> >> > > >>  > >> them enough?
> >> > > >>  > >> - What types of errors should we record? I am proposing:
> >> > > >>  > >>
> >> > > >>  > >> For error reporting:
> >> > > >>  > >> --Message failed to parse
> >> > > >>  > >> --Enrichment failed to enrich
> >> > > >>  > >> --Threat intel feed failures
> >> > > >>  > >> --Generic catch-all for all other errors
> >> > > >>  > >>
> >> > > >>  > >> For validation reporting:
> >> > > >>  > >> --What part of message failed validation
> >> > > >>  > >> --What stellar validator caused the failure
> >> > > >>  > >>
> >> > > >>  > >> -------------------
> >> > > >>  > >> Thank you,
> >> > > >>  > >>
> >> > > >>  > >> James Sirota
> >> > > >>  > >> PPMC- Apache Metron (Incubating)
> >> > > >>  > >> jsirota AT apache DOT org
> >> > > >>  > >>
> >> > > >>  > >> --
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> Sent from my mobile device
> >> > > >>  > >
> >> > > >>  > > -------------------
> >> > > >>  > > Thank you,
> >> > > >>  > >
> >> > > >>  > > James Sirota
> >> > > >>  > > PPMC- Apache Metron (Incubating)
> >> > > >>  > > jsirota AT apache DOT org
> >> > > >>  > >
> >> > > >>  > > --
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > Sent from my mobile device
> >> > > >>  >
> >> > > >>  > -------------------
> >> > > >>  > Thank you,
> >> > > >>  >
> >> > > >>  > James Sirota
> >> > > >>  > PPMC- Apache Metron (Incubating)
> >> > > >>  > jsirota AT apache DOT org
> >> > > >>  >
> >> > > >>  --
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  Sent from my mobile device
> >> > >
> >> > > -------------------
> >> > > Thank you,
> >> > >
> >> > > James Sirota
> >> > > PPMC- Apache Metron (Incubating)
> >> > > jsirota AT apache DOT org
> >> > >
> >> >
> >> --
> >>
> >> Jon
> >>
> >> Sent from my mobile device
> >>
> >
> >
>

-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Error Indexing

Posted by David Lyle <dl...@gmail.com>.

RE: separate JIRA for MPack/Ansible. No objection to tracking them
separately, but for this item to be complete, you'll need both the feature
and the ability to install it.

-D...


On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <me...@gmail.com> wrote:

> Assuming we're going to write all errors to a single error topic, I think
> it makes sense to agree on an error message schema and handle errors across
> the 3 different topologies in the same way with a single implementation.
> The implementation in ParserBolt (ErrorUtils.handleError) produces the most
> verbose error object so I think it's a good candidate for the single
> implementation.  Here is the message structure it currently produces:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error"
> }
>
> From our discussion so far we need to add a couple fields:  an error type
> and hash id.  Adding these to the message looks like:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error",
>   "error.type": "parser_error",
>   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> }
>
> We should also consider expanding the error types I listed earlier.
> Instead of just having "indexing_error" we could have
> "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
>
> Jon, if an exception happens in an enrichment or threat intel bolt the
> message is passed along with no error thrown (only logged).  Everywhere
> else I'm having trouble identifying specific fields that should be hashed.
> Would hashing the message in every case be acceptable?  Do you know of a
> place where we could hash a field instead?  On the topic of exceptions in
> enrichments, are we ok with an error only being logged and not added to the
> message or emitted to the error queue?
>
>
>
> On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com>
> wrote:
>
> > That use case makes sense to me.  I don't think it will require that much
> > additional effort either.
> >
> > On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
> > wrote:
> >
> >> Regarding error vs validation - Either way I'm not very concerned.  I
> >> initially assumed they would be combined and agree with that approach,
> but
> >> splitting them out isn't a very big deal to me either.
> >>
> >> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> else
> >> where it's not possible to pick out the exact thing causing the issue)
> it
> >> would be a hash of the complete message.
> >>
> >> Regarding the architecture, I mostly agree with James except that I
> think
> >> step 3 needs to also be able to somehow group errors via the original
> >> data (identify
> >> replays, identify repeat issues with data in a specific field, issues
> with
> >> consistently different data, etc.).  This is essentially the first step
> of
> >> troubleshooting, which I assume you are doing if you're looking at the
> >> error dashboard.
> >>
> >> If the hash gets moved out of the initial implementation, I'm fairly
> >> certain you lose this ability.  The point here isn't to handle long
> fields
> >> (although that's a benefit of this approach), it's to attach a unique
> >> identifier to the error/validation issue message that links it to the
> >> original problem.  I'd be happy to consider alternative solutions to
> this
> >> problem (for instance, actually sending across the data itself) I just
> >> haven't been able to think of another way to do this that I like better.
> >>
> >> Jon
> >>
> >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
> >> wrote:
> >>
> >> > We also need a JIRA for any install/Ansible/MPack work needed.
> >> >
> >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
> >> wrote:
> >> >
> >> > > Now that I had some time to think about it I would collapse all
> error
> >> and
> >> > > validation topics into one.  We can differentiate between different
> >> views
> >> > > of the data (split by error source etc) via Kibana dashboards.  I
> >> would
> >> > > implement this feature incrementally.  First I would modify all the
> >> bolts
> >> > > to log to a single topic.  Second, I would get the error indexing
> >> done by
> >> > > attaching the indexing topology to the error topic. Third I would
> >> create
> >> > > the necessary dashboards to view errors and validation failures by
> >> > source.
> >> > > Lastly, I would file a follow-on JIRA to introduce hashing of errors
> >> or
> >> > > fields that are too long.  It seems like a separate feature that we
> >> need
> >> > to
> >> > > think through.  We may need a stellar function around that.
> >> > >
> >> > > Thanks,
> >> > > James
> >> > >
> >> > > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> >> > > > I understand what Jon is talking about. He's proposing we hash the
> >> > value
> >> > > > that caused the error, not necessarily the error message itself.
> >> For an
> >> > > > enrichment this is easy. Just pass along the field value that
> failed
> >> > > > enrichment. For other cases the field that caused the error may
> not
> >> be
> >> > so
> >> > > > obvious. Take parser validation for example. The message is
> >> validated
> >> > as
> >> > > > a whole and it may not be easy to determine which field is the
> >> cause.
> >> > In
> >> > > > that case would a hash of the whole message work?
> >> > > >
> >> > > > There is a broader architectural discussion that needs to happen
> >> before
> >> > > we
> >> > > > can implement this. Currently we have an indexing topology that
> >> reads
> >> > > from
> >> > > > 1 topic and writes messages to ES but errors are written to
> several
> >> > > > different topics:
> >> > > >
> >> > > >    - parser_error
> >> > > >    - parser_invalid
> >> > > >    - enrichments_error
> >> > > >    - threatintel_error
> >> > > >    - indexing_error
> >> > > >
> >> > > > I can see 4 possible approaches to implementing this:
> >> > > >
> >> > > >    1. Create an index topology for each error topic
> >> > > >       1. Good because we can easily reuse the indexing topology
> and
> >> > would
> >> > > >       require the least development effort
> >> > > >       2. Bad because it would consume a lot of extra worker slots
> >> > > >    2. Move the topic name into the error JSON message as a new
> >> > > "error_type"
> >> > > >    field and write all messages to the indexing topic
> >> > > >       1. Good because we don't need to create a new topology
> >> > > >       2. Bad because we would be flowing data and errors through
> the
> >> > same
> >> > > >       topology. A spike in errors could affect message indexing.
> >> > > >    3. Compromise between 1 and 2. Create another indexing topology
> >> that
> >> > > is
> >> > > >    dedicated to indexing errors. Move the topic name into the
> error
> >> > JSON
> >> > > >    message as a new "error_type" field and write all errors to a
> >> single
> >> > > error
> >> > > >    topic.
> >> > > >    4. Write a completely new topology with multiple spouts (1 for
> >> each
> >> > > >    error type listed above) that all feed into a single
> >> > > BulkMessageWriterBolt.
> >> > > >       1. Good because the current topologies would not need to
> >> change
> >> > > >       2. Bad because it would require the most development effort,
> >> > would
> >> > > >       not reuse existing topologies and takes up more worker slots
> >> > than 3
> >> > > >
> >> > > > Are there other approaches I haven't thought of? I think 1 and 2
> are
> >> > off
> >> > > > the table because they are shortcuts and not good long-term
> >> solutions.
> >> > 3
> >> > > > would be my choice because it introduces less complexity than 4.
> >> > > Thoughts?
> >> > > >
> >> > > > Ryan
> >> > > >
> >> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
> zeolla@gmail.com
> >> >
> >> > > wrote:
> >> > > >
> >> > > >>  In that case the hash would be of the value in the IP field,
> such
> >> as
> >> > > >>  sha3(8.8.8.8).
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org>
> >> > wrote:
> >> > > >>
> >> > > >>  > Jon,
> >> > > >>  >
> >> > > >>  > I am still not entirely following why we would want to use
> >> hashing.
> >> > > For
> >> > > >>  > example if my error is "Your IP field is invalid and failed
> >> > > validation"
> >> > > >>  > hashing this error string will always result in the same hash.
> >> Why
> >> > > not
> >> > > >>  > just use the actual error string? Can you provide an example
> >> where
> >> > > you
> >> > > >>  > would use it?
> >> > > >>  >
> >> > > >>  > Thanks,
> >> > > >>  > James
> >> > > >>  >
> >> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> >> > > >>  > > For 1 - I'm good with that.
> >> > > >>  > >
> >> > > >>  > > I'm talking about hashing the relevant content itself not
> the
> >> > > error.
> >> > > >>  Some
> >> > > >>  > > benefits are (1) minimize load on search index (there's
> >> minimal
> >> > > benefit
> >> > > >>  > in
> >> > > >>  > > spending the CPU and disk to keep it at full fidelity
> >> (tokenize
> >> > and
> >> > > >>  > store))
> >> > > >>  > > (2) provide something to key on for dashboards (assuming a
> >> good
> >> > > hash
> >> > > >>  > > algorithm that avoids collisions and is second preimage
> >> > resistant)
> >> > > and
> >> > > >>  > (3)
> >> > > >>  > > specific to errors, if the issue is that it failed to
> index, a
> >> > hash
> >> > > >>  gives
> >> > > >>  > > us some protection that the issue will not occur twice.
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> >> jsirota@apache.org>
> >> > > wrote:
> >> > > >>  > >
> >> > > >>  > > Jon,
> >> > > >>  > >
> >> > > >>  > > With regards to 1, collapsing to a single dashboard for each
> >> > would
> >> > > be
> >> > > >>  > > fine. So we would have one error index and one "failed to
> >> > validate"
> >> > > >>  > > index. The distinction is that errors would be things that
> >> went
> >> > > wrong
> >> > > >>  > > during stream processing (failed to parse, etc...), while
> >> > > validation
> >> > > >>  > > failures are messages that explicitly failed stellar
> >> > > validation/schema
> >> > > >>  > > enforcement. There should be relatively few of the second
> >> type.
> >> > > >>  > >
> >> > > >>  > > With respect to 3, why do you want the error hashed? Why not
> >> just
> >> > > >>  search
> >> > > >>  > > for the error text?
> >> > > >>  > >
> >> > > >>  > > Thanks,
> >> > > >>  > > James
> >> > > >>  > >
> >> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> >> > > >>  > >> As someone who currently fills the platform engineer role,
> I
> >> can
> >> > > give
> >> > > >>  > this
> >> > > >>  > >> idea a huge +1. My thoughts:
> >> > > >>  > >>
> >> > > >>  > >> 1. I think it depends on exactly what data is pushed into
> the
> >> > > index
> >> > > >>  > (#3).
> >> > > >>  > >> However, assuming the errors you proposed recording, I
> can't
> >> see
> >> > > huge
> >> > > >>  > >> benefits to having more than one dashboard. I would be
> happy
> >> to
> >> > be
> >> > > >>  > >> persuaded otherwise.
> >> > > >>  > >>
> >> > > >>  > >> 2. I would say yes, storing the errors in HDFS in addition
> to
> >> > > >>  indexing
> >> > > >>  > is
> >> > > >>  > >> a good thing. Using METRON-510
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
> case
> >> > > study,
> >> > > >>  > there
> >> > > >>  > >> is the potential in this environment for
> attacker-controlled
> >> > data
> >> > > to
> >> > > >>  > >
> >> > > >>  > > result
> >> > > >>  > >> in processing errors which could be a method of evading
> >> security
> >> > > >>  > >> monitoring. Once an attack is identified, the long term
> HDFS
> >> > > storage
> >> > > >>  > would
> >> > > >>  > >> allow better historical analysis for
> low-and-slow/persistent
> >> > > attacks
> >> > > >>  > (I'm
> >> > > >>  > >> thinking of a method of data exfil that also won't
> >> successfully
> >> > > get
> >> > > >>  > stored
> >> > > >>  > >> in Lucene, but is hard to identify over a short period of
> >> time).
> >> > > >>  > >> - Along this line, I think that there are various parts of
> >> > Metron
> >> > > >>  > (this
> >> > > >>  > >> included) which could benefit from having method of
> >> configuring
> >> > > data
> >> > > >>  > aging
> >> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> >> > > >>  > >>
> >> > > >>  > >> 3. I would potentially add a hash of the content that
> failed
> >> > > >>  > validation to
> >> > > >>  > >> help identify repeats over time with less of a concern that
> >> > you'd
> >> > > >>  have
> >> > > >>  > >
> >> > > >>  > > back
> >> > > >>  > >> to back failures (i.e. instead of storing the value
> itself).
> >> > > >>  > Additionally,
> >> > > >>  > >> I think it's helpful to be able to search all times there
> >> was an
> >> > > >>  > indexing
> >> > > >>  > >> error (instead of it hitting the catch-all).
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> >> > jsirota@apache.org>
> >> > > >>  > wrote:
> >> > > >>  > >>
> >> > > >>  > >> We already have a capability to capture bolt errors and
> >> > validation
> >> > > >>  > errors
> >> > > >>  > >> and pipe them into a Kafka topic. I want to propose that we
> >> > > attach a
> >> > > >>  > >> writer topology to the error and validation failed kafka
> >> topics
> >> > so
> >> > > >>  > that we
> >> > > >>  > >> can (a) create a new ES index for these errors and (b)
> >> create a
> >> > > new
> >> > > >>  > Kibana
> >> > > >>  > >> dashboard to visualize them. The benefit would be that
> errors
> >> > and
> >> > > >>  > >> validation failures would be easier to see and analyze.
> >> > > >>  > >>
> >> > > >>  > >> I am seeking feedback on the following:
> >> > > >>  > >>
> >> > > >>  > >> - How granular would we want this feature to be? Think we
> >> would
> >> > > want
> >> > > >>  > one
> >> > > >>  > >> index/dashboard per source? Or would it be better to
> collapse
> >> > > >>  > everything
> >> > > >>  > >> into the same index?
> >> > > >>  > >> - Do we care about storing these errors in HDFS as well? Or
> >> is
> >> > > >>  indexing
> >> > > >>  > >> them enough?
> >> > > >>  > >> - What types of errors should we record? I am proposing:
> >> > > >>  > >>
> >> > > >>  > >> For error reporting:
> >> > > >>  > >> --Message failed to parse
> >> > > >>  > >> --Enrichment failed to enrich
> >> > > >>  > >> --Threat intel feed failures
> >> > > >>  > >> --Generic catch-all for all other errors
> >> > > >>  > >>
> >> > > >>  > >> For validation reporting:
> >> > > >>  > >> --What part of message failed validation
> >> > > >>  > >> --What stellar validator caused the failure
> >> > > >>  > >>
> >> > > >>  > >> -------------------
> >> > > >>  > >> Thank you,
> >> > > >>  > >>
> >> > > >>  > >> James Sirota
> >> > > >>  > >> PPMC- Apache Metron (Incubating)
> >> > > >>  > >> jsirota AT apache DOT org
> >> > > >>  > >>
> >> > > >>  > >> --
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> Sent from my mobile device
> >> > > >>  > >
> >> > > >>  > > -------------------
> >> > > >>  > > Thank you,
> >> > > >>  > >
> >> > > >>  > > James Sirota
> >> > > >>  > > PPMC- Apache Metron (Incubating)
> >> > > >>  > > jsirota AT apache DOT org
> >> > > >>  > >
> >> > > >>  > > --
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > Sent from my mobile device
> >> > > >>  >
> >> > > >>  > -------------------
> >> > > >>  > Thank you,
> >> > > >>  >
> >> > > >>  > James Sirota
> >> > > >>  > PPMC- Apache Metron (Incubating)
> >> > > >>  > jsirota AT apache DOT org
> >> > > >>  >
> >> > > >>  --
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  Sent from my mobile device
> >> > >
> >> > > -------------------
> >> > > Thank you,
> >> > >
> >> > > James Sirota
> >> > > PPMC- Apache Metron (Incubating)
> >> > > jsirota AT apache DOT org
> >> > >
> >> >
> >> --
> >>
> >> Jon
> >>
> >> Sent from my mobile device
> >>
> >
> >
>

Re: [DISCUSS] Error Indexing

Posted by Ryan Merriman <me...@gmail.com>.

Assuming we're going to write all errors to a single error topic, I think
it makes sense to agree on an error message schema and handle errors across
the 3 different topologies in the same way with a single implementation.
The implementation in ParserBolt (ErrorUtils.handleError) produces the most
verbose error object so I think it's a good candidate for the single
implementation.  Here is the message structure it currently produces:

{
  "exception": "java.lang.Exception: there was an error",
  "hostname": "host",
  "stack": "java.lang.Exception: ...",
  "time": 1485295416563,
  "message": "there was an error",
  "rawMessage": "raw message",
  "rawMessage_bytes": [],
  "source.type": "bro_error"
}

From our discussion so far we need to add a couple fields:  an error type
and hash id.  Adding these to the message looks like:

{
  "exception": "java.lang.Exception: there was an error",
  "hostname": "host",
  "stack": "java.lang.Exception: ...",
  "time": 1485295416563,
  "message": "there was an error",
  "rawMessage": "raw message",
  "rawMessage_bytes": [],
  "source.type": "bro_error",
  "error.type": "parser_error",
  "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
}

We should also consider expanding the error types I listed earlier.
Instead of just having "indexing_error" we could have
"elasticsearch_indexing_error", "hdfs_indexing_error" and so on.

Jon, if an exception happens in an enrichment or threat intel bolt the
message is passed along with no error thrown (only logged).  Everywhere
else I'm having trouble identifying specific fields that should be hashed.
Would hashing the message in every case be acceptable?  Do you know of a
place where we could hash a field instead?  On the topic of exceptions in
enrichments, are we ok with an error only being logged and not added to the
message or emitted to the error queue?



On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <me...@gmail.com> wrote:

> That use case makes sense to me.  I don't think it will require that much
> additional effort either.
>
> On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com>
> wrote:
>
>> Regarding error vs validation - Either way I'm not very concerned.  I
>> initially assumed they would be combined and agree with that approach, but
>> splitting them out isn't a very big deal to me either.
>>
>> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere else
>> where it's not possible to pick out the exact thing causing the issue) it
>> would be a hash of the complete message.
>>
>> Regarding the architecture, I mostly agree with James except that I think
>> step 3 needs to also be able to somehow group errors via the original
>> data (identify
>> replays, identify repeat issues with data in a specific field, issues with
>> consistently different data, etc.).  This is essentially the first step of
>> troubleshooting, which I assume you are doing if you're looking at the
>> error dashboard.
>>
>> If the hash gets moved out of the initial implementation, I'm fairly
>> certain you lose this ability.  The point here isn't to handle long fields
>> (although that's a benefit of this approach), it's to attach a unique
>> identifier to the error/validation issue message that links it to the
>> original problem.  I'd be happy to consider alternative solutions to this
>> problem (for instance, actually sending across the data itself) I just
>> haven't been able to think of another way to do this that I like better.
>>
>> Jon
>>
>> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com>
>> wrote:
>>
>> > We also need a JIRA for any install/Ansible/MPack work needed.
>> >
>> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
>> wrote:
>> >
>> > > Now that I had some time to think about it I would collapse all error
>> and
>> > > validation topics into one.  We can differentiate between different
>> views
>> > > of the data (split by error source etc) via Kibana dashboards.  I
>> would
>> > > implement this feature incrementally.  First I would modify all the
>> bolts
>> > > to log to a single topic.  Second, I would get the error indexing
>> done by
>> > > attaching the indexing topology to the error topic. Third I would
>> create
>> > > the necessary dashboards to view errors and validation failures by
>> > source.
>> > > Lastly, I would file a follow-on JIRA to introduce hashing of errors
>> or
>> > > fields that are too long.  It seems like a separate feature that we
>> need
>> > to
>> > > think through.  We may need a stellar function around that.
>> > >
>> > > Thanks,
>> > > James
>> > >
>> > > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
>> > > > I understand what Jon is talking about. He's proposing we hash the
>> > value
>> > > > that caused the error, not necessarily the error message itself.
>> For an
>> > > > enrichment this is easy. Just pass along the field value that failed
>> > > > enrichment. For other cases the field that caused the error may not
>> be
>> > so
>> > > > obvious. Take parser validation for example. The message is
>> validated
>> > as
>> > > > a whole and it may not be easy to determine which field is the
>> cause.
>> > In
>> > > > that case would a hash of the whole message work?
>> > > >
>> > > > There is a broader architectural discussion that needs to happen
>> before
>> > > we
>> > > > can implement this. Currently we have an indexing topology that
>> reads
>> > > from
>> > > > 1 topic and writes messages to ES but errors are written to several
>> > > > different topics:
>> > > >
>> > > >    - parser_error
>> > > >    - parser_invalid
>> > > >    - enrichments_error
>> > > >    - threatintel_error
>> > > >    - indexing_error
>> > > >
>> > > > I can see 4 possible approaches to implementing this:
>> > > >
>> > > >    1. Create an index topology for each error topic
>> > > >       1. Good because we can easily reuse the indexing topology and
>> > would
>> > > >       require the least development effort
>> > > >       2. Bad because it would consume a lot of extra worker slots
>> > > >    2. Move the topic name into the error JSON message as a new
>> > > "error_type"
>> > > >    field and write all messages to the indexing topic
>> > > >       1. Good because we don't need to create a new topology
>> > > >       2. Bad because we would be flowing data and errors through the
>> > same
>> > > >       topology. A spike in errors could affect message indexing.
>> > > >    3. Compromise between 1 and 2. Create another indexing topology
>> that
>> > > is
>> > > >    dedicated to indexing errors. Move the topic name into the error
>> > JSON
>> > > >    message as a new "error_type" field and write all errors to a
>> single
>> > > error
>> > > >    topic.
>> > > >    4. Write a completely new topology with multiple spouts (1 for
>> each
>> > > >    error type listed above) that all feed into a single
>> > > BulkMessageWriterBolt.
>> > > >       1. Good because the current topologies would not need to
>> change
>> > > >       2. Bad because it would require the most development effort,
>> > would
>> > > >       not reuse existing topologies and takes up more worker slots
>> > than 3
>> > > >
>> > > > Are there other approaches I haven't thought of? I think 1 and 2 are
>> > off
>> > > > the table because they are shortcuts and not good long-term
>> solutions.
>> > 3
>> > > > would be my choice because it introduces less complexity than 4.
>> > > Thoughts?
>> > > >
>> > > > Ryan
>> > > >
>> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <zeolla@gmail.com
>> >
>> > > wrote:
>> > > >
>> > > >>  In that case the hash would be of the value in the IP field, such
>> as
>> > > >>  sha3(8.8.8.8).
>> > > >>
>> > > >>  Jon
>> > > >>
>> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org>
>> > wrote:
>> > > >>
>> > > >>  > Jon,
>> > > >>  >
>> > > >>  > I am still not entirely following why we would want to use
>> hashing.
>> > > For
>> > > >>  > example if my error is "Your IP field is invalid and failed
>> > > validation"
>> > > >>  > hashing this error string will always result in the same hash.
>> Why
>> > > not
>> > > >>  > just use the actual error string? Can you provide an example
>> where
>> > > you
>> > > >>  > would use it?
>> > > >>  >
>> > > >>  > Thanks,
>> > > >>  > James
>> > > >>  >
>> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
>> > > >>  > > For 1 - I'm good with that.
>> > > >>  > >
>> > > >>  > > I'm talking about hashing the relevant content itself not the
>> > > error.
>> > > >>  Some
>> > > >>  > > benefits are (1) minimize load on search index (there's
>> minimal
>> > > benefit
>> > > >>  > in
>> > > >>  > > spending the CPU and disk to keep it at full fidelity
>> (tokenize
>> > and
>> > > >>  > store))
>> > > >>  > > (2) provide something to key on for dashboards (assuming a
>> good
>> > > hash
>> > > >>  > > algorithm that avoids collisions and is second preimage
>> > resistant)
>> > > and
>> > > >>  > (3)
>> > > >>  > > specific to errors, if the issue is that it failed to index, a
>> > hash
>> > > >>  gives
>> > > >>  > > us some protection that the issue will not occur twice.
>> > > >>  > >
>> > > >>  > > Jon
>> > > >>  > >
>> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
>> jsirota@apache.org>
>> > > wrote:
>> > > >>  > >
>> > > >>  > > Jon,
>> > > >>  > >
>> > > >>  > > With regards to 1, collapsing to a single dashboard for each
>> > would
>> > > be
>> > > >>  > > fine. So we would have one error index and one "failed to
>> > validate"
>> > > >>  > > index. The distinction is that errors would be things that
>> went
>> > > wrong
>> > > >>  > > during stream processing (failed to parse, etc...), while
>> > > validation
>> > > >>  > > failures are messages that explicitly failed stellar
>> > > validation/schema
>> > > >>  > > enforcement. There should be relatively few of the second
>> type.
>> > > >>  > >
>> > > >>  > > With respect to 3, why do you want the error hashed? Why not
>> just
>> > > >>  search
>> > > >>  > > for the error text?
>> > > >>  > >
>> > > >>  > > Thanks,
>> > > >>  > > James
>> > > >>  > >
>> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
>> > > >>  > >> As someone who currently fills the platform engineer role, I
>> can
>> > > give
>> > > >>  > this
>> > > >>  > >> idea a huge +1. My thoughts:
>> > > >>  > >>
>> > > >>  > >> 1. I think it depends on exactly what data is pushed into the
>> > > index
>> > > >>  > (#3).
>> > > >>  > >> However, assuming the errors you proposed recording, I can't
>> see
>> > > huge
>> > > >>  > >> benefits to having more than one dashboard. I would be happy
>> to
>> > be
>> > > >>  > >> persuaded otherwise.
>> > > >>  > >>
>> > > >>  > >> 2. I would say yes, storing the errors in HDFS in addition to
>> > > >>  indexing
>> > > >>  > is
>> > > >>  > >> a good thing. Using METRON-510
>> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a case
>> > > study,
>> > > >>  > there
>> > > >>  > >> is the potential in this environment for attacker-controlled
>> > data
>> > > to
>> > > >>  > >
>> > > >>  > > result
>> > > >>  > >> in processing errors which could be a method of evading
>> security
>> > > >>  > >> monitoring. Once an attack is identified, the long term HDFS
>> > > storage
>> > > >>  > would
>> > > >>  > >> allow better historical analysis for low-and-slow/persistent
>> > > attacks
>> > > >>  > (I'm
>> > > >>  > >> thinking of a method of data exfil that also won't
>> successfully
>> > > get
>> > > >>  > stored
>> > > >>  > >> in Lucene, but is hard to identify over a short period of
>> time).
>> > > >>  > >> - Along this line, I think that there are various parts of
>> > Metron
>> > > >>  > (this
>> > > >>  > >> included) which could benefit from having method of
>> configuring
>> > > data
>> > > >>  > aging
>> > > >>  > >> by bucket in HDFS (Following Nick's comments here
>> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
>> > > >>  > >>
>> > > >>  > >> 3. I would potentially add a hash of the content that failed
>> > > >>  > validation to
>> > > >>  > >> help identify repeats over time with less of a concern that
>> > you'd
>> > > >>  have
>> > > >>  > >
>> > > >>  > > back
>> > > >>  > >> to back failures (i.e. instead of storing the value itself).
>> > > >>  > Additionally,
>> > > >>  > >> I think it's helpful to be able to search all times there
>> was an
>> > > >>  > indexing
>> > > >>  > >> error (instead of it hitting the catch-all).
>> > > >>  > >>
>> > > >>  > >> Jon
>> > > >>  > >>
>> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
>> > jsirota@apache.org>
>> > > >>  > wrote:
>> > > >>  > >>
>> > > >>  > >> We already have a capability to capture bolt errors and
>> > validation
>> > > >>  > errors
>> > > >>  > >> and pipe them into a Kafka topic. I want to propose that we
>> > > attach a
>> > > >>  > >> writer topology to the error and validation failed kafka
>> topics
>> > so
>> > > >>  > that we
>> > > >>  > >> can (a) create a new ES index for these errors and (b)
>> create a
>> > > new
>> > > >>  > Kibana
>> > > >>  > >> dashboard to visualize them. The benefit would be that errors
>> > and
>> > > >>  > >> validation failures would be easier to see and analyze.
>> > > >>  > >>
>> > > >>  > >> I am seeking feedback on the following:
>> > > >>  > >>
>> > > >>  > >> - How granular would we want this feature to be? Think we
>> would
>> > > want
>> > > >>  > one
>> > > >>  > >> index/dashboard per source? Or would it be better to collapse
>> > > >>  > everything
>> > > >>  > >> into the same index?
>> > > >>  > >> - Do we care about storing these errors in HDFS as well? Or
>> is
>> > > >>  indexing
>> > > >>  > >> them enough?
>> > > >>  > >> - What types of errors should we record? I am proposing:
>> > > >>  > >>
>> > > >>  > >> For error reporting:
>> > > >>  > >> --Message failed to parse
>> > > >>  > >> --Enrichment failed to enrich
>> > > >>  > >> --Threat intel feed failures
>> > > >>  > >> --Generic catch-all for all other errors
>> > > >>  > >>
>> > > >>  > >> For validation reporting:
>> > > >>  > >> --What part of message failed validation
>> > > >>  > >> --What stellar validator caused the failure
>> > > >>  > >>
>> > > >>  > >> -------------------
>> > > >>  > >> Thank you,
>> > > >>  > >>
>> > > >>  > >> James Sirota
>> > > >>  > >> PPMC- Apache Metron (Incubating)
>> > > >>  > >> jsirota AT apache DOT org
>> > > >>  > >>
>> > > >>  > >> --
>> > > >>  > >>
>> > > >>  > >> Jon
>> > > >>  > >>
>> > > >>  > >> Sent from my mobile device
>> > > >>  > >
>> > > >>  > > -------------------
>> > > >>  > > Thank you,
>> > > >>  > >
>> > > >>  > > James Sirota
>> > > >>  > > PPMC- Apache Metron (Incubating)
>> > > >>  > > jsirota AT apache DOT org
>> > > >>  > >
>> > > >>  > > --
>> > > >>  > >
>> > > >>  > > Jon
>> > > >>  > >
>> > > >>  > > Sent from my mobile device
>> > > >>  >
>> > > >>  > -------------------
>> > > >>  > Thank you,
>> > > >>  >
>> > > >>  > James Sirota
>> > > >>  > PPMC- Apache Metron (Incubating)
>> > > >>  > jsirota AT apache DOT org
>> > > >>  >
>> > > >>  --
>> > > >>
>> > > >>  Jon
>> > > >>
>> > > >>  Sent from my mobile device
>> > >
>> > > -------------------
>> > > Thank you,
>> > >
>> > > James Sirota
>> > > PPMC- Apache Metron (Incubating)
>> > > jsirota AT apache DOT org
>> > >
>> >
>> --
>>
>> Jon
>>
>> Sent from my mobile device
>>
>
>

Re: [DISCUSS] Error Indexing

Posted by Ryan Merriman <me...@gmail.com>.

That use case makes sense to me.  I don't think it will require that much
additional effort either.

On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:

> Regarding error vs validation - Either way I'm not very concerned.  I
> initially assumed they would be combined and agree with that approach, but
> splitting them out isn't a very big deal to me either.
>
> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere else
> where it's not possible to pick out the exact thing causing the issue) it
> would be a hash of the complete message.
>
> Regarding the architecture, I mostly agree with James except that I think
> step 3 needs to also be able to somehow group errors via the original
> data (identify
> replays, identify repeat issues with data in a specific field, issues with
> consistently different data, etc.).  This is essentially the first step of
> troubleshooting, which I assume you are doing if you're looking at the
> error dashboard.
>
> If the hash gets moved out of the initial implementation, I'm fairly
> certain you lose this ability.  The point here isn't to handle long fields
> (although that's a benefit of this approach), it's to attach a unique
> identifier to the error/validation issue message that links it to the
> original problem.  I'd be happy to consider alternative solutions to this
> problem (for instance, actually sending across the data itself) I just
> haven't been able to think of another way to do this that I like better.
>
> Jon
>
> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com> wrote:
>
> > We also need a JIRA for any install/Ansible/MPack work needed.
> >
> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org>
> wrote:
> >
> > > Now that I had some time to think about it I would collapse all error
> and
> > > validation topics into one.  We can differentiate between different
> views
> > > of the data (split by error source etc) via Kibana dashboards.  I would
> > > implement this feature incrementally.  First I would modify all the
> bolts
> > > to log to a single topic.  Second, I would get the error indexing done
> by
> > > attaching the indexing topology to the error topic. Third I would
> create
> > > the necessary dashboards to view errors and validation failures by
> > source.
> > > Lastly, I would file a follow-on JIRA to introduce hashing of errors or
> > > fields that are too long.  It seems like a separate feature that we
> need
> > to
> > > think through.  We may need a stellar function around that.
> > >
> > > Thanks,
> > > James
> > >
> > > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> > > > I understand what Jon is talking about. He's proposing we hash the
> > value
> > > > that caused the error, not necessarily the error message itself. For
> an
> > > > enrichment this is easy. Just pass along the field value that failed
> > > > enrichment. For other cases the field that caused the error may not
> be
> > so
> > > > obvious. Take parser validation for example. The message is validated
> > as
> > > > a whole and it may not be easy to determine which field is the cause.
> > In
> > > > that case would a hash of the whole message work?
> > > >
> > > > There is a broader architectural discussion that needs to happen
> before
> > > we
> > > > can implement this. Currently we have an indexing topology that reads
> > > from
> > > > 1 topic and writes messages to ES but errors are written to several
> > > > different topics:
> > > >
> > > >    - parser_error
> > > >    - parser_invalid
> > > >    - enrichments_error
> > > >    - threatintel_error
> > > >    - indexing_error
> > > >
> > > > I can see 4 possible approaches to implementing this:
> > > >
> > > >    1. Create an index topology for each error topic
> > > >       1. Good because we can easily reuse the indexing topology and
> > would
> > > >       require the least development effort
> > > >       2. Bad because it would consume a lot of extra worker slots
> > > >    2. Move the topic name into the error JSON message as a new
> > > "error_type"
> > > >    field and write all messages to the indexing topic
> > > >       1. Good because we don't need to create a new topology
> > > >       2. Bad because we would be flowing data and errors through the
> > same
> > > >       topology. A spike in errors could affect message indexing.
> > > >    3. Compromise between 1 and 2. Create another indexing topology
> that
> > > is
> > > >    dedicated to indexing errors. Move the topic name into the error
> > JSON
> > > >    message as a new "error_type" field and write all errors to a
> single
> > > error
> > > >    topic.
> > > >    4. Write a completely new topology with multiple spouts (1 for
> each
> > > >    error type listed above) that all feed into a single
> > > BulkMessageWriterBolt.
> > > >       1. Good because the current topologies would not need to change
> > > >       2. Bad because it would require the most development effort,
> > would
> > > >       not reuse existing topologies and takes up more worker slots
> > than 3
> > > >
> > > > Are there other approaches I haven't thought of? I think 1 and 2 are
> > off
> > > > the table because they are shortcuts and not good long-term
> solutions.
> > 3
> > > > would be my choice because it introduces less complexity than 4.
> > > Thoughts?
> > > >
> > > > Ryan
> > > >
> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <ze...@gmail.com>
> > > wrote:
> > > >
> > > >>  In that case the hash would be of the value in the IP field, such
> as
> > > >>  sha3(8.8.8.8).
> > > >>
> > > >>  Jon
> > > >>
> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org>
> > wrote:
> > > >>
> > > >>  > Jon,
> > > >>  >
> > > >>  > I am still not entirely following why we would want to use
> hashing.
> > > For
> > > >>  > example if my error is "Your IP field is invalid and failed
> > > validation"
> > > >>  > hashing this error string will always result in the same hash.
> Why
> > > not
> > > >>  > just use the actual error string? Can you provide an example
> where
> > > you
> > > >>  > would use it?
> > > >>  >
> > > >>  > Thanks,
> > > >>  > James
> > > >>  >
> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> > > >>  > > For 1 - I'm good with that.
> > > >>  > >
> > > >>  > > I'm talking about hashing the relevant content itself not the
> > > error.
> > > >>  Some
> > > >>  > > benefits are (1) minimize load on search index (there's minimal
> > > benefit
> > > >>  > in
> > > >>  > > spending the CPU and disk to keep it at full fidelity (tokenize
> > and
> > > >>  > store))
> > > >>  > > (2) provide something to key on for dashboards (assuming a good
> > > hash
> > > >>  > > algorithm that avoids collisions and is second preimage
> > resistant)
> > > and
> > > >>  > (3)
> > > >>  > > specific to errors, if the issue is that it failed to index, a
> > hash
> > > >>  gives
> > > >>  > > us some protection that the issue will not occur twice.
> > > >>  > >
> > > >>  > > Jon
> > > >>  > >
> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <jsirota@apache.org
> >
> > > wrote:
> > > >>  > >
> > > >>  > > Jon,
> > > >>  > >
> > > >>  > > With regards to 1, collapsing to a single dashboard for each
> > would
> > > be
> > > >>  > > fine. So we would have one error index and one "failed to
> > validate"
> > > >>  > > index. The distinction is that errors would be things that went
> > > wrong
> > > >>  > > during stream processing (failed to parse, etc...), while
> > > validation
> > > >>  > > failures are messages that explicitly failed stellar
> > > validation/schema
> > > >>  > > enforcement. There should be relatively few of the second type.
> > > >>  > >
> > > >>  > > With respect to 3, why do you want the error hashed? Why not
> just
> > > >>  search
> > > >>  > > for the error text?
> > > >>  > >
> > > >>  > > Thanks,
> > > >>  > > James
> > > >>  > >
> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> > > >>  > >> As someone who currently fills the platform engineer role, I
> can
> > > give
> > > >>  > this
> > > >>  > >> idea a huge +1. My thoughts:
> > > >>  > >>
> > > >>  > >> 1. I think it depends on exactly what data is pushed into the
> > > index
> > > >>  > (#3).
> > > >>  > >> However, assuming the errors you proposed recording, I can't
> see
> > > huge
> > > >>  > >> benefits to having more than one dashboard. I would be happy
> to
> > be
> > > >>  > >> persuaded otherwise.
> > > >>  > >>
> > > >>  > >> 2. I would say yes, storing the errors in HDFS in addition to
> > > >>  indexing
> > > >>  > is
> > > >>  > >> a good thing. Using METRON-510
> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a case
> > > study,
> > > >>  > there
> > > >>  > >> is the potential in this environment for attacker-controlled
> > data
> > > to
> > > >>  > >
> > > >>  > > result
> > > >>  > >> in processing errors which could be a method of evading
> security
> > > >>  > >> monitoring. Once an attack is identified, the long term HDFS
> > > storage
> > > >>  > would
> > > >>  > >> allow better historical analysis for low-and-slow/persistent
> > > attacks
> > > >>  > (I'm
> > > >>  > >> thinking of a method of data exfil that also won't
> successfully
> > > get
> > > >>  > stored
> > > >>  > >> in Lucene, but is hard to identify over a short period of
> time).
> > > >>  > >> - Along this line, I think that there are various parts of
> > Metron
> > > >>  > (this
> > > >>  > >> included) which could benefit from having method of
> configuring
> > > data
> > > >>  > aging
> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> > > >>  > >>
> > > >>  > >> 3. I would potentially add a hash of the content that failed
> > > >>  > validation to
> > > >>  > >> help identify repeats over time with less of a concern that
> > you'd
> > > >>  have
> > > >>  > >
> > > >>  > > back
> > > >>  > >> to back failures (i.e. instead of storing the value itself).
> > > >>  > Additionally,
> > > >>  > >> I think it's helpful to be able to search all times there was
> an
> > > >>  > indexing
> > > >>  > >> error (instead of it hitting the catch-all).
> > > >>  > >>
> > > >>  > >> Jon
> > > >>  > >>
> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> > jsirota@apache.org>
> > > >>  > wrote:
> > > >>  > >>
> > > >>  > >> We already have a capability to capture bolt errors and
> > validation
> > > >>  > errors
> > > >>  > >> and pipe them into a Kafka topic. I want to propose that we
> > > attach a
> > > >>  > >> writer topology to the error and validation failed kafka
> topics
> > so
> > > >>  > that we
> > > >>  > >> can (a) create a new ES index for these errors and (b) create
> a
> > > new
> > > >>  > Kibana
> > > >>  > >> dashboard to visualize them. The benefit would be that errors
> > and
> > > >>  > >> validation failures would be easier to see and analyze.
> > > >>  > >>
> > > >>  > >> I am seeking feedback on the following:
> > > >>  > >>
> > > >>  > >> - How granular would we want this feature to be? Think we
> would
> > > want
> > > >>  > one
> > > >>  > >> index/dashboard per source? Or would it be better to collapse
> > > >>  > everything
> > > >>  > >> into the same index?
> > > >>  > >> - Do we care about storing these errors in HDFS as well? Or is
> > > >>  indexing
> > > >>  > >> them enough?
> > > >>  > >> - What types of errors should we record? I am proposing:
> > > >>  > >>
> > > >>  > >> For error reporting:
> > > >>  > >> --Message failed to parse
> > > >>  > >> --Enrichment failed to enrich
> > > >>  > >> --Threat intel feed failures
> > > >>  > >> --Generic catch-all for all other errors
> > > >>  > >>
> > > >>  > >> For validation reporting:
> > > >>  > >> --What part of message failed validation
> > > >>  > >> --What stellar validator caused the failure
> > > >>  > >>
> > > >>  > >> -------------------
> > > >>  > >> Thank you,
> > > >>  > >>
> > > >>  > >> James Sirota
> > > >>  > >> PPMC- Apache Metron (Incubating)
> > > >>  > >> jsirota AT apache DOT org
> > > >>  > >>
> > > >>  > >> --
> > > >>  > >>
> > > >>  > >> Jon
> > > >>  > >>
> > > >>  > >> Sent from my mobile device
> > > >>  > >
> > > >>  > > -------------------
> > > >>  > > Thank you,
> > > >>  > >
> > > >>  > > James Sirota
> > > >>  > > PPMC- Apache Metron (Incubating)
> > > >>  > > jsirota AT apache DOT org
> > > >>  > >
> > > >>  > > --
> > > >>  > >
> > > >>  > > Jon
> > > >>  > >
> > > >>  > > Sent from my mobile device
> > > >>  >
> > > >>  > -------------------
> > > >>  > Thank you,
> > > >>  >
> > > >>  > James Sirota
> > > >>  > PPMC- Apache Metron (Incubating)
> > > >>  > jsirota AT apache DOT org
> > > >>  >
> > > >>  --
> > > >>
> > > >>  Jon
> > > >>
> > > >>  Sent from my mobile device
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PPMC- Apache Metron (Incubating)
> > > jsirota AT apache DOT org
> > >
> >
> --
>
> Jon
>
> Sent from my mobile device
>

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

Regarding error vs validation - Either way I'm not very concerned.  I
initially assumed they would be combined and agree with that approach, but
splitting them out isn't a very big deal to me either.

Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere else
where it's not possible to pick out the exact thing causing the issue) it
would be a hash of the complete message.

Regarding the architecture, I mostly agree with James except that I think
step 3 needs to also be able to somehow group errors via the original
data (identify
replays, identify repeat issues with data in a specific field, issues with
consistently different data, etc.).  This is essentially the first step of
troubleshooting, which I assume you are doing if you're looking at the
error dashboard.

If the hash gets moved out of the initial implementation, I'm fairly
certain you lose this ability.  The point here isn't to handle long fields
(although that's a benefit of this approach), it's to attach a unique
identifier to the error/validation issue message that links it to the
original problem.  I'd be happy to consider alternative solutions to this
problem (for instance, actually sending across the data itself) I just
haven't been able to think of another way to do this that I like better.

Jon

On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <me...@gmail.com> wrote:

> We also need a JIRA for any install/Ansible/MPack work needed.
>
> On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org> wrote:
>
> > Now that I had some time to think about it I would collapse all error and
> > validation topics into one.  We can differentiate between different views
> > of the data (split by error source etc) via Kibana dashboards.  I would
> > implement this feature incrementally.  First I would modify all the bolts
> > to log to a single topic.  Second, I would get the error indexing done by
> > attaching the indexing topology to the error topic. Third I would create
> > the necessary dashboards to view errors and validation failures by
> source.
> > Lastly, I would file a follow-on JIRA to introduce hashing of errors or
> > fields that are too long.  It seems like a separate feature that we need
> to
> > think through.  We may need a stellar function around that.
> >
> > Thanks,
> > James
> >
> > 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> > > I understand what Jon is talking about. He's proposing we hash the
> value
> > > that caused the error, not necessarily the error message itself. For an
> > > enrichment this is easy. Just pass along the field value that failed
> > > enrichment. For other cases the field that caused the error may not be
> so
> > > obvious. Take parser validation for example. The message is validated
> as
> > > a whole and it may not be easy to determine which field is the cause.
> In
> > > that case would a hash of the whole message work?
> > >
> > > There is a broader architectural discussion that needs to happen before
> > we
> > > can implement this. Currently we have an indexing topology that reads
> > from
> > > 1 topic and writes messages to ES but errors are written to several
> > > different topics:
> > >
> > >    - parser_error
> > >    - parser_invalid
> > >    - enrichments_error
> > >    - threatintel_error
> > >    - indexing_error
> > >
> > > I can see 4 possible approaches to implementing this:
> > >
> > >    1. Create an index topology for each error topic
> > >       1. Good because we can easily reuse the indexing topology and
> would
> > >       require the least development effort
> > >       2. Bad because it would consume a lot of extra worker slots
> > >    2. Move the topic name into the error JSON message as a new
> > "error_type"
> > >    field and write all messages to the indexing topic
> > >       1. Good because we don't need to create a new topology
> > >       2. Bad because we would be flowing data and errors through the
> same
> > >       topology. A spike in errors could affect message indexing.
> > >    3. Compromise between 1 and 2. Create another indexing topology that
> > is
> > >    dedicated to indexing errors. Move the topic name into the error
> JSON
> > >    message as a new "error_type" field and write all errors to a single
> > error
> > >    topic.
> > >    4. Write a completely new topology with multiple spouts (1 for each
> > >    error type listed above) that all feed into a single
> > BulkMessageWriterBolt.
> > >       1. Good because the current topologies would not need to change
> > >       2. Bad because it would require the most development effort,
> would
> > >       not reuse existing topologies and takes up more worker slots
> than 3
> > >
> > > Are there other approaches I haven't thought of? I think 1 and 2 are
> off
> > > the table because they are shortcuts and not good long-term solutions.
> 3
> > > would be my choice because it introduces less complexity than 4.
> > Thoughts?
> > >
> > > Ryan
> > >
> > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <ze...@gmail.com>
> > wrote:
> > >
> > >>  In that case the hash would be of the value in the IP field, such as
> > >>  sha3(8.8.8.8).
> > >>
> > >>  Jon
> > >>
> > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org>
> wrote:
> > >>
> > >>  > Jon,
> > >>  >
> > >>  > I am still not entirely following why we would want to use hashing.
> > For
> > >>  > example if my error is "Your IP field is invalid and failed
> > validation"
> > >>  > hashing this error string will always result in the same hash. Why
> > not
> > >>  > just use the actual error string? Can you provide an example where
> > you
> > >>  > would use it?
> > >>  >
> > >>  > Thanks,
> > >>  > James
> > >>  >
> > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >>  > > For 1 - I'm good with that.
> > >>  > >
> > >>  > > I'm talking about hashing the relevant content itself not the
> > error.
> > >>  Some
> > >>  > > benefits are (1) minimize load on search index (there's minimal
> > benefit
> > >>  > in
> > >>  > > spending the CPU and disk to keep it at full fidelity (tokenize
> and
> > >>  > store))
> > >>  > > (2) provide something to key on for dashboards (assuming a good
> > hash
> > >>  > > algorithm that avoids collisions and is second preimage
> resistant)
> > and
> > >>  > (3)
> > >>  > > specific to errors, if the issue is that it failed to index, a
> hash
> > >>  gives
> > >>  > > us some protection that the issue will not occur twice.
> > >>  > >
> > >>  > > Jon
> > >>  > >
> > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <js...@apache.org>
> > wrote:
> > >>  > >
> > >>  > > Jon,
> > >>  > >
> > >>  > > With regards to 1, collapsing to a single dashboard for each
> would
> > be
> > >>  > > fine. So we would have one error index and one "failed to
> validate"
> > >>  > > index. The distinction is that errors would be things that went
> > wrong
> > >>  > > during stream processing (failed to parse, etc...), while
> > validation
> > >>  > > failures are messages that explicitly failed stellar
> > validation/schema
> > >>  > > enforcement. There should be relatively few of the second type.
> > >>  > >
> > >>  > > With respect to 3, why do you want the error hashed? Why not just
> > >>  search
> > >>  > > for the error text?
> > >>  > >
> > >>  > > Thanks,
> > >>  > > James
> > >>  > >
> > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >>  > >> As someone who currently fills the platform engineer role, I can
> > give
> > >>  > this
> > >>  > >> idea a huge +1. My thoughts:
> > >>  > >>
> > >>  > >> 1. I think it depends on exactly what data is pushed into the
> > index
> > >>  > (#3).
> > >>  > >> However, assuming the errors you proposed recording, I can't see
> > huge
> > >>  > >> benefits to having more than one dashboard. I would be happy to
> be
> > >>  > >> persuaded otherwise.
> > >>  > >>
> > >>  > >> 2. I would say yes, storing the errors in HDFS in addition to
> > >>  indexing
> > >>  > is
> > >>  > >> a good thing. Using METRON-510
> > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a case
> > study,
> > >>  > there
> > >>  > >> is the potential in this environment for attacker-controlled
> data
> > to
> > >>  > >
> > >>  > > result
> > >>  > >> in processing errors which could be a method of evading security
> > >>  > >> monitoring. Once an attack is identified, the long term HDFS
> > storage
> > >>  > would
> > >>  > >> allow better historical analysis for low-and-slow/persistent
> > attacks
> > >>  > (I'm
> > >>  > >> thinking of a method of data exfil that also won't successfully
> > get
> > >>  > stored
> > >>  > >> in Lucene, but is hard to identify over a short period of time).
> > >>  > >> - Along this line, I think that there are various parts of
> Metron
> > >>  > (this
> > >>  > >> included) which could benefit from having method of configuring
> > data
> > >>  > aging
> > >>  > >> by bucket in HDFS (Following Nick's comments here
> > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> > >>  > >>
> > >>  > >> 3. I would potentially add a hash of the content that failed
> > >>  > validation to
> > >>  > >> help identify repeats over time with less of a concern that
> you'd
> > >>  have
> > >>  > >
> > >>  > > back
> > >>  > >> to back failures (i.e. instead of storing the value itself).
> > >>  > Additionally,
> > >>  > >> I think it's helpful to be able to search all times there was an
> > >>  > indexing
> > >>  > >> error (instead of it hitting the catch-all).
> > >>  > >>
> > >>  > >> Jon
> > >>  > >>
> > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> jsirota@apache.org>
> > >>  > wrote:
> > >>  > >>
> > >>  > >> We already have a capability to capture bolt errors and
> validation
> > >>  > errors
> > >>  > >> and pipe them into a Kafka topic. I want to propose that we
> > attach a
> > >>  > >> writer topology to the error and validation failed kafka topics
> so
> > >>  > that we
> > >>  > >> can (a) create a new ES index for these errors and (b) create a
> > new
> > >>  > Kibana
> > >>  > >> dashboard to visualize them. The benefit would be that errors
> and
> > >>  > >> validation failures would be easier to see and analyze.
> > >>  > >>
> > >>  > >> I am seeking feedback on the following:
> > >>  > >>
> > >>  > >> - How granular would we want this feature to be? Think we would
> > want
> > >>  > one
> > >>  > >> index/dashboard per source? Or would it be better to collapse
> > >>  > everything
> > >>  > >> into the same index?
> > >>  > >> - Do we care about storing these errors in HDFS as well? Or is
> > >>  indexing
> > >>  > >> them enough?
> > >>  > >> - What types of errors should we record? I am proposing:
> > >>  > >>
> > >>  > >> For error reporting:
> > >>  > >> --Message failed to parse
> > >>  > >> --Enrichment failed to enrich
> > >>  > >> --Threat intel feed failures
> > >>  > >> --Generic catch-all for all other errors
> > >>  > >>
> > >>  > >> For validation reporting:
> > >>  > >> --What part of message failed validation
> > >>  > >> --What stellar validator caused the failure
> > >>  > >>
> > >>  > >> -------------------
> > >>  > >> Thank you,
> > >>  > >>
> > >>  > >> James Sirota
> > >>  > >> PPMC- Apache Metron (Incubating)
> > >>  > >> jsirota AT apache DOT org
> > >>  > >>
> > >>  > >> --
> > >>  > >>
> > >>  > >> Jon
> > >>  > >>
> > >>  > >> Sent from my mobile device
> > >>  > >
> > >>  > > -------------------
> > >>  > > Thank you,
> > >>  > >
> > >>  > > James Sirota
> > >>  > > PPMC- Apache Metron (Incubating)
> > >>  > > jsirota AT apache DOT org
> > >>  > >
> > >>  > > --
> > >>  > >
> > >>  > > Jon
> > >>  > >
> > >>  > > Sent from my mobile device
> > >>  >
> > >>  > -------------------
> > >>  > Thank you,
> > >>  >
> > >>  > James Sirota
> > >>  > PPMC- Apache Metron (Incubating)
> > >>  > jsirota AT apache DOT org
> > >>  >
> > >>  --
> > >>
> > >>  Jon
> > >>
> > >>  Sent from my mobile device
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
>
-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Error Indexing

Posted by Ryan Merriman <me...@gmail.com>.

We also need a JIRA for any install/Ansible/MPack work needed.

On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <js...@apache.org> wrote:

> Now that I had some time to think about it I would collapse all error and
> validation topics into one.  We can differentiate between different views
> of the data (split by error source etc) via Kibana dashboards.  I would
> implement this feature incrementally.  First I would modify all the bolts
> to log to a single topic.  Second, I would get the error indexing done by
> attaching the indexing topology to the error topic. Third I would create
> the necessary dashboards to view errors and validation failures by source.
> Lastly, I would file a follow-on JIRA to introduce hashing of errors or
> fields that are too long.  It seems like a separate feature that we need to
> think through.  We may need a stellar function around that.
>
> Thanks,
> James
>
> 24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> > I understand what Jon is talking about. He's proposing we hash the value
> > that caused the error, not necessarily the error message itself. For an
> > enrichment this is easy. Just pass along the field value that failed
> > enrichment. For other cases the field that caused the error may not be so
> > obvious. Take parser validation for example. The message is validated as
> > a whole and it may not be easy to determine which field is the cause. In
> > that case would a hash of the whole message work?
> >
> > There is a broader architectural discussion that needs to happen before
> we
> > can implement this. Currently we have an indexing topology that reads
> from
> > 1 topic and writes messages to ES but errors are written to several
> > different topics:
> >
> >    - parser_error
> >    - parser_invalid
> >    - enrichments_error
> >    - threatintel_error
> >    - indexing_error
> >
> > I can see 4 possible approaches to implementing this:
> >
> >    1. Create an index topology for each error topic
> >       1. Good because we can easily reuse the indexing topology and would
> >       require the least development effort
> >       2. Bad because it would consume a lot of extra worker slots
> >    2. Move the topic name into the error JSON message as a new
> "error_type"
> >    field and write all messages to the indexing topic
> >       1. Good because we don't need to create a new topology
> >       2. Bad because we would be flowing data and errors through the same
> >       topology. A spike in errors could affect message indexing.
> >    3. Compromise between 1 and 2. Create another indexing topology that
> is
> >    dedicated to indexing errors. Move the topic name into the error JSON
> >    message as a new "error_type" field and write all errors to a single
> error
> >    topic.
> >    4. Write a completely new topology with multiple spouts (1 for each
> >    error type listed above) that all feed into a single
> BulkMessageWriterBolt.
> >       1. Good because the current topologies would not need to change
> >       2. Bad because it would require the most development effort, would
> >       not reuse existing topologies and takes up more worker slots than 3
> >
> > Are there other approaches I haven't thought of? I think 1 and 2 are off
> > the table because they are shortcuts and not good long-term solutions. 3
> > would be my choice because it introduces less complexity than 4.
> Thoughts?
> >
> > Ryan
> >
> > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <ze...@gmail.com>
> wrote:
> >
> >>  In that case the hash would be of the value in the IP field, such as
> >>  sha3(8.8.8.8).
> >>
> >>  Jon
> >>
> >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org> wrote:
> >>
> >>  > Jon,
> >>  >
> >>  > I am still not entirely following why we would want to use hashing.
> For
> >>  > example if my error is "Your IP field is invalid and failed
> validation"
> >>  > hashing this error string will always result in the same hash. Why
> not
> >>  > just use the actual error string? Can you provide an example where
> you
> >>  > would use it?
> >>  >
> >>  > Thanks,
> >>  > James
> >>  >
> >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> >>  > > For 1 - I'm good with that.
> >>  > >
> >>  > > I'm talking about hashing the relevant content itself not the
> error.
> >>  Some
> >>  > > benefits are (1) minimize load on search index (there's minimal
> benefit
> >>  > in
> >>  > > spending the CPU and disk to keep it at full fidelity (tokenize and
> >>  > store))
> >>  > > (2) provide something to key on for dashboards (assuming a good
> hash
> >>  > > algorithm that avoids collisions and is second preimage resistant)
> and
> >>  > (3)
> >>  > > specific to errors, if the issue is that it failed to index, a hash
> >>  gives
> >>  > > us some protection that the issue will not occur twice.
> >>  > >
> >>  > > Jon
> >>  > >
> >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <js...@apache.org>
> wrote:
> >>  > >
> >>  > > Jon,
> >>  > >
> >>  > > With regards to 1, collapsing to a single dashboard for each would
> be
> >>  > > fine. So we would have one error index and one "failed to validate"
> >>  > > index. The distinction is that errors would be things that went
> wrong
> >>  > > during stream processing (failed to parse, etc...), while
> validation
> >>  > > failures are messages that explicitly failed stellar
> validation/schema
> >>  > > enforcement. There should be relatively few of the second type.
> >>  > >
> >>  > > With respect to 3, why do you want the error hashed? Why not just
> >>  search
> >>  > > for the error text?
> >>  > >
> >>  > > Thanks,
> >>  > > James
> >>  > >
> >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> >>  > >> As someone who currently fills the platform engineer role, I can
> give
> >>  > this
> >>  > >> idea a huge +1. My thoughts:
> >>  > >>
> >>  > >> 1. I think it depends on exactly what data is pushed into the
> index
> >>  > (#3).
> >>  > >> However, assuming the errors you proposed recording, I can't see
> huge
> >>  > >> benefits to having more than one dashboard. I would be happy to be
> >>  > >> persuaded otherwise.
> >>  > >>
> >>  > >> 2. I would say yes, storing the errors in HDFS in addition to
> >>  indexing
> >>  > is
> >>  > >> a good thing. Using METRON-510
> >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a case
> study,
> >>  > there
> >>  > >> is the potential in this environment for attacker-controlled data
> to
> >>  > >
> >>  > > result
> >>  > >> in processing errors which could be a method of evading security
> >>  > >> monitoring. Once an attack is identified, the long term HDFS
> storage
> >>  > would
> >>  > >> allow better historical analysis for low-and-slow/persistent
> attacks
> >>  > (I'm
> >>  > >> thinking of a method of data exfil that also won't successfully
> get
> >>  > stored
> >>  > >> in Lucene, but is hard to identify over a short period of time).
> >>  > >> - Along this line, I think that there are various parts of Metron
> >>  > (this
> >>  > >> included) which could benefit from having method of configuring
> data
> >>  > aging
> >>  > >> by bucket in HDFS (Following Nick's comments here
> >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> >>  > >>
> >>  > >> 3. I would potentially add a hash of the content that failed
> >>  > validation to
> >>  > >> help identify repeats over time with less of a concern that you'd
> >>  have
> >>  > >
> >>  > > back
> >>  > >> to back failures (i.e. instead of storing the value itself).
> >>  > Additionally,
> >>  > >> I think it's helpful to be able to search all times there was an
> >>  > indexing
> >>  > >> error (instead of it hitting the catch-all).
> >>  > >>
> >>  > >> Jon
> >>  > >>
> >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org>
> >>  > wrote:
> >>  > >>
> >>  > >> We already have a capability to capture bolt errors and validation
> >>  > errors
> >>  > >> and pipe them into a Kafka topic. I want to propose that we
> attach a
> >>  > >> writer topology to the error and validation failed kafka topics so
> >>  > that we
> >>  > >> can (a) create a new ES index for these errors and (b) create a
> new
> >>  > Kibana
> >>  > >> dashboard to visualize them. The benefit would be that errors and
> >>  > >> validation failures would be easier to see and analyze.
> >>  > >>
> >>  > >> I am seeking feedback on the following:
> >>  > >>
> >>  > >> - How granular would we want this feature to be? Think we would
> want
> >>  > one
> >>  > >> index/dashboard per source? Or would it be better to collapse
> >>  > everything
> >>  > >> into the same index?
> >>  > >> - Do we care about storing these errors in HDFS as well? Or is
> >>  indexing
> >>  > >> them enough?
> >>  > >> - What types of errors should we record? I am proposing:
> >>  > >>
> >>  > >> For error reporting:
> >>  > >> --Message failed to parse
> >>  > >> --Enrichment failed to enrich
> >>  > >> --Threat intel feed failures
> >>  > >> --Generic catch-all for all other errors
> >>  > >>
> >>  > >> For validation reporting:
> >>  > >> --What part of message failed validation
> >>  > >> --What stellar validator caused the failure
> >>  > >>
> >>  > >> -------------------
> >>  > >> Thank you,
> >>  > >>
> >>  > >> James Sirota
> >>  > >> PPMC- Apache Metron (Incubating)
> >>  > >> jsirota AT apache DOT org
> >>  > >>
> >>  > >> --
> >>  > >>
> >>  > >> Jon
> >>  > >>
> >>  > >> Sent from my mobile device
> >>  > >
> >>  > > -------------------
> >>  > > Thank you,
> >>  > >
> >>  > > James Sirota
> >>  > > PPMC- Apache Metron (Incubating)
> >>  > > jsirota AT apache DOT org
> >>  > >
> >>  > > --
> >>  > >
> >>  > > Jon
> >>  > >
> >>  > > Sent from my mobile device
> >>  >
> >>  > -------------------
> >>  > Thank you,
> >>  >
> >>  > James Sirota
> >>  > PPMC- Apache Metron (Incubating)
> >>  > jsirota AT apache DOT org
> >>  >
> >>  --
> >>
> >>  Jon
> >>
> >>  Sent from my mobile device
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Re: [DISCUSS] Error Indexing

Posted by James Sirota <js...@apache.org>.

Now that I had some time to think about it I would collapse all error and validation topics into one.  We can differentiate between different views of the data (split by error source etc) via Kibana dashboards.  I would implement this feature incrementally.  First I would modify all the bolts to log to a single topic.  Second, I would get the error indexing done by attaching the indexing topology to the error topic. Third I would create the necessary dashboards to view errors and validation failures by source.  Lastly, I would file a follow-on JIRA to introduce hashing of errors or fields that are too long.  It seems like a separate feature that we need to think through.  We may need a stellar function around that.

Thanks,
James 

24.01.2017, 10:25, "Ryan Merriman" <me...@gmail.com>:
> I understand what Jon is talking about. He's proposing we hash the value
> that caused the error, not necessarily the error message itself. For an
> enrichment this is easy. Just pass along the field value that failed
> enrichment. For other cases the field that caused the error may not be so
> obvious. Take parser validation for example. The message is validated as
> a whole and it may not be easy to determine which field is the cause. In
> that case would a hash of the whole message work?
>
> There is a broader architectural discussion that needs to happen before we
> can implement this. Currently we have an indexing topology that reads from
> 1 topic and writes messages to ES but errors are written to several
> different topics:
>
> ���- parser_error
> ���- parser_invalid
> ���- enrichments_error
> ���- threatintel_error
> ���- indexing_error
>
> I can see 4 possible approaches to implementing this:
>
> ���1. Create an index topology for each error topic
> ������1. Good because we can easily reuse the indexing topology and would
> ������require the least development effort
> ������2. Bad because it would consume a lot of extra worker slots
> ���2. Move the topic name into the error JSON message as a new "error_type"
> ���field and write all messages to the indexing topic
> ������1. Good because we don't need to create a new topology
> ������2. Bad because we would be flowing data and errors through the same
> ������topology. A spike in errors could affect message indexing.
> ���3. Compromise between 1 and 2. Create another indexing topology that is
> ���dedicated to indexing errors. Move the topic name into the error JSON
> ���message as a new "error_type" field and write all errors to a single error
> ���topic.
> ���4. Write a completely new topology with multiple spouts (1 for each
> ���error type listed above) that all feed into a single BulkMessageWriterBolt.
> ������1. Good because the current topologies would not need to change
> ������2. Bad because it would require the most development effort, would
> ������not reuse existing topologies and takes up more worker slots than 3
>
> Are there other approaches I haven't thought of? I think 1 and 2 are off
> the table because they are shortcuts and not good long-term solutions. 3
> would be my choice because it introduces less complexity than 4. Thoughts?
>
> Ryan
>
> On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:
>
>> �In that case the hash would be of the value in the IP field, such as
>> �sha3(8.8.8.8).
>>
>> �Jon
>>
>> �On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org> wrote:
>>
>> �> Jon,
>> �>
>> �> I am still not entirely following why we would want to use hashing. For
>> �> example if my error is "Your IP field is invalid and failed validation"
>> �> hashing this error string will always result in the same hash. Why not
>> �> just use the actual error string? Can you provide an example where you
>> �> would use it?
>> �>
>> �> Thanks,
>> �> James
>> �>
>> �> 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
>> �> > For 1 - I'm good with that.
>> �> >
>> �> > I'm talking about hashing the relevant content itself not the error.
>> �Some
>> �> > benefits are (1) minimize load on search index (there's minimal benefit
>> �> in
>> �> > spending the CPU and disk to keep it at full fidelity (tokenize and
>> �> store))
>> �> > (2) provide something to key on for dashboards (assuming a good hash
>> �> > algorithm that avoids collisions and is second preimage resistant) and
>> �> (3)
>> �> > specific to errors, if the issue is that it failed to index, a hash
>> �gives
>> �> > us some protection that the issue will not occur twice.
>> �> >
>> �> > Jon
>> �> >
>> �> > On Mon, Jan 23, 2017, 2:47 PM James Sirota <js...@apache.org> wrote:
>> �> >
>> �> > Jon,
>> �> >
>> �> > With regards to 1, collapsing to a single dashboard for each would be
>> �> > fine. So we would have one error index and one "failed to validate"
>> �> > index. The distinction is that errors would be things that went wrong
>> �> > during stream processing (failed to parse, etc...), while validation
>> �> > failures are messages that explicitly failed stellar validation/schema
>> �> > enforcement. There should be relatively few of the second type.
>> �> >
>> �> > With respect to 3, why do you want the error hashed? Why not just
>> �search
>> �> > for the error text?
>> �> >
>> �> > Thanks,
>> �> > James
>> �> >
>> �> > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
>> �> >> As someone who currently fills the platform engineer role, I can give
>> �> this
>> �> >> idea a huge +1. My thoughts:
>> �> >>
>> �> >> 1. I think it depends on exactly what data is pushed into the index
>> �> (#3).
>> �> >> However, assuming the errors you proposed recording, I can't see huge
>> �> >> benefits to having more than one dashboard. I would be happy to be
>> �> >> persuaded otherwise.
>> �> >>
>> �> >> 2. I would say yes, storing the errors in HDFS in addition to
>> �indexing
>> �> is
>> �> >> a good thing. Using METRON-510
>> �> >> <https://issues.apache.org/jira/browse/METRON-510> as a case study,
>> �> there
>> �> >> is the potential in this environment for attacker-controlled data to
>> �> >
>> �> > result
>> �> >> in processing errors which could be a method of evading security
>> �> >> monitoring. Once an attack is identified, the long term HDFS storage
>> �> would
>> �> >> allow better historical analysis for low-and-slow/persistent attacks
>> �> (I'm
>> �> >> thinking of a method of data exfil that also won't successfully get
>> �> stored
>> �> >> in Lucene, but is hard to identify over a short period of time).
>> �> >> - Along this line, I think that there are various parts of Metron
>> �> (this
>> �> >> included) which could benefit from having method of configuring data
>> �> aging
>> �> >> by bucket in HDFS (Following Nick's comments here
>> �> >> <https://issues.apache.org/jira/browse/METRON-477>).
>> �> >>
>> �> >> 3. I would potentially add a hash of the content that failed
>> �> validation to
>> �> >> help identify repeats over time with less of a concern that you'd
>> �have
>> �> >
>> �> > back
>> �> >> to back failures (i.e. instead of storing the value itself).
>> �> Additionally,
>> �> >> I think it's helpful to be able to search all times there was an
>> �> indexing
>> �> >> error (instead of it hitting the catch-all).
>> �> >>
>> �> >> Jon
>> �> >>
>> �> >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org>
>> �> wrote:
>> �> >>
>> �> >> We already have a capability to capture bolt errors and validation
>> �> errors
>> �> >> and pipe them into a Kafka topic. I want to propose that we attach a
>> �> >> writer topology to the error and validation failed kafka topics so
>> �> that we
>> �> >> can (a) create a new ES index for these errors and (b) create a new
>> �> Kibana
>> �> >> dashboard to visualize them. The benefit would be that errors and
>> �> >> validation failures would be easier to see and analyze.
>> �> >>
>> �> >> I am seeking feedback on the following:
>> �> >>
>> �> >> - How granular would we want this feature to be? Think we would want
>> �> one
>> �> >> index/dashboard per source? Or would it be better to collapse
>> �> everything
>> �> >> into the same index?
>> �> >> - Do we care about storing these errors in HDFS as well? Or is
>> �indexing
>> �> >> them enough?
>> �> >> - What types of errors should we record? I am proposing:
>> �> >>
>> �> >> For error reporting:
>> �> >> --Message failed to parse
>> �> >> --Enrichment failed to enrich
>> �> >> --Threat intel feed failures
>> �> >> --Generic catch-all for all other errors
>> �> >>
>> �> >> For validation reporting:
>> �> >> --What part of message failed validation
>> �> >> --What stellar validator caused the failure
>> �> >>
>> �> >> -------------------
>> �> >> Thank you,
>> �> >>
>> �> >> James Sirota
>> �> >> PPMC- Apache Metron (Incubating)
>> �> >> jsirota AT apache DOT org
>> �> >>
>> �> >> --
>> �> >>
>> �> >> Jon
>> �> >>
>> �> >> Sent from my mobile device
>> �> >
>> �> > -------------------
>> �> > Thank you,
>> �> >
>> �> > James Sirota
>> �> > PPMC- Apache Metron (Incubating)
>> �> > jsirota AT apache DOT org
>> �> >
>> �> > --
>> �> >
>> �> > Jon
>> �> >
>> �> > Sent from my mobile device
>> �>
>> �> -------------------
>> �> Thank you,
>> �>
>> �> James Sirota
>> �> PPMC- Apache Metron (Incubating)
>> �> jsirota AT apache DOT org
>> �>
>> �--
>>
>> �Jon
>>
>> �Sent from my mobile device

-------------------�
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Error Indexing

Posted by Ryan Merriman <me...@gmail.com>.

I understand what Jon is talking about.  He's proposing we hash the value
that caused the error, not necessarily the error message itself.  For an
enrichment this is easy.  Just pass along the field value that failed
enrichment.  For other cases the field that caused the error may not be so
obvious.  Take parser validation for example.  The message is validated as
a whole and it may not be easy to determine which field is the cause.  In
that case would a hash of the whole message work?

There is a broader architectural discussion that needs to happen before we
can implement this.  Currently we have an indexing topology that reads from
1 topic and writes messages to ES but errors are written to several
different topics:

   - parser_error
   - parser_invalid
   - enrichments_error
   - threatintel_error
   - indexing_error

I can see 4 possible approaches to implementing this:

   1. Create an index topology for each error topic
      1. Good because we can easily reuse the indexing topology and would
      require the least development effort
      2. Bad because it would consume a lot of extra worker slots
   2. Move the topic name into the error JSON message as a new "error_type"
   field and write all messages to the indexing topic
      1. Good because we don't need to create a new topology
      2. Bad because we would be flowing data and errors through the same
      topology.  A spike in errors could affect message indexing.
   3. Compromise between 1 and 2.  Create another indexing topology that is
   dedicated to indexing errors.  Move the topic name into the error JSON
   message as a new "error_type" field and write all errors to a single error
   topic.
   4. Write a completely new topology with multiple spouts (1 for each
   error type listed above) that all feed into a single BulkMessageWriterBolt.
      1. Good because the current topologies would not need to change
      2. Bad because it would require the most development effort, would
      not reuse existing topologies and takes up more worker slots than 3

Are there other approaches I haven't thought of?  I think 1 and 2 are off
the table because they are shortcuts and not good long-term solutions.  3
would be my choice because it introduces less complexity than 4.  Thoughts?

Ryan


On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <ze...@gmail.com> wrote:

> In that case the hash would be of the value in the IP field, such as
> sha3(8.8.8.8).
>
> Jon
>
> On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org> wrote:
>
> > Jon,
> >
> > I am still not entirely following why we would want to use hashing.  For
> > example if my error is "Your IP field is invalid and failed validation"
> > hashing this error string will always result in the same hash.  Why not
> > just use the actual error string? Can you provide an example where you
> > would use it?
> >
> > Thanks,
> > James
> >
> > 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> > > For 1 - I'm good with that.
> > >
> > > I'm talking about hashing the relevant content itself not the error.
> Some
> > > benefits are (1) minimize load on search index (there's minimal benefit
> > in
> > > spending the CPU and disk to keep it at full fidelity (tokenize and
> > store))
> > > (2) provide something to key on for dashboards (assuming a good hash
> > > algorithm that avoids collisions and is second preimage resistant) and
> > (3)
> > > specific to errors, if the issue is that it failed to index, a hash
> gives
> > > us some protection that the issue will not occur twice.
> > >
> > > Jon
> > >
> > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <js...@apache.org> wrote:
> > >
> > > Jon,
> > >
> > > With regards to 1, collapsing to a single dashboard for each would be
> > > fine. So we would have one error index and one "failed to validate"
> > > index. The distinction is that errors would be things that went wrong
> > > during stream processing (failed to parse, etc...), while validation
> > > failures are messages that explicitly failed stellar validation/schema
> > > enforcement. There should be relatively few of the second type.
> > >
> > > With respect to 3, why do you want the error hashed? Why not just
> search
> > > for the error text?
> > >
> > > Thanks,
> > > James
> > >
> > > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> > >>  As someone who currently fills the platform engineer role, I can give
> > this
> > >>  idea a huge +1. My thoughts:
> > >>
> > >>  1. I think it depends on exactly what data is pushed into the index
> > (#3).
> > >>  However, assuming the errors you proposed recording, I can't see huge
> > >>  benefits to having more than one dashboard. I would be happy to be
> > >>  persuaded otherwise.
> > >>
> > >>  2. I would say yes, storing the errors in HDFS in addition to
> indexing
> > is
> > >>  a good thing. Using METRON-510
> > >>  <https://issues.apache.org/jira/browse/METRON-510> as a case study,
> > there
> > >>  is the potential in this environment for attacker-controlled data to
> > >
> > > result
> > >>  in processing errors which could be a method of evading security
> > >>  monitoring. Once an attack is identified, the long term HDFS storage
> > would
> > >>  allow better historical analysis for low-and-slow/persistent attacks
> > (I'm
> > >>  thinking of a method of data exfil that also won't successfully get
> > stored
> > >>  in Lucene, but is hard to identify over a short period of time).
> > >>   - Along this line, I think that there are various parts of Metron
> > (this
> > >>  included) which could benefit from having method of configuring data
> > aging
> > >>  by bucket in HDFS (Following Nick's comments here
> > >>  <https://issues.apache.org/jira/browse/METRON-477>).
> > >>
> > >>  3. I would potentially add a hash of the content that failed
> > validation to
> > >>  help identify repeats over time with less of a concern that you'd
> have
> > >
> > > back
> > >>  to back failures (i.e. instead of storing the value itself).
> > Additionally,
> > >>  I think it's helpful to be able to search all times there was an
> > indexing
> > >>  error (instead of it hitting the catch-all).
> > >>
> > >>  Jon
> > >>
> > >>  On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org>
> > wrote:
> > >>
> > >>  We already have a capability to capture bolt errors and validation
> > errors
> > >>  and pipe them into a Kafka topic. I want to propose that we attach a
> > >>  writer topology to the error and validation failed kafka topics so
> > that we
> > >>  can (a) create a new ES index for these errors and (b) create a new
> > Kibana
> > >>  dashboard to visualize them. The benefit would be that errors and
> > >>  validation failures would be easier to see and analyze.
> > >>
> > >>  I am seeking feedback on the following:
> > >>
> > >>  - How granular would we want this feature to be? Think we would want
> > one
> > >>  index/dashboard per source? Or would it be better to collapse
> > everything
> > >>  into the same index?
> > >>  - Do we care about storing these errors in HDFS as well? Or is
> indexing
> > >>  them enough?
> > >>  - What types of errors should we record? I am proposing:
> > >>
> > >>  For error reporting:
> > >>  --Message failed to parse
> > >>  --Enrichment failed to enrich
> > >>  --Threat intel feed failures
> > >>  --Generic catch-all for all other errors
> > >>
> > >>  For validation reporting:
> > >>  --What part of message failed validation
> > >>  --What stellar validator caused the failure
> > >>
> > >>  -------------------
> > >>  Thank you,
> > >>
> > >>  James Sirota
> > >>  PPMC- Apache Metron (Incubating)
> > >>  jsirota AT apache DOT org
> > >>
> > >>  --
> > >>
> > >>  Jon
> > >>
> > >>  Sent from my mobile device
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PPMC- Apache Metron (Incubating)
> > > jsirota AT apache DOT org
> > >
> > > --
> > >
> > > Jon
> > >
> > > Sent from my mobile device
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
> --
>
> Jon
>
> Sent from my mobile device
>

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

In that case the hash would be of the value in the IP field, such as
sha3(8.8.8.8).

Jon

On Mon, Jan 23, 2017, 6:41 PM James Sirota <js...@apache.org> wrote:

> Jon,
>
> I am still not entirely following why we would want to use hashing.  For
> example if my error is "Your IP field is invalid and failed validation"
> hashing this error string will always result in the same hash.  Why not
> just use the actual error string? Can you provide an example where you
> would use it?
>
> Thanks,
> James
>
> 23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> > For 1 - I'm good with that.
> >
> > I'm talking about hashing the relevant content itself not the error. Some
> > benefits are (1) minimize load on search index (there's minimal benefit
> in
> > spending the CPU and disk to keep it at full fidelity (tokenize and
> store))
> > (2) provide something to key on for dashboards (assuming a good hash
> > algorithm that avoids collisions and is second preimage resistant) and
> (3)
> > specific to errors, if the issue is that it failed to index, a hash gives
> > us some protection that the issue will not occur twice.
> >
> > Jon
> >
> > On Mon, Jan 23, 2017, 2:47 PM James Sirota <js...@apache.org> wrote:
> >
> > Jon,
> >
> > With regards to 1, collapsing to a single dashboard for each would be
> > fine. So we would have one error index and one "failed to validate"
> > index. The distinction is that errors would be things that went wrong
> > during stream processing (failed to parse, etc...), while validation
> > failures are messages that explicitly failed stellar validation/schema
> > enforcement. There should be relatively few of the second type.
> >
> > With respect to 3, why do you want the error hashed? Why not just search
> > for the error text?
> >
> > Thanks,
> > James
> >
> > 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> >>  As someone who currently fills the platform engineer role, I can give
> this
> >>  idea a huge +1. My thoughts:
> >>
> >>  1. I think it depends on exactly what data is pushed into the index
> (#3).
> >>  However, assuming the errors you proposed recording, I can't see huge
> >>  benefits to having more than one dashboard. I would be happy to be
> >>  persuaded otherwise.
> >>
> >>  2. I would say yes, storing the errors in HDFS in addition to indexing
> is
> >>  a good thing. Using METRON-510
> >>  <https://issues.apache.org/jira/browse/METRON-510> as a case study,
> there
> >>  is the potential in this environment for attacker-controlled data to
> >
> > result
> >>  in processing errors which could be a method of evading security
> >>  monitoring. Once an attack is identified, the long term HDFS storage
> would
> >>  allow better historical analysis for low-and-slow/persistent attacks
> (I'm
> >>  thinking of a method of data exfil that also won't successfully get
> stored
> >>  in Lucene, but is hard to identify over a short period of time).
> >>   - Along this line, I think that there are various parts of Metron
> (this
> >>  included) which could benefit from having method of configuring data
> aging
> >>  by bucket in HDFS (Following Nick's comments here
> >>  <https://issues.apache.org/jira/browse/METRON-477>).
> >>
> >>  3. I would potentially add a hash of the content that failed
> validation to
> >>  help identify repeats over time with less of a concern that you'd have
> >
> > back
> >>  to back failures (i.e. instead of storing the value itself).
> Additionally,
> >>  I think it's helpful to be able to search all times there was an
> indexing
> >>  error (instead of it hitting the catch-all).
> >>
> >>  Jon
> >>
> >>  On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org>
> wrote:
> >>
> >>  We already have a capability to capture bolt errors and validation
> errors
> >>  and pipe them into a Kafka topic. I want to propose that we attach a
> >>  writer topology to the error and validation failed kafka topics so
> that we
> >>  can (a) create a new ES index for these errors and (b) create a new
> Kibana
> >>  dashboard to visualize them. The benefit would be that errors and
> >>  validation failures would be easier to see and analyze.
> >>
> >>  I am seeking feedback on the following:
> >>
> >>  - How granular would we want this feature to be? Think we would want
> one
> >>  index/dashboard per source? Or would it be better to collapse
> everything
> >>  into the same index?
> >>  - Do we care about storing these errors in HDFS as well? Or is indexing
> >>  them enough?
> >>  - What types of errors should we record? I am proposing:
> >>
> >>  For error reporting:
> >>  --Message failed to parse
> >>  --Enrichment failed to enrich
> >>  --Threat intel feed failures
> >>  --Generic catch-all for all other errors
> >>
> >>  For validation reporting:
> >>  --What part of message failed validation
> >>  --What stellar validator caused the failure
> >>
> >>  -------------------
> >>  Thank you,
> >>
> >>  James Sirota
> >>  PPMC- Apache Metron (Incubating)
> >>  jsirota AT apache DOT org
> >>
> >>  --
> >>
> >>  Jon
> >>
> >>  Sent from my mobile device
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
> > --
> >
> > Jon
> >
> > Sent from my mobile device
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>
-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Error Indexing

Posted by James Sirota <js...@apache.org>.

Jon,

I am still not entirely following why we would want to use hashing.  For example if my error is "Your IP field is invalid and failed validation" hashing this error string will always result in the same hash.  Why not just use the actual error string? Can you provide an example where you would use it?

Thanks,
James

23.01.2017, 16:29, "Zeolla@GMail.com" <ze...@gmail.com>:
> For 1 - I'm good with that.
>
> I'm talking about hashing the relevant content itself not the error. Some
> benefits are (1) minimize load on search index (there's minimal benefit in
> spending the CPU and disk to keep it at full fidelity (tokenize and store))
> (2) provide something to key on for dashboards (assuming a good hash
> algorithm that avoids collisions and is second preimage resistant) and (3)
> specific to errors, if the issue is that it failed to index, a hash gives
> us some protection that the issue will not occur twice.
>
> Jon
>
> On Mon, Jan 23, 2017, 2:47 PM James Sirota <js...@apache.org> wrote:
>
> Jon,
>
> With regards to 1, collapsing to a single dashboard for each would be
> fine. So we would have one error index and one "failed to validate"
> index. The distinction is that errors would be things that went wrong
> during stream processing (failed to parse, etc...), while validation
> failures are messages that explicitly failed stellar validation/schema
> enforcement. There should be relatively few of the second type.
>
> With respect to 3, why do you want the error hashed? Why not just search
> for the error text?
>
> Thanks,
> James
>
> 20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
>> �As someone who currently fills the platform engineer role, I can give this
>> �idea a huge +1. My thoughts:
>>
>> �1. I think it depends on exactly what data is pushed into the index (#3).
>> �However, assuming the errors you proposed recording, I can't see huge
>> �benefits to having more than one dashboard. I would be happy to be
>> �persuaded otherwise.
>>
>> �2. I would say yes, storing the errors in HDFS in addition to indexing is
>> �a good thing. Using METRON-510
>> �<https://issues.apache.org/jira/browse/METRON-510> as a case study, there
>> �is the potential in this environment for attacker-controlled data to
>
> result
>> �in processing errors which could be a method of evading security
>> �monitoring. Once an attack is identified, the long term HDFS storage would
>> �allow better historical analysis for low-and-slow/persistent attacks (I'm
>> �thinking of a method of data exfil that also won't successfully get stored
>> �in Lucene, but is hard to identify over a short period of time).
>> ��- Along this line, I think that there are various parts of Metron (this
>> �included) which could benefit from having method of configuring data aging
>> �by bucket in HDFS (Following Nick's comments here
>> �<https://issues.apache.org/jira/browse/METRON-477>).
>>
>> �3. I would potentially add a hash of the content that failed validation to
>> �help identify repeats over time with less of a concern that you'd have
>
> back
>> �to back failures (i.e. instead of storing the value itself). Additionally,
>> �I think it's helpful to be able to search all times there was an indexing
>> �error (instead of it hitting the catch-all).
>>
>> �Jon
>>
>> �On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org> wrote:
>>
>> �We already have a capability to capture bolt errors and validation errors
>> �and pipe them into a Kafka topic. I want to propose that we attach a
>> �writer topology to the error and validation failed kafka topics so that we
>> �can (a) create a new ES index for these errors and (b) create a new Kibana
>> �dashboard to visualize them. The benefit would be that errors and
>> �validation failures would be easier to see and analyze.
>>
>> �I am seeking feedback on the following:
>>
>> �- How granular would we want this feature to be? Think we would want one
>> �index/dashboard per source? Or would it be better to collapse everything
>> �into the same index?
>> �- Do we care about storing these errors in HDFS as well? Or is indexing
>> �them enough?
>> �- What types of errors should we record? I am proposing:
>>
>> �For error reporting:
>> �--Message failed to parse
>> �--Enrichment failed to enrich
>> �--Threat intel feed failures
>> �--Generic catch-all for all other errors
>>
>> �For validation reporting:
>> �--What part of message failed validation
>> �--What stellar validator caused the failure
>>
>> �-------------------
>> �Thank you,
>>
>> �James Sirota
>> �PPMC- Apache Metron (Incubating)
>> �jsirota AT apache DOT org
>>
>> �--
>>
>> �Jon
>>
>> �Sent from my mobile device
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>
> --
>
> Jon
>
> Sent from my mobile device

-------------------�
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

For 1 - I'm good with that.

I'm talking about hashing the relevant content itself not the error.  Some
benefits are (1) minimize load on search index (there's minimal benefit in
spending the CPU and disk to keep it at full fidelity (tokenize and store))
(2) provide something to key on for dashboards (assuming a good hash
algorithm that avoids collisions and is second preimage resistant) and (3)
specific to errors, if the issue is that it failed to index, a hash gives
us some protection that the issue will not occur twice.

Jon

On Mon, Jan 23, 2017, 2:47 PM James Sirota <js...@apache.org> wrote:

Jon,

With regards to 1, collapsing to a single dashboard for each would be
fine.  So we would have one error index and one "failed to validate"
index.  The distinction is that errors would be things that went wrong
during stream processing (failed to parse, etc...), while validation
failures are messages that explicitly failed stellar validation/schema
enforcement.  There should be relatively few of the second type.


With respect to 3, why do you want the error hashed?  Why not just search
for the error text?

Thanks,
James


20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> As someone who currently fills the platform engineer role, I can give this
> idea a huge +1. My thoughts:
>
> 1. I think it depends on exactly what data is pushed into the index (#3).
> However, assuming the errors you proposed recording, I can't see huge
> benefits to having more than one dashboard. I would be happy to be
> persuaded otherwise.
>
> 2. I would say yes, storing the errors in HDFS in addition to indexing is
> a good thing. Using METRON-510
> <https://issues.apache.org/jira/browse/METRON-510> as a case study, there
> is the potential in this environment for attacker-controlled data to
result
> in processing errors which could be a method of evading security
> monitoring. Once an attack is identified, the long term HDFS storage would
> allow better historical analysis for low-and-slow/persistent attacks (I'm
> thinking of a method of data exfil that also won't successfully get stored
> in Lucene, but is hard to identify over a short period of time).
>  - Along this line, I think that there are various parts of Metron (this
> included) which could benefit from having method of configuring data aging
> by bucket in HDFS (Following Nick's comments here
> <https://issues.apache.org/jira/browse/METRON-477>).
>
> 3. I would potentially add a hash of the content that failed validation to
> help identify repeats over time with less of a concern that you'd have
back
> to back failures (i.e. instead of storing the value itself). Additionally,
> I think it's helpful to be able to search all times there was an indexing
> error (instead of it hitting the catch-all).
>
> Jon
>
> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org> wrote:
>
> We already have a capability to capture bolt errors and validation errors
> and pipe them into a Kafka topic. I want to propose that we attach a
> writer topology to the error and validation failed kafka topics so that we
> can (a) create a new ES index for these errors and (b) create a new Kibana
> dashboard to visualize them. The benefit would be that errors and
> validation failures would be easier to see and analyze.
>
> I am seeking feedback on the following:
>
> - How granular would we want this feature to be? Think we would want one
> index/dashboard per source? Or would it be better to collapse everything
> into the same index?
> - Do we care about storing these errors in HDFS as well? Or is indexing
> them enough?
> - What types of errors should we record? I am proposing:
>
> For error reporting:
> --Message failed to parse
> --Enrichment failed to enrich
> --Threat intel feed failures
> --Generic catch-all for all other errors
>
> For validation reporting:
> --What part of message failed validation
> --What stellar validator caused the failure
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>
> --
>
> Jon
>
> Sent from my mobile device

-------------------
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Error Indexing

Posted by James Sirota <js...@apache.org>.

Jon,

With regards to 1, collapsing to a single dashboard for each would be fine.  So we would have one error index and one "failed to validate" index.  The distinction is that errors would be things that went wrong during stream processing (failed to parse, etc...), while validation failures are messages that explicitly failed stellar validation/schema enforcement.  There should be relatively few of the second type.


With respect to 3, why do you want the error hashed?  Why not just search for the error text?

Thanks,
James 


20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> As someone who currently fills the platform engineer role, I can give this
> idea a huge +1. My thoughts:
>
> 1. I think it depends on exactly what data is pushed into the index (#3).
> However, assuming the errors you proposed recording, I can't see huge
> benefits to having more than one dashboard. I would be happy to be
> persuaded otherwise.
>
> 2. I would say yes, storing the errors in HDFS in addition to indexing is
> a good thing. Using METRON-510
> <https://issues.apache.org/jira/browse/METRON-510> as a case study, there
> is the potential in this environment for attacker-controlled data to result
> in processing errors which could be a method of evading security
> monitoring. Once an attack is identified, the long term HDFS storage would
> allow better historical analysis for low-and-slow/persistent attacks (I'm
> thinking of a method of data exfil that also won't successfully get stored
> in Lucene, but is hard to identify over a short period of time).
> �- Along this line, I think that there are various parts of Metron (this
> included) which could benefit from having method of configuring data aging
> by bucket in HDFS (Following Nick's comments here
> <https://issues.apache.org/jira/browse/METRON-477>).
>
> 3. I would potentially add a hash of the content that failed validation to
> help identify repeats over time with less of a concern that you'd have back
> to back failures (i.e. instead of storing the value itself). Additionally,
> I think it's helpful to be able to search all times there was an indexing
> error (instead of it hitting the catch-all).
>
> Jon
>
> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org> wrote:
>
> We already have a capability to capture bolt errors and validation errors
> and pipe them into a Kafka topic. I want to propose that we attach a
> writer topology to the error and validation failed kafka topics so that we
> can (a) create a new ES index for these errors and (b) create a new Kibana
> dashboard to visualize them. The benefit would be that errors and
> validation failures would be easier to see and analyze.
>
> I am seeking feedback on the following:
>
> - How granular would we want this feature to be? Think we would want one
> index/dashboard per source? Or would it be better to collapse everything
> into the same index?
> - Do we care about storing these errors in HDFS as well? Or is indexing
> them enough?
> - What types of errors should we record? I am proposing:
>
> For error reporting:
> --Message failed to parse
> --Enrichment failed to enrich
> --Threat intel feed failures
> --Generic catch-all for all other errors
>
> For validation reporting:
> --What part of message failed validation
> --What stellar validator caused the failure
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>
> --
>
> Jon
>
> Sent from my mobile device

-------------------�
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Error Indexing

Posted by James Sirota <js...@apache.org>.

Thanks for your feedback, Jon.  Anyone else has interest/feedback for this feature? 

20.01.2017, 14:01, "Zeolla@GMail.com" <ze...@gmail.com>:
> As someone who currently fills the platform engineer role, I can give this
> idea a huge +1. My thoughts:
>
> 1. I think it depends on exactly what data is pushed into the index (#3).
> However, assuming the errors you proposed recording, I can't see huge
> benefits to having more than one dashboard. I would be happy to be
> persuaded otherwise.
>
> 2. I would say yes, storing the errors in HDFS in addition to indexing is
> a good thing. Using METRON-510
> <https://issues.apache.org/jira/browse/METRON-510> as a case study, there
> is the potential in this environment for attacker-controlled data to result
> in processing errors which could be a method of evading security
> monitoring. Once an attack is identified, the long term HDFS storage would
> allow better historical analysis for low-and-slow/persistent attacks (I'm
> thinking of a method of data exfil that also won't successfully get stored
> in Lucene, but is hard to identify over a short period of time).
> �- Along this line, I think that there are various parts of Metron (this
> included) which could benefit from having method of configuring data aging
> by bucket in HDFS (Following Nick's comments here
> <https://issues.apache.org/jira/browse/METRON-477>).
>
> 3. I would potentially add a hash of the content that failed validation to
> help identify repeats over time with less of a concern that you'd have back
> to back failures (i.e. instead of storing the value itself). Additionally,
> I think it's helpful to be able to search all times there was an indexing
> error (instead of it hitting the catch-all).
>
> Jon
>
> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org> wrote:
>
> We already have a capability to capture bolt errors and validation errors
> and pipe them into a Kafka topic. I want to propose that we attach a
> writer topology to the error and validation failed kafka topics so that we
> can (a) create a new ES index for these errors and (b) create a new Kibana
> dashboard to visualize them. The benefit would be that errors and
> validation failures would be easier to see and analyze.
>
> I am seeking feedback on the following:
>
> - How granular would we want this feature to be? Think we would want one
> index/dashboard per source? Or would it be better to collapse everything
> into the same index?
> - Do we care about storing these errors in HDFS as well? Or is indexing
> them enough?
> - What types of errors should we record? I am proposing:
>
> For error reporting:
> --Message failed to parse
> --Enrichment failed to enrich
> --Threat intel feed failures
> --Generic catch-all for all other errors
>
> For validation reporting:
> --What part of message failed validation
> --What stellar validator caused the failure
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>
> --
>
> Jon
>
> Sent from my mobile device

-------------------�
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Error Indexing

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

As someone who currently fills the platform engineer role, I can give this
idea a huge +1.  My thoughts:

1.  I think it depends on exactly what data is pushed into the index (#3).
However, assuming the errors you proposed recording, I can't see huge
benefits to having more than one dashboard.  I would be happy to be
persuaded otherwise.

2.  I would say yes, storing the errors in HDFS in addition to indexing is
a good thing.  Using METRON-510
<https://issues.apache.org/jira/browse/METRON-510> as a case study, there
is the potential in this environment for attacker-controlled data to result
in processing errors which could be a method of evading security
monitoring.  Once an attack is identified, the long term HDFS storage would
allow better historical analysis for low-and-slow/persistent attacks (I'm
thinking of a method of data exfil that also won't successfully get stored
in Lucene, but is hard to identify over a short period of time).
 - Along this line, I think that there are various parts of Metron (this
included) which could benefit from having method of configuring data aging
by bucket in HDFS (Following Nick's comments here
<https://issues.apache.org/jira/browse/METRON-477>).

3.  I would potentially add a hash of the content that failed validation to
help identify repeats over time with less of a concern that you'd have back
to back failures (i.e. instead of storing the value itself).  Additionally,
I think it's helpful to be able to search all times there was an indexing
error (instead of it hitting the catch-all).

Jon

On Fri, Jan 20, 2017 at 1:17 PM James Sirota <js...@apache.org> wrote:

We already have a capability to capture bolt errors and validation errors
and pipe them into a Kafka topic.  I want to propose that we attach a
writer topology to the error and validation failed kafka topics so that we
can (a) create a new ES index for these errors and (b) create a new Kibana
dashboard to visualize them.  The benefit would be that errors and
validation failures would be easier to see and analyze.

I am seeking feedback on the following:

- How granular would we want this feature to be?  Think we would want one
index/dashboard per source?  Or would it be better to collapse everything
into the same index?
- Do we care about storing these errors in HDFS as well?  Or is indexing
them enough?
- What types of errors should we record?  I am proposing:

For error reporting:
--Message failed to parse
--Enrichment failed to enrich
--Threat intel feed failures
--Generic catch-all for all other errors

For validation reporting:
--What part of message failed validation
--What stellar validator caused the failure



-------------------
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

-- 

Jon

Sent from my mobile device