You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Alexey Kukushkin <al...@yahoo.com.INVALID> on 2017/08/14 12:02:07 UTC

Ignite not friendly for Monitoring

Igniters,
While preparing some Ignite materials for Administrators I found Ignite is not friendly for such a critical DevOps practice as monitoring.
TL;DRI think Ignite misses structured descriptions of abnormal events with references to event IDs in the logs not changing as new versions are released.
MORE DETAILS
I call an application “monitoring friendly” if it allows DevOps to:
1. immediately receive a notification (email, SMS, etc.)
2. understand what a problem is without involving developers
3. provide automated recovery action.

Large enterprises do not implement custom solutions. They usually use tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise consistently. All such tools have similar architecture providing a dashboard showing apps as “green/yellow/red”, and numerous “connectors” to look for events in text logs, ESBs, database tables, etc.

For each app DevOps build a “health model” - a diagram displaying the app’s “manageable” components and the app boundaries. A “manageable” component is something that can be started/stopped/configured in isolation. “System boundary” is a list of external apps that the monitored app interacts with.

The main attribute of a manageable component is a list of “operationally significant events”. Those are the events that DevOps can do something with. For example, “failed to connect to cache store” is significant, while “user input validation failed” is not.

Events shall be as specific as possible so that DevOps do not spend time for further analysis. For example, a “database failure” event is not good. There should be “database connection failure”, “invalid database schema”, “database authentication failure”, etc. events.

“Event” is NOT the same as exception occurred in the code. Events identify specific problem from the DevOps point of view. For example, even if “connection to cache store failed” exception might be thrown from several places in the code, that is still the same event. On the other side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be caught in the same place, those are different events since MS SQL Server and Oracle are usually different DevOps groups in large enterprises!

The operationally significant event IDs must be stable: they must not change from one release to another. This is like a contract between developers and DevOps.

This should be the developer’s responsibility to publish and maintain a table with attributes:

- Event ID
- Severity: Critical (Red) - the system is not operational; Warning (Yellow) - the system is operational but health is degraded; None - just an info.
- Description: concise but enough for DevOps to act without developer’s help
- Recovery actions: what DevOps shall do to fix the issue without developer’s help. DevOps might create automated recovery scripts based on this information.

For example:
10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) Open ignite configuration and find zookeeper connection string 2) Make sure the Zookeeper is running
10200 - Warning - Ignite node left the cluster.

Back to Ignite: it looks to me we do not design for operations as described above. We have no event IDs: our logging is subject to change in new version so that any patterns DevOps might use to detect significant events would stop working after upgrade.

If I am not the only one how have such concerns then we might open a ticket to address this.

Best regards, Alexey

Re: Ignite not friendly for Monitoring

Posted by Serge Puchnin <se...@gmail.com>.

It might be an issue:
https://issues.apache.org/jira/browse/IGNITE-3690

I'm going to update it when the domain list will be agreed upon by the
community. 




--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Re: Ignite not friendly for Monitoring

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Is there a Jira ticket for it?

On Mon, Jan 15, 2018 at 7:48 AM, Serge Puchnin <se...@gmail.com>
wrote:

> Igniters,
>
> It's a right idea!
>
> Let's try to revitalize it and make a move on.
>
> As a first step, I would like to propose a list of a top-level domain.
>
> -- the phase 1
>     1. UnExpected, UnKnown
>     2. Cluster and Topology
>         Discovery
>         Segmentation
>         Node Startup
>         Communication
>         Queue
>         Activate, startup process
>         Base line topology
>         Marshaller
>         Metadata
>         Topology Validate
>     3. Cache and Storage
>         Partition map exchange
>         Balancing
>         Long-running transactions
>         Checkpoint
>         Create cache
>         Destroy cache
>         Data loading & streaming
>     4. SQL
>         Long-running queries
>         Parsing
>         Queries
>         Scan Queries
>         SqlLine
>     5. Compute
>         Deployment
>         spi.checkpoint
>         spi.collision
>         Job Schedule
>
> -- the phase 2
>     6. Service
>     7. Security
>     8. ML
>     9. External Adapters
>     10. WebConsole
>     11. Vendor Specific
>         GG
>
>
> For every second-level domain is planning to reserve one hundred error
> codes. Sum of second-level domains (rounded up to next thousand) gives us
> count for top-level.
>
> Every error code has a severity level:
>
> Critical (Red) - the system is not operational;
> Warning (Yellow) - the system is operational but health is degraded;
> Info - just an info.
>
> And two or three letter prefix. It allows to find an issue more easily
> without complex grep rules (something like grep
> "10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between 102500
> до 103569)
>
>
> Domains from the first phase look fine but from the second are vague.
> Initially, we can focus only the first phase.
>
> Please share your thoughts on proposed design.
>
> Serge.
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>

Re: Ignite not friendly for Monitoring

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Assigned the version 2.5 to the ticket. Let's try to make progress on this
before then.

On Tue, Jan 16, 2018 at 12:03 PM, Denis Magda <dm...@apache.org> wrote:

> Serge,
>
> Thanks for taking over this. Think we’re moving in a right direction with
> your proposal:
>
> * I would add a top-level domain for “Integrations”. All the integrations
> with Kafka, Spark, Storm, etc. should go there.
>
> * Second-level domains number can grow over the time per a top-level
> layer. Let’s book a decent range for this possible grow.
>
> * Guess external adapters should go to the “Integrations” which sounds
> better to me.
>
> * Agree that this ticket should be used to track the progress in JIRA:
> https://issues.apache.org/jira/browse/IGNITE-3690 <
> https://issues.apache.org/jira/browse/IGNITE-3690>
>
>
> On top of this, this effort has to be tested using a 3rd party tool such
> as DynoTrace or Nagios. If the tools can pick up and analyze our logs to
> automate classic DevOps tasks then the goal will be achieved. Can you
> include this as a required task for QA?
>
> —
> Denis
>
> > On Jan 15, 2018, at 7:48 AM, Serge Puchnin <se...@gmail.com>
> wrote:
> >
> > Igniters,
> >
> > It's a right idea!
> >
> > Let's try to revitalize it and make a move on.
> >
> > As a first step, I would like to propose a list of a top-level domain.
> >
> > -- the phase 1
> >    1. UnExpected, UnKnown
> >    2. Cluster and Topology
> >        Discovery
> >        Segmentation
> >        Node Startup
> >        Communication
> >        Queue
> >        Activate, startup process
> >        Base line topology
> >        Marshaller
> >        Metadata
> >        Topology Validate
> >    3. Cache and Storage
> >        Partition map exchange
> >        Balancing
> >        Long-running transactions
> >        Checkpoint
> >        Create cache
> >        Destroy cache
> >        Data loading & streaming
> >    4. SQL
> >        Long-running queries
> >        Parsing
> >        Queries
> >        Scan Queries
> >        SqlLine
> >    5. Compute
> >        Deployment
> >        spi.checkpoint
> >        spi.collision
> >        Job Schedule
> >
> > -- the phase 2
> >    6. Service
> >    7. Security
> >    8. ML
> >    9. External Adapters
> >    10. WebConsole
> >    11. Vendor Specific
> >        GG
> >
> >
> > For every second-level domain is planning to reserve one hundred error
> > codes. Sum of second-level domains (rounded up to next thousand) gives us
> > count for top-level.
> >
> > Every error code has a severity level:
> >
> > Critical (Red) - the system is not operational;
> > Warning (Yellow) - the system is operational but health is degraded;
> > Info - just an info.
> >
> > And two or three letter prefix. It allows to find an issue more easily
> > without complex grep rules (something like grep
> > "10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between
> 102500
> > до 103569)
> >
> >
> > Domains from the first phase look fine but from the second are vague.
> > Initially, we can focus only the first phase.
> >
> > Please share your thoughts on proposed design.
> >
> > Serge.
> >
> >
> >
> > --
> > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>
>

Re: Ignite not friendly for Monitoring

Posted by Denis Magda <dm...@apache.org>.

Serge,

Thanks for taking over this. Think we’re moving in a right direction with your proposal:

* I would add a top-level domain for “Integrations”. All the integrations with Kafka, Spark, Storm, etc. should go there.

* Second-level domains number can grow over the time per a top-level layer. Let’s book a decent range for this possible grow.

* Guess external adapters should go to the “Integrations” which sounds better to me.

* Agree that this ticket should be used to track the progress in JIRA: https://issues.apache.org/jira/browse/IGNITE-3690 <https://issues.apache.org/jira/browse/IGNITE-3690>


On top of this, this effort has to be tested using a 3rd party tool such as DynoTrace or Nagios. If the tools can pick up and analyze our logs to automate classic DevOps tasks then the goal will be achieved. Can you include this as a required task for QA?

—
Denis

> On Jan 15, 2018, at 7:48 AM, Serge Puchnin <se...@gmail.com> wrote:
> 
> Igniters, 
> 
> It's a right idea!
> 
> Let's try to revitalize it and make a move on. 
> 
> As a first step, I would like to propose a list of a top-level domain.
> 
> -- the phase 1
>    1. UnExpected, UnKnown
>    2. Cluster and Topology 
>        Discovery
>        Segmentation 
>        Node Startup
>        Communication
>        Queue
>        Activate, startup process
>        Base line topology
>        Marshaller
>        Metadata
>        Topology Validate
>    3. Cache and Storage
>        Partition map exchange
>        Balancing
>        Long-running transactions
>        Checkpoint
>        Create cache
>        Destroy cache
>        Data loading & streaming
>    4. SQL
>        Long-running queries
>        Parsing
>        Queries
>        Scan Queries
>        SqlLine
>    5. Compute
>        Deployment
>        spi.checkpoint
>        spi.collision
>        Job Schedule
> 
> -- the phase 2
>    6. Service
>    7. Security
>    8. ML
>    9. External Adapters 
>    10. WebConsole
>    11. Vendor Specific 
>        GG
> 
> 
> For every second-level domain is planning to reserve one hundred error
> codes. Sum of second-level domains (rounded up to next thousand) gives us
> count for top-level.
> 
> Every error code has a severity level:
> 
> Critical (Red) - the system is not operational; 
> Warning (Yellow) - the system is operational but health is degraded; 
> Info - just an info.
> 
> And two or three letter prefix. It allows to find an issue more easily
> without complex grep rules (something like grep
> "10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between 102500 
> до 103569)
> 
> 
> Domains from the first phase look fine but from the second are vague.  
> Initially, we can focus only the first phase. 
> 
> Please share your thoughts on proposed design.
> 
> Serge.
> 
> 
> 
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Re: Ignite not friendly for Monitoring

Posted by Serge Puchnin <se...@gmail.com>.

Igniters, 

It's a right idea!

Let's try to revitalize it and make a move on. 

As a first step, I would like to propose a list of a top-level domain.

-- the phase 1
    1. UnExpected, UnKnown
    2. Cluster and Topology 
        Discovery
        Segmentation 
        Node Startup
        Communication
        Queue
        Activate, startup process
        Base line topology
        Marshaller
        Metadata
        Topology Validate
    3. Cache and Storage
        Partition map exchange
        Balancing
        Long-running transactions
        Checkpoint
        Create cache
        Destroy cache
        Data loading & streaming
    4. SQL
        Long-running queries
        Parsing
        Queries
        Scan Queries
        SqlLine
    5. Compute
        Deployment
        spi.checkpoint
        spi.collision
        Job Schedule

-- the phase 2
    6. Service
    7. Security
    8. ML
    9. External Adapters 
    10. WebConsole
    11. Vendor Specific 
        GG


For every second-level domain is planning to reserve one hundred error
codes. Sum of second-level domains (rounded up to next thousand) gives us
count for top-level.

Every error code has a severity level:

Critical (Red) - the system is not operational; 
Warning (Yellow) - the system is operational but health is degraded; 
Info - just an info.

And two or three letter prefix. It allows to find an issue more easily
without complex grep rules (something like grep
"10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between 102500 
до 103569)


Domains from the first phase look fine but from the second are vague.  
Initially, we can focus only the first phase. 

Please share your thoughts on proposed design.

Serge.



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Re: Ignite not friendly for Monitoring

Posted by Vladimir Ozerov <vo...@gridgain.com>.

Dima,

Please see latest comments in the ticket [1]. There is special
specification called SQLSTATE governing what errors code are thrown from
SQL operations [2]. This is applicable to both JDBC and ODBC. Apart of from
standard code, database vendor can add it's own codes as a separate field,
or even extend error codes from the standard. However, as a first iteration
we should start respecting SQLSTATE spec without our own Ignite-specific
error codes.

[1] https://issues.apache.org/jira/browse/IGNITE-5620
[2]
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/codes/src/tpc/db2z_sqlstatevalues.html#db2z_sqlstatevalues__code07

On Mon, Aug 28, 2017 at 3:23 PM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> On Mon, Aug 28, 2017 at 1:22 AM, Vladimir Ozerov <vo...@gridgain.com>
> wrote:
>
> > IGNITE-5620 is about error codes thrown from drivers. This is completely
> > different story, as every driver has specification with it's own specific
> > error codes. There is no common denominator.
> >
>
> Vova, I am not sure I understand. I would expect that drivers should
> provide the same SQL error codes as the underlying database. Perhaps,
> drivers have their custom codes for the errors in the driver itself, not in
> SQL.
>
> Can you please clarify?
>
>
> >
> > On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda <dm...@apache.org> wrote:
> >
> > > Vladimir,
> > >
> > > I would disagree. In IGNITE-5620 we’re going to introduce some constant
> > > error codes and prepare a sheet that will elaborate on every error.
> > That’s
> > > a part of bigger endeavor when the whole platform should be covered by
> > > special unique IDs for errors, warning and events.
> > >
> > > Now, we need to agree at least on the IDs range for SQL.
> > >
> > > —
> > > Denis
> > >
> > > > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov <vo...@gridgain.com>
> > > wrote:
> > > >
> > > > Denis,
> > > >
> > > > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > > > monitoring and parser errors.
> > > >
> > > > ср, 16 авг. 2017 г. в 2:57, Denis Magda <dm...@apache.org>:
> > > >
> > > >> Alexey,
> > > >>
> > > >> Didn’t know that such an improvement as consistent IDs for errors
> and
> > > >> events can be used as an integration point with the DevOps tools.
> > Thanks
> > > >> for sharing your experience with us.
> > > >>
> > > >> Would you step in as a architect for this task and make out a JIRA
> > > ticket
> > > >> with all the required information.
> > > >>
> > > >> In general, we’ve already planned to do something around this
> starting
> > > >> with SQL:
> > > >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> > > >> https://issues.apache.org/jira/browse/IGNITE-5620>
> > > >>
> > > >> It makes sense to consider your input before the work on IGNITE-5620
> > is
> > > >> started.
> > > >>
> > > >> —
> > > >> Denis
> > > >>
> > > >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> > > >> alexeykukushkin@yahoo.com.INVALID> wrote:
> > > >>>
> > > >>> Hi Alexey,
> > > >>> A nice thing about delegating alerting to 3rd party enterprise
> > systems
> > > >> is that those systems already deal with lots of things including
> > > >> distributed apps.
> > > >>> What is needed from Ignite is to consistently write to log files
> > (again
> > > >> that means stable event IDs, proper event granularity, no
> repetition,
> > > >> documentation). This would be 3rd party monitoring system's
> > > responsibility
> > > >> to monitor log files on all nodes, filter, aggregate, process,
> > visualize
> > > >> and notify on events.
> > > >>> How a monitoring tool would deal with an event like "node left":
> > > >>> The only thing needed from Ignite is to write an entry like below
> to
> > > log
> > > >> files on all Ignite servers. In this example 3300 identifies this
> > "node
> > > >> left" event and will never change in the future even if text
> > description
> > > >> changes:
> > > >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left
> the
> > > >> cluster
> > > >>> Then we document somewhere on the web that Ignite has event 3300
> and
> > it
> > > >> means a node left the cluster. Maybe provide documentation how to
> deal
> > > with
> > > >> it. Some examples:Oracle Web Cache events:
> > > >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> > > event.htm#sthref2393MS
> > > >> SQL Server events:
> > > >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > > >>> That is all for Ignite! Everything else is handled by specific
> > > >> monitoring system configured by DevOps on the customer side.
> > > >>> Basing on the Ignite documentation similar to above, DevOps of a
> > > company
> > > >> where Ignite is going to be used will configure their monitoring
> > system
> > > to
> > > >> understand Ignite events. Consider the "node left" event as an
> > example.
> > > >>> - This event is output on every node but DevOps do not want to be
> > > >> notified many times. To address this, they will build an "Ignite
> > model"
> > > >> where there will be a parent-child dependency between components
> > "Ignite
> > > >> Cluster" and "Ignite Node". For example, this is how you do it in
> > > Nagios:
> > > >> https://assets.nagios.com/downloads/nagioscore/docs/
> > > nagioscore/4/en/dependencies.html
> > > >> and this is how you do it in Microsoft SCSM:
> > > >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes.
> > Then
> > > >> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> > > >> Nagios) for parent "Ignite Cluster" and child "Ignite Service"
> > > components.
> > > >> State change (OK -> WARNING) and notification (email, SMS, whatever)
> > > will
> > > >> be configured only for the "Ignite Cluster"'s "node left" monitor.-
> > Now
> > > >> suppose a node left. The "node left" monitor (that uses log file
> > > monitoring
> > > >> plugin) on "Ignite Node" will detect the event and pass it to the
> > > parent.
> > > >> This will trigger "Ignite Cluster" state change from OK to WARNING
> and
> > > send
> > > >> a notification. No more notification will be sent unless the "Ignite
> > > >> Cluster" state is reset back to OK, which happens either manually or
> > on
> > > >> timeout or automatically on "node joined".
> > > >>> This was just FYI. We, Ignite developers, do not care about how
> > > >> monitoring works - this is responsibility of customer's DevOps. Our
> > > >> responsibility is consistent event logging.
> > > >>> Thank you!
> > > >>>
> > > >>>
> > > >>> Best regards, Alexey
> > > >>>
> > > >>>
> > > >>> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> > > >> akuznetsov@apache.org> wrote:
> > > >>>
> > > >>> Alexey,
> > > >>>
> > > >>> How you are going to deal with distributed nature of Ignite
> cluster?
> > > >>> And how do you propose handle nodes restart / stop?
> > > >>>
> > > >>> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> > > >>> alexeykukushkin@yahoo.com.invalid> wrote:
> > > >>>
> > > >>>> Hi Denis,
> > > >>>> Monitoring tools simply watch event logs for patterns (regex in
> case
> > > of
> > > >>>> unstructured logs like text files). A stable (not changing in new
> > > >> releases)
> > > >>>> event ID identifying specific issue would be such a pattern.
> > > >>>> We need to introduce such event IDs according to the principles I
> > > >>>> described in my previous mail.
> > > >>>> Best regards, Alexey
> > > >>>>
> > > >>>>
> > > >>>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> > > >>>> dmagda@apache.org> wrote:
> > > >>>>
> > > >>>> Hello Alexey,
> > > >>>>
> > > >>>> Thanks for the detailed input.
> > > >>>>
> > > >>>> Assuming that Ignite supported the suggested events based model,
> how
> > > can
> > > >>>> it be integrated with mentioned tools like DynaTrace or Nagios? Is
> > > this
> > > >> all
> > > >>>> we need?
> > > >>>>
> > > >>>> —
> > > >>>> Denis
> > > >>>>
> > > >>>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
> > > >> alexeykukushkin@yahoo.com
> > > >>>> .INVALID> wrote:
> > > >>>>>
> > > >>>>> Igniters,
> > > >>>>> While preparing some Ignite materials for Administrators I found
> > > Ignite
> > > >>>> is not friendly for such a critical DevOps practice as monitoring.
> > > >>>>> TL;DRI think Ignite misses structured descriptions of abnormal
> > events
> > > >>>> with references to event IDs in the logs not changing as new
> > versions
> > > >> are
> > > >>>> released.
> > > >>>>> MORE DETAILS
> > > >>>>> I call an application “monitoring friendly” if it allows DevOps
> to:
> > > >>>>> 1. immediately receive a notification (email, SMS, etc.)
> > > >>>>> 2. understand what a problem is without involving developers
> > > >>>>> 3. provide automated recovery action.
> > > >>>>>
> > > >>>>> Large enterprises do not implement custom solutions. They usually
> > use
> > > >>>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in
> the
> > > >>>> enterprise consistently. All such tools have similar architecture
> > > >> providing
> > > >>>> a dashboard showing apps as “green/yellow/red”, and numerous
> > > >> “connectors”
> > > >>>> to look for events in text logs, ESBs, database tables, etc.
> > > >>>>>
> > > >>>>> For each app DevOps build a “health model” - a diagram displaying
> > the
> > > >>>> app’s “manageable” components and the app boundaries. A
> “manageable”
> > > >>>> component is something that can be started/stopped/configured in
> > > >> isolation.
> > > >>>> “System boundary” is a list of external apps that the monitored
> app
> > > >>>> interacts with.
> > > >>>>>
> > > >>>>> The main attribute of a manageable component is a list of
> > > >> “operationally
> > > >>>> significant events”. Those are the events that DevOps can do
> > something
> > > >>>> with. For example, “failed to connect to cache store” is
> > significant,
> > > >> while
> > > >>>> “user input validation failed” is not.
> > > >>>>>
> > > >>>>> Events shall be as specific as possible so that DevOps do not
> spend
> > > >> time
> > > >>>> for further analysis. For example, a “database failure” event is
> not
> > > >> good.
> > > >>>> There should be “database connection failure”, “invalid database
> > > >> schema”,
> > > >>>> “database authentication failure”, etc. events.
> > > >>>>>
> > > >>>>> “Event” is NOT the same as exception occurred in the code. Events
> > > >>>> identify specific problem from the DevOps point of view. For
> > example,
> > > >> even
> > > >>>> if “connection to cache store failed” exception might be thrown
> from
> > > >>>> several places in the code, that is still the same event. On the
> > other
> > > >>>> side, even if a SqlServerConnectionTimeout and
> > OracleConnectionTimeout
> > > >>>> exceptions might be caught in the same place, those are different
> > > events
> > > >>>> since MS SQL Server and Oracle are usually different DevOps groups
> > in
> > > >> large
> > > >>>> enterprises!
> > > >>>>>
> > > >>>>> The operationally significant event IDs must be stable: they must
> > not
> > > >>>> change from one release to another. This is like a contract
> between
> > > >>>> developers and DevOps.
> > > >>>>>
> > > >>>>> This should be the developer’s responsibility to publish and
> > > maintain a
> > > >>>> table with attributes:
> > > >>>>>
> > > >>>>> - Event ID
> > > >>>>> - Severity: Critical (Red) - the system is not operational;
> Warning
> > > >>>> (Yellow) - the system is operational but health is degraded; None
> -
> > > >> just an
> > > >>>> info.
> > > >>>>> - Description: concise but enough for DevOps to act without
> > > developer’s
> > > >>>> help
> > > >>>>> - Recovery actions: what DevOps shall do to fix the issue without
> > > >>>> developer’s help. DevOps might create automated recovery scripts
> > based
> > > >> on
> > > >>>> this information.
> > > >>>>>
> > > >>>>> For example:
> > > >>>>> 10100 - Critical - Could not connect to Zookeeper to discovery
> > nodes
> > > -
> > > >>>> 1) Open ignite configuration and find zookeeper connection string
> 2)
> > > >> Make
> > > >>>> sure the Zookeeper is running
> > > >>>>> 10200 - Warning - Ignite node left the cluster.
> > > >>>>>
> > > >>>>> Back to Ignite: it looks to me we do not design for operations as
> > > >>>> described above. We have no event IDs: our logging is subject to
> > > change
> > > >> in
> > > >>>> new version so that any patterns DevOps might use to detect
> > > significant
> > > >>>> events would stop working after upgrade.
> > > >>>>>
> > > >>>>> If I am not the only one how have such concerns then we might
> open
> > a
> > > >>>> ticket to address this.
> > > >>>>>
> > > >>>>>
> > > >>>>> Best regards, Alexey
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Alexey Kuznetsov
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Ignite not friendly for Monitoring

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Mon, Aug 28, 2017 at 1:22 AM, Vladimir Ozerov <vo...@gridgain.com>
wrote:

> IGNITE-5620 is about error codes thrown from drivers. This is completely
> different story, as every driver has specification with it's own specific
> error codes. There is no common denominator.
>

Vova, I am not sure I understand. I would expect that drivers should
provide the same SQL error codes as the underlying database. Perhaps,
drivers have their custom codes for the errors in the driver itself, not in
SQL.

Can you please clarify?


>
> On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda <dm...@apache.org> wrote:
>
> > Vladimir,
> >
> > I would disagree. In IGNITE-5620 we’re going to introduce some constant
> > error codes and prepare a sheet that will elaborate on every error.
> That’s
> > a part of bigger endeavor when the whole platform should be covered by
> > special unique IDs for errors, warning and events.
> >
> > Now, we need to agree at least on the IDs range for SQL.
> >
> > —
> > Denis
> >
> > > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov <vo...@gridgain.com>
> > wrote:
> > >
> > > Denis,
> > >
> > > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > > monitoring and parser errors.
> > >
> > > ср, 16 авг. 2017 г. в 2:57, Denis Magda <dm...@apache.org>:
> > >
> > >> Alexey,
> > >>
> > >> Didn’t know that such an improvement as consistent IDs for errors and
> > >> events can be used as an integration point with the DevOps tools.
> Thanks
> > >> for sharing your experience with us.
> > >>
> > >> Would you step in as a architect for this task and make out a JIRA
> > ticket
> > >> with all the required information.
> > >>
> > >> In general, we’ve already planned to do something around this starting
> > >> with SQL:
> > >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> > >> https://issues.apache.org/jira/browse/IGNITE-5620>
> > >>
> > >> It makes sense to consider your input before the work on IGNITE-5620
> is
> > >> started.
> > >>
> > >> —
> > >> Denis
> > >>
> > >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> > >> alexeykukushkin@yahoo.com.INVALID> wrote:
> > >>>
> > >>> Hi Alexey,
> > >>> A nice thing about delegating alerting to 3rd party enterprise
> systems
> > >> is that those systems already deal with lots of things including
> > >> distributed apps.
> > >>> What is needed from Ignite is to consistently write to log files
> (again
> > >> that means stable event IDs, proper event granularity, no repetition,
> > >> documentation). This would be 3rd party monitoring system's
> > responsibility
> > >> to monitor log files on all nodes, filter, aggregate, process,
> visualize
> > >> and notify on events.
> > >>> How a monitoring tool would deal with an event like "node left":
> > >>> The only thing needed from Ignite is to write an entry like below to
> > log
> > >> files on all Ignite servers. In this example 3300 identifies this
> "node
> > >> left" event and will never change in the future even if text
> description
> > >> changes:
> > >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> > >> cluster
> > >>> Then we document somewhere on the web that Ignite has event 3300 and
> it
> > >> means a node left the cluster. Maybe provide documentation how to deal
> > with
> > >> it. Some examples:Oracle Web Cache events:
> > >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> > event.htm#sthref2393MS
> > >> SQL Server events:
> > >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > >>> That is all for Ignite! Everything else is handled by specific
> > >> monitoring system configured by DevOps on the customer side.
> > >>> Basing on the Ignite documentation similar to above, DevOps of a
> > company
> > >> where Ignite is going to be used will configure their monitoring
> system
> > to
> > >> understand Ignite events. Consider the "node left" event as an
> example.
> > >>> - This event is output on every node but DevOps do not want to be
> > >> notified many times. To address this, they will build an "Ignite
> model"
> > >> where there will be a parent-child dependency between components
> "Ignite
> > >> Cluster" and "Ignite Node". For example, this is how you do it in
> > Nagios:
> > >> https://assets.nagios.com/downloads/nagioscore/docs/
> > nagioscore/4/en/dependencies.html
> > >> and this is how you do it in Microsoft SCSM:
> > >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes.
> Then
> > >> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> > >> Nagios) for parent "Ignite Cluster" and child "Ignite Service"
> > components.
> > >> State change (OK -> WARNING) and notification (email, SMS, whatever)
> > will
> > >> be configured only for the "Ignite Cluster"'s "node left" monitor.-
> Now
> > >> suppose a node left. The "node left" monitor (that uses log file
> > monitoring
> > >> plugin) on "Ignite Node" will detect the event and pass it to the
> > parent.
> > >> This will trigger "Ignite Cluster" state change from OK to WARNING and
> > send
> > >> a notification. No more notification will be sent unless the "Ignite
> > >> Cluster" state is reset back to OK, which happens either manually or
> on
> > >> timeout or automatically on "node joined".
> > >>> This was just FYI. We, Ignite developers, do not care about how
> > >> monitoring works - this is responsibility of customer's DevOps. Our
> > >> responsibility is consistent event logging.
> > >>> Thank you!
> > >>>
> > >>>
> > >>> Best regards, Alexey
> > >>>
> > >>>
> > >>> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> > >> akuznetsov@apache.org> wrote:
> > >>>
> > >>> Alexey,
> > >>>
> > >>> How you are going to deal with distributed nature of Ignite cluster?
> > >>> And how do you propose handle nodes restart / stop?
> > >>>
> > >>> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> > >>> alexeykukushkin@yahoo.com.invalid> wrote:
> > >>>
> > >>>> Hi Denis,
> > >>>> Monitoring tools simply watch event logs for patterns (regex in case
> > of
> > >>>> unstructured logs like text files). A stable (not changing in new
> > >> releases)
> > >>>> event ID identifying specific issue would be such a pattern.
> > >>>> We need to introduce such event IDs according to the principles I
> > >>>> described in my previous mail.
> > >>>> Best regards, Alexey
> > >>>>
> > >>>>
> > >>>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> > >>>> dmagda@apache.org> wrote:
> > >>>>
> > >>>> Hello Alexey,
> > >>>>
> > >>>> Thanks for the detailed input.
> > >>>>
> > >>>> Assuming that Ignite supported the suggested events based model, how
> > can
> > >>>> it be integrated with mentioned tools like DynaTrace or Nagios? Is
> > this
> > >> all
> > >>>> we need?
> > >>>>
> > >>>> —
> > >>>> Denis
> > >>>>
> > >>>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
> > >> alexeykukushkin@yahoo.com
> > >>>> .INVALID> wrote:
> > >>>>>
> > >>>>> Igniters,
> > >>>>> While preparing some Ignite materials for Administrators I found
> > Ignite
> > >>>> is not friendly for such a critical DevOps practice as monitoring.
> > >>>>> TL;DRI think Ignite misses structured descriptions of abnormal
> events
> > >>>> with references to event IDs in the logs not changing as new
> versions
> > >> are
> > >>>> released.
> > >>>>> MORE DETAILS
> > >>>>> I call an application “monitoring friendly” if it allows DevOps to:
> > >>>>> 1. immediately receive a notification (email, SMS, etc.)
> > >>>>> 2. understand what a problem is without involving developers
> > >>>>> 3. provide automated recovery action.
> > >>>>>
> > >>>>> Large enterprises do not implement custom solutions. They usually
> use
> > >>>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> > >>>> enterprise consistently. All such tools have similar architecture
> > >> providing
> > >>>> a dashboard showing apps as “green/yellow/red”, and numerous
> > >> “connectors”
> > >>>> to look for events in text logs, ESBs, database tables, etc.
> > >>>>>
> > >>>>> For each app DevOps build a “health model” - a diagram displaying
> the
> > >>>> app’s “manageable” components and the app boundaries. A “manageable”
> > >>>> component is something that can be started/stopped/configured in
> > >> isolation.
> > >>>> “System boundary” is a list of external apps that the monitored app
> > >>>> interacts with.
> > >>>>>
> > >>>>> The main attribute of a manageable component is a list of
> > >> “operationally
> > >>>> significant events”. Those are the events that DevOps can do
> something
> > >>>> with. For example, “failed to connect to cache store” is
> significant,
> > >> while
> > >>>> “user input validation failed” is not.
> > >>>>>
> > >>>>> Events shall be as specific as possible so that DevOps do not spend
> > >> time
> > >>>> for further analysis. For example, a “database failure” event is not
> > >> good.
> > >>>> There should be “database connection failure”, “invalid database
> > >> schema”,
> > >>>> “database authentication failure”, etc. events.
> > >>>>>
> > >>>>> “Event” is NOT the same as exception occurred in the code. Events
> > >>>> identify specific problem from the DevOps point of view. For
> example,
> > >> even
> > >>>> if “connection to cache store failed” exception might be thrown from
> > >>>> several places in the code, that is still the same event. On the
> other
> > >>>> side, even if a SqlServerConnectionTimeout and
> OracleConnectionTimeout
> > >>>> exceptions might be caught in the same place, those are different
> > events
> > >>>> since MS SQL Server and Oracle are usually different DevOps groups
> in
> > >> large
> > >>>> enterprises!
> > >>>>>
> > >>>>> The operationally significant event IDs must be stable: they must
> not
> > >>>> change from one release to another. This is like a contract between
> > >>>> developers and DevOps.
> > >>>>>
> > >>>>> This should be the developer’s responsibility to publish and
> > maintain a
> > >>>> table with attributes:
> > >>>>>
> > >>>>> - Event ID
> > >>>>> - Severity: Critical (Red) - the system is not operational; Warning
> > >>>> (Yellow) - the system is operational but health is degraded; None -
> > >> just an
> > >>>> info.
> > >>>>> - Description: concise but enough for DevOps to act without
> > developer’s
> > >>>> help
> > >>>>> - Recovery actions: what DevOps shall do to fix the issue without
> > >>>> developer’s help. DevOps might create automated recovery scripts
> based
> > >> on
> > >>>> this information.
> > >>>>>
> > >>>>> For example:
> > >>>>> 10100 - Critical - Could not connect to Zookeeper to discovery
> nodes
> > -
> > >>>> 1) Open ignite configuration and find zookeeper connection string 2)
> > >> Make
> > >>>> sure the Zookeeper is running
> > >>>>> 10200 - Warning - Ignite node left the cluster.
> > >>>>>
> > >>>>> Back to Ignite: it looks to me we do not design for operations as
> > >>>> described above. We have no event IDs: our logging is subject to
> > change
> > >> in
> > >>>> new version so that any patterns DevOps might use to detect
> > significant
> > >>>> events would stop working after upgrade.
> > >>>>>
> > >>>>> If I am not the only one how have such concerns then we might open
> a
> > >>>> ticket to address this.
> > >>>>>
> > >>>>>
> > >>>>> Best regards, Alexey
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Alexey Kuznetsov
> > >>
> > >>
> >
> >
>

Re: Ignite not friendly for Monitoring

Posted by Vladimir Ozerov <vo...@gridgain.com>.

IGNITE-5620 is about error codes thrown from drivers. This is completely
different story, as every driver has specification with it's own specific
error codes. There is no common denominator.

On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda <dm...@apache.org> wrote:

> Vladimir,
>
> I would disagree. In IGNITE-5620 we’re going to introduce some constant
> error codes and prepare a sheet that will elaborate on every error. That’s
> a part of bigger endeavor when the whole platform should be covered by
> special unique IDs for errors, warning and events.
>
> Now, we need to agree at least on the IDs range for SQL.
>
> —
> Denis
>
> > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov <vo...@gridgain.com>
> wrote:
> >
> > Denis,
> >
> > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > monitoring and parser errors.
> >
> > ср, 16 авг. 2017 г. в 2:57, Denis Magda <dm...@apache.org>:
> >
> >> Alexey,
> >>
> >> Didn’t know that such an improvement as consistent IDs for errors and
> >> events can be used as an integration point with the DevOps tools. Thanks
> >> for sharing your experience with us.
> >>
> >> Would you step in as a architect for this task and make out a JIRA
> ticket
> >> with all the required information.
> >>
> >> In general, we’ve already planned to do something around this starting
> >> with SQL:
> >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> >> https://issues.apache.org/jira/browse/IGNITE-5620>
> >>
> >> It makes sense to consider your input before the work on IGNITE-5620 is
> >> started.
> >>
> >> —
> >> Denis
> >>
> >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> >> alexeykukushkin@yahoo.com.INVALID> wrote:
> >>>
> >>> Hi Alexey,
> >>> A nice thing about delegating alerting to 3rd party enterprise systems
> >> is that those systems already deal with lots of things including
> >> distributed apps.
> >>> What is needed from Ignite is to consistently write to log files (again
> >> that means stable event IDs, proper event granularity, no repetition,
> >> documentation). This would be 3rd party monitoring system's
> responsibility
> >> to monitor log files on all nodes, filter, aggregate, process, visualize
> >> and notify on events.
> >>> How a monitoring tool would deal with an event like "node left":
> >>> The only thing needed from Ignite is to write an entry like below to
> log
> >> files on all Ignite servers. In this example 3300 identifies this "node
> >> left" event and will never change in the future even if text description
> >> changes:
> >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> >> cluster
> >>> Then we document somewhere on the web that Ignite has event 3300 and it
> >> means a node left the cluster. Maybe provide documentation how to deal
> with
> >> it. Some examples:Oracle Web Cache events:
> >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> event.htm#sthref2393MS
> >> SQL Server events:
> >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> >>> That is all for Ignite! Everything else is handled by specific
> >> monitoring system configured by DevOps on the customer side.
> >>> Basing on the Ignite documentation similar to above, DevOps of a
> company
> >> where Ignite is going to be used will configure their monitoring system
> to
> >> understand Ignite events. Consider the "node left" event as an example.
> >>> - This event is output on every node but DevOps do not want to be
> >> notified many times. To address this, they will build an "Ignite model"
> >> where there will be a parent-child dependency between components "Ignite
> >> Cluster" and "Ignite Node". For example, this is how you do it in
> Nagios:
> >> https://assets.nagios.com/downloads/nagioscore/docs/
> nagioscore/4/en/dependencies.html
> >> and this is how you do it in Microsoft SCSM:
> >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
> >> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> >> Nagios) for parent "Ignite Cluster" and child "Ignite Service"
> components.
> >> State change (OK -> WARNING) and notification (email, SMS, whatever)
> will
> >> be configured only for the "Ignite Cluster"'s "node left" monitor.- Now
> >> suppose a node left. The "node left" monitor (that uses log file
> monitoring
> >> plugin) on "Ignite Node" will detect the event and pass it to the
> parent.
> >> This will trigger "Ignite Cluster" state change from OK to WARNING and
> send
> >> a notification. No more notification will be sent unless the "Ignite
> >> Cluster" state is reset back to OK, which happens either manually or on
> >> timeout or automatically on "node joined".
> >>> This was just FYI. We, Ignite developers, do not care about how
> >> monitoring works - this is responsibility of customer's DevOps. Our
> >> responsibility is consistent event logging.
> >>> Thank you!
> >>>
> >>>
> >>> Best regards, Alexey
> >>>
> >>>
> >>> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> >> akuznetsov@apache.org> wrote:
> >>>
> >>> Alexey,
> >>>
> >>> How you are going to deal with distributed nature of Ignite cluster?
> >>> And how do you propose handle nodes restart / stop?
> >>>
> >>> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> >>> alexeykukushkin@yahoo.com.invalid> wrote:
> >>>
> >>>> Hi Denis,
> >>>> Monitoring tools simply watch event logs for patterns (regex in case
> of
> >>>> unstructured logs like text files). A stable (not changing in new
> >> releases)
> >>>> event ID identifying specific issue would be such a pattern.
> >>>> We need to introduce such event IDs according to the principles I
> >>>> described in my previous mail.
> >>>> Best regards, Alexey
> >>>>
> >>>>
> >>>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> >>>> dmagda@apache.org> wrote:
> >>>>
> >>>> Hello Alexey,
> >>>>
> >>>> Thanks for the detailed input.
> >>>>
> >>>> Assuming that Ignite supported the suggested events based model, how
> can
> >>>> it be integrated with mentioned tools like DynaTrace or Nagios? Is
> this
> >> all
> >>>> we need?
> >>>>
> >>>> —
> >>>> Denis
> >>>>
> >>>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
> >> alexeykukushkin@yahoo.com
> >>>> .INVALID> wrote:
> >>>>>
> >>>>> Igniters,
> >>>>> While preparing some Ignite materials for Administrators I found
> Ignite
> >>>> is not friendly for such a critical DevOps practice as monitoring.
> >>>>> TL;DRI think Ignite misses structured descriptions of abnormal events
> >>>> with references to event IDs in the logs not changing as new versions
> >> are
> >>>> released.
> >>>>> MORE DETAILS
> >>>>> I call an application “monitoring friendly” if it allows DevOps to:
> >>>>> 1. immediately receive a notification (email, SMS, etc.)
> >>>>> 2. understand what a problem is without involving developers
> >>>>> 3. provide automated recovery action.
> >>>>>
> >>>>> Large enterprises do not implement custom solutions. They usually use
> >>>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> >>>> enterprise consistently. All such tools have similar architecture
> >> providing
> >>>> a dashboard showing apps as “green/yellow/red”, and numerous
> >> “connectors”
> >>>> to look for events in text logs, ESBs, database tables, etc.
> >>>>>
> >>>>> For each app DevOps build a “health model” - a diagram displaying the
> >>>> app’s “manageable” components and the app boundaries. A “manageable”
> >>>> component is something that can be started/stopped/configured in
> >> isolation.
> >>>> “System boundary” is a list of external apps that the monitored app
> >>>> interacts with.
> >>>>>
> >>>>> The main attribute of a manageable component is a list of
> >> “operationally
> >>>> significant events”. Those are the events that DevOps can do something
> >>>> with. For example, “failed to connect to cache store” is significant,
> >> while
> >>>> “user input validation failed” is not.
> >>>>>
> >>>>> Events shall be as specific as possible so that DevOps do not spend
> >> time
> >>>> for further analysis. For example, a “database failure” event is not
> >> good.
> >>>> There should be “database connection failure”, “invalid database
> >> schema”,
> >>>> “database authentication failure”, etc. events.
> >>>>>
> >>>>> “Event” is NOT the same as exception occurred in the code. Events
> >>>> identify specific problem from the DevOps point of view. For example,
> >> even
> >>>> if “connection to cache store failed” exception might be thrown from
> >>>> several places in the code, that is still the same event. On the other
> >>>> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> >>>> exceptions might be caught in the same place, those are different
> events
> >>>> since MS SQL Server and Oracle are usually different DevOps groups in
> >> large
> >>>> enterprises!
> >>>>>
> >>>>> The operationally significant event IDs must be stable: they must not
> >>>> change from one release to another. This is like a contract between
> >>>> developers and DevOps.
> >>>>>
> >>>>> This should be the developer’s responsibility to publish and
> maintain a
> >>>> table with attributes:
> >>>>>
> >>>>> - Event ID
> >>>>> - Severity: Critical (Red) - the system is not operational; Warning
> >>>> (Yellow) - the system is operational but health is degraded; None -
> >> just an
> >>>> info.
> >>>>> - Description: concise but enough for DevOps to act without
> developer’s
> >>>> help
> >>>>> - Recovery actions: what DevOps shall do to fix the issue without
> >>>> developer’s help. DevOps might create automated recovery scripts based
> >> on
> >>>> this information.
> >>>>>
> >>>>> For example:
> >>>>> 10100 - Critical - Could not connect to Zookeeper to discovery nodes
> -
> >>>> 1) Open ignite configuration and find zookeeper connection string 2)
> >> Make
> >>>> sure the Zookeeper is running
> >>>>> 10200 - Warning - Ignite node left the cluster.
> >>>>>
> >>>>> Back to Ignite: it looks to me we do not design for operations as
> >>>> described above. We have no event IDs: our logging is subject to
> change
> >> in
> >>>> new version so that any patterns DevOps might use to detect
> significant
> >>>> events would stop working after upgrade.
> >>>>>
> >>>>> If I am not the only one how have such concerns then we might open a
> >>>> ticket to address this.
> >>>>>
> >>>>>
> >>>>> Best regards, Alexey
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Alexey Kuznetsov
> >>
> >>
>
>

Re: Ignite not friendly for Monitoring

Posted by Denis Magda <dm...@apache.org>.

Vladimir,

I would disagree. In IGNITE-5620 we’re going to introduce some constant error codes and prepare a sheet that will elaborate on every error. That’s a part of bigger endeavor when the whole platform should be covered by special unique IDs for errors, warning and events. 

Now, we need to agree at least on the IDs range for SQL.

—
Denis

> On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov <vo...@gridgain.com> wrote:
> 
> Denis,
> 
> IGNITE-5620 is completely different thing. Let's do not mix cluster
> monitoring and parser errors.
> 
> ср, 16 авг. 2017 г. в 2:57, Denis Magda <dm...@apache.org>:
> 
>> Alexey,
>> 
>> Didn’t know that such an improvement as consistent IDs for errors and
>> events can be used as an integration point with the DevOps tools. Thanks
>> for sharing your experience with us.
>> 
>> Would you step in as a architect for this task and make out a JIRA ticket
>> with all the required information.
>> 
>> In general, we’ve already planned to do something around this starting
>> with SQL:
>> https://issues.apache.org/jira/browse/IGNITE-5620 <
>> https://issues.apache.org/jira/browse/IGNITE-5620>
>> 
>> It makes sense to consider your input before the work on IGNITE-5620 is
>> started.
>> 
>> —
>> Denis
>> 
>>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
>> alexeykukushkin@yahoo.com.INVALID> wrote:
>>> 
>>> Hi Alexey,
>>> A nice thing about delegating alerting to 3rd party enterprise systems
>> is that those systems already deal with lots of things including
>> distributed apps.
>>> What is needed from Ignite is to consistently write to log files (again
>> that means stable event IDs, proper event granularity, no repetition,
>> documentation). This would be 3rd party monitoring system's responsibility
>> to monitor log files on all nodes, filter, aggregate, process, visualize
>> and notify on events.
>>> How a monitoring tool would deal with an event like "node left":
>>> The only thing needed from Ignite is to write an entry like below to log
>> files on all Ignite servers. In this example 3300 identifies this "node
>> left" event and will never change in the future even if text description
>> changes:
>>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
>> cluster
>>> Then we document somewhere on the web that Ignite has event 3300 and it
>> means a node left the cluster. Maybe provide documentation how to deal with
>> it. Some examples:Oracle Web Cache events:
>> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS
>> SQL Server events:
>> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
>>> That is all for Ignite! Everything else is handled by specific
>> monitoring system configured by DevOps on the customer side.
>>> Basing on the Ignite documentation similar to above, DevOps of a company
>> where Ignite is going to be used will configure their monitoring system to
>> understand Ignite events. Consider the "node left" event as an example.
>>> - This event is output on every node but DevOps do not want to be
>> notified many times. To address this, they will build an "Ignite model"
>> where there will be a parent-child dependency between components "Ignite
>> Cluster" and "Ignite Node". For example, this is how you do it in Nagios:
>> https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html
>> and this is how you do it in Microsoft SCSM:
>> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
>> DevOps will configure "node left" monitors in SCSM (or a "checks" in
>> Nagios) for parent "Ignite Cluster" and child "Ignite Service" components.
>> State change (OK -> WARNING) and notification (email, SMS, whatever) will
>> be configured only for the "Ignite Cluster"'s "node left" monitor.- Now
>> suppose a node left. The "node left" monitor (that uses log file monitoring
>> plugin) on "Ignite Node" will detect the event and pass it to the parent.
>> This will trigger "Ignite Cluster" state change from OK to WARNING and send
>> a notification. No more notification will be sent unless the "Ignite
>> Cluster" state is reset back to OK, which happens either manually or on
>> timeout or automatically on "node joined".
>>> This was just FYI. We, Ignite developers, do not care about how
>> monitoring works - this is responsibility of customer's DevOps. Our
>> responsibility is consistent event logging.
>>> Thank you!
>>> 
>>> 
>>> Best regards, Alexey
>>> 
>>> 
>>> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
>> akuznetsov@apache.org> wrote:
>>> 
>>> Alexey,
>>> 
>>> How you are going to deal with distributed nature of Ignite cluster?
>>> And how do you propose handle nodes restart / stop?
>>> 
>>> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
>>> alexeykukushkin@yahoo.com.invalid> wrote:
>>> 
>>>> Hi Denis,
>>>> Monitoring tools simply watch event logs for patterns (regex in case of
>>>> unstructured logs like text files). A stable (not changing in new
>> releases)
>>>> event ID identifying specific issue would be such a pattern.
>>>> We need to introduce such event IDs according to the principles I
>>>> described in my previous mail.
>>>> Best regards, Alexey
>>>> 
>>>> 
>>>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
>>>> dmagda@apache.org> wrote:
>>>> 
>>>> Hello Alexey,
>>>> 
>>>> Thanks for the detailed input.
>>>> 
>>>> Assuming that Ignite supported the suggested events based model, how can
>>>> it be integrated with mentioned tools like DynaTrace or Nagios? Is this
>> all
>>>> we need?
>>>> 
>>>> —
>>>> Denis
>>>> 
>>>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
>> alexeykukushkin@yahoo.com
>>>> .INVALID> wrote:
>>>>> 
>>>>> Igniters,
>>>>> While preparing some Ignite materials for Administrators I found Ignite
>>>> is not friendly for such a critical DevOps practice as monitoring.
>>>>> TL;DRI think Ignite misses structured descriptions of abnormal events
>>>> with references to event IDs in the logs not changing as new versions
>> are
>>>> released.
>>>>> MORE DETAILS
>>>>> I call an application “monitoring friendly” if it allows DevOps to:
>>>>> 1. immediately receive a notification (email, SMS, etc.)
>>>>> 2. understand what a problem is without involving developers
>>>>> 3. provide automated recovery action.
>>>>> 
>>>>> Large enterprises do not implement custom solutions. They usually use
>>>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
>>>> enterprise consistently. All such tools have similar architecture
>> providing
>>>> a dashboard showing apps as “green/yellow/red”, and numerous
>> “connectors”
>>>> to look for events in text logs, ESBs, database tables, etc.
>>>>> 
>>>>> For each app DevOps build a “health model” - a diagram displaying the
>>>> app’s “manageable” components and the app boundaries. A “manageable”
>>>> component is something that can be started/stopped/configured in
>> isolation.
>>>> “System boundary” is a list of external apps that the monitored app
>>>> interacts with.
>>>>> 
>>>>> The main attribute of a manageable component is a list of
>> “operationally
>>>> significant events”. Those are the events that DevOps can do something
>>>> with. For example, “failed to connect to cache store” is significant,
>> while
>>>> “user input validation failed” is not.
>>>>> 
>>>>> Events shall be as specific as possible so that DevOps do not spend
>> time
>>>> for further analysis. For example, a “database failure” event is not
>> good.
>>>> There should be “database connection failure”, “invalid database
>> schema”,
>>>> “database authentication failure”, etc. events.
>>>>> 
>>>>> “Event” is NOT the same as exception occurred in the code. Events
>>>> identify specific problem from the DevOps point of view. For example,
>> even
>>>> if “connection to cache store failed” exception might be thrown from
>>>> several places in the code, that is still the same event. On the other
>>>> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
>>>> exceptions might be caught in the same place, those are different events
>>>> since MS SQL Server and Oracle are usually different DevOps groups in
>> large
>>>> enterprises!
>>>>> 
>>>>> The operationally significant event IDs must be stable: they must not
>>>> change from one release to another. This is like a contract between
>>>> developers and DevOps.
>>>>> 
>>>>> This should be the developer’s responsibility to publish and maintain a
>>>> table with attributes:
>>>>> 
>>>>> - Event ID
>>>>> - Severity: Critical (Red) - the system is not operational; Warning
>>>> (Yellow) - the system is operational but health is degraded; None -
>> just an
>>>> info.
>>>>> - Description: concise but enough for DevOps to act without developer’s
>>>> help
>>>>> - Recovery actions: what DevOps shall do to fix the issue without
>>>> developer’s help. DevOps might create automated recovery scripts based
>> on
>>>> this information.
>>>>> 
>>>>> For example:
>>>>> 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
>>>> 1) Open ignite configuration and find zookeeper connection string 2)
>> Make
>>>> sure the Zookeeper is running
>>>>> 10200 - Warning - Ignite node left the cluster.
>>>>> 
>>>>> Back to Ignite: it looks to me we do not design for operations as
>>>> described above. We have no event IDs: our logging is subject to change
>> in
>>>> new version so that any patterns DevOps might use to detect significant
>>>> events would stop working after upgrade.
>>>>> 
>>>>> If I am not the only one how have such concerns then we might open a
>>>> ticket to address this.
>>>>> 
>>>>> 
>>>>> Best regards, Alexey
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Alexey Kuznetsov
>> 
>>

Re: Ignite not friendly for Monitoring

Posted by Vladimir Ozerov <vo...@gridgain.com>.

Denis,

IGNITE-5620 is completely different thing. Let's do not mix cluster
monitoring and parser errors.

ср, 16 авг. 2017 г. в 2:57, Denis Magda <dm...@apache.org>:

> Alexey,
>
> Didn’t know that such an improvement as consistent IDs for errors and
> events can be used as an integration point with the DevOps tools. Thanks
> for sharing your experience with us.
>
> Would you step in as a architect for this task and make out a JIRA ticket
> with all the required information.
>
> In general, we’ve already planned to do something around this starting
> with SQL:
> https://issues.apache.org/jira/browse/IGNITE-5620 <
> https://issues.apache.org/jira/browse/IGNITE-5620>
>
> It makes sense to consider your input before the work on IGNITE-5620 is
> started.
>
> —
> Denis
>
> > On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> alexeykukushkin@yahoo.com.INVALID> wrote:
> >
> > Hi Alexey,
> > A nice thing about delegating alerting to 3rd party enterprise systems
> is that those systems already deal with lots of things including
> distributed apps.
> > What is needed from Ignite is to consistently write to log files (again
> that means stable event IDs, proper event granularity, no repetition,
> documentation). This would be 3rd party monitoring system's responsibility
> to monitor log files on all nodes, filter, aggregate, process, visualize
> and notify on events.
> > How a monitoring tool would deal with an event like "node left":
> > The only thing needed from Ignite is to write an entry like below to log
> files on all Ignite servers. In this example 3300 identifies this "node
> left" event and will never change in the future even if text description
> changes:
> > [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> cluster
> > Then we document somewhere on the web that Ignite has event 3300 and it
> means a node left the cluster. Maybe provide documentation how to deal with
> it. Some examples:Oracle Web Cache events:
> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS
> SQL Server events:
> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > That is all for Ignite! Everything else is handled by specific
> monitoring system configured by DevOps on the customer side.
> > Basing on the Ignite documentation similar to above, DevOps of a company
> where Ignite is going to be used will configure their monitoring system to
> understand Ignite events. Consider the "node left" event as an example.
> > - This event is output on every node but DevOps do not want to be
> notified many times. To address this, they will build an "Ignite model"
> where there will be a parent-child dependency between components "Ignite
> Cluster" and "Ignite Node". For example, this is how you do it in Nagios:
> https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html
> and this is how you do it in Microsoft SCSM:
> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> Nagios) for parent "Ignite Cluster" and child "Ignite Service" components.
> State change (OK -> WARNING) and notification (email, SMS, whatever) will
> be configured only for the "Ignite Cluster"'s "node left" monitor.- Now
> suppose a node left. The "node left" monitor (that uses log file monitoring
> plugin) on "Ignite Node" will detect the event and pass it to the parent.
> This will trigger "Ignite Cluster" state change from OK to WARNING and send
> a notification. No more notification will be sent unless the "Ignite
> Cluster" state is reset back to OK, which happens either manually or on
> timeout or automatically on "node joined".
> > This was just FYI. We, Ignite developers, do not care about how
> monitoring works - this is responsibility of customer's DevOps. Our
> responsibility is consistent event logging.
> > Thank you!
> >
> >
> > Best regards, Alexey
> >
> >
> > On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> akuznetsov@apache.org> wrote:
> >
> > Alexey,
> >
> > How you are going to deal with distributed nature of Ignite cluster?
> > And how do you propose handle nodes restart / stop?
> >
> > On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> > alexeykukushkin@yahoo.com.invalid> wrote:
> >
> >> Hi Denis,
> >> Monitoring tools simply watch event logs for patterns (regex in case of
> >> unstructured logs like text files). A stable (not changing in new
> releases)
> >> event ID identifying specific issue would be such a pattern.
> >> We need to introduce such event IDs according to the principles I
> >> described in my previous mail.
> >> Best regards, Alexey
> >>
> >>
> >> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> >> dmagda@apache.org> wrote:
> >>
> >> Hello Alexey,
> >>
> >> Thanks for the detailed input.
> >>
> >> Assuming that Ignite supported the suggested events based model, how can
> >> it be integrated with mentioned tools like DynaTrace or Nagios? Is this
> all
> >> we need?
> >>
> >> —
> >> Denis
> >>
> >>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
> alexeykukushkin@yahoo.com
> >> .INVALID> wrote:
> >>>
> >>> Igniters,
> >>> While preparing some Ignite materials for Administrators I found Ignite
> >> is not friendly for such a critical DevOps practice as monitoring.
> >>> TL;DRI think Ignite misses structured descriptions of abnormal events
> >> with references to event IDs in the logs not changing as new versions
> are
> >> released.
> >>> MORE DETAILS
> >>> I call an application “monitoring friendly” if it allows DevOps to:
> >>> 1. immediately receive a notification (email, SMS, etc.)
> >>> 2. understand what a problem is without involving developers
> >>> 3. provide automated recovery action.
> >>>
> >>> Large enterprises do not implement custom solutions. They usually use
> >> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> >> enterprise consistently. All such tools have similar architecture
> providing
> >> a dashboard showing apps as “green/yellow/red”, and numerous
> “connectors”
> >> to look for events in text logs, ESBs, database tables, etc.
> >>>
> >>> For each app DevOps build a “health model” - a diagram displaying the
> >> app’s “manageable” components and the app boundaries. A “manageable”
> >> component is something that can be started/stopped/configured in
> isolation.
> >> “System boundary” is a list of external apps that the monitored app
> >> interacts with.
> >>>
> >>> The main attribute of a manageable component is a list of
> “operationally
> >> significant events”. Those are the events that DevOps can do something
> >> with. For example, “failed to connect to cache store” is significant,
> while
> >> “user input validation failed” is not.
> >>>
> >>> Events shall be as specific as possible so that DevOps do not spend
> time
> >> for further analysis. For example, a “database failure” event is not
> good.
> >> There should be “database connection failure”, “invalid database
> schema”,
> >> “database authentication failure”, etc. events.
> >>>
> >>> “Event” is NOT the same as exception occurred in the code. Events
> >> identify specific problem from the DevOps point of view. For example,
> even
> >> if “connection to cache store failed” exception might be thrown from
> >> several places in the code, that is still the same event. On the other
> >> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> >> exceptions might be caught in the same place, those are different events
> >> since MS SQL Server and Oracle are usually different DevOps groups in
> large
> >> enterprises!
> >>>
> >>> The operationally significant event IDs must be stable: they must not
> >> change from one release to another. This is like a contract between
> >> developers and DevOps.
> >>>
> >>> This should be the developer’s responsibility to publish and maintain a
> >> table with attributes:
> >>>
> >>> - Event ID
> >>> - Severity: Critical (Red) - the system is not operational; Warning
> >> (Yellow) - the system is operational but health is degraded; None -
> just an
> >> info.
> >>> - Description: concise but enough for DevOps to act without developer’s
> >> help
> >>> - Recovery actions: what DevOps shall do to fix the issue without
> >> developer’s help. DevOps might create automated recovery scripts based
> on
> >> this information.
> >>>
> >>> For example:
> >>> 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> >> 1) Open ignite configuration and find zookeeper connection string 2)
> Make
> >> sure the Zookeeper is running
> >>> 10200 - Warning - Ignite node left the cluster.
> >>>
> >>> Back to Ignite: it looks to me we do not design for operations as
> >> described above. We have no event IDs: our logging is subject to change
> in
> >> new version so that any patterns DevOps might use to detect significant
> >> events would stop working after upgrade.
> >>>
> >>> If I am not the only one how have such concerns then we might open a
> >> ticket to address this.
> >>>
> >>>
> >>> Best regards, Alexey
> >>
> >
> >
> >
> > --
> > Alexey Kuznetsov
>
>

Re: Ignite not friendly for Monitoring

Posted by Denis Magda <dm...@apache.org>.

Alexey,

Didn’t know that such an improvement as consistent IDs for errors and events can be used as an integration point with the DevOps tools. Thanks for sharing your experience with us.

Would you step in as a architect for this task and make out a JIRA ticket with all the required information.

In general, we’ve already planned to do something around this starting with SQL:
https://issues.apache.org/jira/browse/IGNITE-5620 <https://issues.apache.org/jira/browse/IGNITE-5620>

It makes sense to consider your input before the work on IGNITE-5620 is started.

—
Denis

> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <al...@yahoo.com.INVALID> wrote:
> 
> Hi Alexey,
> A nice thing about delegating alerting to 3rd party enterprise systems is that those systems already deal with lots of things including distributed apps.
> What is needed from Ignite is to consistently write to log files (again that means stable event IDs, proper event granularity, no repetition, documentation). This would be 3rd party monitoring system's responsibility to monitor log files on all nodes, filter, aggregate, process, visualize and notify on events.
> How a monitoring tool would deal with an event like "node left":
> The only thing needed from Ignite is to write an entry like below to log files on all Ignite servers. In this example 3300 identifies this "node left" event and will never change in the future even if text description changes:
> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the cluster
> Then we document somewhere on the web that Ignite has event 3300 and it means a node left the cluster. Maybe provide documentation how to deal with it. Some examples:Oracle Web Cache events: https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS SQL Server events: https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx 
> That is all for Ignite! Everything else is handled by specific monitoring system configured by DevOps on the customer side. 
> Basing on the Ignite documentation similar to above, DevOps of a company where Ignite is going to be used will configure their monitoring system to understand Ignite events. Consider the "node left" event as an example.
> - This event is output on every node but DevOps do not want to be notified many times. To address this, they will build an "Ignite model" where there will be a parent-child dependency between components "Ignite Cluster" and "Ignite Node". For example, this is how you do it in Nagios: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html and this is how you do it in Microsoft SCSM: https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then DevOps will configure "node left" monitors in SCSM (or a "checks" in Nagios) for parent "Ignite Cluster" and child "Ignite Service" components. State change (OK -> WARNING) and notification (email, SMS, whatever) will be configured only for the "Ignite Cluster"'s "node left" monitor.- Now suppose a node left. The "node left" monitor (that uses log file monitoring plugin) on "Ignite Node" will detect the event and pass it to the parent. This will trigger "Ignite Cluster" state change from OK to WARNING and send a notification. No more notification will be sent unless the "Ignite Cluster" state is reset back to OK, which happens either manually or on timeout or automatically on "node joined". 
> This was just FYI. We, Ignite developers, do not care about how monitoring works - this is responsibility of customer's DevOps. Our responsibility is consistent event logging.
> Thank you!
> 
> 
> Best regards, Alexey
> 
> 
> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <ak...@apache.org> wrote:
> 
> Alexey,
> 
> How you are going to deal with distributed nature of Ignite cluster?
> And how do you propose handle nodes restart / stop?
> 
> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> alexeykukushkin@yahoo.com.invalid> wrote:
> 
>> Hi Denis,
>> Monitoring tools simply watch event logs for patterns (regex in case of
>> unstructured logs like text files). A stable (not changing in new releases)
>> event ID identifying specific issue would be such a pattern.
>> We need to introduce such event IDs according to the principles I
>> described in my previous mail.
>> Best regards, Alexey
>> 
>> 
>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
>> dmagda@apache.org> wrote:
>> 
>> Hello Alexey,
>> 
>> Thanks for the detailed input.
>> 
>> Assuming that Ignite supported the suggested events based model, how can
>> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
>> we need?
>> 
>> —
>> Denis
>> 
>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukushkin@yahoo.com
>> .INVALID> wrote:
>>> 
>>> Igniters,
>>> While preparing some Ignite materials for Administrators I found Ignite
>> is not friendly for such a critical DevOps practice as monitoring.
>>> TL;DRI think Ignite misses structured descriptions of abnormal events
>> with references to event IDs in the logs not changing as new versions are
>> released.
>>> MORE DETAILS
>>> I call an application “monitoring friendly” if it allows DevOps to:
>>> 1. immediately receive a notification (email, SMS, etc.)
>>> 2. understand what a problem is without involving developers
>>> 3. provide automated recovery action.
>>> 
>>> Large enterprises do not implement custom solutions. They usually use
>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
>> enterprise consistently. All such tools have similar architecture providing
>> a dashboard showing apps as “green/yellow/red”, and numerous “connectors”
>> to look for events in text logs, ESBs, database tables, etc.
>>> 
>>> For each app DevOps build a “health model” - a diagram displaying the
>> app’s “manageable” components and the app boundaries. A “manageable”
>> component is something that can be started/stopped/configured in isolation.
>> “System boundary” is a list of external apps that the monitored app
>> interacts with.
>>> 
>>> The main attribute of a manageable component is a list of “operationally
>> significant events”. Those are the events that DevOps can do something
>> with. For example, “failed to connect to cache store” is significant, while
>> “user input validation failed” is not.
>>> 
>>> Events shall be as specific as possible so that DevOps do not spend time
>> for further analysis. For example, a “database failure” event is not good.
>> There should be “database connection failure”, “invalid database schema”,
>> “database authentication failure”, etc. events.
>>> 
>>> “Event” is NOT the same as exception occurred in the code. Events
>> identify specific problem from the DevOps point of view. For example, even
>> if “connection to cache store failed” exception might be thrown from
>> several places in the code, that is still the same event. On the other
>> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
>> exceptions might be caught in the same place, those are different events
>> since MS SQL Server and Oracle are usually different DevOps groups in large
>> enterprises!
>>> 
>>> The operationally significant event IDs must be stable: they must not
>> change from one release to another. This is like a contract between
>> developers and DevOps.
>>> 
>>> This should be the developer’s responsibility to publish and maintain a
>> table with attributes:
>>> 
>>> - Event ID
>>> - Severity: Critical (Red) - the system is not operational; Warning
>> (Yellow) - the system is operational but health is degraded; None - just an
>> info.
>>> - Description: concise but enough for DevOps to act without developer’s
>> help
>>> - Recovery actions: what DevOps shall do to fix the issue without
>> developer’s help. DevOps might create automated recovery scripts based on
>> this information.
>>> 
>>> For example:
>>> 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
>> 1) Open ignite configuration and find zookeeper connection string 2) Make
>> sure the Zookeeper is running
>>> 10200 - Warning - Ignite node left the cluster.
>>> 
>>> Back to Ignite: it looks to me we do not design for operations as
>> described above. We have no event IDs: our logging is subject to change in
>> new version so that any patterns DevOps might use to detect significant
>> events would stop working after upgrade.
>>> 
>>> If I am not the only one how have such concerns then we might open a
>> ticket to address this.
>>> 
>>> 
>>> Best regards, Alexey
>> 
> 
> 
> 
> -- 
> Alexey Kuznetsov

Re: Ignite not friendly for Monitoring

Posted by Alexey Kukushkin <al...@yahoo.com.INVALID>.

Hi Alexey,
A nice thing about delegating alerting to 3rd party enterprise systems is that those systems already deal with lots of things including distributed apps.
What is needed from Ignite is to consistently write to log files (again that means stable event IDs, proper event granularity, no repetition, documentation). This would be 3rd party monitoring system's responsibility to monitor log files on all nodes, filter, aggregate, process, visualize and notify on events.
How a monitoring tool would deal with an event like "node left":
The only thing needed from Ignite is to write an entry like below to log files on all Ignite servers. In this example 3300 identifies this "node left" event and will never change in the future even if text description changes:
[2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the cluster
Then we document somewhere on the web that Ignite has event 3300 and it means a node left the cluster. Maybe provide documentation how to deal with it. Some examples:Oracle Web Cache events: https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS SQL Server events: https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx 
That is all for Ignite! Everything else is handled by specific monitoring system configured by DevOps on the customer side. 
Basing on the Ignite documentation similar to above, DevOps of a company where Ignite is going to be used will configure their monitoring system to understand Ignite events. Consider the "node left" event as an example.
- This event is output on every node but DevOps do not want to be notified many times. To address this, they will build an "Ignite model" where there will be a parent-child dependency between components "Ignite Cluster" and "Ignite Node". For example, this is how you do it in Nagios: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html and this is how you do it in Microsoft SCSM: https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then DevOps will configure "node left" monitors in SCSM (or a "checks" in Nagios) for parent "Ignite Cluster" and child "Ignite Service" components. State change (OK -> WARNING) and notification (email, SMS, whatever) will be configured only for the "Ignite Cluster"'s "node left" monitor.- Now suppose a node left. The "node left" monitor (that uses log file monitoring plugin) on "Ignite Node" will detect the event and pass it to the parent. This will trigger "Ignite Cluster" state change from OK to WARNING and send a notification. No more notification will be sent unless the "Ignite Cluster" state is reset back to OK, which happens either manually or on timeout or automatically on "node joined". 
This was just FYI. We, Ignite developers, do not care about how monitoring works - this is responsibility of customer's DevOps. Our responsibility is consistent event logging.
Thank you!


Best regards, Alexey


On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <ak...@apache.org> wrote:

Alexey,

How you are going to deal with distributed nature of Ignite cluster?
And how do you propose handle nodes restart / stop?

On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
alexeykukushkin@yahoo.com.invalid> wrote:

> Hi Denis,
> Monitoring tools simply watch event logs for patterns (regex in case of
> unstructured logs like text files). A stable (not changing in new releases)
> event ID identifying specific issue would be such a pattern.
> We need to introduce such event IDs according to the principles I
> described in my previous mail.
> Best regards, Alexey
>
>
> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> dmagda@apache.org> wrote:
>
> Hello Alexey,
>
> Thanks for the detailed input.
>
> Assuming that Ignite supported the suggested events based model, how can
> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
> we need?
>
> —
> Denis
>
> > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukushkin@yahoo.com
> .INVALID> wrote:
> >
> > Igniters,
> > While preparing some Ignite materials for Administrators I found Ignite
> is not friendly for such a critical DevOps practice as monitoring.
> > TL;DRI think Ignite misses structured descriptions of abnormal events
> with references to event IDs in the logs not changing as new versions are
> released.
> > MORE DETAILS
> > I call an application “monitoring friendly” if it allows DevOps to:
> > 1. immediately receive a notification (email, SMS, etc.)
> > 2. understand what a problem is without involving developers
> > 3. provide automated recovery action.
> >
> > Large enterprises do not implement custom solutions. They usually use
> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> enterprise consistently. All such tools have similar architecture providing
> a dashboard showing apps as “green/yellow/red”, and numerous “connectors”
> to look for events in text logs, ESBs, database tables, etc.
> >
> > For each app DevOps build a “health model” - a diagram displaying the
> app’s “manageable” components and the app boundaries. A “manageable”
> component is something that can be started/stopped/configured in isolation.
> “System boundary” is a list of external apps that the monitored app
> interacts with.
> >
> > The main attribute of a manageable component is a list of “operationally
> significant events”. Those are the events that DevOps can do something
> with. For example, “failed to connect to cache store” is significant, while
> “user input validation failed” is not.
> >
> > Events shall be as specific as possible so that DevOps do not spend time
> for further analysis. For example, a “database failure” event is not good.
> There should be “database connection failure”, “invalid database schema”,
> “database authentication failure”, etc. events.
> >
> > “Event” is NOT the same as exception occurred in the code. Events
> identify specific problem from the DevOps point of view. For example, even
> if “connection to cache store failed” exception might be thrown from
> several places in the code, that is still the same event. On the other
> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> exceptions might be caught in the same place, those are different events
> since MS SQL Server and Oracle are usually different DevOps groups in large
> enterprises!
> >
> > The operationally significant event IDs must be stable: they must not
> change from one release to another. This is like a contract between
> developers and DevOps.
> >
> > This should be the developer’s responsibility to publish and maintain a
> table with attributes:
> >
> > - Event ID
> > - Severity: Critical (Red) - the system is not operational; Warning
> (Yellow) - the system is operational but health is degraded; None - just an
> info.
> > - Description: concise but enough for DevOps to act without developer’s
> help
> > - Recovery actions: what DevOps shall do to fix the issue without
> developer’s help. DevOps might create automated recovery scripts based on
> this information.
> >
> > For example:
> > 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> 1) Open ignite configuration and find zookeeper connection string 2) Make
> sure the Zookeeper is running
> > 10200 - Warning - Ignite node left the cluster.
> >
> > Back to Ignite: it looks to me we do not design for operations as
> described above. We have no event IDs: our logging is subject to change in
> new version so that any patterns DevOps might use to detect significant
> events would stop working after upgrade.
> >
> > If I am not the only one how have such concerns then we might open a
> ticket to address this.
> >
> >
> > Best regards, Alexey
>



-- 
Alexey Kuznetsov

Re: Ignite not friendly for Monitoring

Posted by Alexey Kuznetsov <ak...@apache.org>.

Alexey,

How you are going to deal with distributed nature of Ignite cluster?
And how do you propose handle nodes restart / stop?

On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
alexeykukushkin@yahoo.com.invalid> wrote:

> Hi Denis,
> Monitoring tools simply watch event logs for patterns (regex in case of
> unstructured logs like text files). A stable (not changing in new releases)
> event ID identifying specific issue would be such a pattern.
> We need to introduce such event IDs according to the principles I
> described in my previous mail.
> Best regards, Alexey
>
>
> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> dmagda@apache.org> wrote:
>
> Hello Alexey,
>
> Thanks for the detailed input.
>
> Assuming that Ignite supported the suggested events based model, how can
> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
> we need?
>
> —
> Denis
>
> > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukushkin@yahoo.com
> .INVALID> wrote:
> >
> > Igniters,
> > While preparing some Ignite materials for Administrators I found Ignite
> is not friendly for such a critical DevOps practice as monitoring.
> > TL;DRI think Ignite misses structured descriptions of abnormal events
> with references to event IDs in the logs not changing as new versions are
> released.
> > MORE DETAILS
> > I call an application “monitoring friendly” if it allows DevOps to:
> > 1. immediately receive a notification (email, SMS, etc.)
> > 2. understand what a problem is without involving developers
> > 3. provide automated recovery action.
> >
> > Large enterprises do not implement custom solutions. They usually use
> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> enterprise consistently. All such tools have similar architecture providing
> a dashboard showing apps as “green/yellow/red”, and numerous “connectors”
> to look for events in text logs, ESBs, database tables, etc.
> >
> > For each app DevOps build a “health model” - a diagram displaying the
> app’s “manageable” components and the app boundaries. A “manageable”
> component is something that can be started/stopped/configured in isolation.
> “System boundary” is a list of external apps that the monitored app
> interacts with.
> >
> > The main attribute of a manageable component is a list of “operationally
> significant events”. Those are the events that DevOps can do something
> with. For example, “failed to connect to cache store” is significant, while
> “user input validation failed” is not.
> >
> > Events shall be as specific as possible so that DevOps do not spend time
> for further analysis. For example, a “database failure” event is not good.
> There should be “database connection failure”, “invalid database schema”,
> “database authentication failure”, etc. events.
> >
> > “Event” is NOT the same as exception occurred in the code. Events
> identify specific problem from the DevOps point of view. For example, even
> if “connection to cache store failed” exception might be thrown from
> several places in the code, that is still the same event. On the other
> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> exceptions might be caught in the same place, those are different events
> since MS SQL Server and Oracle are usually different DevOps groups in large
> enterprises!
> >
> > The operationally significant event IDs must be stable: they must not
> change from one release to another. This is like a contract between
> developers and DevOps.
> >
> > This should be the developer’s responsibility to publish and maintain a
> table with attributes:
> >
> > - Event ID
> > - Severity: Critical (Red) - the system is not operational; Warning
> (Yellow) - the system is operational but health is degraded; None - just an
> info.
> > - Description: concise but enough for DevOps to act without developer’s
> help
> > - Recovery actions: what DevOps shall do to fix the issue without
> developer’s help. DevOps might create automated recovery scripts based on
> this information.
> >
> > For example:
> > 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> 1) Open ignite configuration and find zookeeper connection string 2) Make
> sure the Zookeeper is running
> > 10200 - Warning - Ignite node left the cluster.
> >
> > Back to Ignite: it looks to me we do not design for operations as
> described above. We have no event IDs: our logging is subject to change in
> new version so that any patterns DevOps might use to detect significant
> events would stop working after upgrade.
> >
> > If I am not the only one how have such concerns then we might open a
> ticket to address this.
> >
> >
> > Best regards, Alexey
>



-- 
Alexey Kuznetsov

Re: Ignite not friendly for Monitoring

Posted by Alexey Kukushkin <al...@yahoo.com.INVALID>.

Hi Denis,
Monitoring tools simply watch event logs for patterns (regex in case of unstructured logs like text files). A stable (not changing in new releases) event ID identifying specific issue would be such a pattern. 
We need to introduce such event IDs according to the principles I described in my previous mail.
Best regards, Alexey


On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <dm...@apache.org> wrote:

Hello Alexey,

Thanks for the detailed input.

Assuming that Ignite supported the suggested events based model, how can it be integrated with mentioned tools like DynaTrace or Nagios? Is this all we need?

—
Denis
 
> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <al...@yahoo.com.INVALID> wrote:
> 
> Igniters,
> While preparing some Ignite materials for Administrators I found Ignite is not friendly for such a critical DevOps practice as monitoring. 
> TL;DRI think Ignite misses structured descriptions of abnormal events with references to event IDs in the logs not changing as new versions are released.
> MORE DETAILS
> I call an application “monitoring friendly” if it allows DevOps to:
> 1. immediately receive a notification (email, SMS, etc.)
> 2. understand what a problem is without involving developers 
> 3. provide automated recovery action.
> 
> Large enterprises do not implement custom solutions. They usually use tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise consistently. All such tools have similar architecture providing a dashboard showing apps as “green/yellow/red”, and numerous “connectors” to look for events in text logs, ESBs, database tables, etc.
> 
> For each app DevOps build a “health model” - a diagram displaying the app’s “manageable” components and the app boundaries. A “manageable” component is something that can be started/stopped/configured in isolation. “System boundary” is a list of external apps that the monitored app interacts with.
> 
> The main attribute of a manageable component is a list of “operationally significant events”. Those are the events that DevOps can do something with. For example, “failed to connect to cache store” is significant, while “user input validation failed” is not.
> 
> Events shall be as specific as possible so that DevOps do not spend time for further analysis. For example, a “database failure” event is not good. There should be “database connection failure”, “invalid database schema”, “database authentication failure”, etc. events.  
> 
> “Event” is NOT the same as exception occurred in the code. Events identify specific problem from the DevOps point of view. For example, even if “connection to cache store failed” exception might be thrown from several places in the code, that is still the same event. On the other side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be caught in the same place, those are different events since MS SQL Server and Oracle are usually different DevOps groups in large enterprises!
> 
> The operationally significant event IDs must be stable: they must not change from one release to another. This is like a contract between developers and DevOps.
> 
> This should be the developer’s responsibility to publish and maintain a table with attributes:
> 
> - Event ID
> - Severity: Critical (Red) - the system is not operational; Warning (Yellow) - the system is operational but health is degraded; None - just an info.
> - Description: concise but enough for DevOps to act without developer’s help
> - Recovery actions: what DevOps shall do to fix the issue without developer’s help. DevOps might create automated recovery scripts based on this information.
> 
> For example:
> 10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) Open ignite configuration and find zookeeper connection string 2) Make sure the Zookeeper is running
> 10200 - Warning - Ignite node left the cluster.
> 
> Back to Ignite: it looks to me we do not design for operations as described above. We have no event IDs: our logging is subject to change in new version so that any patterns DevOps might use to detect significant events would stop working after upgrade.
> 
> If I am not the only one how have such concerns then we might open a ticket to address this.
> 
> 
> Best regards, Alexey

Re: Ignite not friendly for Monitoring

Posted by Denis Magda <dm...@apache.org>.

Hello Alexey,

Thanks for the detailed input.

Assuming that Ignite supported the suggested events based model, how can it be integrated with mentioned tools like DynaTrace or Nagios? Is this all we need?

—
Denis
 
> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <al...@yahoo.com.INVALID> wrote:
> 
> Igniters,
> While preparing some Ignite materials for Administrators I found Ignite is not friendly for such a critical DevOps practice as monitoring. 
> TL;DRI think Ignite misses structured descriptions of abnormal events with references to event IDs in the logs not changing as new versions are released.
> MORE DETAILS
> I call an application “monitoring friendly” if it allows DevOps to:
> 1. immediately receive a notification (email, SMS, etc.)
> 2. understand what a problem is without involving developers 
> 3. provide automated recovery action.
> 
> Large enterprises do not implement custom solutions. They usually use tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise consistently. All such tools have similar architecture providing a dashboard showing apps as “green/yellow/red”, and numerous “connectors” to look for events in text logs, ESBs, database tables, etc.
> 
> For each app DevOps build a “health model” - a diagram displaying the app’s “manageable” components and the app boundaries. A “manageable” component is something that can be started/stopped/configured in isolation. “System boundary” is a list of external apps that the monitored app interacts with.
> 
> The main attribute of a manageable component is a list of “operationally significant events”. Those are the events that DevOps can do something with. For example, “failed to connect to cache store” is significant, while “user input validation failed” is not.
> 
> Events shall be as specific as possible so that DevOps do not spend time for further analysis. For example, a “database failure” event is not good. There should be “database connection failure”, “invalid database schema”, “database authentication failure”, etc. events.  
> 
> “Event” is NOT the same as exception occurred in the code. Events identify specific problem from the DevOps point of view. For example, even if “connection to cache store failed” exception might be thrown from several places in the code, that is still the same event. On the other side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be caught in the same place, those are different events since MS SQL Server and Oracle are usually different DevOps groups in large enterprises!
> 
> The operationally significant event IDs must be stable: they must not change from one release to another. This is like a contract between developers and DevOps.
> 
> This should be the developer’s responsibility to publish and maintain a table with attributes:
> 
> - Event ID
> - Severity: Critical (Red) - the system is not operational; Warning (Yellow) - the system is operational but health is degraded; None - just an info.
> - Description: concise but enough for DevOps to act without developer’s help
> - Recovery actions: what DevOps shall do to fix the issue without developer’s help. DevOps might create automated recovery scripts based on this information.
> 
> For example:
> 10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) Open ignite configuration and find zookeeper connection string 2) Make sure the Zookeeper is running
> 10200 - Warning - Ignite node left the cluster.
> 
> Back to Ignite: it looks to me we do not design for operations as described above. We have no event IDs: our logging is subject to change in new version so that any patterns DevOps might use to detect significant events would stop working after upgrade.
> 
> If I am not the only one how have such concerns then we might open a ticket to address this.
> 
> 
> Best regards, Alexey