You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Matthew Stump <ms...@vorstella.com> on 2019/02/19 23:34:45 UTC

Looking for feedback on automated root-cause system

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8
years, and have worked on hundreds of Cassandra deployments. One of the
things I’ve noticed in myself and a lot of my peers that have done
consulting, support or worked on really big deployments is that we get
burnt out. We fight a lot of the same fires over and over again, and don’t
get to work on new or interesting stuff. Also, what we do is really hard to
transfer to other people because it’s based on experience.

Over the past year my team and I have been working to overcome that gap,
creating an assistant that’s able to scale some of this knowledge. We’ve
got it to the point where it’s able to classify known root causes for an
outage or an SLA breach in Cassandra with an accuracy greater than 90%. It
can accurately diagnose bugs, data-modeling issues, or misuse of certain
features and when it does give you specific remediation steps with links to
knowledge base articles.

We think we’ve seeded our database with enough root causes that it’ll catch
the vast majority of issues but there is always the possibility that we’ll
run into something previously unknown like CASSANDRA-11170 (one of the
issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested
in giving the product a trial. The process would be a collaboration, where
we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump

RE: Looking for feedback on automated root-cause system

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
I see they have a website now at https://vorstella.com/

 

 

From: Matt Stump [mailto:mrevilgnome@gmail.com] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

For some reason responses to the thread didn't hit my work email, I didn't see the responses until I check from my personal. 

 

The way that the system works is that we install a collector that pulls a bunch of metrics from each node and sends it up to our NOC every minute. We've got a bunch of stream processors that take this data and do a bunch of things with it. We've got some dumb ones that check for common miss-configurations, bugs etc.. they also populate dashboards and a couple of minimal graphs. The more intelligent agents take a look at the metrics and they start generating a bunch of calculated/scaled metrics and events. If one of these triggers a threshold then we kick off the ML that does classification using the stored data to classify the root cause, and point you to the correct knowledge base article with remediation steps. Because we've got he cluster history we can identify a breach, and give you an SLA in about 1 minute. The goal is to get you from 0 to resolution as quickly as possible. 

 

We're looking for feedback on the existing system, do these events make sense, do I need to beef up a knowledge base article, did it classify correctly, or is there some big bug that everyone is running into that needs to be publicized. We're also looking for where to go next, which models are going to make your life easier?

 

The system works for C*, Elastic and Kafka. We'll be doing some blog posts explaining in more detail how it works and some of the interesting things we've found. For example everything everyone thought they knew about Cassandra thread pool tuning is wrong, nobody really knows how to tune Kafka for large messages, or that there are major issues with the Kubernetes charts that people are using.

 

 

 

On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Any information you can share on the inputs it needs/uses would be helpful.

 

Kenneth Brotman

 

From: daemeon reiydelle [mailto:daemeonr@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

Welcome to the world of testing predictive analytics. I will pass this on to my folks at Accenture, know of a couple of C* clients we run, wondering what you had in mind?

 

 

Daemeon C.M. Reiydelle

email: daemeonr@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

 

 

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and a lot of my peers that have done consulting, support or worked on really big deployments is that we get burnt out. We fight a lot of the same fires over and over again, and don’t get to work on new or interesting stuff Also, what we do is really hard to transfer to other people because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant that’s able to scale some of this knowledge. We’ve got it to the point where it’s able to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain features and when it does give you specific remediation steps with links to knowledge base articles. 

 

We think we’ve seeded our database with enough root causes that it’ll catch the vast majority of issues but there is always the possibility that we’ll run into something previously unknown like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the product a trial. The process would be a collaboration, where we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump


RE: Looking for feedback on automated root-cause system

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
Sounds like a promising step forward.  I’d certainly like to know when the blog posts are up. 

 

Kenneth Brotman

 

From: Matt Stump [mailto:mrevilgnome@gmail.com] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

For some reason responses to the thread didn't hit my work email, I didn't see the responses until I check from my personal. 

 

The way that the system works is that we install a collector that pulls a bunch of metrics from each node and sends it up to our NOC every minute. We've got a bunch of stream processors that take this data and do a bunch of things with it. We've got some dumb ones that check for common miss-configurations, bugs etc.. they also populate dashboards and a couple of minimal graphs. The more intelligent agents take a look at the metrics and they start generating a bunch of calculated/scaled metrics and events. If one of these triggers a threshold then we kick off the ML that does classification using the stored data to classify the root cause, and point you to the correct knowledge base article with remediation steps. Because we've got he cluster history we can identify a breach, and give you an SLA in about 1 minute. The goal is to get you from 0 to resolution as quickly as possible. 

 

We're looking for feedback on the existing system, do these events make sense, do I need to beef up a knowledge base article, did it classify correctly, or is there some big bug that everyone is running into that needs to be publicized. We're also looking for where to go next, which models are going to make your life easier?

 

The system works for C*, Elastic and Kafka. We'll be doing some blog posts explaining in more detail how it works and some of the interesting things we've found. For example everything everyone thought they knew about Cassandra thread pool tuning is wrong, nobody really knows how to tune Kafka for large messages, or that there are major issues with the Kubernetes charts that people are using.

 

 

 

On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Any information you can share on the inputs it needs/uses would be helpful.

 

Kenneth Brotman

 

From: daemeon reiydelle [mailto:daemeonr@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

Welcome to the world of testing predictive analytics. I will pass this on to my folks at Accenture, know of a couple of C* clients we run, wondering what you had in mind?

 

 

Daemeon C.M. Reiydelle

email: daemeonr@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

 

 

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and a lot of my peers that have done consulting, support or worked on really big deployments is that we get burnt out. We fight a lot of the same fires over and over again, and don’t get to work on new or interesting stuff Also, what we do is really hard to transfer to other people because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant that’s able to scale some of this knowledge. We’ve got it to the point where it’s able to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain features and when it does give you specific remediation steps with links to knowledge base articles. 

 

We think we’ve seeded our database with enough root causes that it’ll catch the vast majority of issues but there is always the possibility that we’ll run into something previously unknown like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the product a trial. The process would be a collaboration, where we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump


RE: Looking for feedback on automated root-cause system

Posted by Kenneth Brotman <ke...@yahoo.com>.
I found their YouTube video, Machine Learning & The future of DevOps – An Intro to Vorstella: https://www.youtube.com/watch?v=YZ5_LAXvUUo

 

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com.INVALID] 
Sent: Tuesday, March 05, 2019 11:50 AM
To: user@cassandra.apache.org
Subject: RE: Looking for feedback on automated root-cause system

 

You are the real deal. I know you’ve been a top notch person in the community for a long time.  Glad to hear that this is coming.  It’s very exciting!

 

From: Matthew Stump [mailto:mstump@vorstella.com] 
Sent: Tuesday, March 05, 2019 11:47 AM
To: user@cassandra.apache.org
Subject: Re: Looking for feedback on automated root-cause system

 

We probably will, that'll come soon-ish (a couple of weeks perhaps). Right now we're limited by who we can engage with in order to collect feedback.

 

On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Simulators will never get you there.  Why don’t you let everyone plug in to the NOC in exchange for standard features or limited scale, make some money on the big cats that can you can make value proposition attractive for anyway.  You get the data you have to have – and free; everyone’s Cassandra cluster get’s smart!

 

 

From: Matthew Stump [mailto:mstump@vorstella.com] 
Sent: Tuesday, March 05, 2019 11:12 AM
To: user@cassandra.apache.org
Subject: Re: Looking for feedback on automated root-cause system

 

Getting people to send data to us can be a little bit of a PITA, but it's doable. We've got data from regulated/secure environments streaming in. None of the data we collect is a risk, but the default is to say no and you've got to overcome that barrier. We've been through the audit a bunch of times, it gets easier each time because everyone asks more or less the same questions and requires the same set of disclosures.

 

Cold start for AI is always an issue but we overcame it via two routes:

 

We had customers from a pre-existing line of business. We were probably the first ones to run production Cassandra workloads at scale in k8s. We funded the work behind the some of the initial blog posts and had to figure out most of the ins-and-outs of making it work. This data is good for helping to identify edge cases and bugs that you wouldn't normally encounter, but it's super noisy and you've got to do a lot to isolate and/or derive value from data in the beginning if you're attempting to do root cause.

 

Leveraging the above we built out an extensive simulations pipeline. It initially started as python scripts targeting k8s, but it's since been fully automated with Spinnaker.  We have a couple of simulations running all the time doing continuous integration with the models, collectors and pipeline code, but will burst out to a couple hundred clusters if we need to test something complicated. It's takes just a couple of minutes to have it spin up hundreds of different load generators, targeting different versions of C*, running with different topologies, using clean disks or restoring from previous snapshots.

 

As the corpus grows simulations mater less, and it's easier to get signal from noise in a customer cluster.

 

On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Matt,

 

Do you anticipate having trouble getting clients to allow the collector to send data up to your NOC?  Wouldn’t a lot of companies be unable or uneasy about that?

 

Your ML can only work if it’s got LOTS of data from many different scenarios.  How are you addressing that?  How are you able to get that much good quality data?

 

Kenneth Brotman

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com] 
Sent: Tuesday, March 05, 2019 10:01 AM
To: 'user@cassandra.apache.org'
Subject: RE: Looking for feedback on automated root-cause system

 

I see they have a website now at https://vorstella.com/

 

 

From: Matt Stump [mailto:mrevilgnome@gmail.com] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

For some reason responses to the thread didn't hit my work email, I didn't see the responses until I check from my personal. 

 

The way that the system works is that we install a collector that pulls a bunch of metrics from each node and sends it up to our NOC every minute. We've got a bunch of stream processors that take this data and do a bunch of things with it. We've got some dumb ones that check for common miss-configurations, bugs etc.. they also populate dashboards and a couple of minimal graphs. The more intelligent agents take a look at the metrics and they start generating a bunch of calculated/scaled metrics and events. If one of these triggers a threshold then we kick off the ML that does classification using the stored data to classify the root cause, and point you to the correct knowledge base article with remediation steps. Because we've got he cluster history we can identify a breach, and give you an SLA in about 1 minute. The goal is to get you from 0 to resolution as quickly as possible. 

 

We're looking for feedback on the existing system, do these events make sense, do I need to beef up a knowledge base article, did it classify correctly, or is there some big bug that everyone is running into that needs to be publicized. We're also looking for where to go next, which models are going to make your life easier?

 

The system works for C*, Elastic and Kafka. We'll be doing some blog posts explaining in more detail how it works and some of the interesting things we've found. For example everything everyone thought they knew about Cassandra thread pool tuning is wrong, nobody really knows how to tune Kafka for large messages, or that there are major issues with the Kubernetes charts that people are using.

 

 

 

On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Any information you can share on the inputs it needs/uses would be helpful.

 

Kenneth Brotman

 

From: daemeon reiydelle [mailto:daemeonr@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

Welcome to the world of testing predictive analytics I will pass this on to my folks at Accenture, know of a couple of C* clients we run, wondering what you had in mind?

 

 

Daemeon C.M. Reiydelle

email: daemeonr@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

 

 

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and a lot of my peers that have done consulting, support or worked on really big deployments is that we get burnt out. We fight a lot of the same fires over and over again, and don’t get to work on new or interesting stuff Also, what we do is really hard to transfer to other people because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant that’s able to scale some of this knowledge. We’ve got it to the point where it’s able to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain features and when it does give you specific remediation steps with links to knowledge base articles. 

 

We think we’ve seeded our database with enough root causes that it’ll catch the vast majority of issues but there is always the possibility that we’ll run into something previously unknown like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the product a trial. The process would be a collaboration, where we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump


RE: Looking for feedback on automated root-cause system

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
You are the real deal. I know you’ve been a top notch person in the community for a long time.  Glad to hear that this is coming.  It’s very exciting!

 

From: Matthew Stump [mailto:mstump@vorstella.com] 
Sent: Tuesday, March 05, 2019 11:47 AM
To: user@cassandra.apache.org
Subject: Re: Looking for feedback on automated root-cause system

 

We probably will, that'll come soon-ish (a couple of weeks perhaps). Right now we're limited by who we can engage with in order to collect feedback.

 

On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Simulators will never get you there.  Why don’t you let everyone plug in to the NOC in exchange for standard features or limited scale, make some money on the big cats that can you can make value proposition attractive for anyway.  You get the data you have to have – and free; everyone’s Cassandra cluster get’s smart!

 

 

From: Matthew Stump [mailto:mstump@vorstella.com] 
Sent: Tuesday, March 05, 2019 11:12 AM
To: user@cassandra.apache.org
Subject: Re: Looking for feedback on automated root-cause system

 

Getting people to send data to us can be a little bit of a PITA, but it's doable. We've got data from regulated/secure environments streaming in. None of the data we collect is a risk, but the default is to say no and you've got to overcome that barrier. We've been through the audit a bunch of times, it gets easier each time because everyone asks more or less the same questions and requires the same set of disclosures.

 

Cold start for AI is always an issue but we overcame it via two routes:

 

We had customers from a pre-existing line of business. We were probably the first ones to run production Cassandra workloads at scale in k8s. We funded the work behind the some of the initial blog posts and had to figure out most of the ins-and-outs of making it work. This data is good for helping to identify edge cases and bugs that you wouldn't normally encounter, but it's super noisy and you've got to do a lot to isolate and/or derive value from data in the beginning if you're attempting to do root cause.

 

Leveraging the above we built out an extensive simulations pipeline. It initially started as python scripts targeting k8s, but it's since been fully automated with Spinnaker.  We have a couple of simulations running all the time doing continuous integration with the models, collectors and pipeline code, but will burst out to a couple hundred clusters if we need to test something complicated. It's takes just a couple of minutes to have it spin up hundreds of different load generators, targeting different versions of C*, running with different topologies, using clean disks or restoring from previous snapshots.

 

As the corpus grows simulations mater less, and it's easier to get signal from noise in a customer cluster.

 

On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Matt,

 

Do you anticipate having trouble getting clients to allow the collector to send data up to your NOC?  Wouldn’t a lot of companies be unable or uneasy about that?

 

Your ML can only work if it’s got LOTS of data from many different scenarios.  How are you addressing that?  How are you able to get that much good quality data?

 

Kenneth Brotman

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com] 
Sent: Tuesday, March 05, 2019 10:01 AM
To: 'user@cassandra.apache.org'
Subject: RE: Looking for feedback on automated root-cause system

 

I see they have a website now at https://vorstella.com/

 

 

From: Matt Stump [mailto:mrevilgnome@gmail.com] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

For some reason responses to the thread didn't hit my work email, I didn't see the responses until I check from my personal. 

 

The way that the system works is that we install a collector that pulls a bunch of metrics from each node and sends it up to our NOC every minute. We've got a bunch of stream processors that take this data and do a bunch of things with it. We've got some dumb ones that check for common miss-configurations, bugs etc.. they also populate dashboards and a couple of minimal graphs. The more intelligent agents take a look at the metrics and they start generating a bunch of calculated/scaled metrics and events. If one of these triggers a threshold then we kick off the ML that does classification using the stored data to classify the root cause, and point you to the correct knowledge base article with remediation steps. Because we've got he cluster history we can identify a breach, and give you an SLA in about 1 minute. The goal is to get you from 0 to resolution as quickly as possible. 

 

We're looking for feedback on the existing system, do these events make sense, do I need to beef up a knowledge base article, did it classify correctly, or is there some big bug that everyone is running into that needs to be publicized. We're also looking for where to go next, which models are going to make your life easier?

 

The system works for C*, Elastic and Kafka. We'll be doing some blog posts explaining in more detail how it works and some of the interesting things we've found. For example everything everyone thought they knew about Cassandra thread pool tuning is wrong, nobody really knows how to tune Kafka for large messages, or that there are major issues with the Kubernetes charts that people are using.

 

 

 

On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Any information you can share on the inputs it needs/uses would be helpful.

 

Kenneth Brotman

 

From: daemeon reiydelle [mailto:daemeonr@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

Welcome to the world of testing predictive analytics I will pass this on to my folks at Accenture, know of a couple of C* clients we run, wondering what you had in mind?

 

 

Daemeon C.M. Reiydelle

email: daemeonr@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

 

 

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and a lot of my peers that have done consulting, support or worked on really big deployments is that we get burnt out. We fight a lot of the same fires over and over again, and don’t get to work on new or interesting stuff Also, what we do is really hard to transfer to other people because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant that’s able to scale some of this knowledge. We’ve got it to the point where it’s able to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain features and when it does give you specific remediation steps with links to knowledge base articles. 

 

We think we’ve seeded our database with enough root causes that it’ll catch the vast majority of issues but there is always the possibility that we’ll run into something previously unknown like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the product a trial. The process would be a collaboration, where we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump


Re: Looking for feedback on automated root-cause system

Posted by Matthew Stump <ms...@vorstella.com>.
We probably will, that'll come soon-ish (a couple of weeks perhaps). Right
now we're limited by who we can engage with in order to collect feedback.

On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman
<ke...@yahoo.com.invalid> wrote:

> Simulators will never get you there.  Why don’t you let everyone plug in
> to the NOC in exchange for standard features or limited scale, make some
> money on the big cats that can you can make value proposition attractive
> for anyway.  You get the data you have to have – and free; everyone’s
> Cassandra cluster get’s smart!
>
>
>
>
>
> *From:* Matthew Stump [mailto:mstump@vorstella.com]
> *Sent:* Tuesday, March 05, 2019 11:12 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Getting people to send data to us can be a little bit of a PITA, but it's
> doable. We've got data from regulated/secure environments streaming in.
> None of the data we collect is a risk, but the default is to say no and
> you've got to overcome that barrier. We've been through the audit a bunch
> of times, it gets easier each time because everyone asks more or less the
> same questions and requires the same set of disclosures.
>
>
>
> Cold start for AI is always an issue but we overcame it via two routes:
>
>
>
> We had customers from a pre-existing line of business. We were probably
> the first ones to run production Cassandra workloads at scale in k8s. We
> funded the work behind the some of the initial blog posts and had to figure
> out most of the ins-and-outs of making it work. This data is good for
> helping to identify edge cases and bugs that you wouldn't normally
> encounter, but it's super noisy and you've got to do a lot to isolate
> and/or derive value from data in the beginning if you're attempting to do
> root cause.
>
>
>
> Leveraging the above we built out an extensive simulations pipeline. It
> initially started as python scripts targeting k8s, but it's since been
> fully automated with Spinnaker.  We have a couple of simulations running
> all the time doing continuous integration with the models, collectors and
> pipeline code, but will burst out to a couple hundred clusters if we need
> to test something complicated. It's takes just a couple of minutes to have
> it spin up hundreds of different load generators, targeting different
> versions of C*, running with different topologies, using clean disks or
> restoring from previous snapshots.
>
>
>
> As the corpus grows simulations mater less, and it's easier to get signal
> from noise in a customer cluster.
>
>
>
> On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman
> <ke...@yahoo.com.invalid> wrote:
>
> Matt,
>
>
>
> Do you anticipate having trouble getting clients to allow the collector to
> send data up to your NOC?  Wouldn’t a lot of companies be unable or uneasy
> about that?
>
>
>
> Your ML can only work if it’s got LOTS of data from many different
> scenarios.  How are you addressing that?  How are you able to get that much
> good quality data?
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kenneth Brotman [mailto:kenbrotman@yahoo.com]
> *Sent:* Tuesday, March 05, 2019 10:01 AM
> *To:* 'user@cassandra.apache.org'
> *Subject:* RE: Looking for feedback on automated root-cause system
>
>
>
> I see they have a website now at https://vorstella.com/
>
>
>
>
>
> *From:* Matt Stump [mailto:mrevilgnome@gmail.com]
> *Sent:* Friday, February 22, 2019 7:56 AM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> For some reason responses to the thread didn't hit my work email, I didn't
> see the responses until I check from my personal.
>
>
>
> The way that the system works is that we install a collector that pulls a
> bunch of metrics from each node and sends it up to our NOC every minute.
> We've got a bunch of stream processors that take this data and do a bunch
> of things with it. We've got some dumb ones that check for common
> miss-configurations, bugs etc.. they also populate dashboards and a couple
> of minimal graphs. The more intelligent agents take a look at the metrics
> and they start generating a bunch of calculated/scaled metrics and events.
> If one of these triggers a threshold then we kick off the ML that does
> classification using the stored data to classify the root cause, and point
> you to the correct knowledge base article with remediation steps. Because
> we've got he cluster history we can identify a breach, and give you an SLA
> in about 1 minute. The goal is to get you from 0 to resolution as quickly
> as possible.
>
>
>
> We're looking for feedback on the existing system, do these events make
> sense, do I need to beef up a knowledge base article, did it classify
> correctly, or is there some big bug that everyone is running into that
> needs to be publicized. We're also looking for where to go next, which
> models are going to make your life easier?
>
>
>
> The system works for C*, Elastic and Kafka. We'll be doing some blog posts
> explaining in more detail how it works and some of the interesting things
> we've found. For example everything everyone thought they knew about
> Cassandra thread pool tuning is wrong, nobody really knows how to tune
> Kafka for large messages, or that there are major issues with the
> Kubernetes charts that people are using.
>
>
>
>
>
>
>
> On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman
> <ke...@yahoo.com.invalid> wrote:
>
> Any information you can share on the inputs it needs/uses would be helpful.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* daemeon reiydelle [mailto:daemeonr@gmail.com]
> *Sent:* Tuesday, February 19, 2019 4:27 PM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Welcome to the world of testing predictive analytics. I will pass this on
> to my folks at Accenture, know of a couple of C* clients we run, wondering
> what you had in mind?
>
>
>
>
>
> *Daemeon C.M. Reiydelle*
>
> *email: daemeonr@gmail.com <da...@gmail.com>*
>
> *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
> daemeon.c.mreiydelle*
>
>
>
>
>
> On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com>
> wrote:
>
> Howdy,
>
> I’ve been engaged in the Cassandra user community for a long time, almost
> 8 years, and have worked on hundreds of Cassandra deployments. One of the
> things I’ve noticed in myself and a lot of my peers that have done
> consulting, support or worked on really big deployments is that we get
> burnt out. We fight a lot of the same fires over and over again, and don’t
> get to work on new or interesting stuff Also, what we do is really hard to
> transfer to other people because it’s based on experience.
>
> Over the past year my team and I have been working to overcome that gap,
> creating an assistant that’s able to scale some of this knowledge. We’ve
> got it to the point where it’s able to classify known root causes for an
> outage or an SLA breach in Cassandra with an accuracy greater than 90%. It
> can accurately diagnose bugs, data-modeling issues, or misuse of certain
> features and when it does give you specific remediation steps with links to
> knowledge base articles.
>
>
>
> We think we’ve seeded our database with enough root causes that it’ll
> catch the vast majority of issues but there is always the possibility that
> we’ll run into something previously unknown like CASSANDRA-11170 (one of
> the issues our system found in the wild).
>
> We’re looking for feedback and would like to know if anyone is interested
> in giving the product a trial. The process would be a collaboration, where
> we both get to learn from each other and improve how we’re doing things.
>
> Thanks,
> Matt Stump
>
>

RE: Looking for feedback on automated root-cause system

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
Simulators will never get you there.  Why don’t you let everyone plug in to the NOC in exchange for standard features or limited scale, make some money on the big cats that can you can make value proposition attractive for anyway.  You get the data you have to have – and free; everyone’s Cassandra cluster get’s smart!

 

 

From: Matthew Stump [mailto:mstump@vorstella.com] 
Sent: Tuesday, March 05, 2019 11:12 AM
To: user@cassandra.apache.org
Subject: Re: Looking for feedback on automated root-cause system

 

Getting people to send data to us can be a little bit of a PITA, but it's doable. We've got data from regulated/secure environments streaming in. None of the data we collect is a risk, but the default is to say no and you've got to overcome that barrier. We've been through the audit a bunch of times, it gets easier each time because everyone asks more or less the same questions and requires the same set of disclosures.

 

Cold start for AI is always an issue but we overcame it via two routes:

 

We had customers from a pre-existing line of business. We were probably the first ones to run production Cassandra workloads at scale in k8s. We funded the work behind the some of the initial blog posts and had to figure out most of the ins-and-outs of making it work. This data is good for helping to identify edge cases and bugs that you wouldn't normally encounter, but it's super noisy and you've got to do a lot to isolate and/or derive value from data in the beginning if you're attempting to do root cause.

 

Leveraging the above we built out an extensive simulations pipeline. It initially started as python scripts targeting k8s, but it's since been fully automated with Spinnaker.  We have a couple of simulations running all the time doing continuous integration with the models, collectors and pipeline code, but will burst out to a couple hundred clusters if we need to test something complicated. It's takes just a couple of minutes to have it spin up hundreds of different load generators, targeting different versions of C*, running with different topologies, using clean disks or restoring from previous snapshots.

 

As the corpus grows simulations mater less, and it's easier to get signal from noise in a customer cluster.

 

On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Matt,

 

Do you anticipate having trouble getting clients to allow the collector to send data up to your NOC?  Wouldn’t a lot of companies be unable or uneasy about that?

 

Your ML can only work if it’s got LOTS of data from many different scenarios.  How are you addressing that?  How are you able to get that much good quality data?

 

Kenneth Brotman

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com] 
Sent: Tuesday, March 05, 2019 10:01 AM
To: 'user@cassandra.apache.org'
Subject: RE: Looking for feedback on automated root-cause system

 

I see they have a website now at https://vorstella.com/

 

 

From: Matt Stump [mailto:mrevilgnome@gmail.com] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

For some reason responses to the thread didn't hit my work email, I didn't see the responses until I check from my personal. 

 

The way that the system works is that we install a collector that pulls a bunch of metrics from each node and sends it up to our NOC every minute. We've got a bunch of stream processors that take this data and do a bunch of things with it. We've got some dumb ones that check for common miss-configurations, bugs etc.. they also populate dashboards and a couple of minimal graphs. The more intelligent agents take a look at the metrics and they start generating a bunch of calculated/scaled metrics and events. If one of these triggers a threshold then we kick off the ML that does classification using the stored data to classify the root cause, and point you to the correct knowledge base article with remediation steps. Because we've got he cluster history we can identify a breach, and give you an SLA in about 1 minute. The goal is to get you from 0 to resolution as quickly as possible. 

 

We're looking for feedback on the existing system, do these events make sense, do I need to beef up a knowledge base article, did it classify correctly, or is there some big bug that everyone is running into that needs to be publicized. We're also looking for where to go next, which models are going to make your life easier?

 

The system works for C*, Elastic and Kafka. We'll be doing some blog posts explaining in more detail how it works and some of the interesting things we've found. For example everything everyone thought they knew about Cassandra thread pool tuning is wrong, nobody really knows how to tune Kafka for large messages, or that there are major issues with the Kubernetes charts that people are using.

 

 

 

On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Any information you can share on the inputs it needs/uses would be helpful.

 

Kenneth Brotman

 

From: daemeon reiydelle [mailto:daemeonr@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

Welcome to the world of testing predictive analytics. I will pass this on to my folks at Accenture, know of a couple of C* clients we run, wondering what you had in mind?

 

 

Daemeon C.M. Reiydelle

email: daemeonr@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

 

 

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and a lot of my peers that have done consulting, support or worked on really big deployments is that we get burnt out. We fight a lot of the same fires over and over again, and don’t get to work on new or interesting stuff Also, what we do is really hard to transfer to other people because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant that’s able to scale some of this knowledge. We’ve got it to the point where it’s able to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain features and when it does give you specific remediation steps with links to knowledge base articles. 

 

We think we’ve seeded our database with enough root causes that it’ll catch the vast majority of issues but there is always the possibility that we’ll run into something previously unknown like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the product a trial. The process would be a collaboration, where we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump


Re: Looking for feedback on automated root-cause system

Posted by Matthew Stump <ms...@vorstella.com>.
Getting people to send data to us can be a little bit of a PITA, but it's
doable. We've got data from regulated/secure environments streaming in.
None of the data we collect is a risk, but the default is to say no and
you've got to overcome that barrier. We've been through the audit a bunch
of times, it gets easier each time because everyone asks more or less the
same questions and requires the same set of disclosures.

Cold start for AI is always an issue but we overcame it via two routes:

We had customers from a pre-existing line of business. We were probably the
first ones to run production Cassandra workloads at scale in k8s. We funded
the work behind the some of the initial blog posts and had to figure out
most of the ins-and-outs of making it work. This data is good for helping
to identify edge cases and bugs that you wouldn't normally encounter, but
it's super noisy and you've got to do a lot to isolate and/or derive value
from data in the beginning if you're attempting to do root cause.

Leveraging the above we built out an extensive simulations pipeline. It
initially started as python scripts targeting k8s, but it's since been
fully automated with Spinnaker.  We have a couple of simulations running
all the time doing continuous integration with the models, collectors and
pipeline code, but will burst out to a couple hundred clusters if we need
to test something complicated. It's takes just a couple of minutes to have
it spin up hundreds of different load generators, targeting different
versions of C*, running with different topologies, using clean disks or
restoring from previous snapshots.

As the corpus grows simulations mater less, and it's easier to get signal
from noise in a customer cluster.

On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman
<ke...@yahoo.com.invalid> wrote:

> Matt,
>
>
>
> Do you anticipate having trouble getting clients to allow the collector to
> send data up to your NOC?  Wouldn’t a lot of companies be unable or uneasy
> about that?
>
>
>
> Your ML can only work if it’s got LOTS of data from many different
> scenarios.  How are you addressing that?  How are you able to get that much
> good quality data?
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kenneth Brotman [mailto:kenbrotman@yahoo.com]
> *Sent:* Tuesday, March 05, 2019 10:01 AM
> *To:* 'user@cassandra.apache.org'
> *Subject:* RE: Looking for feedback on automated root-cause system
>
>
>
> I see they have a website now at https://vorstella.com/
>
>
>
>
>
> *From:* Matt Stump [mailto:mrevilgnome@gmail.com]
> *Sent:* Friday, February 22, 2019 7:56 AM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> For some reason responses to the thread didn't hit my work email, I didn't
> see the responses until I check from my personal.
>
>
>
> The way that the system works is that we install a collector that pulls a
> bunch of metrics from each node and sends it up to our NOC every minute.
> We've got a bunch of stream processors that take this data and do a bunch
> of things with it. We've got some dumb ones that check for common
> miss-configurations, bugs etc.. they also populate dashboards and a couple
> of minimal graphs. The more intelligent agents take a look at the metrics
> and they start generating a bunch of calculated/scaled metrics and events.
> If one of these triggers a threshold then we kick off the ML that does
> classification using the stored data to classify the root cause, and point
> you to the correct knowledge base article with remediation steps. Because
> we've got he cluster history we can identify a breach, and give you an SLA
> in about 1 minute. The goal is to get you from 0 to resolution as quickly
> as possible.
>
>
>
> We're looking for feedback on the existing system, do these events make
> sense, do I need to beef up a knowledge base article, did it classify
> correctly, or is there some big bug that everyone is running into that
> needs to be publicized. We're also looking for where to go next, which
> models are going to make your life easier?
>
>
>
> The system works for C*, Elastic and Kafka. We'll be doing some blog posts
> explaining in more detail how it works and some of the interesting things
> we've found. For example everything everyone thought they knew about
> Cassandra thread pool tuning is wrong, nobody really knows how to tune
> Kafka for large messages, or that there are major issues with the
> Kubernetes charts that people are using.
>
>
>
>
>
>
>
> On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman
> <ke...@yahoo.com.invalid> wrote:
>
> Any information you can share on the inputs it needs/uses would be helpful.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* daemeon reiydelle [mailto:daemeonr@gmail.com]
> *Sent:* Tuesday, February 19, 2019 4:27 PM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Welcome to the world of testing predictive analytics. I will pass this on
> to my folks at Accenture, know of a couple of C* clients we run, wondering
> what you had in mind?
>
>
>
>
>
> *Daemeon C.M. Reiydelle*
>
> *email: daemeonr@gmail.com <da...@gmail.com>*
>
> *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
> daemeon.c.mreiydelle*
>
>
>
>
>
> On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com>
> wrote:
>
> Howdy,
>
> I’ve been engaged in the Cassandra user community for a long time, almost
> 8 years, and have worked on hundreds of Cassandra deployments. One of the
> things I’ve noticed in myself and a lot of my peers that have done
> consulting, support or worked on really big deployments is that we get
> burnt out. We fight a lot of the same fires over and over again, and don’t
> get to work on new or interesting stuff Also, what we do is really hard to
> transfer to other people because it’s based on experience.
>
> Over the past year my team and I have been working to overcome that gap,
> creating an assistant that’s able to scale some of this knowledge. We’ve
> got it to the point where it’s able to classify known root causes for an
> outage or an SLA breach in Cassandra with an accuracy greater than 90%. It
> can accurately diagnose bugs, data-modeling issues, or misuse of certain
> features and when it does give you specific remediation steps with links to
> knowledge base articles.
>
>
>
> We think we’ve seeded our database with enough root causes that it’ll
> catch the vast majority of issues but there is always the possibility that
> we’ll run into something previously unknown like CASSANDRA-11170 (one of
> the issues our system found in the wild).
>
> We’re looking for feedback and would like to know if anyone is interested
> in giving the product a trial. The process would be a collaboration, where
> we both get to learn from each other and improve how we’re doing things.
>
> Thanks,
> Matt Stump
>
>

RE: Looking for feedback on automated root-cause system

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
Matt,

 

Do you anticipate having trouble getting clients to allow the collector to send data up to your NOC?  Wouldn’t a lot of companies be unable or uneasy about that?

 

Your ML can only work if it’s got LOTS of data from many different scenarios.  How are you addressing that?  How are you able to get that much good quality data?

 

Kenneth Brotman

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com] 
Sent: Tuesday, March 05, 2019 10:01 AM
To: 'user@cassandra.apache.org'
Subject: RE: Looking for feedback on automated root-cause system

 

I see they have a website now at https://vorstella.com/

 

 

From: Matt Stump [mailto:mrevilgnome@gmail.com] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

For some reason responses to the thread didn't hit my work email, I didn't see the responses until I check from my personal. 

 

The way that the system works is that we install a collector that pulls a bunch of metrics from each node and sends it up to our NOC every minute. We've got a bunch of stream processors that take this data and do a bunch of things with it. We've got some dumb ones that check for common miss-configurations, bugs etc.. they also populate dashboards and a couple of minimal graphs. The more intelligent agents take a look at the metrics and they start generating a bunch of calculated/scaled metrics and events. If one of these triggers a threshold then we kick off the ML that does classification using the stored data to classify the root cause, and point you to the correct knowledge base article with remediation steps. Because we've got he cluster history we can identify a breach, and give you an SLA in about 1 minute. The goal is to get you from 0 to resolution as quickly as possible. 

 

We're looking for feedback on the existing system, do these events make sense, do I need to beef up a knowledge base article, did it classify correctly, or is there some big bug that everyone is running into that needs to be publicized. We're also looking for where to go next, which models are going to make your life easier?

 

The system works for C*, Elastic and Kafka. We'll be doing some blog posts explaining in more detail how it works and some of the interesting things we've found. For example everything everyone thought they knew about Cassandra thread pool tuning is wrong, nobody really knows how to tune Kafka for large messages, or that there are major issues with the Kubernetes charts that people are using.

 

 

 

On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <ke...@yahoo.com.invalid> wrote:

Any information you can share on the inputs it needs/uses would be helpful.

 

Kenneth Brotman

 

From: daemeon reiydelle [mailto:daemeonr@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

Welcome to the world of testing predictive analytics. I will pass this on to my folks at Accenture, know of a couple of C* clients we run, wondering what you had in mind?

 

 

Daemeon C.M. Reiydelle

email: daemeonr@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

 

 

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and a lot of my peers that have done consulting, support or worked on really big deployments is that we get burnt out. We fight a lot of the same fires over and over again, and don’t get to work on new or interesting stuff Also, what we do is really hard to transfer to other people because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant that’s able to scale some of this knowledge. We’ve got it to the point where it’s able to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain features and when it does give you specific remediation steps with links to knowledge base articles. 

 

We think we’ve seeded our database with enough root causes that it’ll catch the vast majority of issues but there is always the possibility that we’ll run into something previously unknown like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the product a trial. The process would be a collaboration, where we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump


Re: Looking for feedback on automated root-cause system

Posted by Matt Stump <mr...@gmail.com>.
For some reason responses to the thread didn't hit my work email, I didn't
see the responses until I check from my personal.

The way that the system works is that we install a collector that pulls a
bunch of metrics from each node and sends it up to our NOC every minute.
We've got a bunch of stream processors that take this data and do a bunch
of things with it. We've got some dumb ones that check for common
miss-configurations, bugs etc.. they also populate dashboards and a couple
of minimal graphs. The more intelligent agents take a look at the metrics
and they start generating a bunch of calculated/scaled metrics and events.
If one of these triggers a threshold then we kick off the ML that does
classification using the stored data to classify the root cause, and point
you to the correct knowledge base article with remediation steps. Because
we've got he cluster history we can identify a breach, and give you an SLA
in about 1 minute. The goal is to get you from 0 to resolution as quickly
as possible.

We're looking for feedback on the existing system, do these events make
sense, do I need to beef up a knowledge base article, did it classify
correctly, or is there some big bug that everyone is running into that
needs to be publicized. We're also looking for where to go next, which
models are going to make your life easier?

The system works for C*, Elastic and Kafka. We'll be doing some blog posts
explaining in more detail how it works and some of the interesting things
we've found. For example everything everyone thought they knew about
Cassandra thread pool tuning is wrong, nobody really knows how to tune
Kafka for large messages, or that there are major issues with the
Kubernetes charts that people are using.



On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman
<ke...@yahoo.com.invalid> wrote:

> Any information you can share on the inputs it needs/uses would be helpful.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* daemeon reiydelle [mailto:daemeonr@gmail.com]
> *Sent:* Tuesday, February 19, 2019 4:27 PM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Welcome to the world of testing predictive analytics. I will pass this on
> to my folks at Accenture, know of a couple of C* clients we run, wondering
> what you had in mind?
>
>
>
>
>
> *Daemeon C.M. Reiydelle*
>
> *email: daemeonr@gmail.com <da...@gmail.com>*
>
> *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
> daemeon.c.mreiydelle*
>
>
>
>
>
> On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com>
> wrote:
>
> Howdy,
>
> I’ve been engaged in the Cassandra user community for a long time, almost
> 8 years, and have worked on hundreds of Cassandra deployments. One of the
> things I’ve noticed in myself and a lot of my peers that have done
> consulting, support or worked on really big deployments is that we get
> burnt out. We fight a lot of the same fires over and over again, and don’t
> get to work on new or interesting stuff Also, what we do is really hard to
> transfer to other people because it’s based on experience.
>
> Over the past year my team and I have been working to overcome that gap,
> creating an assistant that’s able to scale some of this knowledge. We’ve
> got it to the point where it’s able to classify known root causes for an
> outage or an SLA breach in Cassandra with an accuracy greater than 90%. It
> can accurately diagnose bugs, data-modeling issues, or misuse of certain
> features and when it does give you specific remediation steps with links to
> knowledge base articles.
>
>
>
> We think we’ve seeded our database with enough root causes that it’ll
> catch the vast majority of issues but there is always the possibility that
> we’ll run into something previously unknown like CASSANDRA-11170 (one of
> the issues our system found in the wild).
>
> We’re looking for feedback and would like to know if anyone is interested
> in giving the product a trial. The process would be a collaboration, where
> we both get to learn from each other and improve how we’re doing things.
>
> Thanks,
> Matt Stump
>
>

RE: Looking for feedback on automated root-cause system

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
Any information you can share on the inputs it needs/uses would be helpful.

 

Kenneth Brotman

 

From: daemeon reiydelle [mailto:daemeonr@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

 

Welcome to the world of testing predictive analytics. I will pass this on to my folks at Accenture, know of a couple of C* clients we run, wondering what you had in mind?

 

 

Daemeon C.M. Reiydelle

email: daemeonr@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

 

 

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 years, and have worked on hundreds of Cassandra deployments. One of the things I’ve noticed in myself and a lot of my peers that have done consulting, support or worked on really big deployments is that we get burnt out. We fight a lot of the same fires over and over again, and don’t get to work on new or interesting stuff Also, what we do is really hard to transfer to other people because it’s based on experience. 

Over the past year my team and I have been working to overcome that gap, creating an assistant that’s able to scale some of this knowledge. We’ve got it to the point where it’s able to classify known root causes for an outage or an SLA breach in Cassandra with an accuracy greater than 90%. It can accurately diagnose bugs, data-modeling issues, or misuse of certain features and when it does give you specific remediation steps with links to knowledge base articles. 

 

We think we’ve seeded our database with enough root causes that it’ll catch the vast majority of issues but there is always the possibility that we’ll run into something previously unknown like CASSANDRA-11170 (one of the issues our system found in the wild).

We’re looking for feedback and would like to know if anyone is interested in giving the product a trial. The process would be a collaboration, where we both get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump


Re: Looking for feedback on automated root-cause system

Posted by daemeon reiydelle <da...@gmail.com>.
Welcome to the world of testing predictive analytics. I will pass this on
to my folks at Accenture, know of a couple of C* clients we run, wondering
what you had in mind?


*Daemeon C.M. Reiydelle*

*email: daemeonr@gmail.com <da...@gmail.com>*
*San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
daemeon.c.m.reiydelle*



On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <ms...@vorstella.com> wrote:

> Howdy,
>
> I’ve been engaged in the Cassandra user community for a long time, almost
> 8 years, and have worked on hundreds of Cassandra deployments. One of the
> things I’ve noticed in myself and a lot of my peers that have done
> consulting, support or worked on really big deployments is that we get
> burnt out. We fight a lot of the same fires over and over again, and don’t
> get to work on new or interesting stuff. Also, what we do is really hard to
> transfer to other people because it’s based on experience.
>
> Over the past year my team and I have been working to overcome that gap,
> creating an assistant that’s able to scale some of this knowledge. We’ve
> got it to the point where it’s able to classify known root causes for an
> outage or an SLA breach in Cassandra with an accuracy greater than 90%. It
> can accurately diagnose bugs, data-modeling issues, or misuse of certain
> features and when it does give you specific remediation steps with links to
> knowledge base articles.
>
> We think we’ve seeded our database with enough root causes that it’ll
> catch the vast majority of issues but there is always the possibility that
> we’ll run into something previously unknown like CASSANDRA-11170 (one of
> the issues our system found in the wild).
>
> We’re looking for feedback and would like to know if anyone is interested
> in giving the product a trial. The process would be a collaboration, where
> we both get to learn from each other and improve how we’re doing things.
>
> Thanks,
> Matt Stump
>
>