You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-dev@apache.org by Christofer Dutz <ch...@c-ware.de> on 2016/08/18 09:13:56 UTC

Tool proposal for helping run and monitor the ASF Infra Services

Hi,



I have been on the Infra Hipchat for a few weeks now while trying to migrate the Flex project to Maven and back to the ASF Infra build system. Thanks for your support in this and even more thanks for the trust in granting me access and Admin rights on the windows1 build agent.



In the chat I observed the daily work of you guys, having to maintain quite a zoo of all sorts of different systems on different platforms. Some problems you were having seem quite easy to track down ... if the hard disk is full, you clean up. But not all problems are that easy to track down. Thinking of the problems with repository.apache.org ... here the cause was the proxy being flooded with connections (I think this was the case) ... regular restarts of this helped temporarily, but I don't think that helps on the long term as no one had an idea why those connections were hanging there in the first place.



A few years ago the company I work for - codecentric - have founded a company called Instana. They are developing an agent based system for monitoring IT infrastructure. In contrast to most established solutions, they use machine learning strategies to analyze the root cause for problems. While you can probably achieve similar results with normal tools, the problem is that you need a very detailed domain knowledge to do so and in a regularly changing environment you need to continuously keep adjusting your metrics. Instana does this automatically. I think you can imagine how tricky it is to follow the root cause for bad response times through a network of interconnected services.



Investing almost all of my free time (and a lot of my paid time) for Apache, noticing a lot of the problems you have to deal with every day, I asked Instana if they would be willing to provide their service to the ASF for free and they agreed and immediately setup a dedicated instance.



I wanted to try the thing out as I would prefer to grab a few beers with you at ApcheCon in Cevillia and not get punched in the face for recommending something bad ;-) ... so I tried this on my private Server playground. I unpacked and started the agent and the host appeared on the web console and reported the problems it was having (ones I didn't even know about) as well as other systems it communicates with ... as soon as I added agents on these machines the analytics started doing their work across system and I built up a map view of my services and their correlation. So it's really a system that needs almost no configuration at all :-)


I uploaded the internal product presentation here: https://public.centerdevice.de/1a9dc4ed-515e-482e-9fd6-6d60a5562598 (please don't share this outside of the ASF)

Please use the password: 4p4cheR0cks (I'll remove that document in about two weeks)


By the way ... the screenshots in the presentation are real ... I was amazed of seeing a 3D web UI in production for the first time ;-)



So if there is any interest in this offer, I would be more than happy to provide credentials to you and assist you in getting started, so you could easily try it out. The guys at Instana would also be delighted to give you guys an online demo and answer any questions you might be having. Feel free to conatact Mirco directly for this: mirko.novakovic@codecentric.de



Chris

AW: AW: Tool proposal for helping run and monitor the ASF Infra Services

Posted by Christofer Dutz <ch...@c-ware.de>.
Hi Daniel,

I Could imagine that this offer is intended not to be time limited, but I'm not directly involved in Instana. I have asked the CEO of Instana to subscribe here and provide the answers to your questions.

But I am pretty sure he will be able to provide answers you will like ;-)

Chris



Von meinem Samsung Galaxy Smartphone gesendet.


-------- Ursprüngliche Nachricht --------
Von: Daniel Gruno <hu...@apache.org>
Datum: 19.08.16 09:16 (GMT+01:00)
An: infrastructure-dev@apache.org
Betreff: Re: AW: Tool proposal for helping run and monitor the ASF Infra Services

On 08/19/2016 08:20 AM, Christofer Dutz wrote:
> Hi Chris,
>
> I knew that someone asked exactly the "how does it compare to datadog" question somewhere. Here's the link to that mail thread https://news.ycombinator.com/item?id=12147219
>
> And I can confirm the shortcomings of the time series approach, cause in jenkins, I'd say about 70% of recent failures of flex builds were due to timeouts when uploading Maven artifacts to nexus. The current solution doesn't seem to detect that. Not only that I couldn't see any hipchat notifications. The infra guys always had to start looking for the real reason of the timeouts as nexus wasn't having any problems at all.

I'd say the real reason we weren't being notified is because httpd
wasn't telling us it was jammed, so it would have made little difference
whether we used product X or Y to monitor it, when it wasn't sending out
data that could lead us to what the problem was. It wasn't a question of
granularity, the data just wasn't enabled for any agent to see.

I would imagine this would be something _instead_ of datadog, which
leads me to some questions:

- For how long into the future could we have a guarantee that this isn't
gonna cost us $$$/year or whatever the price would end up being with a
non-free version?

- What is understood by real time checks here? and what exactly is checked?

- How does this relate to more advanced monitoring systems like
Circonus? As you may know, we had to drop that as it proved to be rather
time consuming.

- What sort of integrations does this system have? How are alerts
dispatched?

I'd also be interested in learning about plugins and how to customize
the agents.

With regards,
Daniel.

>
> And I really live the feature of tracking down the response time for one service back to other servers to find it the real reason for a system being slow (have a look in the presentation for this. There's a great slide on this)
>
> Chris
>
>
> Von meinem Samsung Galaxy Smartphone gesendet.
>
>
> -------- Ursprüngliche Nachricht --------
> Von: Chris Lambertus <cm...@apache.org>
> Datum: 19.08.16 07:03 (GMT+01:00)
> An: infrastructure-dev@apache.org, Christofer Dutz <ch...@c-ware.de>
> Cc: mirko.novakovic@codecentric.de
> Betreff: Re: Tool proposal for helping run and monitor the ASF Infra Services
>
>
>
> Hiya Chris,
>
> Thanks for the info and the legwork on this. We currently use DataDog, which is very similar to what Instana appears to provide — an agent-based monitoring solution that gives us that kind of look into our infra. We also have a number of internal tools that report on various goings-on as well. You might see some of this in #asfinfra on hipchat from SNMP2HipChat. DataDog also reports various problems there, as does our monitoring via PingMyBox.
>
> Since you’re not root@, you may not see some of the stuff that we see, but I think by and large, the majority of the monitors do direct to #asfinfra. Have you noticed gaps in the monitoring? Since we moved to DataDog, we’ve been quite happy with the resolution and metrics we’ve been able to get. It’s been on my back burner for awhile to expose some of our DD dashboards as public, but for right now it’s somewhat limited access. In the interests of transparency (but not at the expense of security,) I’d be happy to work with you to expose more of this, and I’m happy to address any questions or concerns about shortcomings in our monitoring.
>
> Many thanks to Instana for offering the ASF free services! I’d definitely like to hear more about what they might be able to offer on top of what we already get from DataDog. I’ll take a look at the info you sent out. Please feel free to follow up with me directly, either via email or hipchat.
>
> Cheers,
> -Chris
>
>
>
>
>> On Aug 18, 2016, at 2:13 AM, Christofer Dutz <ch...@c-ware.de> wrote:
>>
>> Hi,
>>
>>
>>
>> I have been on the Infra Hipchat for a few weeks now while trying to migrate the Flex project to Maven and back to the ASF Infra build system. Thanks for your support in this and even more thanks for the trust in granting me access and Admin rights on the windows1 build agent.
>>
>>
>>
>> In the chat I observed the daily work of you guys, having to maintain quite a zoo of all sorts of different systems on different platforms. Some problems you were having seem quite easy to track down ... if the hard disk is full, you clean up. But not all problems are that easy to track down. Thinking of the problems with repository.apache.org ... here the cause was the proxy being flooded with connections (I think this was the case) ... regular restarts of this helped temporarily, but I don't think that helps on the long term as no one had an idea why those connections were hanging there in the first place.
>>
>>
>>
>> A few years ago the company I work for - codecentric - have founded a company called Instana. They are developing an agent based system for monitoring IT infrastructure. In contrast to most established solutions, they use machine learning strategies to analyze the root cause for problems. While you can probably achieve similar results with normal tools, the problem is that you need a very detailed domain knowledge to do so and in a regularly changing environment you need to continuously keep adjusting your metrics. Instana does this automatically. I think you can imagine how tricky it is to follow the root cause for bad response times through a network of interconnected services.
>>
>>
>>
>> Investing almost all of my free time (and a lot of my paid time) for Apache, noticing a lot of the problems you have to deal with every day, I asked Instana if they would be willing to provide their service to the ASF for free and they agreed and immediately setup a dedicated instance.
>>
>>
>>
>> I wanted to try the thing out as I would prefer to grab a few beers with you at ApcheCon in Cevillia and not get punched in the face for recommending something bad ;-) ... so I tried this on my private Server playground. I unpacked and started the agent and the host appeared on the web console and reported the problems it was having (ones I didn't even know about) as well as other systems it communicates with ... as soon as I added agents on these machines the analytics started doing their work across system and I built up a map view of my services and their correlation. So it's really a system that needs almost no configuration at all :-)
>>
>>
>> I uploaded the internal product presentation here: https://public.centerdevice.de/1a9dc4ed-515e-482e-9fd6-6d60a5562598 (please don't share this outside of the ASF)
>>
>> Please use the password: 4p4cheR0cks (I'll remove that document in about two weeks)
>>
>>
>> By the way ... the screenshots in the presentation are real ... I was amazed of seeing a 3D web UI in production for the first time ;-)
>>
>>
>>
>> So if there is any interest in this offer, I would be more than happy to provide credentials to you and assist you in getting started, so you could easily try it out. The guys at Instana would also be delighted to give you guys an online demo and answer any questions you might be having. Feel free to conatact Mirco directly for this: mirko.novakovic@codecentric.de
>>
>>
>>
>> Chris
>
>


Re: AW: Tool proposal for helping run and monitor the ASF Infra Services

Posted by Daniel Gruno <hu...@apache.org>.
On 08/19/2016 08:20 AM, Christofer Dutz wrote:
> Hi Chris,
> 
> I knew that someone asked exactly the "how does it compare to datadog" question somewhere. Here's the link to that mail thread https://news.ycombinator.com/item?id=12147219
> 
> And I can confirm the shortcomings of the time series approach, cause in jenkins, I'd say about 70% of recent failures of flex builds were due to timeouts when uploading Maven artifacts to nexus. The current solution doesn't seem to detect that. Not only that I couldn't see any hipchat notifications. The infra guys always had to start looking for the real reason of the timeouts as nexus wasn't having any problems at all.

I'd say the real reason we weren't being notified is because httpd
wasn't telling us it was jammed, so it would have made little difference
whether we used product X or Y to monitor it, when it wasn't sending out
data that could lead us to what the problem was. It wasn't a question of
granularity, the data just wasn't enabled for any agent to see.

I would imagine this would be something _instead_ of datadog, which
leads me to some questions:

- For how long into the future could we have a guarantee that this isn't
gonna cost us $$$/year or whatever the price would end up being with a
non-free version?

- What is understood by real time checks here? and what exactly is checked?

- How does this relate to more advanced monitoring systems like
Circonus? As you may know, we had to drop that as it proved to be rather
time consuming.

- What sort of integrations does this system have? How are alerts
dispatched?

I'd also be interested in learning about plugins and how to customize
the agents.

With regards,
Daniel.

> 
> And I really live the feature of tracking down the response time for one service back to other servers to find it the real reason for a system being slow (have a look in the presentation for this. There's a great slide on this)
> 
> Chris
> 
> 
> Von meinem Samsung Galaxy Smartphone gesendet.
> 
> 
> -------- Ursprngliche Nachricht --------
> Von: Chris Lambertus <cm...@apache.org>
> Datum: 19.08.16 07:03 (GMT+01:00)
> An: infrastructure-dev@apache.org, Christofer Dutz <ch...@c-ware.de>
> Cc: mirko.novakovic@codecentric.de
> Betreff: Re: Tool proposal for helping run and monitor the ASF Infra Services
> 
> 
> 
> Hiya Chris,
> 
> Thanks for the info and the legwork on this. We currently use DataDog, which is very similar to what Instana appears to provide  an agent-based monitoring solution that gives us that kind of look into our infra. We also have a number of internal tools that report on various goings-on as well. You might see some of this in #asfinfra on hipchat from SNMP2HipChat. DataDog also reports various problems there, as does our monitoring via PingMyBox.
> 
> Since youre not root@, you may not see some of the stuff that we see, but I think by and large, the majority of the monitors do direct to #asfinfra. Have you noticed gaps in the monitoring? Since we moved to DataDog, weve been quite happy with the resolution and metrics weve been able to get. Its been on my back burner for awhile to expose some of our DD dashboards as public, but for right now its somewhat limited access. In the interests of transparency (but not at the expense of security,) Id be happy to work with you to expose more of this, and Im happy to address any questions or concerns about shortcomings in our monitoring.
> 
> Many thanks to Instana for offering the ASF free services! Id definitely like to hear more about what they might be able to offer on top of what we already get from DataDog. Ill take a look at the info you sent out. Please feel free to follow up with me directly, either via email or hipchat.
> 
> Cheers,
> -Chris
> 
> 
> 
> 
>> On Aug 18, 2016, at 2:13 AM, Christofer Dutz <ch...@c-ware.de> wrote:
>>
>> Hi,
>>
>>
>>
>> I have been on the Infra Hipchat for a few weeks now while trying to migrate the Flex project to Maven and back to the ASF Infra build system. Thanks for your support in this and even more thanks for the trust in granting me access and Admin rights on the windows1 build agent.
>>
>>
>>
>> In the chat I observed the daily work of you guys, having to maintain quite a zoo of all sorts of different systems on different platforms. Some problems you were having seem quite easy to track down ... if the hard disk is full, you clean up. But not all problems are that easy to track down. Thinking of the problems with repository.apache.org ... here the cause was the proxy being flooded with connections (I think this was the case) ... regular restarts of this helped temporarily, but I don't think that helps on the long term as no one had an idea why those connections were hanging there in the first place.
>>
>>
>>
>> A few years ago the company I work for - codecentric - have founded a company called Instana. They are developing an agent based system for monitoring IT infrastructure. In contrast to most established solutions, they use machine learning strategies to analyze the root cause for problems. While you can probably achieve similar results with normal tools, the problem is that you need a very detailed domain knowledge to do so and in a regularly changing environment you need to continuously keep adjusting your metrics. Instana does this automatically. I think you can imagine how tricky it is to follow the root cause for bad response times through a network of interconnected services.
>>
>>
>>
>> Investing almost all of my free time (and a lot of my paid time) for Apache, noticing a lot of the problems you have to deal with every day, I asked Instana if they would be willing to provide their service to the ASF for free and they agreed and immediately setup a dedicated instance.
>>
>>
>>
>> I wanted to try the thing out as I would prefer to grab a few beers with you at ApcheCon in Cevillia and not get punched in the face for recommending something bad ;-) ... so I tried this on my private Server playground. I unpacked and started the agent and the host appeared on the web console and reported the problems it was having (ones I didn't even know about) as well as other systems it communicates with ... as soon as I added agents on these machines the analytics started doing their work across system and I built up a map view of my services and their correlation. So it's really a system that needs almost no configuration at all :-)
>>
>>
>> I uploaded the internal product presentation here: https://public.centerdevice.de/1a9dc4ed-515e-482e-9fd6-6d60a5562598 (please don't share this outside of the ASF)
>>
>> Please use the password: 4p4cheR0cks (I'll remove that document in about two weeks)
>>
>>
>> By the way ... the screenshots in the presentation are real ... I was amazed of seeing a 3D web UI in production for the first time ;-)
>>
>>
>>
>> So if there is any interest in this offer, I would be more than happy to provide credentials to you and assist you in getting started, so you could easily try it out. The guys at Instana would also be delighted to give you guys an online demo and answer any questions you might be having. Feel free to conatact Mirco directly for this: mirko.novakovic@codecentric.de
>>
>>
>>
>> Chris
> 
> 


AW: Tool proposal for helping run and monitor the ASF Infra Services

Posted by Christofer Dutz <ch...@c-ware.de>.
Hi Chris,

I knew that someone asked exactly the "how does it compare to datadog" question somewhere. Here's the link to that mail thread https://news.ycombinator.com/item?id=12147219

And I can confirm the shortcomings of the time series approach, cause in jenkins, I'd say about 70% of recent failures of flex builds were due to timeouts when uploading Maven artifacts to nexus. The current solution doesn't seem to detect that. Not only that I couldn't see any hipchat notifications. The infra guys always had to start looking for the real reason of the timeouts as nexus wasn't having any problems at all.

And I really live the feature of tracking down the response time for one service back to other servers to find it the real reason for a system being slow (have a look in the presentation for this. There's a great slide on this)

Chris


Von meinem Samsung Galaxy Smartphone gesendet.


-------- Ursprüngliche Nachricht --------
Von: Chris Lambertus <cm...@apache.org>
Datum: 19.08.16 07:03 (GMT+01:00)
An: infrastructure-dev@apache.org, Christofer Dutz <ch...@c-ware.de>
Cc: mirko.novakovic@codecentric.de
Betreff: Re: Tool proposal for helping run and monitor the ASF Infra Services



Hiya Chris,

Thanks for the info and the legwork on this. We currently use DataDog, which is very similar to what Instana appears to provide — an agent-based monitoring solution that gives us that kind of look into our infra. We also have a number of internal tools that report on various goings-on as well. You might see some of this in #asfinfra on hipchat from SNMP2HipChat. DataDog also reports various problems there, as does our monitoring via PingMyBox.

Since you’re not root@, you may not see some of the stuff that we see, but I think by and large, the majority of the monitors do direct to #asfinfra. Have you noticed gaps in the monitoring? Since we moved to DataDog, we’ve been quite happy with the resolution and metrics we’ve been able to get. It’s been on my back burner for awhile to expose some of our DD dashboards as public, but for right now it’s somewhat limited access. In the interests of transparency (but not at the expense of security,) I’d be happy to work with you to expose more of this, and I’m happy to address any questions or concerns about shortcomings in our monitoring.

Many thanks to Instana for offering the ASF free services! I’d definitely like to hear more about what they might be able to offer on top of what we already get from DataDog. I’ll take a look at the info you sent out. Please feel free to follow up with me directly, either via email or hipchat.

Cheers,
-Chris




> On Aug 18, 2016, at 2:13 AM, Christofer Dutz <ch...@c-ware.de> wrote:
>
> Hi,
>
>
>
> I have been on the Infra Hipchat for a few weeks now while trying to migrate the Flex project to Maven and back to the ASF Infra build system. Thanks for your support in this and even more thanks for the trust in granting me access and Admin rights on the windows1 build agent.
>
>
>
> In the chat I observed the daily work of you guys, having to maintain quite a zoo of all sorts of different systems on different platforms. Some problems you were having seem quite easy to track down ... if the hard disk is full, you clean up. But not all problems are that easy to track down. Thinking of the problems with repository.apache.org ... here the cause was the proxy being flooded with connections (I think this was the case) ... regular restarts of this helped temporarily, but I don't think that helps on the long term as no one had an idea why those connections were hanging there in the first place.
>
>
>
> A few years ago the company I work for - codecentric - have founded a company called Instana. They are developing an agent based system for monitoring IT infrastructure. In contrast to most established solutions, they use machine learning strategies to analyze the root cause for problems. While you can probably achieve similar results with normal tools, the problem is that you need a very detailed domain knowledge to do so and in a regularly changing environment you need to continuously keep adjusting your metrics. Instana does this automatically. I think you can imagine how tricky it is to follow the root cause for bad response times through a network of interconnected services.
>
>
>
> Investing almost all of my free time (and a lot of my paid time) for Apache, noticing a lot of the problems you have to deal with every day, I asked Instana if they would be willing to provide their service to the ASF for free and they agreed and immediately setup a dedicated instance.
>
>
>
> I wanted to try the thing out as I would prefer to grab a few beers with you at ApcheCon in Cevillia and not get punched in the face for recommending something bad ;-) ... so I tried this on my private Server playground. I unpacked and started the agent and the host appeared on the web console and reported the problems it was having (ones I didn't even know about) as well as other systems it communicates with ... as soon as I added agents on these machines the analytics started doing their work across system and I built up a map view of my services and their correlation. So it's really a system that needs almost no configuration at all :-)
>
>
> I uploaded the internal product presentation here: https://public.centerdevice.de/1a9dc4ed-515e-482e-9fd6-6d60a5562598 (please don't share this outside of the ASF)
>
> Please use the password: 4p4cheR0cks (I'll remove that document in about two weeks)
>
>
> By the way ... the screenshots in the presentation are real ... I was amazed of seeing a 3D web UI in production for the first time ;-)
>
>
>
> So if there is any interest in this offer, I would be more than happy to provide credentials to you and assist you in getting started, so you could easily try it out. The guys at Instana would also be delighted to give you guys an online demo and answer any questions you might be having. Feel free to conatact Mirco directly for this: mirko.novakovic@codecentric.de
>
>
>
> Chris


Re: Tool proposal for helping run and monitor the ASF Infra Services

Posted by Chris Lambertus <cm...@apache.org>.

Hiya Chris,

Thanks for the info and the legwork on this. We currently use DataDog, which is very similar to what Instana appears to provide — an agent-based monitoring solution that gives us that kind of look into our infra. We also have a number of internal tools that report on various goings-on as well. You might see some of this in #asfinfra on hipchat from SNMP2HipChat. DataDog also reports various problems there, as does our monitoring via PingMyBox.

Since you’re not root@, you may not see some of the stuff that we see, but I think by and large, the majority of the monitors do direct to #asfinfra. Have you noticed gaps in the monitoring? Since we moved to DataDog, we’ve been quite happy with the resolution and metrics we’ve been able to get. It’s been on my back burner for awhile to expose some of our DD dashboards as public, but for right now it’s somewhat limited access. In the interests of transparency (but not at the expense of security,) I’d be happy to work with you to expose more of this, and I’m happy to address any questions or concerns about shortcomings in our monitoring.

Many thanks to Instana for offering the ASF free services! I’d definitely like to hear more about what they might be able to offer on top of what we already get from DataDog. I’ll take a look at the info you sent out. Please feel free to follow up with me directly, either via email or hipchat.

Cheers,
-Chris




> On Aug 18, 2016, at 2:13 AM, Christofer Dutz <ch...@c-ware.de> wrote:
> 
> Hi,
> 
> 
> 
> I have been on the Infra Hipchat for a few weeks now while trying to migrate the Flex project to Maven and back to the ASF Infra build system. Thanks for your support in this and even more thanks for the trust in granting me access and Admin rights on the windows1 build agent.
> 
> 
> 
> In the chat I observed the daily work of you guys, having to maintain quite a zoo of all sorts of different systems on different platforms. Some problems you were having seem quite easy to track down ... if the hard disk is full, you clean up. But not all problems are that easy to track down. Thinking of the problems with repository.apache.org ... here the cause was the proxy being flooded with connections (I think this was the case) ... regular restarts of this helped temporarily, but I don't think that helps on the long term as no one had an idea why those connections were hanging there in the first place.
> 
> 
> 
> A few years ago the company I work for - codecentric - have founded a company called Instana. They are developing an agent based system for monitoring IT infrastructure. In contrast to most established solutions, they use machine learning strategies to analyze the root cause for problems. While you can probably achieve similar results with normal tools, the problem is that you need a very detailed domain knowledge to do so and in a regularly changing environment you need to continuously keep adjusting your metrics. Instana does this automatically. I think you can imagine how tricky it is to follow the root cause for bad response times through a network of interconnected services.
> 
> 
> 
> Investing almost all of my free time (and a lot of my paid time) for Apache, noticing a lot of the problems you have to deal with every day, I asked Instana if they would be willing to provide their service to the ASF for free and they agreed and immediately setup a dedicated instance.
> 
> 
> 
> I wanted to try the thing out as I would prefer to grab a few beers with you at ApcheCon in Cevillia and not get punched in the face for recommending something bad ;-) ... so I tried this on my private Server playground. I unpacked and started the agent and the host appeared on the web console and reported the problems it was having (ones I didn't even know about) as well as other systems it communicates with ... as soon as I added agents on these machines the analytics started doing their work across system and I built up a map view of my services and their correlation. So it's really a system that needs almost no configuration at all :-)
> 
> 
> I uploaded the internal product presentation here: https://public.centerdevice.de/1a9dc4ed-515e-482e-9fd6-6d60a5562598 (please don't share this outside of the ASF)
> 
> Please use the password: 4p4cheR0cks (I'll remove that document in about two weeks)
> 
> 
> By the way ... the screenshots in the presentation are real ... I was amazed of seeing a 3D web UI in production for the first time ;-)
> 
> 
> 
> So if there is any interest in this offer, I would be more than happy to provide credentials to you and assist you in getting started, so you could easily try it out. The guys at Instana would also be delighted to give you guys an online demo and answer any questions you might be having. Feel free to conatact Mirco directly for this: mirko.novakovic@codecentric.de
> 
> 
> 
> Chris