You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Jeremy Olexa <jo...@spscommerce.com> on 2015/10/04 10:45:18 UTC

Advice on agent disconnect and reconnect

Hello,


We have been observing some agent processes disconnects when our agent processes are in another datacenter, A, and accessing the master cluster in datacenter B. I would like to mitigate this issue because it ejects all the applications running and then all of the sandbox links, etc, are not available because the slave is "lost"


I have attached the disconnect portion of the log here:

https://gist.github.com/jolexa/1a80e26a4b017846d083


I am curious if anyone can offer some advice on making the relevant Mesos processes more resilient in this regard. I'm confused on all the timeout options and I don't know exactly what to tweak safely.


Thanks for any assistance!

-Jeremy


Re: Advice on agent disconnect and reconnect

Posted by Benjamin Mahler <be...@gmail.com>.
There are a few things going on there, you're having ZooKeeper connectivity
issues. And the master is not able to health check the agent.

I would recommend triaging what occurred in your network, but you can
increase the master's health check timeouts as a mitigation. You can also
control the maximum rate at which the master removes unhealthy agents.

On Sun, Oct 4, 2015 at 1:45 AM, Jeremy Olexa <jo...@spscommerce.com> wrote:

> Hello,
>
>
> We have been observing some agent processes disconnects when our agent
> processes are in another datacenter, A, and accessing the master cluster in
> datacenter B. I would like to mitigate this issue because it ejects all the
> applications running and then all of the sandbox links, etc, are not
> available because the slave is "lost"
>
>
> I have attached the disconnect portion of the log here:
>
> https://gist.github.com/jolexa/1a80e26a4b017846d083
>
> I am curious if anyone can offer some advice on making the relevant Mesos
> processes more resilient in this regard. I'm confused on all the timeout
> options and I don't know exactly what to tweak safely.
>
>
> Thanks for any assistance!
>
> -Jeremy
>
>
>