You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Sergei Hanus (Jira)" <ji...@apache.org> on 2020/11/04 12:51:00 UTC

[jira] [Created] (MESOS-10197) One of processes gets incorrect status after stopping and starting mesos-master and mesos-agent simultaneously

Sergei Hanus created MESOS-10197:
------------------------------------

             Summary: One of processes gets incorrect status after stopping and starting mesos-master and mesos-agent simultaneously
                 Key: MESOS-10197
                 URL: https://issues.apache.org/jira/browse/MESOS-10197
             Project: Mesos
          Issue Type: Bug
            Reporter: Sergei Hanus


We are using mesos 1.8.0 together with marathon 1.7.50

We run several child services under marathon. When we stop and start all services (including mesos-master and mesos-agent) or simply reboot the server, usually everything is returning back to functional.

But, sometimes we observe, that one of child services is reported as healthy, but in fact there is no such process on the server. When we restart mesos-sgent once more, this child service appears as a process and actually starts working.

 

At the same time we observe the following message in agent log:

 
{code:java}
I1103 01:48:08.291822  6542 slave.cpp:5491] Killing un-reregistered executor 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework a99f25dd-d176-4ffd-9351-e70a357c1872-0000 at executor(1)@10.100.5.141:36452
I1103 01:48:08.291896  6542 slave.cpp:7848] Finished recovery
{code}
What could be the reason of such behavior and how to avoid it? If this services' state is stuck somethere in agents' internal structures (metadata file on disk, or something like that) - hwo could we cleanup this state?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)