You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Sergei Hanus (Jira)" <ji...@apache.org> on 2020/11/11 06:28:00 UTC

[jira] [Commented] (MESOS-10197) One of processes gets incorrect status after stopping and starting mesos-master and mesos-agent simultaneously

    [ https://issues.apache.org/jira/browse/MESOS-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229772#comment-17229772 ] 

Sergei Hanus commented on MESOS-10197:
--------------------------------------

I also tried to check this with the latest 1.10 release - still the same behavior. Agent does not report failed service state, even after I restart the agent. Only cleanup of the corresponding executor in meta folder helps to restore service back to functional.

 

> One of processes gets incorrect status after stopping and starting mesos-master and mesos-agent simultaneously
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-10197
>                 URL: https://issues.apache.org/jira/browse/MESOS-10197
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Sergei Hanus
>            Priority: Major
>
> We are using mesos 1.8.0 together with marathon 1.7.50
> We run several child services under marathon. When we stop and start all services (including mesos-master and mesos-agent) or simply reboot the server, usually everything is returning back to functional.
> But, sometimes we observe, that one of child services is reported as healthy, but in fact there is no such process on the server. When we restart mesos-sgent once more, this child service appears as a process and actually starts working.
>  
> At the same time we observe the following messages in stderr of executor:
>  
> {code:java}
> 1103 01:46:18.725567 5922 exec.cpp:518] Agent exited, but framework has checkpointing enabled. Waiting 20secs to reconnect with agent a99f25dd-d176-4ffd-9351-e70a357c1872-S1
> 200
> I1103 01:46:25.777845 5921 checker_process.cpp:986] COMMAND health check for task 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' returned: 0
> I1103 01:46:38.732237 5925 exec.cpp:499] Recovery timeout of 20secs exceeded; Shutting down
> I1103 01:46:38.732295 5925 exec.cpp:445] Executor asked to shutdown
> I1103 01:46:38.732455 5925 executor.cpp:190] Received SHUTDOWN event
> I1103 01:46:38.732482 5925 executor.cpp:829] Shutting down
> I1103 01:46:38.733742 5925 executor.cpp:942] Sending SIGTERM to process tree at pid 5927
> W1103 01:46:38.741338 5926 process.cpp:1890] Failed to send 'mesos.internal.StatusUpdateMessage' to '10.100.5.141:5051', connect: Failed to connect to 10.100.5.141:5051: Connection refused
> I1103 01:46:38.771276 5925 executor.cpp:955] Sent SIGTERM to the following process trees:{code}
>  
> And in mesos-slave.log:
>  
> {code:java}
> I1103 01:48:08.291822  6542 slave.cpp:5491] Killing un-reregistered executor 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework a99f25dd-d176-4ffd-9351-e70a357c1872-0000 at executor(1)@10.100.5.141:36452
> I1103 01:48:08.291896  6542 slave.cpp:7848] Finished recovery
> {code}
>  
> Also, in mesos-slave.log I see a bunch of messages for services, which were successfully restarted, like that:
>  
> {code:java}
> I1103 01:48:06.278928  6542 containerizer.cpp:854] Recovering container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 for executor 'ia-cloud_worker-management-service.735fd7ad-1d67-11eb-ad1d-12962e9c065b' of framework a99f25dd-d176-4ffd-9351-e70a357c1872-0000
> I1103 01:48:06.285398  6541 containerizer.cpp:3117] Container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 has exited
> I1103 01:48:06.285406  6541 containerizer.cpp:2576] Destroying container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 in RUNNING state
> I1103 01:48:06.285413  6541 containerizer.cpp:3278] Transitioning the state of container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 from RUNNING to DESTROYING
> I1103 01:48:06.311017  6541 launcher.cpp:161] Asked to destroy container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9
> W1103 01:48:06.398953  6540 containerizer.cpp:2375] Ignoring update for unknown container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9
> {code}
>  
> But for the problematic service messages are different (there's no event  that container has exited):
>  
>  
> {code:java}
> I1103 01:48:06.278280 6542 containerizer.cpp:854] Recovering container 570461ae-ac26-445e-8cbf-42e366d5ee91 for executor 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework a99f25dd-d176-4ffd-9351-e70a357c1872-0000
> I1103 01:48:08.292357 6542 containerizer.cpp:2576] Destroying container 570461ae-ac26-445e-8cbf-42e366d5ee91 in RUNNING state
> I1103 01:48:08.292371 6542 containerizer.cpp:3278] Transitioning the state of container 570461ae-ac26-445e-8cbf-42e366d5ee91 from RUNNING to DESTROYING
> I1103 01:48:08.292414 6542 launcher.cpp:161] Asked to destroy container 570461ae-ac26-445e-8cbf-42e366d5ee91
> {code}
>  
>  
>  
> What could be the reason of such behavior and how to avoid it? If this services' state is stuck somethere in agents' internal structures (metadata file on disk, or something like that) - how could we cleanup this state?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)