You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Elizabeth Lingg (JIRA)" <ji...@apache.org> on 2015/04/29 19:31:06 UTC

[jira] [Resolved] (MESOS-2605) The slave sometimes does not send active executors during reregistration

     [ https://issues.apache.org/jira/browse/MESOS-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Elizabeth Lingg resolved MESOS-2605.
------------------------------------
    Resolution: Fixed

Tested this with 0.22.1 RC6, and it seems to be fixed.

> The slave sometimes does not send active executors during reregistration
> ------------------------------------------------------------------------
>
>                 Key: MESOS-2605
>                 URL: https://issues.apache.org/jira/browse/MESOS-2605
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.22.0
>            Reporter: Elizabeth Lingg
>              Labels: mesosphere
>
> The slave sometimes does not send active executors during reregistration. Framework checkpointing is enabled, and the executor successfully reregisters. However, the tasks in that executor are LOST (by abnormal executor termination) because the executor is removed by the mesos master as unknown. See the example below, task.journalnode.journalnode.NodeExecutor.1428609184051.
> See the Slave Logs here for the Task:
> {code}
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> {code}
> Master Logs:
> {code}
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666  1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652  1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5 from framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712  1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372  1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013746  1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013767  1067 master.cpp:3336] Forwarding status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)