You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Mike Barborak <Mi...@MindLakes.com> on 2015/09/14 15:38:39 UTC

how to overcome orphaned tasks after master failure

Hi,

I'm trying to understand a chain of events that occurred this weekend to see if I'm doing something wrong in my custom framework.

The cluster in question had a single Mesos master and < 10 slaves. At some point the master got a fatal error and apparently respawned:

Log file created at: 2015/09/12 20:42:18
Running on machine: ip-172-31-11-45
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0912 20:42:18.452498  2832 sorter.cpp:213] Check failed: total.resources.contains(slaveId)

More from syslog:

Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ed8dd  google::LogMessage::Fail()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ef71d  google::LogMessage::SendToLog()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ed4cc  google::LogMessage::Flush()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08f0019  google::LogMessageFatal::~LogMessageFatal()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c03d6c40  mesos::internal::master::allocator::DRFSorter::remove()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c02ba2ec  mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c089f2c9  process::ProcessManager::resume()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c089f5bf  process::schedule()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0bf02c182  start_thread
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0bed5947d  (unknown)
Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713130] init: mesos-master main process (2811) killed by ABRT signal
Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713146] init: mesos-master main process ended, respawning

The new master process was elected as the new leader and then my problems began. All of the tasks on the cluster were logged as "possibly orphaned" and while they weren't stopped, they seemed to get disconnected from the Mesos master. (I'm not sure what I mean by disconnected. The symptoms are that the tasks' entries in Mesos DNS were lost and I no longer saw them listed on the landing page of the Mesos master UI. The processes continued to run though on the slaves.)  Here is the log of the master restarting and noting the possibly orphaned tasks:

Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451133 10692 master.hpp:159] Adding task 0 with resources cpus(service):4; mem(service):50000 on slave 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal)
...
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: W0912 20:42:42.451499 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0006 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal)
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451741 10692 master.cpp:3597] Re-registered slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; disk(service):35056; ports(service):[31000-32000]
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451797 10692 master.cpp:3625] Sending updated checkpointed resources  to slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal)
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451910 10692 hierarchical.hpp:528] Added slave 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; disk(service):35056; ports(service):[31000-32000] (allocated: cpus(service):6.25; mem(service):52432)

Here is the log on the slave in these logs just before and after the master respawn.

Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:18.450093 54718 slave.cpp:1946] Asked to shut down framework 20150908-084257-755703724-5050-2811-0041 by master@172.31.11.45:5050
Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: W0912 20:42:18.450150 54718 slave.cpp:1961] Cannot shut down unknown framework 20150908-084257-755703724-5050-2811-0041
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003684 54713 detector.cpp:138] Detected a new leader: None
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003851 54720 slave.cpp:677] Lost leading master
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003882 54720 slave.cpp:720] Detecting new master
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003856 54715 status_update_manager.cpp:176] Pausing sending status updates
Sep 12 20:42:30 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:30.451076 54720 slave.cpp:4179] Querying resource estimator for oversubscribable resources

I'm not sure if this is interesting, but there was some logging from the framework scheduler that showed "Shutdown failed" messages I hadn't seen before:

I0908 10:15:37.936308   856 exec.cpp:132] Version: 0.23.0
I0908 10:15:37.938036   858 exec.cpp:206] Executor registered on slave 20150825-081928-755703724-5050-18248-S0
I0908 10:15:38.398712   873 sched.cpp:157] Version: 0.23.0
I0908 10:15:38.400921   891 sched.cpp:254] New master detected at master@172.31.11.45:5050
I0908 10:15:38.401202   891 sched.cpp:264] No credentials provided. Attempting to register without authentication
I0908 10:15:38.402683   891 sched.cpp:448] Framework registered with 20150908-084257-755703724-5050-2811-0000
E0909 18:22:57.000236   865 socket.hpp:107] Shutdown failed on fd=10: Transport endpoint is not connected [107]
E0909 18:23:26.991600   898 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]
E0909 20:15:44.286170   898 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]

Is there something I need to do such that tasks don't get orphaned and disconnected from their schedulers after failures? Is there something I can write in my custom framework that would let me reconnect the scheduler to its tasks?

Thanks,
Mike

RE: how to overcome orphaned tasks after master failure

Posted by Mike Barborak <Mi...@MindLakes.com>.

Hi,

This is what I see regarding the framework in the logs on the master:

mesos-master.INFO:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal)
mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal)
mesos-master.ip-172-31-11-45.invalid-user.log.WARNING.20150912-204242.10672:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal)
mesos-master.WARNING:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal)

The framework itself logged this at startup:

Starting framework with previous ID value: "20150908-084257-755703724-5050-2811-0000"
Registered! ID = 20150908-084257-755703724-5050-2811-0000
No status messages in 5s. Assuming reconciliation complete.
task 0 null
Received offer 20150908-084257-755703724-5050-2811-O525 with cpus: 16.0 and mem: 63395.0
Launching task 0 using offer 20150908-084257-755703724-5050-2811-O525
Status update: task 0 is in state TASK_RUNNING

and then nothing during the reregistration.

On the slave 20150908-084257-755703724-5050-2811-S0  at this time I see is this:

mesos-slave.INFO:I0912 20:42:15.450817 54714 slave.cpp:4179] Querying resource estimator for oversubscribable resources
mesos-slave.INFO:I0912 20:42:15.450942 54723 slave.cpp:4193] Received oversubscribable resources  from the resource estimator
mesos-slave.INFO:I0912 20:42:18.450093 54718 slave.cpp:1946] Asked to shut down framework 20150908-084257-755703724-5050-2811-0041 by master@172.31.11.45:5050
mesos-slave.INFO:W0912 20:42:18.450150 54718 slave.cpp:1961] Cannot shut down unknown framework 20150908-084257-755703724-5050-2811-0041
mesos-slave.INFO:I0912 20:42:28.003684 54713 detector.cpp:138] Detected a new leader: None
mesos-slave.INFO:I0912 20:42:28.003851 54720 slave.cpp:677] Lost leading master
mesos-slave.INFO:I0912 20:42:28.003882 54720 slave.cpp:720] Detecting new master
mesos-slave.INFO:I0912 20:42:28.003856 54715 status_update_manager.cpp:176] Pausing sending status updates
mesos-slave.INFO:I0912 20:42:30.451076 54720 slave.cpp:4179] Querying resource estimator for oversubscribable resources
mesos-slave.INFO:I0912 20:42:30.451254 54728 slave.cpp:4193] Received oversubscribable resources  from the resource estimator
mesos-slave.INFO:I0912 20:42:42.157058 54722 slave.cpp:3087] master@172.31.11.45:5050 exited
mesos-slave.INFO:W0912 20:42:42.157120 54722 slave.cpp:3090] Master disconnected! Waiting for a new master to be elected
mesos-slave.INFO:I0912 20:42:42.301710 54720 detector.cpp:138] Detected a new leader: (id='29')
mesos-slave.INFO:I0912 20:42:42.301834 54724 group.cpp:656] Trying to get '/mesos/info_0000000029' in ZooKeeper
mesos-slave.INFO:W0912 20:42:42.303886 54724 detector.cpp:444] Leading master master@172.31.11.45:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-23
40)
mesos-slave.INFO:I0912 20:42:42.303947 54724 detector.cpp:481] A new leading master (UPID=master@172.31.11.45:5050) is detected
mesos-slave.INFO:I0912 20:42:42.304061 54724 slave.cpp:684] New master detected at master@172.31.11.45:5050
mesos-slave.INFO:I0912 20:42:42.304075 54716 status_update_manager.cpp:176] Pausing sending status updates
mesos-slave.INFO:I0912 20:42:42.304178 54724 slave.cpp:709] No credentials provided. Attempting to register without authentication
mesos-slave.INFO:I0912 20:42:42.304221 54724 slave.cpp:720] Detecting new master
mesos-slave.INFO:I0912 20:42:42.453955 54723 slave.cpp:959] Re-registered with master master@172.31.11.45:5050
mesos-slave.INFO:I0912 20:42:42.454052 54723 slave.cpp:995] Forwarding total oversubscribed resources
mesos-slave.INFO:I0912 20:42:42.454056 54720 status_update_manager.cpp:183] Resuming sending status updates
mesos-slave.INFO:I0912 20:42:42.454222 54723 slave.cpp:2202] Updated checkpointed resources from  to
mesos-slave.INFO:W0912 20:42:43.504453 54725 slave.cpp:2105] Ignoring updating pid for framework 20150501-182221-755703724-5050-17276-0000 because it does not exist
mesos-slave.INFO:I0912 20:42:45.451882 54722 slave.cpp:4179] Querying resource estimator for oversubscribable resources
mesos-slave.INFO:I0912 20:42:45.452005 54720 slave.cpp:4193] Received oversubscribable resources  from the resource estimator

The line about “Ignoring updating pid” is confusing to me. The framework 20150501-182221-755703724-5050-17276-0000 mentioned in that log entry is Marathon. How things work on my cluster is that framework 20150908-084257-755703724-5050-2811-0000 is started by Marathon as a command. In this case, Marathon created a task called service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 on a different slave than whose log is shown to launch the framework. These are the entries on the master about this task being an orphaned task of Marathon.

mesos-master.INFO:I0912 20:42:43.249708 10689 master.hpp:159] Adding task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 with resources cpus(marathon):0.01; mem(marathon):200; ports(marathon):[31238-31238] on slave 20150825-081928-755703724-5050-18248-S0 (dev1.ml.com)
mesos-master.INFO:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com)
mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:I0912 20:42:43.249708 10689 master.hpp:159] Adding task service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 with resources cpus(marathon):0.01; mem(marathon):200; ports(marathon):[31238-31238] on slave 20150825-081928-755703724-5050-18248-S0 (dev1.ml.com)
mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com)
mesos-master.ip-172-31-11-45.invalid-user.log.WARNING.20150912-204242.10672:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com)
mesos-master.WARNING:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com)

Perhaps my problem has to do with launching the framework via a command through Marathon? I will try to reproduce this tomorrow with and without Marathon.

Thanks for looking this over.

Best,
Mike


From: Vinod Kone [mailto:vinodkone@gmail.com]
Sent: Monday, September 14, 2015 4:07 PM
To: user@mesos.apache.org
Subject: Re: how to overcome orphaned tasks after master failure


On Mon, Sep 14, 2015 at 12:40 PM, Mike Barborak <Mi...@mindlakes.com>> wrote:
Sorry for my ignorance, but what is the “scheduler driver?” My framework is based on the Java example:

Some details about the driver should be here: http://mesos.apache.org/documentation/latest/app-framework-development-guide/

Looks like you are using the driver. Can you paste the logs of the scheduler and master (related to the framework) after master failover?

Re: how to overcome orphaned tasks after master failure

Posted by Vinod Kone <vi...@gmail.com>.

On Mon, Sep 14, 2015 at 12:40 PM, Mike Barborak <Mi...@mindlakes.com> wrote:

> Sorry for my ignorance, but what is the “scheduler driver?” My framework
> is based on the Java example:
>
>
Some details about the driver should be here:
http://mesos.apache.org/documentation/latest/app-framework-development-guide/

Looks like you are using the driver. Can you paste the logs of the
scheduler and master (related to the framework) after master failover?

RE: how to overcome orphaned tasks after master failure

Posted by Mike Barborak <Mi...@MindLakes.com>.

Hi,

Sorry for my ignorance, but what is the “scheduler driver?” My framework is based on the Java example:

https://github.com/apache/mesos/blob/master/src/examples/java/TestFramework.java

Just guessing, but is there something I should be doing in the reregistered method? My understanding was that this was more of a status callback than something needing functionality.

public void reregistered(SchedulerDriver driver, MasterInfo masterInfo) {
}

Thanks,
Mike




From: Vinod Kone [mailto:vinodkone@gmail.com]
Sent: Monday, September 14, 2015 1:16 PM
To: user@mesos.apache.org
Subject: Re: how to overcome orphaned tasks after master failure

The framework is expected to re-register with a failed over master with the same framework id as before. If you are using the scheduler driver, that should happen automatically.

On Mon, Sep 14, 2015 at 6:38 AM, Mike Barborak <Mi...@mindlakes.com>> wrote:
Hi,

I’m trying to understand a chain of events that occurred this weekend to see if I’m doing something wrong in my custom framework.

The cluster in question had a single Mesos master and < 10 slaves. At some point the master got a fatal error and apparently respawned:

Log file created at: 2015/09/12 20:42:18
Running on machine: ip-172-31-11-45
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0912 20:42:18.452498  2832 sorter.cpp:213] Check failed: total.resources.contains(slaveId)

More from syslog:

Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ed8dd  google::LogMessage::Fail()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ef71d  google::LogMessage::SendToLog()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ed4cc  google::LogMessage::Flush()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08f0019  google::LogMessageFatal::~LogMessageFatal()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c03d6c40  mesos::internal::master::allocator::DRFSorter::remove()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c02ba2ec  mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c089f2c9  process::ProcessManager::resume()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c089f5bf  process::schedule()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0bf02c182  start_thread
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0bed5947d  (unknown)
Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713130] init: mesos-master main process (2811) killed by ABRT signal
Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713146] init: mesos-master main process ended, respawning

The new master process was elected as the new leader and then my problems began. All of the tasks on the cluster were logged as “possibly orphaned” and while they weren’t stopped, they seemed to get disconnected from the Mesos master. (I’m not sure what I mean by disconnected. The symptoms are that the tasks’ entries in Mesos DNS were lost and I no longer saw them listed on the landing page of the Mesos master UI. The processes continued to run though on the slaves.)  Here is the log of the master restarting and noting the possibly orphaned tasks:

Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451133 10692 master.hpp:159] Adding task 0 with resources cpus(service):4; mem(service):50000 on slave 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal)
...
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: W0912 20:42:42.451499 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0006 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051<http://172.31.19.210:5051> (ip-172-31-19-210.ec2.internal)
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451741 10692 master.cpp:3597] Re-registered slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051<http://172.31.19.210:5051> (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; disk(service):35056; ports(service):[31000-32000]
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451797 10692 master.cpp:3625] Sending updated checkpointed resources  to slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051<http://172.31.19.210:5051> (ip-172-31-19-210.ec2.internal)
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451910 10692 hierarchical.hpp:528] Added slave 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; disk(service):35056; ports(service):[31000-32000] (allocated: cpus(service):6.25; mem(service):52432)

Here is the log on the slave in these logs just before and after the master respawn.

Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:18.450093 54718 slave.cpp:1946] Asked to shut down framework 20150908-084257-755703724-5050-2811-0041 by master@172.31.11.45:5050<http://master@172.31.11.45:5050>
Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: W0912 20:42:18.450150 54718 slave.cpp:1961] Cannot shut down unknown framework 20150908-084257-755703724-5050-2811-0041
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003684 54713 detector.cpp:138] Detected a new leader: None
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003851 54720 slave.cpp:677] Lost leading master
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003882 54720 slave.cpp:720] Detecting new master
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003856 54715 status_update_manager.cpp:176] Pausing sending status updates
Sep 12 20:42:30 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:30.451076 54720 slave.cpp:4179] Querying resource estimator for oversubscribable resources

I’m not sure if this is interesting, but there was some logging from the framework scheduler that showed “Shutdown failed” messages I hadn’t seen before:

I0908 10:15:37.936308   856 exec.cpp:132] Version: 0.23.0
I0908 10:15:37.938036   858 exec.cpp:206] Executor registered on slave 20150825-081928-755703724-5050-18248-S0
I0908 10:15:38.398712   873 sched.cpp:157] Version: 0.23.0
I0908 10:15:38.400921   891 sched.cpp:254] New master detected at master@172.31.11.45:5050<http://master@172.31.11.45:5050>
I0908 10:15:38.401202   891 sched.cpp:264] No credentials provided. Attempting to register without authentication
I0908 10:15:38.402683   891 sched.cpp:448] Framework registered with 20150908-084257-755703724-5050-2811-0000
E0909 18:22:57.000236   865 socket.hpp:107] Shutdown failed on fd=10: Transport endpoint is not connected [107]
E0909 18:23:26.991600   898 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]
E0909 20:15:44.286170   898 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]

Is there something I need to do such that tasks don’t get orphaned and disconnected from their schedulers after failures? Is there something I can write in my custom framework that would let me reconnect the scheduler to its tasks?

Thanks,
Mike

Re: how to overcome orphaned tasks after master failure

Posted by Vinod Kone <vi...@gmail.com>.

The framework is expected to re-register with a failed over master with the
same framework id as before. If you are using the scheduler driver, that
should happen automatically.

On Mon, Sep 14, 2015 at 6:38 AM, Mike Barborak <Mi...@mindlakes.com> wrote:

> Hi,
>
>
>
> I’m trying to understand a chain of events that occurred this weekend to
> see if I’m doing something wrong in my custom framework.
>
>
>
> The cluster in question had a single Mesos master and < 10 slaves. At some
> point the master got a fatal error and apparently respawned:
>
>
>
> Log file created at: 2015/09/12 20:42:18
>
> Running on machine: ip-172-31-11-45
>
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>
> F0912 20:42:18.452498  2832 sorter.cpp:213] Check failed:
> total.resources.contains(slaveId)
>
>
>
> More from syslog:
>
>
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08ed8dd  google::LogMessage::Fail()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08ef71d  google::LogMessage::SendToLog()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08ed4cc  google::LogMessage::Flush()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08f0019  google::LogMessageFatal::~LogMessageFatal()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c03d6c40  mesos::internal::master::allocator::DRFSorter::remove()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c02ba2ec
>  mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c089f2c9  process::ProcessManager::resume()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c089f5bf  process::schedule()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0bf02c182  start_thread
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0bed5947d  (unknown)
>
> Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713130] init:
> mesos-master main process (2811) killed by ABRT signal
>
> Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713146] init:
> mesos-master main process ended, respawning
>
>
>
> The new master process was elected as the new leader and then my problems
> began. All of the tasks on the cluster were logged as “possibly orphaned”
> and while they weren’t stopped, they seemed to get disconnected from the
> Mesos master. (I’m not sure what I mean by disconnected. The symptoms are
> that the tasks’ entries in Mesos DNS were lost and I no longer saw them
> listed on the landing page of the Mesos master UI. The processes continued
> to run though on the slaves.)  Here is the log of the master restarting and
> noting the possibly orphaned tasks:
>
>
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451133
> 10692 master.hpp:159] Adding task 0 with resources cpus(service):4;
> mem(service):50000 on slave 20150908-084257-755703724-5050-2811-S0
> (ip-172-31-19-210.ec2.internal)
>
> ...
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: W0912 20:42:42.451499
> 10692 master.cpp:4926] Possibly orphaned task 0 of framework
> 20150908-084257-755703724-5050-2811-0006 running on slave
> 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051
> (ip-172-31-19-210.ec2.internal)
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451741
> 10692 master.cpp:3597] Re-registered slave
> 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051
> (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395;
> disk(service):35056; ports(service):[31000-32000]
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451797
> 10692 master.cpp:3625] Sending updated checkpointed resources  to slave
> 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051
> (ip-172-31-19-210.ec2.internal)
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451910
> 10692 hierarchical.hpp:528] Added slave
> 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal) with
> cpus(service):16; mem(service):63395; disk(service):35056;
> ports(service):[31000-32000] (allocated: cpus(service):6.25;
> mem(service):52432)
>
>
>
> Here is the log on the slave in these logs just before and after the
> master respawn.
>
>
>
> Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:18.450093
> 54718 slave.cpp:1946] Asked to shut down framework
> 20150908-084257-755703724-5050-2811-0041 by master@172.31.11.45:5050
>
> Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: W0912 20:42:18.450150
> 54718 slave.cpp:1961] Cannot shut down unknown framework
> 20150908-084257-755703724-5050-2811-0041
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003684
> 54713 detector.cpp:138] Detected a new leader: None
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003851
> 54720 slave.cpp:677] Lost leading master
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003882
> 54720 slave.cpp:720] Detecting new master
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003856
> 54715 status_update_manager.cpp:176] Pausing sending status updates
>
> Sep 12 20:42:30 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:30.451076
> 54720 slave.cpp:4179] Querying resource estimator for oversubscribable
> resources
>
>
>
> I’m not sure if this is interesting, but there was some logging from the
> framework scheduler that showed “Shutdown failed” messages I hadn’t seen
> before:
>
>
>
> I0908 10:15:37.936308   856 exec.cpp:132] Version: 0.23.0
>
> I0908 10:15:37.938036   858 exec.cpp:206] Executor registered on slave
> 20150825-081928-755703724-5050-18248-S0
>
> I0908 10:15:38.398712   873 sched.cpp:157] Version: 0.23.0
>
> I0908 10:15:38.400921   891 sched.cpp:254] New master detected at
> master@172.31.11.45:5050
>
> I0908 10:15:38.401202   891 sched.cpp:264] No credentials provided.
> Attempting to register without authentication
>
> I0908 10:15:38.402683   891 sched.cpp:448] Framework registered with
> 20150908-084257-755703724-5050-2811-0000
>
> E0909 18:22:57.000236   865 socket.hpp:107] Shutdown failed on fd=10:
> Transport endpoint is not connected [107]
>
> E0909 18:23:26.991600   898 socket.hpp:107] Shutdown failed on fd=13:
> Transport endpoint is not connected [107]
>
> E0909 20:15:44.286170   898 socket.hpp:107] Shutdown failed on fd=13:
> Transport endpoint is not connected [107]
>
>
>
> Is there something I need to do such that tasks don’t get orphaned and
> disconnected from their schedulers after failures? Is there something I can
> write in my custom framework that would let me reconnect the scheduler to
> its tasks?
>
>
>
> Thanks,
>
> Mike
>
>
>