You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joseph Wu (JIRA)" <ji...@apache.org> on 2018/02/28 17:19:00 UTC

[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

    [ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380688#comment-16380688 ] 

Joseph Wu commented on MESOS-8623:
----------------------------------

This CHECK can be hit whenever the master attempts to send any message to a framework recovered via Agent registration (i.e. Agent reports that the framework is running tasks on it), but before the framework has reconnected to the master.  

The master should be guarding against sending messages to disconnected frameworks, so we'll have to track down which code path is responsible for sending this message.

A cursory {{git blame}} suggests that this crash could have happened from 1.2.0 onwards, but this also depends on which message is being sent:
https://github.com/apache/mesos/commit/0dbceafa3b7caba9e541cd64bce3d5421e3b1262

Note: This commit removed a {{return;}} that would have hidden this bug.
https://github.com/apache/mesos/commit/65efb347301f90e638361a50282cb74980c4c081

> Crashed framework brings down the whole Mesos cluster
> -----------------------------------------------------
>
>                 Key: MESOS-8623
>                 URL: https://issues.apache.org/jira/browse/MESOS-8623
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.1
>         Environment: Debian 8
> Mesos 1.4.1
>            Reporter: Tomas Barton
>            Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. Unfortunately even after fixing Docker issues, all Mesos masters repeatedly failed to start. In despair I've deleted all {{replicated_log}} data from Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got replayed and the master crashed again. Average lifetime for Mesos master was less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get '/mesos/log_replicas/0000002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get '/mesos/log_replicas/0000002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get '/mesos/log_replicas/0000002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master attempted to send message to disconnected framework 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @     0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @     0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @     0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @     0x7f7187ca2da9  google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @     0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @     0x7f71870465d5  mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @     0x7f7186fcfe8a  mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @     0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @     0x7f7187008e36  std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @     0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @     0x7f7186fb7ee4  mesos::internal::master::Master::_visit()
> mesos-master[3814]: @     0x7f7186fd0d5d  mesos::internal::master::Master::visit()
> mesos-master[3814]: @     0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @     0x7f7187c08d46  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @     0x7f7185babca0  (unknown)
> mesos-master[3814]: @     0x7f71853c6064  start_thread
> mesos-master[3814]: @     0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client environment:host.name=svc01
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
> mesos-master[27840]: I0228 01:32:38.412711 27841 log.cpp:107] Attempting to join replica to ZooKeeper group
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@730: Client environment:host.name=svc01
> mesos-master[27840]: I0228 01:32:38.412932 27841 recover.cpp:451] Starting replica recovery
> mesos-master[27840]: I0228 01:32:38.413024 27846 recover.cpp:477] Replica is in VOTING status
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
> mesos-master[27840]: 2018-02-28 01:32:38,413:27829(0x7fab4775b700):ZOO_INFO@log_env@738: Client environment:os.arch=3.16.0-5-amd64
> mesos-master[27840]: 2018-02-28 01:32:38,413:27829(0x7fab4775b700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08)
> mesos-master[27840]: 2018-02-28 01:32:38,413:27829(0x7fab47f5c700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
> {code}
> on slave Mesos is trying many times to kill all instances of kafka framework:
> {code}
> I0227 23:14:57.993875 26049 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> W0227 23:14:57.993914 26049 slave.cpp:3099] Ignoring kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the executor 'ser
> I0227 23:15:02.993985 26048 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> W0227 23:15:02.994027 26048 slave.cpp:3099] Ignoring kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the executor 'ser
> I0227 23:15:07.992681 26041 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> W0227 23:15:07.992720 26041 slave.cpp:3099] Ignoring kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the executor 'ser
> I0227 23:15:12.992703 26039 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> {code}
> the Docker daemon stopped to respond, thus any tasks were failing:
> {code}
> E0227 23:06:20.440865 26044 slave.cpp:5292] Container '10763986-e133-4483-b34e-cfe10903a46f' for executor 'api_rapi-a.d25fc1e0-1c12-11e8-a38c-fef9d3423
> *** Aborted at 1519772780 (unix time) try "date -d @1519772780" if you are using GNU date ***
> PC: @     0x7f1326961067 (unknown)
> PC: @     0x7f1326961067 (unknown)
> *** SIGABRT (@0x47c0) received by PID 18368 (TID 0x7f131c27b700) from PID 18368; stack trace: ***
> @     0x7f1326ce6890 (unknown)
> @     0x7f1326961067 (unknown)
> @     0x7f1326962448 (unknown)
> @     0x560218e2f740 (unknown)
> @     0x560218e2f77c (unknown)
> @     0x7f1329261ff9 (unknown)
> @     0x7f1329261c81 (unknown)
> @     0x7f1329261c41 (unknown)
> @     0x7f1329263ecf (unknown)
> @     0x7f1329261561 (unknown)
> @     0x7f1328422509 (unknown)
> @     0x7f13284253ff (unknown)
> @     0x7f1328433134 (unknown)
> @     0x7f132844ab26 (unknown)
> @     0x7f1328398a47 (unknown)
> @     0x7f132839a9ed (unknown)
> @     0x7f1329260bcc (unknown)
> @     0x7f1328398a47 (unknown)
> @     0x7f132839a9ed (unknown)
> @     0x7f13289ec817 (unknown)
> @     0x7f13289ec91f (unknown)
> @     0x7f1329259818 (unknown)
> @     0x7f1329259ac3 (unknown)
> @     0x7f1329218e29 (unknown)
> @     0x7f132921ed16 (unknown)
> @     0x7f13271c3ca0 (unknown)
> @     0x7f1326cdf064 start_thread
> @     0x7f1326a1462d (unknown)
> '
> I0227 23:06:20.441429 26043 slave.cpp:5405] Executor 'api_rapi-a.d25fc1e0-1c12-11e8-a38c-fef9d3423c7f' of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d13
> I0227 23:06:20.441520 26043 slave.cpp:4399] Handling status update TASK_FAILED (UUID: 003e072e-341b-48ce-abb5-79849128ba6f) for task api_rapi-a.d25fc1e
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)