You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joseph Wu (JIRA)" <ji...@apache.org> on 2018/02/28 17:19:00 UTC
[jira] [Commented] (MESOS-8623) Crashed framework brings down the
whole Mesos cluster
[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380688#comment-16380688 ]
Joseph Wu commented on MESOS-8623:
----------------------------------
This CHECK can be hit whenever the master attempts to send any message to a framework recovered via Agent registration (i.e. Agent reports that the framework is running tasks on it), but before the framework has reconnected to the master.
The master should be guarding against sending messages to disconnected frameworks, so we'll have to track down which code path is responsible for sending this message.
A cursory {{git blame}} suggests that this crash could have happened from 1.2.0 onwards, but this also depends on which message is being sent:
https://github.com/apache/mesos/commit/0dbceafa3b7caba9e541cd64bce3d5421e3b1262
Note: This commit removed a {{return;}} that would have hidden this bug.
https://github.com/apache/mesos/commit/65efb347301f90e638361a50282cb74980c4c081
> Crashed framework brings down the whole Mesos cluster
> -----------------------------------------------------
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
> Reporter: Tomas Barton
> Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone.
> The issue was caused by an unresponsive Docker engine on a single agent node. Unfortunately even after fixing Docker issues, all Mesos masters repeatedly failed to start. In despair I've deleted all {{replicated_log}} data from Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got replayed and the master crashed again. Average lifetime for Mesos master was less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get '/mesos/log_replicas/0000002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get '/mesos/log_replicas/0000002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get '/mesos/log_replicas/0000002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master attempted to send message to disconnected framework 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9 google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769 _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5 mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1 ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36 std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1 ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4 mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22 process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0 (unknown)
> mesos-master[3814]: @ 0x7f71853c6064 start_thread
> mesos-master[3814]: @ 0x7f71850fb62d (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client environment:host.name=svc01
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
> mesos-master[27840]: I0228 01:32:38.412711 27841 log.cpp:107] Attempting to join replica to ZooKeeper group
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@730: Client environment:host.name=svc01
> mesos-master[27840]: I0228 01:32:38.412932 27841 recover.cpp:451] Starting replica recovery
> mesos-master[27840]: I0228 01:32:38.413024 27846 recover.cpp:477] Replica is in VOTING status
> mesos-master[27840]: 2018-02-28 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
> mesos-master[27840]: 2018-02-28 01:32:38,413:27829(0x7fab4775b700):ZOO_INFO@log_env@738: Client environment:os.arch=3.16.0-5-amd64
> mesos-master[27840]: 2018-02-28 01:32:38,413:27829(0x7fab4775b700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08)
> mesos-master[27840]: 2018-02-28 01:32:38,413:27829(0x7fab47f5c700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
> {code}
> on slave Mesos is trying many times to kill all instances of kafka framework:
> {code}
> I0227 23:14:57.993875 26049 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> W0227 23:14:57.993914 26049 slave.cpp:3099] Ignoring kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the executor 'ser
> I0227 23:15:02.993985 26048 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> W0227 23:15:02.994027 26048 slave.cpp:3099] Ignoring kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the executor 'ser
> I0227 23:15:07.992681 26041 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> W0227 23:15:07.992720 26041 slave.cpp:3099] Ignoring kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the executor 'ser
> I0227 23:15:12.992703 26039 slave.cpp:2931] Asked to kill task service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework ecd3a4be-d34
> {code}
> the Docker daemon stopped to respond, thus any tasks were failing:
> {code}
> E0227 23:06:20.440865 26044 slave.cpp:5292] Container '10763986-e133-4483-b34e-cfe10903a46f' for executor 'api_rapi-a.d25fc1e0-1c12-11e8-a38c-fef9d3423
> *** Aborted at 1519772780 (unix time) try "date -d @1519772780" if you are using GNU date ***
> PC: @ 0x7f1326961067 (unknown)
> PC: @ 0x7f1326961067 (unknown)
> *** SIGABRT (@0x47c0) received by PID 18368 (TID 0x7f131c27b700) from PID 18368; stack trace: ***
> @ 0x7f1326ce6890 (unknown)
> @ 0x7f1326961067 (unknown)
> @ 0x7f1326962448 (unknown)
> @ 0x560218e2f740 (unknown)
> @ 0x560218e2f77c (unknown)
> @ 0x7f1329261ff9 (unknown)
> @ 0x7f1329261c81 (unknown)
> @ 0x7f1329261c41 (unknown)
> @ 0x7f1329263ecf (unknown)
> @ 0x7f1329261561 (unknown)
> @ 0x7f1328422509 (unknown)
> @ 0x7f13284253ff (unknown)
> @ 0x7f1328433134 (unknown)
> @ 0x7f132844ab26 (unknown)
> @ 0x7f1328398a47 (unknown)
> @ 0x7f132839a9ed (unknown)
> @ 0x7f1329260bcc (unknown)
> @ 0x7f1328398a47 (unknown)
> @ 0x7f132839a9ed (unknown)
> @ 0x7f13289ec817 (unknown)
> @ 0x7f13289ec91f (unknown)
> @ 0x7f1329259818 (unknown)
> @ 0x7f1329259ac3 (unknown)
> @ 0x7f1329218e29 (unknown)
> @ 0x7f132921ed16 (unknown)
> @ 0x7f13271c3ca0 (unknown)
> @ 0x7f1326cdf064 start_thread
> @ 0x7f1326a1462d (unknown)
> '
> I0227 23:06:20.441429 26043 slave.cpp:5405] Executor 'api_rapi-a.d25fc1e0-1c12-11e8-a38c-fef9d3423c7f' of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d13
> I0227 23:06:20.441520 26043 slave.cpp:4399] Handling status update TASK_FAILED (UUID: 003e072e-341b-48ce-abb5-79849128ba6f) for task api_rapi-a.d25fc1e
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)