You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "adaibee (JIRA)" <ji...@apache.org> on 2017/09/19 03:17:00 UTC

[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

    [ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171058#comment-16171058 ] 

adaibee commented on MESOS-7966:
--------------------------------

{code:shell}
// # rpm -qa mesos
mesos-1.2.0-1.2.0.x86_64
{code}
Cluster info:
3 mesos-master(mesos_quorum=1)


3 mesos-slave
We had a loop doing maintenance job in three mesos-slaves:
1.mesos-maintenance-schedule 
2.machine-down 
3.machine-up
4.mesos-maintenance-schedule-cancel
But in the loop, we found one of mesos-master crashed and  other mesos-master was elected.
We found something in mesos.slave.FATAL:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
And in mesos.slave.INFO:
2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack trace: ***
2017-09-12 16:39:07.402 err mesos-master[254491]: @ 0x7f4cf356fba6 google::LogMessage::Fail()
2017-09-12 16:39:07.413 err mesos-master[254491]: @ 0x7f4cf356fb05 google::LogMessage::SendToLog()
2017-09-12 16:39:07.420 err mesos-master[254491]: @ 0x7f4cf356f516 google::LogMessage::Flush()
2017-09-12 16:39:07.424 err mesos-master[254491]: @ 0x7f4cf357224a google::LogMessageFatal::~LogMessageFatal()
2017-09-12 16:39:07.429 err mesos-master[254491]: @ 0x7f4cf2344a32 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::updateInverseOffer()
2017-09-12 16:39:07.435 err mesos-master[254491]: @ 0x7f4cf1f8d9f9 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES18_
2017-09-12 16:39:07.445 err mesos-master[254491]: @ 0x7f4cf1f938bb _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatusEERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
2017-09-12 16:39:07.455 err mesos-master[254491]: @ 0x7f4cf34dd049 std::function<>::operator()()
2017-09-12 16:39:07.460 err mesos-master[254491]: @ 0x7f4cf34c1285 process::ProcessBase::visit()
2017-09-12 16:39:07.464 err mesos-master[254491]: @ 0x7f4cf34cc58a process::DispatchEvent::visit()
2017-09-12 16:39:07.465 err mesos-master[254491]: @ 0x7f4cf4e4ad4e process::ProcessBase::serve()
2017-09-12 16:39:07.469 err mesos-master[254491]: @ 0x7f4cf34bd281 process::ProcessManager::resume()
2017-09-12 16:39:07.471 err mesos-master[254491]: @ 0x7f4cf34b9a2c _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
2017-09-12 16:39:07.473 err mesos-master[254491]: @ 0x7f4cf34cbbf2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
2017-09-12 16:39:07.475 err mesos-master[254491]: @ 0x7f4cf34cbb36 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
2017-09-12 16:39:07.477 err mesos-master[254491]: @ 0x7f4cf34cbac0 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced3ba1e0 (unknown)
2017-09-12 16:39:07.478 err mesos-master[254491]: @ 0x7f4ced613dc5 start_thread
2017-09-12 16:39:07.479 err mesos-master[254491]: @ 0x7f4cecb21ced __clone
2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT

This case could be reproduced by calling `for i in {1..8}; do python call.py; done` (call.py gist: https://gist.github.com/athlum/e2cd04bfb9f81a790d31643606252a49 ).

Looks like there is something wrong when call /maintenance/schedule concurrently.

We met this case because we use wrote a service base on ansible that manage the mesos cluster. When we create a task to update slave configs with a certain number of workers. Just like:

1. call schedule for 3 machine: a,b,c.
2. as machine a was done, maintenance window updates to: b,c
3. as an other machine "d" assigned after a immediately, windows will update to: b,c,d

This change sometimes happen in little interval. Then we find the fatal log just in Bayou's mail.

What's the right way to update maintanence window? Thanks to any reply.

> check for maintenance on agent causes fatal error
> -------------------------------------------------
>
>                 Key: MESOS-7966
>                 URL: https://issues.apache.org/jira/browse/MESOS-7966
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Rob Johnson
>            Priority: Critical
>
> We interact with the maintenance API frequently to orchestrate gracefully draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with the api. This happens relatively frequently, and impacts us when downstream frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're happy to provide any other logs you need - please let me know what would be useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)