You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Geoffroy Jabouley <ge...@gmail.com> on 2015/03/06 13:26:03 UTC

Weird behavior when stopping the mesos master leader of a HA mesos cluster

Hello

we are facing some unexpecting issues when testing high availability
behaviors of our mesos cluster.

*Our use case:*

*State*: the mesos cluster is up (3 machines), 1 docker task is running on
each slave (started from marathon)

*Action*: stop the mesos master leader process

*Expected*: mesos master leader has changed, *active tasks remain unchanged*

*Seen*: mesos master leader has changed, *all active tasks are now FAILED
but docker containers are still running*, marathon detects FAILED tasks and
starts new tasks. We end with 2 docker containers running on each machine,
but only one is linked to a RUNNING mesos task.


Is the seen behavior correct?

Have we misunderstood the high availability concept? We thought that doing
this use case would not have any impact on the current cluster state
(except leader re-election)

Thanks in advance for your help
Regards

---------------------------------------------------

our setup is the following:
3 identical mesos nodes with:
    + zookeeper
    + docker 1.5
    + mesos master 0.21.1 configured in HA mode
    + mesos slave 0.21.1 configured with checkpointing, strict and reconnect
    + marathon 0.8.0 configured in HA mode with checkpointing

---------------------------------------------------

Command lines:


*mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
--cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
--quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos

*mesos-slave*
/usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.30.19
--ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]

*marathon*
java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
-Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
/usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
--local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
--hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,10.195.30.20:2181
,10.195.30.21:2181/mesos

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Geoffroy Jabouley <ge...@gmail.com>.
Thanks a lot Dario for the workaround! It works fine and can be scripted
with ansible.

For the record, the github issue is available here:
https://github.com/mesosphere/marathon/issues/1292

2015-03-12 17:27 GMT+01:00 Dario Rexin <da...@mesosphere.io>:

> Hi Geoffrey,
>
> we identified the issue and will fix it in Marathon 0.8.2. To prevent this
> behaviour for now, you just have to make sure that in a fresh setup
> (Marathon was never connected to Mesos) you first start up a single
> Marathon and let it register with Mesos and then start the other Marathon
> instances. The problem is a race in first registration with Mesos and
> fetching the FrameworkID from Zookeeper. Please let me know if the
> workaround does not help you.
>
> Cheers,
> Dario
>
> On 12 Mar 2015, at 09:20, Alex Rukletsov <al...@mesosphere.io> wrote:
>
> Geoffroy,
>
> yes, it looks like a marathon issue, so feel free to post it there as well.
>
> On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley <
> geoffroy.jabouley@gmail.com> wrote:
>
>> Thanks Alex for your answer. I will have a look.
>>
>> Would it be better to (cross-)post this discussion on the marathon
>> mailing list?
>>
>> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
>>
>> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <al...@mesosphere.io>:
>>
>>> Geoffroy,
>>>
>>> most probably you're hitting this bug:
>>> https://github.com/mesosphere/marathon/issues/1063. The problem is that
>>> Marathon can register instead of re-registering when a master fails
>>> over. From master point of view, it's a new framework, that's why the
>>> previous task is gone and a new one (that technically belongs to a new
>>> framework) is started. You can see that frameworks have two different IDs
>>> (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>>>
>>> Hope that helps,
>>> Alex
>>>
>>> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
>>> geoffroy.jabouley@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> thanks for your interest. Following are the requested logs, which will
>>>> result in a pretty big mail.
>>>>
>>>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as
>>>> our mesos containerizer.
>>>>
>>>> For reminder, here is the use case performed to get the logs file:
>>>>
>>>> --------------------------------
>>>>
>>>> Our cluster: 3 identical mesos nodes with:
>>>>     + zookeeper
>>>>     + docker 1.5
>>>>     + mesos master 0.21.1 configured in HA mode
>>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>>> reconnect
>>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>>
>>>> --------------------------------
>>>>
>>>> *Begin State: *
>>>> + the mesos cluster is up (3 machines)
>>>> + mesos master leader is 10.195.30.19
>>>> + marathon leader is 10.195.30.21
>>>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>>>>
>>>> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>>>>
>>>> *Expected*: mesos master leader has changed, active tasks / frameworks
>>>> remain unchanged
>>>>
>>>> *End state: *
>>>> + mesos master leader *has changed, now 10.195.30.21*
>>>> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
>>>> (not showing anymore on the mesos UI), but *docker container is still
>>>> running*
>>>> + a n*ew APPTASK is now running on slave 10.195.30.19*
>>>> + marathon framework "registration time" in mesos UI shows "Just now"
>>>> + marathon leader *has changed, now 10.195.30.20*
>>>>
>>>>
>>>> --------------------------------
>>>>
>>>> Now comes the 6 requested logs, which might contain
>>>> interesting/relevant information, but i as a newcomer to mesos it is hard
>>>> to read...
>>>>
>>>>
>>>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19/>:*
>>>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal
>>>> SIGTERM from process 1 of user 0; exiting
>>>>
>>>>
>>>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21/>:*
>>>> I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.011823   922 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group
>>>> memberships changed
>>>> I0310 11:31:40.015847   915 group.cpp:659] Trying to get
>>>> '/mesos/log_replicas/0000000000' in ZooKeeper
>>>> I0310 11:31:40.016047   922 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader
>>>> is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
>>>> I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
>>>> master!
>>>> I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
>>>> I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
>>>> I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
>>>> I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
>>>> promise request with proposal 2
>>>> I0310 11:31:40.017503   915 group.cpp:659] Trying to get
>>>> '/mesos/log_replicas/0000000003' in ZooKeeper
>>>> I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
>>>> bytes) to leveldb took 893672ns
>>>> I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
>>>> I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
>>>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
>>>> I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping
>>>> to fill missing position
>>>> I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
>>>> position 8
>>>> I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from
>>>> leveldb took 13195ns
>>>> I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from
>>>> leveldb took 3120ns
>>>> I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
>>>> registry (610B) in 7.385856ms
>>>> I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
>>>> 9263ns; attempting to update the 'registry'
>>>> I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 9
>>>> I0310 11:31:40.024435   923 replica.cpp:508] Replica received write
>>>> request for position 9
>>>> I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 1.259338ms
>>>> I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
>>>> I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned
>>>> notice for position 9
>>>> I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 404795ns
>>>> I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
>>>> I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND
>>>> action at position 9
>>>> I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.029248ms
>>>> I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered
>>>> registrar
>>>> I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log
>>>> to 9
>>>> I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from
>>>> the Registry (608B) ; allowing 1days for slaves to re-register
>>>> I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 10
>>>> I0310 11:31:40.026988   923 replica.cpp:508] Replica received write
>>>> request for position 10
>>>> I0310 11:31:40.027640   923 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 641018ns
>>>> I0310 11:31:40.027652   923 replica.cpp:676] Persisted action at 10
>>>> I0310 11:31:40.030848   923 replica.cpp:655] Replica received learned
>>>> notice for position 10
>>>> I0310 11:31:40.031883   923 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 1.008914ms
>>>> I0310 11:31:40.031963   923 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 46724ns
>>>> I0310 11:31:40.031977   923 replica.cpp:676] Persisted action at 10
>>>> I0310 11:31:40.031986   923 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 10
>>>> I0310 11:31:40.055415   918 master.cpp:1383] Received registration
>>>> request for framework 'marathon' at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.055496   918 master.cpp:1447] Registering framework
>>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.055642   919 hierarchical_allocator_process.hpp:329]
>>>> Added framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.189151   919 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:40.189280   919 registrar.cpp:445] Applied 1 operations in
>>>> 15452ns; attempting to update the 'registry'
>>>> I0310 11:31:40.189949   919 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.189978   919 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 11
>>>> I0310 11:31:40.190112   919 replica.cpp:508] Replica received write
>>>> request for position 11
>>>> I0310 11:31:40.190563   919 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 437440ns
>>>> I0310 11:31:40.190577   919 replica.cpp:676] Persisted action at 11
>>>> I0310 11:31:40.191249   921 replica.cpp:655] Replica received learned
>>>> notice for position 11
>>>> I0310 11:31:40.192159   921 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 892767ns
>>>> I0310 11:31:40.192178   921 replica.cpp:676] Persisted action at 11
>>>> I0310 11:31:40.192184   921 replica.cpp:661] Replica learned APPEND
>>>> action at position 11
>>>> I0310 11:31:40.192350   921 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.0528ms
>>>> I0310 11:31:40.192387   919 log.cpp:699] Attempting to truncate the log
>>>> to 11
>>>> I0310 11:31:40.192415   919 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 12
>>>> I0310 11:31:40.192539   915 replica.cpp:508] Replica received write
>>>> request for position 12
>>>> I0310 11:31:40.192600   921 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.192680   917 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.192847   917 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.193164   915 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 610664ns
>>>> I0310 11:31:40.193181   915 replica.cpp:676] Persisted action at 12
>>>> I0310 11:31:40.193568   915 replica.cpp:655] Replica received learned
>>>> notice for position 12
>>>> I0310 11:31:40.193948   915 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 364062ns
>>>> I0310 11:31:40.193979   915 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 12256ns
>>>> I0310 11:31:40.193985   915 replica.cpp:676] Persisted action at 12
>>>> I0310 11:31:40.193990   915 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 12
>>>> I0310 11:31:40.248615   915 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.248744   915 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.774416   915 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21)
>>>> I0310 11:31:40.774976   915 registrar.cpp:445] Applied 1 operations in
>>>> 42342ns; attempting to update the 'registry'
>>>> I0310 11:31:40.777273   921 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.777436   921 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 13
>>>> I0310 11:31:40.777989   921 replica.cpp:508] Replica received write
>>>> request for position 13
>>>> I0310 11:31:40.779558   921 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 1.513714ms
>>>> I0310 11:31:40.779633   921 replica.cpp:676] Persisted action at 13
>>>> I0310 11:31:40.781821   919 replica.cpp:655] Replica received learned
>>>> notice for position 13
>>>> I0310 11:31:40.784417   919 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 2.542036ms
>>>> I0310 11:31:40.784446   919 replica.cpp:676] Persisted action at 13
>>>> I0310 11:31:40.784452   919 replica.cpp:661] Replica learned APPEND
>>>> action at position 13
>>>> I0310 11:31:40.784711   920 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 9.68192ms
>>>> I0310 11:31:40.784762   917 log.cpp:699] Attempting to truncate the log
>>>> to 13
>>>> I0310 11:31:40.784808   920 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 14
>>>> I0310 11:31:40.784865   917 master.hpp:877] Adding task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
>>>> I0310 11:31:40.784955   919 replica.cpp:508] Replica received write
>>>> request for position 14
>>>> W0310 11:31:40.785038   917 master.cpp:4468] Possibly orphaned task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
>>>> framework 20150310-112310-320783114-5050-24289-0000 running on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21)
>>>> I0310 11:31:40.785105   917 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.785162   920 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.785679   921 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.786429   919 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 1.454211ms
>>>> I0310 11:31:40.786455   919 replica.cpp:676] Persisted action at 14
>>>> I0310 11:31:40.786782   919 replica.cpp:655] Replica received learned
>>>> notice for position 14
>>>> I0310 11:31:40.787833   919 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 1.027014ms
>>>> I0310 11:31:40.787873   919 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 14085ns
>>>> I0310 11:31:40.787883   919 replica.cpp:676] Persisted action at 14
>>>> I0310 11:31:40.787889   919 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 14
>>>> I0310 11:31:40.792536   922 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.792670   922 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.819602   921 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20)
>>>> I0310 11:31:40.819736   921 registrar.cpp:445] Applied 1 operations in
>>>> 16656ns; attempting to update the 'registry'
>>>> I0310 11:31:40.820617   921 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.820726   918 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 15
>>>> I0310 11:31:40.820938   918 replica.cpp:508] Replica received write
>>>> request for position 15
>>>> I0310 11:31:40.821641   918 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 670583ns
>>>> I0310 11:31:40.821663   918 replica.cpp:676] Persisted action at 15
>>>> I0310 11:31:40.822265   917 replica.cpp:655] Replica received learned
>>>> notice for position 15
>>>> I0310 11:31:40.823463   917 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 1.178687ms
>>>> I0310 11:31:40.823490   917 replica.cpp:676] Persisted action at 15
>>>> I0310 11:31:40.823498   917 replica.cpp:661] Replica learned APPEND
>>>> action at position 15
>>>> I0310 11:31:40.823755   917 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.97696ms
>>>> I0310 11:31:40.823823   917 log.cpp:699] Attempting to truncate the log
>>>> to 15
>>>> I0310 11:31:40.824147   922 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 16
>>>> I0310 11:31:40.824482   922 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.824597   921 replica.cpp:508] Replica received write
>>>> request for position 16
>>>> I0310 11:31:40.824128   917 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.824975   917 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.831900   921 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 7.228682ms
>>>> I0310 11:31:40.832031   921 replica.cpp:676] Persisted action at 16
>>>> I0310 11:31:40.832456   917 replica.cpp:655] Replica received learned
>>>> notice for position 16
>>>> I0310 11:31:40.835178   917 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 2.674392ms
>>>> I0310 11:31:40.835297   917 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 45220ns
>>>> I0310 11:31:40.835322   917 replica.cpp:676] Persisted action at 16
>>>> I0310 11:31:40.835341   917 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 16
>>>> I0310 11:31:40.838281   923 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.838389   923 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.948725   919 http.cpp:344] HTTP request for
>>>> '/master/redirect'
>>>> I0310 11:31:41.479118   918 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:45.368074   918 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:45.385144   917 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:45.385292   917 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:46.368450   917 master.cpp:3843] Sending 2 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375222   920 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375360   920 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375530   920 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:46.375599   920 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:48.031230   915 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:51.374285   922 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:51.379391   921 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:51.379487   921 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:51.482094   923 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:52.375326   917 master.cpp:3843] Sending 2 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391376   919 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391512   919 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:52.391659   921 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391751   921 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:55.062060   918 master.cpp:3611] Performing explicit task
>>>> state reconciliation for 1 tasks of framework
>>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:55.062588   919 master.cpp:3556] Performing implicit task
>>>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:56.140990   923 http.cpp:344] HTTP request for
>>>> '/master/redirect'
>>>> I0310 11:31:57.379288   918 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:57.430888   918 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:57.431068   918 master.hpp:877] Adding task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
>>>> I0310 11:31:57.431089   918 master.cpp:2503] Launching task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:57.431205   918 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.682133   919 master.cpp:3446] Forwarding status update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.682186   919 master.cpp:3418] Status update TASK_RUNNING
>>>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 from slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:57.682199   919 master.cpp:4693] Updating the latest state
>>>> of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799
>>>> of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>>>>
>>>>
>>>> *from MESOS slave 10.195.30.21*
>>>> I0310 11:31:28.750200  1074 slave.cpp:2623] master@10.195.30.19:5050
>>>> exited
>>>> W0310 11:31:28.750249  1074 slave.cpp:2626] Master disconnected!
>>>> Waiting for a new master to be elected
>>>> I0310 11:31:40.012516  1075 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.012899  1073 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.017143  1072 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.017408  1072 slave.cpp:602] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.017546  1076 status_update_manager.cpp:171] Pausing
>>>> sending status updates
>>>> I0310 11:31:40.018673  1072 slave.cpp:627] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.018689  1072 slave.cpp:638] Detecting new master
>>>> I0310 11:31:40.785364  1075 slave.cpp:824] Re-registered with master
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.785398  1075 status_update_manager.cpp:178] Resuming
>>>> sending status updates
>>>> I0310 11:32:10.639506  1075 slave.cpp:3321] Current usage 12.27%. Max
>>>> allowed age: 5.441217749539572days
>>>>
>>>>
>>>> *from MESOS slave 10.195.30.19*
>>>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
>>>> exited
>>>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected!
>>>> Waiting for a new master to be elected
>>>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
>>>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing
>>>> sending status updates
>>>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
>>>> sending status updates
>>>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 in work directory
>>>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
>>>> executor
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework '20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
>>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
>>>> framework '20150310-112310-354337546-5050-895-0000'
>>>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
>>>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
>>>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework '20150310-112310-354337546-5050-895-0000' in container
>>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received
>>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346]
>>>> Checkpointing UPDATE for status update TASK_RUNNING (UUID:
>>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 to
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
>>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received
>>>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e)
>>>> for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346]
>>>> Checkpointing ACK for status update TASK_RUNNING (UUID:
>>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to
>>>> 204 at
>>>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
>>>> 'memory.soft_limit_in_bytes' to 160MB for container
>>>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
>>>> 'memory.limit_in_bytes' to 160MB at
>>>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
>>>> allowed age: 5.555425200437824days
>>>>
>>>>
>>>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21/>:*
>>>> I0310 11:31:40.017248  1115 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.017334  1115 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.017727  1115 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> [2015-03-10 11:31:40,017] WARN Disconnected
>>>> (mesosphere.marathon.MarathonScheduler:224)
>>>> [2015-03-10 11:31:40,019] INFO Abdicating
>>>> (mesosphere.marathon.MarathonSchedulerService:312)
>>>> [2015-03-10 11:31:40,019] INFO Defeat leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>>> [INFO] [03/10/2015 11:31:40.019]
>>>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> [INFO] [03/10/2015 11:31:40.019]
>>>> [marathon-akka.actor.default-dispatcher-5] [
>>>> akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
>>>> [2015-03-10 11:31:40,021] INFO Stopping driver
>>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>>> I0310 11:31:40.022001  1115 sched.cpp:1286] Asked to stop the driver
>>>> [2015-03-10 11:31:40,024] INFO Setting framework ID to
>>>> 20150310-112310-320783114-5050-24289-0000
>>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>>> I0310 11:31:40.026274  1115 sched.cpp:234] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.026418  1115 sched.cpp:242] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.026458  1115 sched.cpp:752] Stopping framework
>>>> '20150310-112310-320783114-5050-24289-0000'
>>>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing
>>>> optional abdication command.
>>>> (mesosphere.marathon.MarathonSchedulerService:192)
>>>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
>>>> (mesosphere.marathon.MarathonSchedulerService:246)
>>>> [2015-03-10 11:31:40,032] INFO Defeat leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>>> [2015-03-10 11:31:40,032] INFO Stopping driver
>>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>>> I0310 11:31:40.032588  1107 sched.cpp:1286] Asked to stop the driver
>>>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
>>>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
>>>> [2015-03-10 11:31:40,033] INFO Setting framework ID to
>>>> 20150310-112310-320783114-5050-24289-0000
>>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is
>>>> not a candidate for leader, current voting: [member_0000000001,
>>>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
>>>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering
>>>> leadership (mesosphere.marathon.MarathonSchedulerService:341)
>>>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
>>>> (com.twitter.common.zookeeper.Group:426)
>>>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is
>>>> not a candidate for leader, current voting: [member_0000000001,
>>>> member_0000000002, member_0000000003]
>>>> (com.twitter.common.zookeeper.CandidateImpl:144)
>>>> [2015-03-10 11:31:40,568] INFO Candidate
>>>> /marathon/leader/member_0000000003 waiting for the next leader election,
>>>> current voting: [member_0000000001, member_0000000002, member_0000000003]
>>>> (com.twitter.common.zookeeper.CandidateImpl:165)
>>>>
>>>>
>>>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20/>:*
>>>> [2015-03-10 11:31:40,029] INFO Candidate
>>>> /marathon/leader/member_0000000001 is now leader of group:
>>>> [member_0000000001, member_0000000002]
>>>> (com.twitter.common.zookeeper.CandidateImpl:152)
>>>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
>>>> (mesosphere.marathon.MarathonSchedulerService:253)
>>>> [2015-03-10 11:31:40,044] INFO Elect leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:299)
>>>> [2015-03-10 11:31:40,044] INFO Running driver
>>>> (mesosphere.marathon.MarathonSchedulerService:184)
>>>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
>>>> Client environment:zookeeper.version=zookeeper C client 3.4.5
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
>>>> Client environment:host.name=srv-d2u-9-virtip20
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
>>>> Client environment:os.name=Linux
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
>>>> Client environment:os.arch=3.13.0-44-generic
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
>>>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
>>>> Client environment:user.name=(null)
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
>>>> Client environment:user.home=/root
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
>>>> Client environment:user.dir=/
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
>>>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
>>>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0
>>>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
>>>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
>>>> (mesosphere.marathon.MarathonSchedulerService:329)
>>>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
>>>> initiated connection to server [10.195.30.19:2181]
>>>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
>>>> session establishment complete on server [10.195.30.19:2181],
>>>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
>>>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
>>>> 10.195.30.20:45771) connected to ZooKeeper
>>>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations:
>>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path
>>>> '/mesos' in ZooKeeper
>>>> [INFO] [03/10/2015 11:31:40.047]
>>>> [marathon-akka.actor.default-dispatcher-2] [
>>>> akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
>>>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for
>>>> version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
>>>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
>>>> 20150310-112310-354337546-5050-895-0000
>>>> [2015-03-10 11:31:40,056] INFO Registered as
>>>> 20150310-112310-354337546-5050-895-0000 to master
>>>> '20150310-112310-354337546-5050-895'
>>>> (mesosphere.marathon.MarathonScheduler:72)
>>>> [2015-03-10 11:31:40,063] INFO Stored framework ID
>>>> '20150310-112310-354337546-5050-895-0000'
>>>> (mesosphere.mesos.util.FrameworkIdUtil:49)
>>>> [INFO] [03/10/2015 11:31:40.065]
>>>> [marathon-akka.actor.default-dispatcher-6] [
>>>> akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
>>>> [INFO] [03/10/2015 11:31:40.067]
>>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
>>>> (mesosphere.marathon.SchedulerActions:403)
>>>> [INFO] [03/10/2015 11:31:55.053]
>>>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
>>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
>>>> encountered. This logging can be turned off or adjusted with configuration
>>>> settings 'akka.log-dead-letters' and
>>>> 'akka.log-dead-letters-during-shutdown'.
>>>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
>>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>>> [2015-03-10 11:31:55,064] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
>>>> TASK_LOST (Reconciliation: Task is unknown to the slave)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:31:55,069] INFO Need to scale
>>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>>> (mesosphere.marathon.SchedulerActions:488)
>>>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
>>>> /ffaas-backoffice-app-nopersist (0 queued)
>>>> (mesosphere.marathon.SchedulerActions:494)
>>>> [2015-03-10 11:31:55,069] INFO Task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
>>>> expunged and removed from TaskTracker
>>>> (mesosphere.marathon.tasks.TaskTracker:107)
>>>> [2015-03-10 11:31:55,070] INFO Sending event notification.
>>>> (mesosphere.marathon.MarathonScheduler:262)
>>>> [INFO] [03/10/2015 11:31:55.072]
>>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> [2015-03-10 11:31:55,073] INFO Need to scale
>>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>>> (mesosphere.marathon.SchedulerActions:488)
>>>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
>>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>>> (mesosphere.marathon.SchedulerActions:498)
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:31:57,682] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:31:57,694] INFO Sending event notification.
>>>> (mesosphere.marathon.MarathonScheduler:262)
>>>> [INFO] [03/10/2015 11:31:57.694]
>>>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
>>>> (mesosphere.marathon.tasks.TaskTracker:170)
>>>> [INFO] [03/10/2015 11:36:55.050]
>>>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
>>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
>>>> encountered. This logging can be turned off or adjusted with configuration
>>>> settings 'akka.log-dead-letters' and
>>>> 'akka.log-dead-letters-during-shutdown'.
>>>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
>>>> (mesosphere.marathon.SchedulerActions:403)
>>>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
>>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>>> [2015-03-10 11:36:55,063] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING (Reconciliation: Latest task state)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:36:55,065] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING (Reconciliation: Latest task state)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
>>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>>> (mesosphere.marathon.SchedulerActions:512)
>>>>
>>>>
>>>>
>>>> -- End of logs
>>>>
>>>>
>>>>
>>>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>>>>
>>>>> This is certainly not the expected/desired behavior when failing over
>>>>> a mesos master in HA mode. In addition to the master logs Alex requested,
>>>>> can you also provide relevant portions of the slave logs for these tasks?
>>>>> If the slave processes themselves never failed over, checkpointing and
>>>>> slave recovery should be irrelevant. Are you running the mesos-slave itself
>>>>> inside a Docker, or any other non-traditional setup?
>>>>>
>>>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>>>>> defaults to "reconnect", and --strict defaults to true, so none of those
>>>>> are necessary.
>>>>>
>>>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>>>>> wrote:
>>>>>
>>>>>> Geoffroy,
>>>>>>
>>>>>> could you please provide master logs (both from killed and taking
>>>>>> over masters)?
>>>>>>
>>>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>>>>> geoffroy.jabouley@gmail.com> wrote:
>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> we are facing some unexpecting issues when testing high availability
>>>>>>> behaviors of our mesos cluster.
>>>>>>>
>>>>>>> *Our use case:*
>>>>>>>
>>>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>>>>> running on each slave (started from marathon)
>>>>>>>
>>>>>>> *Action*: stop the mesos master leader process
>>>>>>>
>>>>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>>>>> unchanged*
>>>>>>>
>>>>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>>>>> FAILED but docker containers are still running*, marathon detects
>>>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>>>>
>>>>>>>
>>>>>>> Is the seen behavior correct?
>>>>>>>
>>>>>>> Have we misunderstood the high availability concept? We thought that
>>>>>>> doing this use case would not have any impact on the current cluster state
>>>>>>> (except leader re-election)
>>>>>>>
>>>>>>> Thanks in advance for your help
>>>>>>> Regards
>>>>>>>
>>>>>>> ---------------------------------------------------
>>>>>>>
>>>>>>> our setup is the following:
>>>>>>> 3 identical mesos nodes with:
>>>>>>>     + zookeeper
>>>>>>>     + docker 1.5
>>>>>>>     + mesos master 0.21.1 configured in HA mode
>>>>>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>>>>>> reconnect
>>>>>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>>>>>
>>>>>>> ---------------------------------------------------
>>>>>>>
>>>>>>> Command lines:
>>>>>>>
>>>>>>>
>>>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>>>>
>>>>>>> *mesos-slave*
>>>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint
>>>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins
>>>>>>> --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>>>>
>>>>>>> *marathon*
>>>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Dario Rexin <da...@mesosphere.io>.
Hi Geoffrey,

we identified the issue and will fix it in Marathon 0.8.2. To prevent this behaviour for now, you just have to make sure that in a fresh setup (Marathon was never connected to Mesos) you first start up a single Marathon and let it register with Mesos and then start the other Marathon instances. The problem is a race in first registration with Mesos and fetching the FrameworkID from Zookeeper. Please let me know if the workaround does not help you.

Cheers,
Dario

> On 12 Mar 2015, at 09:20, Alex Rukletsov <al...@mesosphere.io> wrote:
> 
> Geoffroy,
> 
> yes, it looks like a marathon issue, so feel free to post it there as well.
> 
> On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley <geoffroy.jabouley@gmail.com <ma...@gmail.com>> wrote:
> Thanks Alex for your answer. I will have a look.
> 
> Would it be better to (cross-)post this discussion on the marathon mailing list?
> 
> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
> 
> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <alex@mesosphere.io <ma...@mesosphere.io>>:
> Geoffroy,
> 
> most probably you're hitting this bug: https://github.com/mesosphere/marathon/issues/1063 <https://github.com/mesosphere/marathon/issues/1063>. The problem is that Marathon can register instead of re-registering when a master fails over. From master point of view, it's a new framework, that's why the previous task is gone and a new one (that technically belongs to a new framework) is started. You can see that frameworks have two different IDs (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
> 
> Hope that helps,
> Alex
> 
> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <geoffroy.jabouley@gmail.com <ma...@gmail.com>> wrote:
> Hello
> 
> thanks for your interest. Following are the requested logs, which will result in a pretty big mail.
> 
> Mesos/Marathon are NOT running inside docker, we only use Docker as our mesos containerizer.
> 
> For reminder, here is the use case performed to get the logs file:
> 
> --------------------------------
> 
> Our cluster: 3 identical mesos nodes with:
>     + zookeeper
>     + docker 1.5
>     + mesos master 0.21.1 configured in HA mode
>     + mesos slave 0.21.1 configured with checkpointing, strict and reconnect
>     + marathon 0.8.0 configured in HA mode with checkpointing
>     
> --------------------------------
> 
> Begin State: 
> + the mesos cluster is up (3 machines)
> + mesos master leader is 10.195.30.19
> + marathon leader is 10.195.30.21
> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
> 
> Action: stop the mesos master leader process (sudo stop mesos-master)
> 
> Expected: mesos master leader has changed, active tasks / frameworks remain unchanged
> 
> End state: 
> + mesos master leader has changed, now 10.195.30.21
> + previously running APPTASK on the slave 10.195.30.21 has "disappear" (not showing anymore on the mesos UI), but docker container is still running
> + a new APPTASK is now running on slave 10.195.30.19
> + marathon framework "registration time" in mesos UI shows "Just now"
> + marathon leader has changed, now 10.195.30.20
> 
> 
> --------------------------------
> 
> Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read...
> 
> 
> from previous MESOS master leader 10.195.30.19 <http://10.195.30.19/>:
> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting
> 
> 
> from new MESOS master leader 10.195.30.21 <http://10.195.30.21/>:
> I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.011823   922 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships changed
> I0310 11:31:40.015847   915 group.cpp:659] Trying to get '/mesos/log_replicas/0000000000' in ZooKeeper
> I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 <http://master@10.195.30.21:5050/> with id 20150310-112310-354337546-5050-895
> I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading master!
> I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
> I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
> I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
> I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit promise request with proposal 2
> I0310 11:31:40.017503   915 group.cpp:659] Trying to get '/mesos/log_replicas/0000000003' in ZooKeeper
> I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns
> I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
> I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@10.195.30.20:5050 <http://10.195.30.20:5050/>, log-replica(1)@10.195.30.21:5050 <http://10.195.30.21:5050/> }
> I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping to fill missing position
> I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending position 8
> I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from leveldb took 13195ns
> I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from leveldb took 3120ns
> I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the registry (610B) in 7.385856ms
> I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in 9263ns; attempting to update the 'registry'
> I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 9
> I0310 11:31:40.024435   923 replica.cpp:508] Replica received write request for position 9
> I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 1.259338ms
> I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned notice for position 9
> I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 404795ns
> I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND action at position 9
> I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the 'registry' in 3.029248ms
> I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered registrar
> I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log to 9
> I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from the Registry (608B) ; allowing 1days for slaves to re-register
> I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 10
> I0310 11:31:40.026988   923 replica.cpp:508] Replica received write request for position 10
> I0310 11:31:40.027640   923 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 641018ns
> I0310 11:31:40.027652   923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.030848   923 replica.cpp:655] Replica received learned notice for position 10
> I0310 11:31:40.031883   923 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 1.008914ms
> I0310 11:31:40.031963   923 leveldb.cpp:401] Deleting ~2 keys from leveldb took 46724ns
> I0310 11:31:40.031977   923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.031986   923 replica.cpp:661] Replica learned TRUNCATE action at position 10
> I0310 11:31:40.055415   918 master.cpp:1383] Received registration request for framework 'marathon' at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.055496   918 master.cpp:1447] Registering framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.055642   919 hierarchical_allocator_process.hpp:329] Added framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.189151   919 master.cpp:3246] Re-registering slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19)
> I0310 11:31:40.189280   919 registrar.cpp:445] Applied 1 operations in 15452ns; attempting to update the 'registry'
> I0310 11:31:40.189949   919 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.189978   919 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 11
> I0310 11:31:40.190112   919 replica.cpp:508] Replica received write request for position 11
> I0310 11:31:40.190563   919 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 437440ns
> I0310 11:31:40.190577   919 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.191249   921 replica.cpp:655] Replica received learned notice for position 11
> I0310 11:31:40.192159   921 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 892767ns
> I0310 11:31:40.192178   921 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.192184   921 replica.cpp:661] Replica learned APPEND action at position 11
> I0310 11:31:40.192350   921 registrar.cpp:490] Successfully updated the 'registry' in 3.0528ms
> I0310 11:31:40.192387   919 log.cpp:699] Attempting to truncate the log to 11
> I0310 11:31:40.192415   919 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 12
> I0310 11:31:40.192539   915 replica.cpp:508] Replica received write request for position 12
> I0310 11:31:40.192600   921 master.cpp:3314] Re-registered slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> I0310 11:31:40.192680   917 hierarchical_allocator_process.hpp:442] Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 available)
> I0310 11:31:40.192847   917 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.193164   915 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 610664ns
> I0310 11:31:40.193181   915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193568   915 replica.cpp:655] Replica received learned notice for position 12
> I0310 11:31:40.193948   915 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 364062ns
> I0310 11:31:40.193979   915 leveldb.cpp:401] Deleting ~2 keys from leveldb took 12256ns
> I0310 11:31:40.193985   915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193990   915 replica.cpp:661] Replica learned TRUNCATE action at position 12
> I0310 11:31:40.248615   915 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.248744   915 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.774416   915 master.cpp:3246] Re-registering slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21)
> I0310 11:31:40.774976   915 registrar.cpp:445] Applied 1 operations in 42342ns; attempting to update the 'registry'
> I0310 11:31:40.777273   921 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.777436   921 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 13
> I0310 11:31:40.777989   921 replica.cpp:508] Replica received write request for position 13
> I0310 11:31:40.779558   921 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 1.513714ms
> I0310 11:31:40.779633   921 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.781821   919 replica.cpp:655] Replica received learned notice for position 13
> I0310 11:31:40.784417   919 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 2.542036ms
> I0310 11:31:40.784446   919 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.784452   919 replica.cpp:661] Replica learned APPEND action at position 13
> I0310 11:31:40.784711   920 registrar.cpp:490] Successfully updated the 'registry' in 9.68192ms
> I0310 11:31:40.784762   917 log.cpp:699] Attempting to truncate the log to 13
> I0310 11:31:40.784808   920 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 14
> I0310 11:31:40.784865   917 master.hpp:877] Adding task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
> I0310 11:31:40.784955   919 replica.cpp:508] Replica received write request for position 14
> W0310 11:31:40.785038   917 master.cpp:4468] Possibly orphaned task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of framework 20150310-112310-320783114-5050-24289-0000 running on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21)
> I0310 11:31:40.785105   917 master.cpp:3314] Re-registered slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> I0310 11:31:40.785162   920 hierarchical_allocator_process.hpp:442] Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 available)
> I0310 11:31:40.785679   921 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.786429   919 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 1.454211ms
> I0310 11:31:40.786455   919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.786782   919 replica.cpp:655] Replica received learned notice for position 14
> I0310 11:31:40.787833   919 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 1.027014ms
> I0310 11:31:40.787873   919 leveldb.cpp:401] Deleting ~2 keys from leveldb took 14085ns
> I0310 11:31:40.787883   919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.787889   919 replica.cpp:661] Replica learned TRUNCATE action at position 14
> I0310 11:31:40.792536   922 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.792670   922 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S2 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.819602   921 master.cpp:3246] Re-registering slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20)
> I0310 11:31:40.819736   921 registrar.cpp:445] Applied 1 operations in 16656ns; attempting to update the 'registry'
> I0310 11:31:40.820617   921 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.820726   918 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 15
> I0310 11:31:40.820938   918 replica.cpp:508] Replica received write request for position 15
> I0310 11:31:40.821641   918 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 670583ns
> I0310 11:31:40.821663   918 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.822265   917 replica.cpp:655] Replica received learned notice for position 15
> I0310 11:31:40.823463   917 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 1.178687ms
> I0310 11:31:40.823490   917 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.823498   917 replica.cpp:661] Replica learned APPEND action at position 15
> I0310 11:31:40.823755   917 registrar.cpp:490] Successfully updated the 'registry' in 3.97696ms
> I0310 11:31:40.823823   917 log.cpp:699] Attempting to truncate the log to 15
> I0310 11:31:40.824147   922 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 16
> I0310 11:31:40.824482   922 hierarchical_allocator_process.hpp:442] Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 available)
> I0310 11:31:40.824597   921 replica.cpp:508] Replica received write request for position 16
> I0310 11:31:40.824128   917 master.cpp:3314] Re-registered slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> I0310 11:31:40.824975   917 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.831900   921 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 7.228682ms
> I0310 11:31:40.832031   921 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.832456   917 replica.cpp:655] Replica received learned notice for position 16
> I0310 11:31:40.835178   917 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 2.674392ms
> I0310 11:31:40.835297   917 leveldb.cpp:401] Deleting ~2 keys from leveldb took 45220ns
> I0310 11:31:40.835322   917 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.835341   917 replica.cpp:661] Replica learned TRUNCATE action at position 16
> I0310 11:31:40.838281   923 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.838389   923 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S0 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.948725   919 http.cpp:344] HTTP request for '/master/redirect'
> I0310 11:31:41.479118   918 http.cpp:478] HTTP request for '/master/state.json'
> I0310 11:31:45.368074   918 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:45.385144   917 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:45.385292   917 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.368450   917 master.cpp:3843] Sending 2 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:46.375222   920 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:46.375360   920 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:46.375530   920 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S0 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.375599   920 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S2 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:48.031230   915 http.cpp:478] HTTP request for '/master/state.json'
> I0310 11:31:51.374285   922 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:51.379391   921 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:51.379487   921 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:51.482094   923 http.cpp:478] HTTP request for '/master/state.json'
> I0310 11:31:52.375326   917 master.cpp:3843] Sending 2 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:52.391376   919 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:52.391512   919 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S2 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:52.391659   921 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:52.391751   921 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S0 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:55.062060   918 master.cpp:3611] Performing explicit task state reconciliation for 1 tasks of framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:55.062588   919 master.cpp:3556] Performing implicit task state reconciliation for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:56.140990   923 http.cpp:344] HTTP request for '/master/redirect'
> I0310 11:31:57.379288   918 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:57.430888   918 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:57.431068   918 master.hpp:877] Adding task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
> I0310 11:31:57.431089   918 master.cpp:2503] Launching task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/> with resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19)
> I0310 11:31:57.431205   918 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682133   919 master.cpp:3446] Forwarding status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682186   919 master.cpp:3418] Status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 from slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19)
> I0310 11:31:57.682199   919 master.cpp:4693] Updating the latest state of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
> 
> 
> from MESOS slave 10.195.30.21
> I0310 11:31:28.750200  1074 slave.cpp:2623] master@10.195.30.19:5050 <http://master@10.195.30.19:5050/> exited
> W0310 11:31:28.750249  1074 slave.cpp:2626] Master disconnected! Waiting for a new master to be elected
> I0310 11:31:40.012516  1075 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.012899  1073 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017143  1072 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.017408  1072 slave.cpp:602] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.017546  1076 status_update_manager.cpp:171] Pausing sending status updates
> I0310 11:31:40.018673  1072 slave.cpp:627] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.018689  1072 slave.cpp:638] Detecting new master
> I0310 11:31:40.785364  1075 slave.cpp:824] Re-registered with master master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.785398  1075 status_update_manager.cpp:178] Resuming sending status updates
> I0310 11:32:10.639506  1075 slave.cpp:3321] Current usage 12.27%. Max allowed age: 5.441217749539572days
> 
> 
> from MESOS slave 10.195.30.19
> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050 <http://master@10.195.30.19:5050/> exited
> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting for a new master to be elected
> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing sending status updates
> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming sending status updates
> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 in work directory '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for executor ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework '20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of framework '20150310-112310-354337546-5050-895-0000'
> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of framework '20150310-112310-354337546-5050-895-0000' in container 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of framework 20150310-112310-354337546-5050-895-0000 from executor(1)@10.195.30.19:56574 <http://10.195.30.19:56574/>
> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 from executor(1)@10.195.30.19:56574 <http://10.195.30.19:56574/>
> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 to master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 to executor(1)@10.195.30.19:56574 <http://10.195.30.19:56574/>
> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204 at /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2 for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated 'memory.soft_limit_in_bytes' to 160MB for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated 'memory.limit_in_bytes' to 160MB at /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2 for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max allowed age: 5.555425200437824days
> 
> 
> From previous Marathon leader 10.195.30.21 <http://10.195.30.21/>:
> I0310 11:31:40.017248  1115 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.017334  1115 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017727  1115 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> [2015-03-10 11:31:40,017] WARN Disconnected (mesosphere.marathon.MarathonScheduler:224)
> [2015-03-10 11:31:40,019] INFO Abdicating (mesosphere.marathon.MarathonSchedulerService:312)
> [2015-03-10 11:31:40,019] INFO Defeat leadership (mesosphere.marathon.MarathonSchedulerService:285)
> [INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b] POSTing to all endpoints.
> [INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
> [2015-03-10 11:31:40,021] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.022001  1115 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,024] INFO Setting framework ID to 20150310-112310-320783114-5050-24289-0000 (mesosphere.marathon.MarathonSchedulerService:73)
> I0310 11:31:40.026274  1115 sched.cpp:234] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.026418  1115 sched.cpp:242] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.026458  1115 sched.cpp:752] Stopping framework '20150310-112310-320783114-5050-24289-0000'
> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing optional abdication command. (mesosphere.marathon.MarathonSchedulerService:192)
> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:246)
> [2015-03-10 11:31:40,032] INFO Defeat leadership (mesosphere.marathon.MarathonSchedulerService:285)
> [2015-03-10 11:31:40,032] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.032588  1107 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500 milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
> [2015-03-10 11:31:40,033] INFO Setting framework ID to 20150310-112310-320783114-5050-24289-0000 (mesosphere.marathon.MarathonSchedulerService:73)
> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is not a candidate for leader, current voting: [member_0000000001, member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:341)
> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003 (com.twitter.common.zookeeper.Group:426)
> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is not a candidate for leader, current voting: [member_0000000001, member_0000000002, member_0000000003] (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,568] INFO Candidate /marathon/leader/member_0000000003 waiting for the next leader election, current voting: [member_0000000001, member_0000000002, member_0000000003] (com.twitter.common.zookeeper.CandidateImpl:165)
> 
> 
> From new Marathon leader 10.195.30.20 <http://10.195.30.20/>:
> [2015-03-10 11:31:40,029] INFO Candidate /marathon/leader/member_0000000001 is now leader of group: [member_0000000001, member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:152)
> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:253)
> [2015-03-10 11:31:40,044] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService:299)
> [2015-03-10 11:31:40,044] INFO Running driver (mesosphere.marathon.MarathonSchedulerService:184)
> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716: Client environment:host.name <http://host.name/>=srv-d2u-9-virtip20
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723: Client environment:os.name <http://os.name/>=Linux
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-44-generic
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725: Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733: Client environment:user.name <http://user.name/>=(null)
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741: Client environment:user.home=/root
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753: Client environment:user.dir=/
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181 <http://10.195.30.21:2181/> sessionTimeout=10000 watcher=0x7fda9da9a6a0 sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff (mesosphere.marathon.MarathonSchedulerService:329)
> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703: initiated connection to server [10.195.30.19:2181 <http://10.195.30.19:2181/>]
> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750: session establishment complete on server [10.195.30.19:2181 <http://10.195.30.19:2181/>], sessionId=0x14c0335ad7e000d, negotiated timeout=10000
> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@10.195.30.20:45771 <http://10.195.30.20:45771/>) connected to ZooKeeper
> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
> [INFO] [03/10/2015 11:31:40.047] [marathon-akka.actor.default-dispatcher-2] [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> [2015-03-10 11:31:40,053] INFO Migration successfully applied for version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with 20150310-112310-354337546-5050-895-0000
> [2015-03-10 11:31:40,056] INFO Registered as 20150310-112310-354337546-5050-895-0000 to master '20150310-112310-354337546-5050-895' (mesosphere.marathon.MarathonScheduler:72)
> [2015-03-10 11:31:40,063] INFO Stored framework ID '20150310-112310-354337546-5050-895-0000' (mesosphere.mesos.util.FrameworkIdUtil:49)
> [INFO] [03/10/2015 11:31:40.065] [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
> [INFO] [03/10/2015 11:31:40.067] [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b] POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps (mesosphere.marathon.SchedulerActions:403)
> [INFO] [03/10/2015 11:31:55.053] [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters] Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:31:55,064] INFO Received status update for task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799: TASK_LOST (Reconciliation: Task is unknown to the slave) (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:55,069] INFO Need to scale /ffaas-backoffice-app-nopersist from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for /ffaas-backoffice-app-nopersist (0 queued) (mesosphere.marathon.SchedulerActions:494)
> [2015-03-10 11:31:55,069] INFO Task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 expunged and removed from TaskTracker (mesosphere.marathon.tasks.TaskTracker:107)
> [2015-03-10 11:31:55,070] INFO Sending event notification. (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:55.072] [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b] POSTing to all endpoints.
> [2015-03-10 11:31:55,073] INFO Need to scale /ffaas-backoffice-app-nopersist from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for /ffaas-backoffice-app-nopersist. Not scaling. (mesosphere.marathon.SchedulerActions:498)
> ...
> ...
> ...
> [2015-03-10 11:31:57,682] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:57,694] INFO Sending event notification. (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:57.694] [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b] POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store (mesosphere.marathon.tasks.TaskTracker:170)
> [INFO] [03/10/2015 11:36:55.050] [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters] Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps (mesosphere.marathon.SchedulerActions:403)
> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:36:55,063] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,065] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,066] INFO Already running 1 instances of /ffaas-backoffice-app-nopersist. Not scaling. (mesosphere.marathon.SchedulerActions:512)
> 
> 
> 
> -- End of logs
> 
> 
> 
> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>>:
> This is certainly not the expected/desired behavior when failing over a mesos master in HA mode. In addition to the master logs Alex requested, can you also provide relevant portions of the slave logs for these tasks? If the slave processes themselves never failed over, checkpointing and slave recovery should be irrelevant. Are you running the mesos-slave itself inside a Docker, or any other non-traditional setup?
> 
> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover defaults to "reconnect", and --strict defaults to true, so none of those are necessary.
> 
> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <alex@mesosphere.io <ma...@mesosphere.io>> wrote:
> Geoffroy, 
> 
> could you please provide master logs (both from killed and taking over masters)?
> 
> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <geoffroy.jabouley@gmail.com <ma...@gmail.com>> wrote:
> Hello
> 
> we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster.
> 
> Our use case:
> 
> State: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon)
> 
> Action: stop the mesos master leader process
> 
> Expected: mesos master leader has changed, active tasks remain unchanged
> 
> Seen: mesos master leader has changed, all active tasks are now FAILED but docker containers are still running, marathon detects FAILED tasks and starts new tasks. We end with 2 docker containers running on each machine, but only one is linked to a RUNNING mesos task.
> 
> 
> Is the seen behavior correct? 
> 
> Have we misunderstood the high availability concept? We thought that doing this use case would not have any impact on the current cluster state (except leader re-election)
> 
> Thanks in advance for your help
> Regards
> 
> ---------------------------------------------------
> 
> our setup is the following:
> 3 identical mesos nodes with:
>     + zookeeper
>     + docker 1.5
>     + mesos master 0.21.1 configured in HA mode
>     + mesos slave 0.21.1 configured with checkpointing, strict and reconnect
>     + marathon 0.8.0 configured in HA mode with checkpointing
>     
> ---------------------------------------------------
> 
> Command lines:
> 
> mesos-master
> usr/sbin/mesos-master --zk=zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/mesos <http://10.195.30.21:2181/mesos> --port=5050 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
> 
> mesos-slave
> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/mesos <http://10.195.30.21:2181/mesos> --checkpoint --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.195.30.19 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
> 
> marathon
> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port 8443 --checkpoint --zk zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/marathon <http://10.195.30.21:2181/marathon> --master zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/mesos <http://10.195.30.21:2181/mesos>
> 
> 
> 
> 
> 
> 


Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Alex Rukletsov <al...@mesosphere.io>.
Geoffroy,

yes, it looks like a marathon issue, so feel free to post it there as well.

On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley <
geoffroy.jabouley@gmail.com> wrote:

> Thanks Alex for your answer. I will have a look.
>
> Would it be better to (cross-)post this discussion on the marathon mailing
> list?
>
> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
>
> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <al...@mesosphere.io>:
>
>> Geoffroy,
>>
>> most probably you're hitting this bug:
>> https://github.com/mesosphere/marathon/issues/1063. The problem is that
>> Marathon can register instead of re-registering when a master fails
>> over. From master point of view, it's a new framework, that's why the
>> previous task is gone and a new one (that technically belongs to a new
>> framework) is started. You can see that frameworks have two different IDs
>> (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>>
>> Hope that helps,
>> Alex
>>
>> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
>> geoffroy.jabouley@gmail.com> wrote:
>>
>>> Hello
>>>
>>> thanks for your interest. Following are the requested logs, which will
>>> result in a pretty big mail.
>>>
>>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as
>>> our mesos containerizer.
>>>
>>> For reminder, here is the use case performed to get the logs file:
>>>
>>> --------------------------------
>>>
>>> Our cluster: 3 identical mesos nodes with:
>>>     + zookeeper
>>>     + docker 1.5
>>>     + mesos master 0.21.1 configured in HA mode
>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>> reconnect
>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>
>>> --------------------------------
>>>
>>> *Begin State: *
>>> + the mesos cluster is up (3 machines)
>>> + mesos master leader is 10.195.30.19
>>> + marathon leader is 10.195.30.21
>>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>>>
>>> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>>>
>>> *Expected*: mesos master leader has changed, active tasks / frameworks
>>> remain unchanged
>>>
>>> *End state: *
>>> + mesos master leader *has changed, now 10.195.30.21*
>>> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
>>> (not showing anymore on the mesos UI), but *docker container is still
>>> running*
>>> + a n*ew APPTASK is now running on slave 10.195.30.19*
>>> + marathon framework "registration time" in mesos UI shows "Just now"
>>> + marathon leader *has changed, now 10.195.30.20*
>>>
>>>
>>> --------------------------------
>>>
>>> Now comes the 6 requested logs, which might contain interesting/relevant
>>> information, but i as a newcomer to mesos it is hard to read...
>>>
>>>
>>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
>>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
>>> from process 1 of user 0; exiting
>>>
>>>
>>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
>>> I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.011823   922 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships
>>> changed
>>> I0310 11:31:40.015847   915 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000000' in ZooKeeper
>>> I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is
>>> master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
>>> I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
>>> master!
>>> I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
>>> I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
>>> I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
>>> I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
>>> promise request with proposal 2
>>> I0310 11:31:40.017503   915 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000003' in ZooKeeper
>>> I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
>>> bytes) to leveldb took 893672ns
>>> I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
>>> I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
>>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
>>> I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping
>>> to fill missing position
>>> I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
>>> position 8
>>> I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from
>>> leveldb took 13195ns
>>> I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from
>>> leveldb took 3120ns
>>> I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
>>> registry (610B) in 7.385856ms
>>> I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
>>> 9263ns; attempting to update the 'registry'
>>> I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 9
>>> I0310 11:31:40.024435   923 replica.cpp:508] Replica received write
>>> request for position 9
>>> I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 1.259338ms
>>> I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
>>> I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned
>>> notice for position 9
>>> I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 404795ns
>>> I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
>>> I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND
>>> action at position 9
>>> I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the
>>> 'registry' in 3.029248ms
>>> I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered
>>> registrar
>>> I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log
>>> to 9
>>> I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from the
>>> Registry (608B) ; allowing 1days for slaves to re-register
>>> I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 10
>>> I0310 11:31:40.026988   923 replica.cpp:508] Replica received write
>>> request for position 10
>>> I0310 11:31:40.027640   923 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 641018ns
>>> I0310 11:31:40.027652   923 replica.cpp:676] Persisted action at 10
>>> I0310 11:31:40.030848   923 replica.cpp:655] Replica received learned
>>> notice for position 10
>>> I0310 11:31:40.031883   923 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 1.008914ms
>>> I0310 11:31:40.031963   923 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 46724ns
>>> I0310 11:31:40.031977   923 replica.cpp:676] Persisted action at 10
>>> I0310 11:31:40.031986   923 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 10
>>> I0310 11:31:40.055415   918 master.cpp:1383] Received registration
>>> request for framework 'marathon' at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.055496   918 master.cpp:1447] Registering framework
>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.055642   919 hierarchical_allocator_process.hpp:329]
>>> Added framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.189151   919 master.cpp:3246] Re-registering slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19)
>>> I0310 11:31:40.189280   919 registrar.cpp:445] Applied 1 operations in
>>> 15452ns; attempting to update the 'registry'
>>> I0310 11:31:40.189949   919 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.189978   919 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 11
>>> I0310 11:31:40.190112   919 replica.cpp:508] Replica received write
>>> request for position 11
>>> I0310 11:31:40.190563   919 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 437440ns
>>> I0310 11:31:40.190577   919 replica.cpp:676] Persisted action at 11
>>> I0310 11:31:40.191249   921 replica.cpp:655] Replica received learned
>>> notice for position 11
>>> I0310 11:31:40.192159   921 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 892767ns
>>> I0310 11:31:40.192178   921 replica.cpp:676] Persisted action at 11
>>> I0310 11:31:40.192184   921 replica.cpp:661] Replica learned APPEND
>>> action at position 11
>>> I0310 11:31:40.192350   921 registrar.cpp:490] Successfully updated the
>>> 'registry' in 3.0528ms
>>> I0310 11:31:40.192387   919 log.cpp:699] Attempting to truncate the log
>>> to 11
>>> I0310 11:31:40.192415   919 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 12
>>> I0310 11:31:40.192539   915 replica.cpp:508] Replica received write
>>> request for position 12
>>> I0310 11:31:40.192600   921 master.cpp:3314] Re-registered slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148
>>> I0310 11:31:40.192680   917 hierarchical_allocator_process.hpp:442]
>>> Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>> available)
>>> I0310 11:31:40.192847   917 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.193164   915 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 610664ns
>>> I0310 11:31:40.193181   915 replica.cpp:676] Persisted action at 12
>>> I0310 11:31:40.193568   915 replica.cpp:655] Replica received learned
>>> notice for position 12
>>> I0310 11:31:40.193948   915 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 364062ns
>>> I0310 11:31:40.193979   915 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 12256ns
>>> I0310 11:31:40.193985   915 replica.cpp:676] Persisted action at 12
>>> I0310 11:31:40.193990   915 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 12
>>> I0310 11:31:40.248615   915 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.248744   915 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.774416   915 master.cpp:3246] Re-registering slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21)
>>> I0310 11:31:40.774976   915 registrar.cpp:445] Applied 1 operations in
>>> 42342ns; attempting to update the 'registry'
>>> I0310 11:31:40.777273   921 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.777436   921 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 13
>>> I0310 11:31:40.777989   921 replica.cpp:508] Replica received write
>>> request for position 13
>>> I0310 11:31:40.779558   921 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 1.513714ms
>>> I0310 11:31:40.779633   921 replica.cpp:676] Persisted action at 13
>>> I0310 11:31:40.781821   919 replica.cpp:655] Replica received learned
>>> notice for position 13
>>> I0310 11:31:40.784417   919 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 2.542036ms
>>> I0310 11:31:40.784446   919 replica.cpp:676] Persisted action at 13
>>> I0310 11:31:40.784452   919 replica.cpp:661] Replica learned APPEND
>>> action at position 13
>>> I0310 11:31:40.784711   920 registrar.cpp:490] Successfully updated the
>>> 'registry' in 9.68192ms
>>> I0310 11:31:40.784762   917 log.cpp:699] Attempting to truncate the log
>>> to 13
>>> I0310 11:31:40.784808   920 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 14
>>> I0310 11:31:40.784865   917 master.hpp:877] Adding task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
>>> I0310 11:31:40.784955   919 replica.cpp:508] Replica received write
>>> request for position 14
>>> W0310 11:31:40.785038   917 master.cpp:4468] Possibly orphaned task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
>>> framework 20150310-112310-320783114-5050-24289-0000 running on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21)
>>> I0310 11:31:40.785105   917 master.cpp:3314] Re-registered slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148
>>> I0310 11:31:40.785162   920 hierarchical_allocator_process.hpp:442]
>>> Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
>>> available)
>>> I0310 11:31:40.785679   921 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.786429   919 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 1.454211ms
>>> I0310 11:31:40.786455   919 replica.cpp:676] Persisted action at 14
>>> I0310 11:31:40.786782   919 replica.cpp:655] Replica received learned
>>> notice for position 14
>>> I0310 11:31:40.787833   919 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 1.027014ms
>>> I0310 11:31:40.787873   919 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 14085ns
>>> I0310 11:31:40.787883   919 replica.cpp:676] Persisted action at 14
>>> I0310 11:31:40.787889   919 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 14
>>> I0310 11:31:40.792536   922 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.792670   922 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.819602   921 master.cpp:3246] Re-registering slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20)
>>> I0310 11:31:40.819736   921 registrar.cpp:445] Applied 1 operations in
>>> 16656ns; attempting to update the 'registry'
>>> I0310 11:31:40.820617   921 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.820726   918 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 15
>>> I0310 11:31:40.820938   918 replica.cpp:508] Replica received write
>>> request for position 15
>>> I0310 11:31:40.821641   918 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 670583ns
>>> I0310 11:31:40.821663   918 replica.cpp:676] Persisted action at 15
>>> I0310 11:31:40.822265   917 replica.cpp:655] Replica received learned
>>> notice for position 15
>>> I0310 11:31:40.823463   917 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 1.178687ms
>>> I0310 11:31:40.823490   917 replica.cpp:676] Persisted action at 15
>>> I0310 11:31:40.823498   917 replica.cpp:661] Replica learned APPEND
>>> action at position 15
>>> I0310 11:31:40.823755   917 registrar.cpp:490] Successfully updated the
>>> 'registry' in 3.97696ms
>>> I0310 11:31:40.823823   917 log.cpp:699] Attempting to truncate the log
>>> to 15
>>> I0310 11:31:40.824147   922 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 16
>>> I0310 11:31:40.824482   922 hierarchical_allocator_process.hpp:442]
>>> Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>> available)
>>> I0310 11:31:40.824597   921 replica.cpp:508] Replica received write
>>> request for position 16
>>> I0310 11:31:40.824128   917 master.cpp:3314] Re-registered slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148
>>> I0310 11:31:40.824975   917 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.831900   921 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 7.228682ms
>>> I0310 11:31:40.832031   921 replica.cpp:676] Persisted action at 16
>>> I0310 11:31:40.832456   917 replica.cpp:655] Replica received learned
>>> notice for position 16
>>> I0310 11:31:40.835178   917 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 2.674392ms
>>> I0310 11:31:40.835297   917 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 45220ns
>>> I0310 11:31:40.835322   917 replica.cpp:676] Persisted action at 16
>>> I0310 11:31:40.835341   917 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 16
>>> I0310 11:31:40.838281   923 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.838389   923 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.948725   919 http.cpp:344] HTTP request for
>>> '/master/redirect'
>>> I0310 11:31:41.479118   918 http.cpp:478] HTTP request for
>>> '/master/state.json'
>>> I0310 11:31:45.368074   918 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:45.385144   917 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:45.385292   917 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:46.368450   917 master.cpp:3843] Sending 2 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:46.375222   920 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:46.375360   920 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:46.375530   920 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:46.375599   920 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:48.031230   915 http.cpp:478] HTTP request for
>>> '/master/state.json'
>>> I0310 11:31:51.374285   922 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:51.379391   921 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:51.379487   921 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:51.482094   923 http.cpp:478] HTTP request for
>>> '/master/state.json'
>>> I0310 11:31:52.375326   917 master.cpp:3843] Sending 2 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:52.391376   919 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:52.391512   919 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:52.391659   921 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:52.391751   921 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:55.062060   918 master.cpp:3611] Performing explicit task
>>> state reconciliation for 1 tasks of framework
>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:55.062588   919 master.cpp:3556] Performing implicit task
>>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:56.140990   923 http.cpp:344] HTTP request for
>>> '/master/redirect'
>>> I0310 11:31:57.379288   918 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:57.430888   918 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:57.431068   918 master.hpp:877] Adding task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
>>> I0310 11:31:57.431089   918 master.cpp:2503] Launching task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19)
>>> I0310 11:31:57.431205   918 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.682133   919 master.cpp:3446] Forwarding status update
>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.682186   919 master.cpp:3418] Status update TASK_RUNNING
>>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 from slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19)
>>> I0310 11:31:57.682199   919 master.cpp:4693] Updating the latest state
>>> of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799
>>> of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>>>
>>>
>>> *from MESOS slave 10.195.30.21*
>>> I0310 11:31:28.750200  1074 slave.cpp:2623] master@10.195.30.19:5050
>>> exited
>>> W0310 11:31:28.750249  1074 slave.cpp:2626] Master disconnected! Waiting
>>> for a new master to be elected
>>> I0310 11:31:40.012516  1075 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.012899  1073 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.017143  1072 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.017408  1072 slave.cpp:602] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.017546  1076 status_update_manager.cpp:171] Pausing
>>> sending status updates
>>> I0310 11:31:40.018673  1072 slave.cpp:627] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.018689  1072 slave.cpp:638] Detecting new master
>>> I0310 11:31:40.785364  1075 slave.cpp:824] Re-registered with master
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.785398  1075 status_update_manager.cpp:178] Resuming
>>> sending status updates
>>> I0310 11:32:10.639506  1075 slave.cpp:3321] Current usage 12.27%. Max
>>> allowed age: 5.441217749539572days
>>>
>>>
>>> *from MESOS slave 10.195.30.19*
>>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
>>> exited
>>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
>>> for a new master to be elected
>>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
>>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing
>>> sending status updates
>>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
>>> sending status updates
>>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 in work directory
>>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
>>> executor
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework '20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
>>> executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
>>> framework '20150310-112310-354337546-5050-895-0000'
>>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
>>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
>>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>> framework '20150310-112310-354337546-5050-895-0000' in container
>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for
>>> executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>> 10.195.30.19:56574
>>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>> executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>> 10.195.30.19:56574
>>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received
>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
>>> UPDATE for status update TASK_RUNNING (UUID:
>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 to
>>> master@10.195.30.21:5050
>>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
>>> 10.195.30.19:56574
>>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received
>>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e)
>>> for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
>>> ACK for status update TASK_RUNNING (UUID:
>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204
>>> at
>>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
>>> 'memory.soft_limit_in_bytes' to 160MB for container
>>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
>>> 'memory.limit_in_bytes' to 160MB at
>>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
>>> allowed age: 5.555425200437824days
>>>
>>>
>>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
>>> I0310 11:31:40.017248  1115 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.017334  1115 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.017727  1115 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> [2015-03-10 11:31:40,017] WARN Disconnected
>>> (mesosphere.marathon.MarathonScheduler:224)
>>> [2015-03-10 11:31:40,019] INFO Abdicating
>>> (mesosphere.marathon.MarathonSchedulerService:312)
>>> [2015-03-10 11:31:40,019] INFO Defeat leadership
>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>> [INFO] [03/10/2015 11:31:40.019]
>>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> [INFO] [03/10/2015 11:31:40.019]
>>> [marathon-akka.actor.default-dispatcher-5]
>>> [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
>>> [2015-03-10 11:31:40,021] INFO Stopping driver
>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>> I0310 11:31:40.022001  1115 sched.cpp:1286] Asked to stop the driver
>>> [2015-03-10 11:31:40,024] INFO Setting framework ID to
>>> 20150310-112310-320783114-5050-24289-0000
>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>> I0310 11:31:40.026274  1115 sched.cpp:234] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.026418  1115 sched.cpp:242] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.026458  1115 sched.cpp:752] Stopping framework
>>> '20150310-112310-320783114-5050-24289-0000'
>>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing
>>> optional abdication command.
>>> (mesosphere.marathon.MarathonSchedulerService:192)
>>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
>>> (mesosphere.marathon.MarathonSchedulerService:246)
>>> [2015-03-10 11:31:40,032] INFO Defeat leadership
>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>> [2015-03-10 11:31:40,032] INFO Stopping driver
>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>> I0310 11:31:40.032588  1107 sched.cpp:1286] Asked to stop the driver
>>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
>>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
>>> [2015-03-10 11:31:40,033] INFO Setting framework ID to
>>> 20150310-112310-320783114-5050-24289-0000
>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is
>>> not a candidate for leader, current voting: [member_0000000001,
>>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
>>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering
>>> leadership (mesosphere.marathon.MarathonSchedulerService:341)
>>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
>>> (com.twitter.common.zookeeper.Group:426)
>>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is
>>> not a candidate for leader, current voting: [member_0000000001,
>>> member_0000000002, member_0000000003]
>>> (com.twitter.common.zookeeper.CandidateImpl:144)
>>> [2015-03-10 11:31:40,568] INFO Candidate
>>> /marathon/leader/member_0000000003 waiting for the next leader election,
>>> current voting: [member_0000000001, member_0000000002, member_0000000003]
>>> (com.twitter.common.zookeeper.CandidateImpl:165)
>>>
>>>
>>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
>>> [2015-03-10 11:31:40,029] INFO Candidate
>>> /marathon/leader/member_0000000001 is now leader of group:
>>> [member_0000000001, member_0000000002]
>>> (com.twitter.common.zookeeper.CandidateImpl:152)
>>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
>>> (mesosphere.marathon.MarathonSchedulerService:253)
>>> [2015-03-10 11:31:40,044] INFO Elect leadership
>>> (mesosphere.marathon.MarathonSchedulerService:299)
>>> [2015-03-10 11:31:40,044] INFO Running driver
>>> (mesosphere.marathon.MarathonSchedulerService:184)
>>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
>>> Client environment:zookeeper.version=zookeeper C client 3.4.5
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
>>> Client environment:host.name=srv-d2u-9-virtip20
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
>>> Client environment:os.name=Linux
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
>>> Client environment:os.arch=3.13.0-44-generic
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
>>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
>>> Client environment:user.name=(null)
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
>>> Client environment:user.home=/root
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
>>> Client environment:user.dir=/
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
>>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
>>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0
>>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
>>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
>>> (mesosphere.marathon.MarathonSchedulerService:329)
>>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
>>> initiated connection to server [10.195.30.19:2181]
>>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
>>> session establishment complete on server [10.195.30.19:2181],
>>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
>>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
>>> 10.195.30.20:45771) connected to ZooKeeper
>>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations:
>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path
>>> '/mesos' in ZooKeeper
>>> [INFO] [03/10/2015 11:31:40.047]
>>> [marathon-akka.actor.default-dispatcher-2]
>>> [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
>>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for
>>> version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
>>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
>>> 20150310-112310-354337546-5050-895-0000
>>> [2015-03-10 11:31:40,056] INFO Registered as
>>> 20150310-112310-354337546-5050-895-0000 to master
>>> '20150310-112310-354337546-5050-895'
>>> (mesosphere.marathon.MarathonScheduler:72)
>>> [2015-03-10 11:31:40,063] INFO Stored framework ID
>>> '20150310-112310-354337546-5050-895-0000'
>>> (mesosphere.mesos.util.FrameworkIdUtil:49)
>>> [INFO] [03/10/2015 11:31:40.065]
>>> [marathon-akka.actor.default-dispatcher-6]
>>> [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
>>> [INFO] [03/10/2015 11:31:40.067]
>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> ...
>>> ...
>>> ...
>>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
>>> (mesosphere.marathon.SchedulerActions:403)
>>> [INFO] [03/10/2015 11:31:55.053]
>>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
>>> encountered. This logging can be turned off or adjusted with configuration
>>> settings 'akka.log-dead-letters' and
>>> 'akka.log-dead-letters-during-shutdown'.
>>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>> [2015-03-10 11:31:55,064] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
>>> TASK_LOST (Reconciliation: Task is unknown to the slave)
>>> (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:31:55,069] INFO Need to scale
>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>> (mesosphere.marathon.SchedulerActions:488)
>>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
>>> /ffaas-backoffice-app-nopersist (0 queued)
>>> (mesosphere.marathon.SchedulerActions:494)
>>> [2015-03-10 11:31:55,069] INFO Task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
>>> expunged and removed from TaskTracker
>>> (mesosphere.marathon.tasks.TaskTracker:107)
>>> [2015-03-10 11:31:55,070] INFO Sending event notification.
>>> (mesosphere.marathon.MarathonScheduler:262)
>>> [INFO] [03/10/2015 11:31:55.072]
>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> [2015-03-10 11:31:55,073] INFO Need to scale
>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>> (mesosphere.marathon.SchedulerActions:488)
>>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>> (mesosphere.marathon.SchedulerActions:498)
>>> ...
>>> ...
>>> ...
>>> [2015-03-10 11:31:57,682] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:31:57,694] INFO Sending event notification.
>>> (mesosphere.marathon.MarathonScheduler:262)
>>> [INFO] [03/10/2015 11:31:57.694]
>>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> ...
>>> ...
>>> ...
>>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
>>> (mesosphere.marathon.tasks.TaskTracker:170)
>>> [INFO] [03/10/2015 11:36:55.050]
>>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
>>> encountered. This logging can be turned off or adjusted with configuration
>>> settings 'akka.log-dead-letters' and
>>> 'akka.log-dead-letters-during-shutdown'.
>>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
>>> (mesosphere.marathon.SchedulerActions:403)
>>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>> [2015-03-10 11:36:55,063] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>> TASK_RUNNING (Reconciliation: Latest task state)
>>> (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:36:55,065] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>> TASK_RUNNING (Reconciliation: Latest task state)
>>> (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>> (mesosphere.marathon.SchedulerActions:512)
>>>
>>>
>>>
>>> -- End of logs
>>>
>>>
>>>
>>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>>>
>>>> This is certainly not the expected/desired behavior when failing over a
>>>> mesos master in HA mode. In addition to the master logs Alex requested, can
>>>> you also provide relevant portions of the slave logs for these tasks? If
>>>> the slave processes themselves never failed over, checkpointing and slave
>>>> recovery should be irrelevant. Are you running the mesos-slave itself
>>>> inside a Docker, or any other non-traditional setup?
>>>>
>>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>>>> defaults to "reconnect", and --strict defaults to true, so none of those
>>>> are necessary.
>>>>
>>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>>>> wrote:
>>>>
>>>>> Geoffroy,
>>>>>
>>>>> could you please provide master logs (both from killed and taking over
>>>>> masters)?
>>>>>
>>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>>>> geoffroy.jabouley@gmail.com> wrote:
>>>>>
>>>>>> Hello
>>>>>>
>>>>>> we are facing some unexpecting issues when testing high availability
>>>>>> behaviors of our mesos cluster.
>>>>>>
>>>>>> *Our use case:*
>>>>>>
>>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>>>> running on each slave (started from marathon)
>>>>>>
>>>>>> *Action*: stop the mesos master leader process
>>>>>>
>>>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>>>> unchanged*
>>>>>>
>>>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>>>> FAILED but docker containers are still running*, marathon detects
>>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>>>
>>>>>>
>>>>>> Is the seen behavior correct?
>>>>>>
>>>>>> Have we misunderstood the high availability concept? We thought that
>>>>>> doing this use case would not have any impact on the current cluster state
>>>>>> (except leader re-election)
>>>>>>
>>>>>> Thanks in advance for your help
>>>>>> Regards
>>>>>>
>>>>>> ---------------------------------------------------
>>>>>>
>>>>>> our setup is the following:
>>>>>> 3 identical mesos nodes with:
>>>>>>     + zookeeper
>>>>>>     + docker 1.5
>>>>>>     + mesos master 0.21.1 configured in HA mode
>>>>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>>>>> reconnect
>>>>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>>>>
>>>>>> ---------------------------------------------------
>>>>>>
>>>>>> Command lines:
>>>>>>
>>>>>>
>>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>>>
>>>>>> *mesos-slave*
>>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,
>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint
>>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins
>>>>>> --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>>>
>>>>>> *marathon*
>>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Geoffroy Jabouley <ge...@gmail.com>.
Thanks Alex for your answer. I will have a look.

Would it be better to (cross-)post this discussion on the marathon mailing
list?

Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.

2015-03-11 22:18 GMT+01:00 Alex Rukletsov <al...@mesosphere.io>:

> Geoffroy,
>
> most probably you're hitting this bug:
> https://github.com/mesosphere/marathon/issues/1063. The problem is that
> Marathon can register instead of re-registering when a master fails
> over. From master point of view, it's a new framework, that's why the
> previous task is gone and a new one (that technically belongs to a new
> framework) is started. You can see that frameworks have two different IDs
> (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>
> Hope that helps,
> Alex
>
> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
> geoffroy.jabouley@gmail.com> wrote:
>
>> Hello
>>
>> thanks for your interest. Following are the requested logs, which will
>> result in a pretty big mail.
>>
>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as
>> our mesos containerizer.
>>
>> For reminder, here is the use case performed to get the logs file:
>>
>> --------------------------------
>>
>> Our cluster: 3 identical mesos nodes with:
>>     + zookeeper
>>     + docker 1.5
>>     + mesos master 0.21.1 configured in HA mode
>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>> reconnect
>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>
>> --------------------------------
>>
>> *Begin State: *
>> + the mesos cluster is up (3 machines)
>> + mesos master leader is 10.195.30.19
>> + marathon leader is 10.195.30.21
>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>>
>> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>>
>> *Expected*: mesos master leader has changed, active tasks / frameworks
>> remain unchanged
>>
>> *End state: *
>> + mesos master leader *has changed, now 10.195.30.21*
>> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
>> (not showing anymore on the mesos UI), but *docker container is still
>> running*
>> + a n*ew APPTASK is now running on slave 10.195.30.19*
>> + marathon framework "registration time" in mesos UI shows "Just now"
>> + marathon leader *has changed, now 10.195.30.20*
>>
>>
>> --------------------------------
>>
>> Now comes the 6 requested logs, which might contain interesting/relevant
>> information, but i as a newcomer to mesos it is hard to read...
>>
>>
>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
>> from process 1 of user 0; exiting
>>
>>
>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
>> I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.011823   922 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships
>> changed
>> I0310 11:31:40.015847   915 group.cpp:659] Trying to get
>> '/mesos/log_replicas/0000000000' in ZooKeeper
>> I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is
>> master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
>> I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
>> master!
>> I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
>> I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
>> I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
>> I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
>> promise request with proposal 2
>> I0310 11:31:40.017503   915 group.cpp:659] Trying to get
>> '/mesos/log_replicas/0000000003' in ZooKeeper
>> I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
>> bytes) to leveldb took 893672ns
>> I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
>> I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
>> I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping to
>> fill missing position
>> I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
>> position 8
>> I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from
>> leveldb took 13195ns
>> I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from
>> leveldb took 3120ns
>> I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
>> registry (610B) in 7.385856ms
>> I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
>> 9263ns; attempting to update the 'registry'
>> I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 9
>> I0310 11:31:40.024435   923 replica.cpp:508] Replica received write
>> request for position 9
>> I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 1.259338ms
>> I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
>> I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned
>> notice for position 9
>> I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 404795ns
>> I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
>> I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND
>> action at position 9
>> I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the
>> 'registry' in 3.029248ms
>> I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered
>> registrar
>> I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log
>> to 9
>> I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from the
>> Registry (608B) ; allowing 1days for slaves to re-register
>> I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 10
>> I0310 11:31:40.026988   923 replica.cpp:508] Replica received write
>> request for position 10
>> I0310 11:31:40.027640   923 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 641018ns
>> I0310 11:31:40.027652   923 replica.cpp:676] Persisted action at 10
>> I0310 11:31:40.030848   923 replica.cpp:655] Replica received learned
>> notice for position 10
>> I0310 11:31:40.031883   923 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 1.008914ms
>> I0310 11:31:40.031963   923 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 46724ns
>> I0310 11:31:40.031977   923 replica.cpp:676] Persisted action at 10
>> I0310 11:31:40.031986   923 replica.cpp:661] Replica learned TRUNCATE
>> action at position 10
>> I0310 11:31:40.055415   918 master.cpp:1383] Received registration
>> request for framework 'marathon' at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.055496   918 master.cpp:1447] Registering framework
>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.055642   919 hierarchical_allocator_process.hpp:329] Added
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.189151   919 master.cpp:3246] Re-registering slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19)
>> I0310 11:31:40.189280   919 registrar.cpp:445] Applied 1 operations in
>> 15452ns; attempting to update the 'registry'
>> I0310 11:31:40.189949   919 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.189978   919 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 11
>> I0310 11:31:40.190112   919 replica.cpp:508] Replica received write
>> request for position 11
>> I0310 11:31:40.190563   919 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 437440ns
>> I0310 11:31:40.190577   919 replica.cpp:676] Persisted action at 11
>> I0310 11:31:40.191249   921 replica.cpp:655] Replica received learned
>> notice for position 11
>> I0310 11:31:40.192159   921 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 892767ns
>> I0310 11:31:40.192178   921 replica.cpp:676] Persisted action at 11
>> I0310 11:31:40.192184   921 replica.cpp:661] Replica learned APPEND
>> action at position 11
>> I0310 11:31:40.192350   921 registrar.cpp:490] Successfully updated the
>> 'registry' in 3.0528ms
>> I0310 11:31:40.192387   919 log.cpp:699] Attempting to truncate the log
>> to 11
>> I0310 11:31:40.192415   919 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 12
>> I0310 11:31:40.192539   915 replica.cpp:508] Replica received write
>> request for position 12
>> I0310 11:31:40.192600   921 master.cpp:3314] Re-registered slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148
>> I0310 11:31:40.192680   917 hierarchical_allocator_process.hpp:442] Added
>> slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>> available)
>> I0310 11:31:40.192847   917 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.193164   915 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 610664ns
>> I0310 11:31:40.193181   915 replica.cpp:676] Persisted action at 12
>> I0310 11:31:40.193568   915 replica.cpp:655] Replica received learned
>> notice for position 12
>> I0310 11:31:40.193948   915 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 364062ns
>> I0310 11:31:40.193979   915 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 12256ns
>> I0310 11:31:40.193985   915 replica.cpp:676] Persisted action at 12
>> I0310 11:31:40.193990   915 replica.cpp:661] Replica learned TRUNCATE
>> action at position 12
>> I0310 11:31:40.248615   915 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O0 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.248744   915 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.774416   915 master.cpp:3246] Re-registering slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21)
>> I0310 11:31:40.774976   915 registrar.cpp:445] Applied 1 operations in
>> 42342ns; attempting to update the 'registry'
>> I0310 11:31:40.777273   921 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.777436   921 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 13
>> I0310 11:31:40.777989   921 replica.cpp:508] Replica received write
>> request for position 13
>> I0310 11:31:40.779558   921 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 1.513714ms
>> I0310 11:31:40.779633   921 replica.cpp:676] Persisted action at 13
>> I0310 11:31:40.781821   919 replica.cpp:655] Replica received learned
>> notice for position 13
>> I0310 11:31:40.784417   919 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 2.542036ms
>> I0310 11:31:40.784446   919 replica.cpp:676] Persisted action at 13
>> I0310 11:31:40.784452   919 replica.cpp:661] Replica learned APPEND
>> action at position 13
>> I0310 11:31:40.784711   920 registrar.cpp:490] Successfully updated the
>> 'registry' in 9.68192ms
>> I0310 11:31:40.784762   917 log.cpp:699] Attempting to truncate the log
>> to 13
>> I0310 11:31:40.784808   920 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 14
>> I0310 11:31:40.784865   917 master.hpp:877] Adding task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
>> I0310 11:31:40.784955   919 replica.cpp:508] Replica received write
>> request for position 14
>> W0310 11:31:40.785038   917 master.cpp:4468] Possibly orphaned task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
>> framework 20150310-112310-320783114-5050-24289-0000 running on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21)
>> I0310 11:31:40.785105   917 master.cpp:3314] Re-registered slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148
>> I0310 11:31:40.785162   920 hierarchical_allocator_process.hpp:442] Added
>> slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
>> available)
>> I0310 11:31:40.785679   921 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.786429   919 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 1.454211ms
>> I0310 11:31:40.786455   919 replica.cpp:676] Persisted action at 14
>> I0310 11:31:40.786782   919 replica.cpp:655] Replica received learned
>> notice for position 14
>> I0310 11:31:40.787833   919 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 1.027014ms
>> I0310 11:31:40.787873   919 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 14085ns
>> I0310 11:31:40.787883   919 replica.cpp:676] Persisted action at 14
>> I0310 11:31:40.787889   919 replica.cpp:661] Replica learned TRUNCATE
>> action at position 14
>> I0310 11:31:40.792536   922 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O1 ] on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.792670   922 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S2 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.819602   921 master.cpp:3246] Re-registering slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20)
>> I0310 11:31:40.819736   921 registrar.cpp:445] Applied 1 operations in
>> 16656ns; attempting to update the 'registry'
>> I0310 11:31:40.820617   921 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.820726   918 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 15
>> I0310 11:31:40.820938   918 replica.cpp:508] Replica received write
>> request for position 15
>> I0310 11:31:40.821641   918 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 670583ns
>> I0310 11:31:40.821663   918 replica.cpp:676] Persisted action at 15
>> I0310 11:31:40.822265   917 replica.cpp:655] Replica received learned
>> notice for position 15
>> I0310 11:31:40.823463   917 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 1.178687ms
>> I0310 11:31:40.823490   917 replica.cpp:676] Persisted action at 15
>> I0310 11:31:40.823498   917 replica.cpp:661] Replica learned APPEND
>> action at position 15
>> I0310 11:31:40.823755   917 registrar.cpp:490] Successfully updated the
>> 'registry' in 3.97696ms
>> I0310 11:31:40.823823   917 log.cpp:699] Attempting to truncate the log
>> to 15
>> I0310 11:31:40.824147   922 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 16
>> I0310 11:31:40.824482   922 hierarchical_allocator_process.hpp:442] Added
>> slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>> available)
>> I0310 11:31:40.824597   921 replica.cpp:508] Replica received write
>> request for position 16
>> I0310 11:31:40.824128   917 master.cpp:3314] Re-registered slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148
>> I0310 11:31:40.824975   917 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.831900   921 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 7.228682ms
>> I0310 11:31:40.832031   921 replica.cpp:676] Persisted action at 16
>> I0310 11:31:40.832456   917 replica.cpp:655] Replica received learned
>> notice for position 16
>> I0310 11:31:40.835178   917 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 2.674392ms
>> I0310 11:31:40.835297   917 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 45220ns
>> I0310 11:31:40.835322   917 replica.cpp:676] Persisted action at 16
>> I0310 11:31:40.835341   917 replica.cpp:661] Replica learned TRUNCATE
>> action at position 16
>> I0310 11:31:40.838281   923 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O2 ] on slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.838389   923 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S0 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.948725   919 http.cpp:344] HTTP request for
>> '/master/redirect'
>> I0310 11:31:41.479118   918 http.cpp:478] HTTP request for
>> '/master/state.json'
>> I0310 11:31:45.368074   918 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:45.385144   917 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O3 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:45.385292   917 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:46.368450   917 master.cpp:3843] Sending 2 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:46.375222   920 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O4 ] on slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:46.375360   920 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O5 ] on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:46.375530   920 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S0 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:46.375599   920 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S2 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:48.031230   915 http.cpp:478] HTTP request for
>> '/master/state.json'
>> I0310 11:31:51.374285   922 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:51.379391   921 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O6 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:51.379487   921 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:51.482094   923 http.cpp:478] HTTP request for
>> '/master/state.json'
>> I0310 11:31:52.375326   917 master.cpp:3843] Sending 2 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:52.391376   919 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O7 ] on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:52.391512   919 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S2 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:52.391659   921 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O8 ] on slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:52.391751   921 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S0 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:55.062060   918 master.cpp:3611] Performing explicit task
>> state reconciliation for 1 tasks of framework
>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:55.062588   919 master.cpp:3556] Performing implicit task
>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:56.140990   923 http.cpp:344] HTTP request for
>> '/master/redirect'
>> I0310 11:31:57.379288   918 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:57.430888   918 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O9 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:57.431068   918 master.hpp:877] Adding task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
>> I0310 11:31:57.431089   918 master.cpp:2503] Launching task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19)
>> I0310 11:31:57.431205   918 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.682133   919 master.cpp:3446] Forwarding status update
>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.682186   919 master.cpp:3418] Status update TASK_RUNNING
>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 from slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19)
>> I0310 11:31:57.682199   919 master.cpp:4693] Updating the latest state of
>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>>
>>
>> *from MESOS slave 10.195.30.21*
>> I0310 11:31:28.750200  1074 slave.cpp:2623] master@10.195.30.19:5050
>> exited
>> W0310 11:31:28.750249  1074 slave.cpp:2626] Master disconnected! Waiting
>> for a new master to be elected
>> I0310 11:31:40.012516  1075 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.012899  1073 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.017143  1072 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.017408  1072 slave.cpp:602] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.017546  1076 status_update_manager.cpp:171] Pausing
>> sending status updates
>> I0310 11:31:40.018673  1072 slave.cpp:627] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.018689  1072 slave.cpp:638] Detecting new master
>> I0310 11:31:40.785364  1075 slave.cpp:824] Re-registered with master
>> master@10.195.30.21:5050
>> I0310 11:31:40.785398  1075 status_update_manager.cpp:178] Resuming
>> sending status updates
>> I0310 11:32:10.639506  1075 slave.cpp:3321] Current usage 12.27%. Max
>> allowed age: 5.441217749539572days
>>
>>
>> *from MESOS slave 10.195.30.19*
>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
>> exited
>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
>> for a new master to be elected
>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing
>> sending status updates
>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
>> master@10.195.30.21:5050
>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
>> sending status updates
>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 in work directory
>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
>> executor
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework '20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
>> executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
>> framework '20150310-112310-354337546-5050-895-0000'
>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>> framework '20150310-112310-354337546-5050-895-0000' in container
>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>> 10.195.30.19:56574
>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>> executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>> 10.195.30.19:56574
>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received
>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
>> UPDATE for status update TASK_RUNNING (UUID:
>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 to
>> master@10.195.30.21:5050
>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
>> 10.195.30.19:56574
>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received
>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e)
>> for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
>> ACK for status update TASK_RUNNING (UUID:
>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204
>> at
>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
>> 'memory.soft_limit_in_bytes' to 160MB for container
>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
>> 'memory.limit_in_bytes' to 160MB at
>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
>> allowed age: 5.555425200437824days
>>
>>
>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
>> I0310 11:31:40.017248  1115 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.017334  1115 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.017727  1115 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> [2015-03-10 11:31:40,017] WARN Disconnected
>> (mesosphere.marathon.MarathonScheduler:224)
>> [2015-03-10 11:31:40,019] INFO Abdicating
>> (mesosphere.marathon.MarathonSchedulerService:312)
>> [2015-03-10 11:31:40,019] INFO Defeat leadership
>> (mesosphere.marathon.MarathonSchedulerService:285)
>> [INFO] [03/10/2015 11:31:40.019]
>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> [INFO] [03/10/2015 11:31:40.019]
>> [marathon-akka.actor.default-dispatcher-5]
>> [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
>> [2015-03-10 11:31:40,021] INFO Stopping driver
>> (mesosphere.marathon.MarathonSchedulerService:221)
>> I0310 11:31:40.022001  1115 sched.cpp:1286] Asked to stop the driver
>> [2015-03-10 11:31:40,024] INFO Setting framework ID to
>> 20150310-112310-320783114-5050-24289-0000
>> (mesosphere.marathon.MarathonSchedulerService:73)
>> I0310 11:31:40.026274  1115 sched.cpp:234] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.026418  1115 sched.cpp:242] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.026458  1115 sched.cpp:752] Stopping framework
>> '20150310-112310-320783114-5050-24289-0000'
>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing
>> optional abdication command.
>> (mesosphere.marathon.MarathonSchedulerService:192)
>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
>> (mesosphere.marathon.MarathonSchedulerService:246)
>> [2015-03-10 11:31:40,032] INFO Defeat leadership
>> (mesosphere.marathon.MarathonSchedulerService:285)
>> [2015-03-10 11:31:40,032] INFO Stopping driver
>> (mesosphere.marathon.MarathonSchedulerService:221)
>> I0310 11:31:40.032588  1107 sched.cpp:1286] Asked to stop the driver
>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
>> [2015-03-10 11:31:40,033] INFO Setting framework ID to
>> 20150310-112310-320783114-5050-24289-0000
>> (mesosphere.marathon.MarathonSchedulerService:73)
>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is
>> not a candidate for leader, current voting: [member_0000000001,
>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership
>> (mesosphere.marathon.MarathonSchedulerService:341)
>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
>> (com.twitter.common.zookeeper.Group:426)
>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is
>> not a candidate for leader, current voting: [member_0000000001,
>> member_0000000002, member_0000000003]
>> (com.twitter.common.zookeeper.CandidateImpl:144)
>> [2015-03-10 11:31:40,568] INFO Candidate
>> /marathon/leader/member_0000000003 waiting for the next leader election,
>> current voting: [member_0000000001, member_0000000002, member_0000000003]
>> (com.twitter.common.zookeeper.CandidateImpl:165)
>>
>>
>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
>> [2015-03-10 11:31:40,029] INFO Candidate
>> /marathon/leader/member_0000000001 is now leader of group:
>> [member_0000000001, member_0000000002]
>> (com.twitter.common.zookeeper.CandidateImpl:152)
>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
>> (mesosphere.marathon.MarathonSchedulerService:253)
>> [2015-03-10 11:31:40,044] INFO Elect leadership
>> (mesosphere.marathon.MarathonSchedulerService:299)
>> [2015-03-10 11:31:40,044] INFO Running driver
>> (mesosphere.marathon.MarathonSchedulerService:184)
>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
>> Client environment:zookeeper.version=zookeeper C client 3.4.5
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
>> Client environment:host.name=srv-d2u-9-virtip20
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
>> Client environment:os.name=Linux
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
>> Client environment:os.arch=3.13.0-44-generic
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
>> Client environment:user.name=(null)
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
>> Client environment:user.home=/root
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
>> Client environment:user.dir=/
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0
>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
>> (mesosphere.marathon.MarathonSchedulerService:329)
>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
>> initiated connection to server [10.195.30.19:2181]
>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
>> session establishment complete on server [10.195.30.19:2181],
>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
>> 10.195.30.20:45771) connected to ZooKeeper
>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos'
>> in ZooKeeper
>> [INFO] [03/10/2015 11:31:40.047]
>> [marathon-akka.actor.default-dispatcher-2]
>> [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for version
>> Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
>> 20150310-112310-354337546-5050-895-0000
>> [2015-03-10 11:31:40,056] INFO Registered as
>> 20150310-112310-354337546-5050-895-0000 to master
>> '20150310-112310-354337546-5050-895'
>> (mesosphere.marathon.MarathonScheduler:72)
>> [2015-03-10 11:31:40,063] INFO Stored framework ID
>> '20150310-112310-354337546-5050-895-0000'
>> (mesosphere.mesos.util.FrameworkIdUtil:49)
>> [INFO] [03/10/2015 11:31:40.065]
>> [marathon-akka.actor.default-dispatcher-6]
>> [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
>> [INFO] [03/10/2015 11:31:40.067]
>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> ...
>> ...
>> ...
>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
>> (mesosphere.marathon.SchedulerActions:403)
>> [INFO] [03/10/2015 11:31:55.053]
>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
>> encountered. This logging can be turned off or adjusted with configuration
>> settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>> [2015-03-10 11:31:55,064] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
>> TASK_LOST (Reconciliation: Task is unknown to the slave)
>> (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:31:55,069] INFO Need to scale
>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>> (mesosphere.marathon.SchedulerActions:488)
>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
>> /ffaas-backoffice-app-nopersist (0 queued)
>> (mesosphere.marathon.SchedulerActions:494)
>> [2015-03-10 11:31:55,069] INFO Task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
>> expunged and removed from TaskTracker
>> (mesosphere.marathon.tasks.TaskTracker:107)
>> [2015-03-10 11:31:55,070] INFO Sending event notification.
>> (mesosphere.marathon.MarathonScheduler:262)
>> [INFO] [03/10/2015 11:31:55.072]
>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> [2015-03-10 11:31:55,073] INFO Need to scale
>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>> (mesosphere.marathon.SchedulerActions:488)
>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
>> /ffaas-backoffice-app-nopersist. Not scaling.
>> (mesosphere.marathon.SchedulerActions:498)
>> ...
>> ...
>> ...
>> [2015-03-10 11:31:57,682] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:31:57,694] INFO Sending event notification.
>> (mesosphere.marathon.MarathonScheduler:262)
>> [INFO] [03/10/2015 11:31:57.694]
>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> ...
>> ...
>> ...
>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
>> (mesosphere.marathon.tasks.TaskTracker:170)
>> [INFO] [03/10/2015 11:36:55.050]
>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
>> encountered. This logging can be turned off or adjusted with configuration
>> settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
>> (mesosphere.marathon.SchedulerActions:403)
>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>> [2015-03-10 11:36:55,063] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>> TASK_RUNNING (Reconciliation: Latest task state)
>> (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:36:55,065] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>> TASK_RUNNING (Reconciliation: Latest task state)
>> (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
>> /ffaas-backoffice-app-nopersist. Not scaling.
>> (mesosphere.marathon.SchedulerActions:512)
>>
>>
>>
>> -- End of logs
>>
>>
>>
>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>>
>>> This is certainly not the expected/desired behavior when failing over a
>>> mesos master in HA mode. In addition to the master logs Alex requested, can
>>> you also provide relevant portions of the slave logs for these tasks? If
>>> the slave processes themselves never failed over, checkpointing and slave
>>> recovery should be irrelevant. Are you running the mesos-slave itself
>>> inside a Docker, or any other non-traditional setup?
>>>
>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>>> defaults to "reconnect", and --strict defaults to true, so none of those
>>> are necessary.
>>>
>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>>> wrote:
>>>
>>>> Geoffroy,
>>>>
>>>> could you please provide master logs (both from killed and taking over
>>>> masters)?
>>>>
>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>>> geoffroy.jabouley@gmail.com> wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> we are facing some unexpecting issues when testing high availability
>>>>> behaviors of our mesos cluster.
>>>>>
>>>>> *Our use case:*
>>>>>
>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>>> running on each slave (started from marathon)
>>>>>
>>>>> *Action*: stop the mesos master leader process
>>>>>
>>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>>> unchanged*
>>>>>
>>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>>> FAILED but docker containers are still running*, marathon detects
>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>>
>>>>>
>>>>> Is the seen behavior correct?
>>>>>
>>>>> Have we misunderstood the high availability concept? We thought that
>>>>> doing this use case would not have any impact on the current cluster state
>>>>> (except leader re-election)
>>>>>
>>>>> Thanks in advance for your help
>>>>> Regards
>>>>>
>>>>> ---------------------------------------------------
>>>>>
>>>>> our setup is the following:
>>>>> 3 identical mesos nodes with:
>>>>>     + zookeeper
>>>>>     + docker 1.5
>>>>>     + mesos master 0.21.1 configured in HA mode
>>>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>>>> reconnect
>>>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>>>
>>>>> ---------------------------------------------------
>>>>>
>>>>> Command lines:
>>>>>
>>>>>
>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>>
>>>>> *mesos-slave*
>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,
>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint
>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins
>>>>> --hostname=10.195.30.19 --ip=10.195.30.19
>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>>
>>>>> *marathon*
>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Alex Rukletsov <al...@mesosphere.io>.
Geoffroy,

most probably you're hitting this bug:
https://github.com/mesosphere/marathon/issues/1063. The problem is that
Marathon can register instead of re-registering when a master fails
over. From master point of view, it's a new framework, that's why the
previous task is gone and a new one (that technically belongs to a new
framework) is started. You can see that frameworks have two different IDs
(check lines 11:31:40.055496 and 11:31:40.785038) in your example.

Hope that helps,
Alex

On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
geoffroy.jabouley@gmail.com> wrote:

> Hello
>
> thanks for your interest. Following are the requested logs, which will
> result in a pretty big mail.
>
> Mesos/Marathon are *NOT running inside docker*, we only use Docker as our
> mesos containerizer.
>
> For reminder, here is the use case performed to get the logs file:
>
> --------------------------------
>
> Our cluster: 3 identical mesos nodes with:
>     + zookeeper
>     + docker 1.5
>     + mesos master 0.21.1 configured in HA mode
>     + mesos slave 0.21.1 configured with checkpointing, strict and
> reconnect
>     + marathon 0.8.0 configured in HA mode with checkpointing
>
> --------------------------------
>
> *Begin State: *
> + the mesos cluster is up (3 machines)
> + mesos master leader is 10.195.30.19
> + marathon leader is 10.195.30.21
> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>
> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>
> *Expected*: mesos master leader has changed, active tasks / frameworks
> remain unchanged
>
> *End state: *
> + mesos master leader *has changed, now 10.195.30.21*
> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
> (not showing anymore on the mesos UI), but *docker container is still
> running*
> + a n*ew APPTASK is now running on slave 10.195.30.19*
> + marathon framework "registration time" in mesos UI shows "Just now"
> + marathon leader *has changed, now 10.195.30.20*
>
>
> --------------------------------
>
> Now comes the 6 requested logs, which might contain interesting/relevant
> information, but i as a newcomer to mesos it is hard to read...
>
>
> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
> from process 1 of user 0; exiting
>
>
> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
> I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.011823   922 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships
> changed
> I0310 11:31:40.015847   915 group.cpp:659] Trying to get
> '/mesos/log_replicas/0000000000' in ZooKeeper
> I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is
> master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
> I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading master!
> I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
> I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
> I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
> I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
> promise request with proposal 2
> I0310 11:31:40.017503   915 group.cpp:659] Trying to get
> '/mesos/log_replicas/0000000003' in ZooKeeper
> I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8 bytes)
> to leveldb took 893672ns
> I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
> I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
> I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping to
> fill missing position
> I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
> position 8
> I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from leveldb
> took 13195ns
> I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from leveldb
> took 3120ns
> I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
> registry (610B) in 7.385856ms
> I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
> 9263ns; attempting to update the 'registry'
> I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 9
> I0310 11:31:40.024435   923 replica.cpp:508] Replica received write
> request for position 9
> I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 1.259338ms
> I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned
> notice for position 9
> I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 404795ns
> I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND action
> at position 9
> I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the
> 'registry' in 3.029248ms
> I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered
> registrar
> I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log to
> 9
> I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from the
> Registry (608B) ; allowing 1days for slaves to re-register
> I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 10
> I0310 11:31:40.026988   923 replica.cpp:508] Replica received write
> request for position 10
> I0310 11:31:40.027640   923 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 641018ns
> I0310 11:31:40.027652   923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.030848   923 replica.cpp:655] Replica received learned
> notice for position 10
> I0310 11:31:40.031883   923 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 1.008914ms
> I0310 11:31:40.031963   923 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 46724ns
> I0310 11:31:40.031977   923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.031986   923 replica.cpp:661] Replica learned TRUNCATE
> action at position 10
> I0310 11:31:40.055415   918 master.cpp:1383] Received registration request
> for framework 'marathon' at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.055496   918 master.cpp:1447] Registering framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.055642   919 hierarchical_allocator_process.hpp:329] Added
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.189151   919 master.cpp:3246] Re-registering slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19)
> I0310 11:31:40.189280   919 registrar.cpp:445] Applied 1 operations in
> 15452ns; attempting to update the 'registry'
> I0310 11:31:40.189949   919 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.189978   919 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 11
> I0310 11:31:40.190112   919 replica.cpp:508] Replica received write
> request for position 11
> I0310 11:31:40.190563   919 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 437440ns
> I0310 11:31:40.190577   919 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.191249   921 replica.cpp:655] Replica received learned
> notice for position 11
> I0310 11:31:40.192159   921 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 892767ns
> I0310 11:31:40.192178   921 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.192184   921 replica.cpp:661] Replica learned APPEND action
> at position 11
> I0310 11:31:40.192350   921 registrar.cpp:490] Successfully updated the
> 'registry' in 3.0528ms
> I0310 11:31:40.192387   919 log.cpp:699] Attempting to truncate the log to
> 11
> I0310 11:31:40.192415   919 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 12
> I0310 11:31:40.192539   915 replica.cpp:508] Replica received write
> request for position 12
> I0310 11:31:40.192600   921 master.cpp:3314] Re-registered slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148
> I0310 11:31:40.192680   917 hierarchical_allocator_process.hpp:442] Added
> slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> available)
> I0310 11:31:40.192847   917 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.193164   915 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 610664ns
> I0310 11:31:40.193181   915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193568   915 replica.cpp:655] Replica received learned
> notice for position 12
> I0310 11:31:40.193948   915 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 364062ns
> I0310 11:31:40.193979   915 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 12256ns
> I0310 11:31:40.193985   915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193990   915 replica.cpp:661] Replica learned TRUNCATE
> action at position 12
> I0310 11:31:40.248615   915 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O0 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.248744   915 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.774416   915 master.cpp:3246] Re-registering slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21)
> I0310 11:31:40.774976   915 registrar.cpp:445] Applied 1 operations in
> 42342ns; attempting to update the 'registry'
> I0310 11:31:40.777273   921 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.777436   921 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 13
> I0310 11:31:40.777989   921 replica.cpp:508] Replica received write
> request for position 13
> I0310 11:31:40.779558   921 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 1.513714ms
> I0310 11:31:40.779633   921 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.781821   919 replica.cpp:655] Replica received learned
> notice for position 13
> I0310 11:31:40.784417   919 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 2.542036ms
> I0310 11:31:40.784446   919 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.784452   919 replica.cpp:661] Replica learned APPEND action
> at position 13
> I0310 11:31:40.784711   920 registrar.cpp:490] Successfully updated the
> 'registry' in 9.68192ms
> I0310 11:31:40.784762   917 log.cpp:699] Attempting to truncate the log to
> 13
> I0310 11:31:40.784808   920 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 14
> I0310 11:31:40.784865   917 master.hpp:877] Adding task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
> I0310 11:31:40.784955   919 replica.cpp:508] Replica received write
> request for position 14
> W0310 11:31:40.785038   917 master.cpp:4468] Possibly orphaned task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
> framework 20150310-112310-320783114-5050-24289-0000 running on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21)
> I0310 11:31:40.785105   917 master.cpp:3314] Re-registered slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148
> I0310 11:31:40.785162   920 hierarchical_allocator_process.hpp:442] Added
> slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
> available)
> I0310 11:31:40.785679   921 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.786429   919 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 1.454211ms
> I0310 11:31:40.786455   919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.786782   919 replica.cpp:655] Replica received learned
> notice for position 14
> I0310 11:31:40.787833   919 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 1.027014ms
> I0310 11:31:40.787873   919 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 14085ns
> I0310 11:31:40.787883   919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.787889   919 replica.cpp:661] Replica learned TRUNCATE
> action at position 14
> I0310 11:31:40.792536   922 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O1 ] on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.792670   922 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S2 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.819602   921 master.cpp:3246] Re-registering slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20)
> I0310 11:31:40.819736   921 registrar.cpp:445] Applied 1 operations in
> 16656ns; attempting to update the 'registry'
> I0310 11:31:40.820617   921 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.820726   918 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 15
> I0310 11:31:40.820938   918 replica.cpp:508] Replica received write
> request for position 15
> I0310 11:31:40.821641   918 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 670583ns
> I0310 11:31:40.821663   918 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.822265   917 replica.cpp:655] Replica received learned
> notice for position 15
> I0310 11:31:40.823463   917 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 1.178687ms
> I0310 11:31:40.823490   917 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.823498   917 replica.cpp:661] Replica learned APPEND action
> at position 15
> I0310 11:31:40.823755   917 registrar.cpp:490] Successfully updated the
> 'registry' in 3.97696ms
> I0310 11:31:40.823823   917 log.cpp:699] Attempting to truncate the log to
> 15
> I0310 11:31:40.824147   922 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 16
> I0310 11:31:40.824482   922 hierarchical_allocator_process.hpp:442] Added
> slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> available)
> I0310 11:31:40.824597   921 replica.cpp:508] Replica received write
> request for position 16
> I0310 11:31:40.824128   917 master.cpp:3314] Re-registered slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148
> I0310 11:31:40.824975   917 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.831900   921 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 7.228682ms
> I0310 11:31:40.832031   921 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.832456   917 replica.cpp:655] Replica received learned
> notice for position 16
> I0310 11:31:40.835178   917 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 2.674392ms
> I0310 11:31:40.835297   917 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 45220ns
> I0310 11:31:40.835322   917 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.835341   917 replica.cpp:661] Replica learned TRUNCATE
> action at position 16
> I0310 11:31:40.838281   923 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O2 ] on slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.838389   923 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S0 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.948725   919 http.cpp:344] HTTP request for
> '/master/redirect'
> I0310 11:31:41.479118   918 http.cpp:478] HTTP request for
> '/master/state.json'
> I0310 11:31:45.368074   918 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:45.385144   917 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O3 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:45.385292   917 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.368450   917 master.cpp:3843] Sending 2 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:46.375222   920 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O4 ] on slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:46.375360   920 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O5 ] on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:46.375530   920 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S0 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.375599   920 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S2 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:48.031230   915 http.cpp:478] HTTP request for
> '/master/state.json'
> I0310 11:31:51.374285   922 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:51.379391   921 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O6 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:51.379487   921 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:51.482094   923 http.cpp:478] HTTP request for
> '/master/state.json'
> I0310 11:31:52.375326   917 master.cpp:3843] Sending 2 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:52.391376   919 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O7 ] on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:52.391512   919 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S2 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:52.391659   921 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O8 ] on slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:52.391751   921 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S0 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:55.062060   918 master.cpp:3611] Performing explicit task
> state reconciliation for 1 tasks of framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:55.062588   919 master.cpp:3556] Performing implicit task
> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:56.140990   923 http.cpp:344] HTTP request for
> '/master/redirect'
> I0310 11:31:57.379288   918 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:57.430888   918 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O9 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:57.431068   918 master.hpp:877] Adding task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
> I0310 11:31:57.431089   918 master.cpp:2503] Launching task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19)
> I0310 11:31:57.431205   918 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682133   919 master.cpp:3446] Forwarding status update
> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682186   919 master.cpp:3418] Status update TASK_RUNNING
> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 from slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19)
> I0310 11:31:57.682199   919 master.cpp:4693] Updating the latest state of
> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>
>
> *from MESOS slave 10.195.30.21*
> I0310 11:31:28.750200  1074 slave.cpp:2623] master@10.195.30.19:5050
> exited
> W0310 11:31:28.750249  1074 slave.cpp:2626] Master disconnected! Waiting
> for a new master to be elected
> I0310 11:31:40.012516  1075 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.012899  1073 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017143  1072 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.017408  1072 slave.cpp:602] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.017546  1076 status_update_manager.cpp:171] Pausing sending
> status updates
> I0310 11:31:40.018673  1072 slave.cpp:627] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.018689  1072 slave.cpp:638] Detecting new master
> I0310 11:31:40.785364  1075 slave.cpp:824] Re-registered with master
> master@10.195.30.21:5050
> I0310 11:31:40.785398  1075 status_update_manager.cpp:178] Resuming
> sending status updates
> I0310 11:32:10.639506  1075 slave.cpp:3321] Current usage 12.27%. Max
> allowed age: 5.441217749539572days
>
>
> *from MESOS slave 10.195.30.19*
> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
> exited
> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
> for a new master to be elected
> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing sending
> status updates
> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
> master@10.195.30.21:5050
> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
> sending status updates
> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 in work directory
> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
> executor
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework '20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
> executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
> framework '20150310-112310-354337546-5050-895-0000'
> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
> framework '20150310-112310-354337546-5050-895-0000' in container
> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
> 10.195.30.19:56574
> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
> executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
> 10.195.30.19:56574
> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received status
> update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
> UPDATE for status update TASK_RUNNING (UUID:
> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 to
> master@10.195.30.21:5050
> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
> 10.195.30.19:56574
> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received status
> update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
> ACK for status update TASK_RUNNING (UUID:
> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204
> at
> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
> 'memory.soft_limit_in_bytes' to 160MB for container
> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
> 'memory.limit_in_bytes' to 160MB at
> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
> allowed age: 5.555425200437824days
>
>
> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
> I0310 11:31:40.017248  1115 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.017334  1115 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017727  1115 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> [2015-03-10 11:31:40,017] WARN Disconnected
> (mesosphere.marathon.MarathonScheduler:224)
> [2015-03-10 11:31:40,019] INFO Abdicating
> (mesosphere.marathon.MarathonSchedulerService:312)
> [2015-03-10 11:31:40,019] INFO Defeat leadership
> (mesosphere.marathon.MarathonSchedulerService:285)
> [INFO] [03/10/2015 11:31:40.019]
> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
> POSTing to all endpoints.
> [INFO] [03/10/2015 11:31:40.019]
> [marathon-akka.actor.default-dispatcher-5]
> [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
> [2015-03-10 11:31:40,021] INFO Stopping driver
> (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.022001  1115 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,024] INFO Setting framework ID to
> 20150310-112310-320783114-5050-24289-0000
> (mesosphere.marathon.MarathonSchedulerService:73)
> I0310 11:31:40.026274  1115 sched.cpp:234] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.026418  1115 sched.cpp:242] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.026458  1115 sched.cpp:752] Stopping framework
> '20150310-112310-320783114-5050-24289-0000'
> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing optional
> abdication command. (mesosphere.marathon.MarathonSchedulerService:192)
> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
> (mesosphere.marathon.MarathonSchedulerService:246)
> [2015-03-10 11:31:40,032] INFO Defeat leadership
> (mesosphere.marathon.MarathonSchedulerService:285)
> [2015-03-10 11:31:40,032] INFO Stopping driver
> (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.032588  1107 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
> [2015-03-10 11:31:40,033] INFO Setting framework ID to
> 20150310-112310-320783114-5050-24289-0000
> (mesosphere.marathon.MarathonSchedulerService:73)
> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is not
> a candidate for leader, current voting: [member_0000000001,
> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership
> (mesosphere.marathon.MarathonSchedulerService:341)
> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
> (com.twitter.common.zookeeper.Group:426)
> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is not
> a candidate for leader, current voting: [member_0000000001,
> member_0000000002, member_0000000003]
> (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,568] INFO Candidate
> /marathon/leader/member_0000000003 waiting for the next leader election,
> current voting: [member_0000000001, member_0000000002, member_0000000003]
> (com.twitter.common.zookeeper.CandidateImpl:165)
>
>
> *From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
> [2015-03-10 11:31:40,029] INFO Candidate
> /marathon/leader/member_0000000001 is now leader of group:
> [member_0000000001, member_0000000002]
> (com.twitter.common.zookeeper.CandidateImpl:152)
> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
> (mesosphere.marathon.MarathonSchedulerService:253)
> [2015-03-10 11:31:40,044] INFO Elect leadership
> (mesosphere.marathon.MarathonSchedulerService:299)
> [2015-03-10 11:31:40,044] INFO Running driver
> (mesosphere.marathon.MarathonSchedulerService:184)
> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
> Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
> Client environment:host.name=srv-d2u-9-virtip20
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
> Client environment:os.name=Linux
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
> Client environment:os.arch=3.13.0-44-generic
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
> Client environment:user.name=(null)
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
> Client environment:user.home=/root
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
> Client environment:user.dir=/
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0 sessionId=0
> sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
> (mesosphere.marathon.MarathonSchedulerService:329)
> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
> initiated connection to server [10.195.30.19:2181]
> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
> session establishment complete on server [10.195.30.19:2181],
> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
> 10.195.30.20:45771) connected to ZooKeeper
> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos'
> in ZooKeeper
> [INFO] [03/10/2015 11:31:40.047]
> [marathon-akka.actor.default-dispatcher-2]
> [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> [2015-03-10 11:31:40,053] INFO Migration successfully applied for version
> Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
> 20150310-112310-354337546-5050-895-0000
> [2015-03-10 11:31:40,056] INFO Registered as
> 20150310-112310-354337546-5050-895-0000 to master
> '20150310-112310-354337546-5050-895'
> (mesosphere.marathon.MarathonScheduler:72)
> [2015-03-10 11:31:40,063] INFO Stored framework ID
> '20150310-112310-354337546-5050-895-0000'
> (mesosphere.mesos.util.FrameworkIdUtil:49)
> [INFO] [03/10/2015 11:31:40.065]
> [marathon-akka.actor.default-dispatcher-6]
> [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
> [INFO] [03/10/2015 11:31:40.067]
> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
> POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
> (mesosphere.marathon.SchedulerActions:403)
> [INFO] [03/10/2015 11:31:55.053]
> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
> encountered. This logging can be turned off or adjusted with configuration
> settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
> Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:31:55,064] INFO Received status update for task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
> TASK_LOST (Reconciliation: Task is unknown to the slave)
> (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:55,069] INFO Need to scale
> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
> (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
> /ffaas-backoffice-app-nopersist (0 queued)
> (mesosphere.marathon.SchedulerActions:494)
> [2015-03-10 11:31:55,069] INFO Task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
> expunged and removed from TaskTracker
> (mesosphere.marathon.tasks.TaskTracker:107)
> [2015-03-10 11:31:55,070] INFO Sending event notification.
> (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:55.072]
> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
> POSTing to all endpoints.
> [2015-03-10 11:31:55,073] INFO Need to scale
> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
> (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
> /ffaas-backoffice-app-nopersist. Not scaling.
> (mesosphere.marathon.SchedulerActions:498)
> ...
> ...
> ...
> [2015-03-10 11:31:57,682] INFO Received status update for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:57,694] INFO Sending event notification.
> (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:57.694]
> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
> POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
> (mesosphere.marathon.tasks.TaskTracker:170)
> [INFO] [03/10/2015 11:36:55.050]
> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
> encountered. This logging can be turned off or adjusted with configuration
> settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
> (mesosphere.marathon.SchedulerActions:403)
> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
> Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:36:55,063] INFO Received status update for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
> TASK_RUNNING (Reconciliation: Latest task state)
> (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,065] INFO Received status update for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
> TASK_RUNNING (Reconciliation: Latest task state)
> (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
> /ffaas-backoffice-app-nopersist. Not scaling.
> (mesosphere.marathon.SchedulerActions:512)
>
>
>
> -- End of logs
>
>
>
> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>
>> This is certainly not the expected/desired behavior when failing over a
>> mesos master in HA mode. In addition to the master logs Alex requested, can
>> you also provide relevant portions of the slave logs for these tasks? If
>> the slave processes themselves never failed over, checkpointing and slave
>> recovery should be irrelevant. Are you running the mesos-slave itself
>> inside a Docker, or any other non-traditional setup?
>>
>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>> defaults to "reconnect", and --strict defaults to true, so none of those
>> are necessary.
>>
>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>> wrote:
>>
>>> Geoffroy,
>>>
>>> could you please provide master logs (both from killed and taking over
>>> masters)?
>>>
>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>> geoffroy.jabouley@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> we are facing some unexpecting issues when testing high availability
>>>> behaviors of our mesos cluster.
>>>>
>>>> *Our use case:*
>>>>
>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>> running on each slave (started from marathon)
>>>>
>>>> *Action*: stop the mesos master leader process
>>>>
>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>> unchanged*
>>>>
>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>> FAILED but docker containers are still running*, marathon detects
>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>
>>>>
>>>> Is the seen behavior correct?
>>>>
>>>> Have we misunderstood the high availability concept? We thought that
>>>> doing this use case would not have any impact on the current cluster state
>>>> (except leader re-election)
>>>>
>>>> Thanks in advance for your help
>>>> Regards
>>>>
>>>> ---------------------------------------------------
>>>>
>>>> our setup is the following:
>>>> 3 identical mesos nodes with:
>>>>     + zookeeper
>>>>     + docker 1.5
>>>>     + mesos master 0.21.1 configured in HA mode
>>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>>> reconnect
>>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>>
>>>> ---------------------------------------------------
>>>>
>>>> Command lines:
>>>>
>>>>
>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>
>>>> *mesos-slave*
>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181
>>>> ,10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
>>>> --executor_registration_timeout=5mins --hostname=10.195.30.19
>>>> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>
>>>> *marathon*
>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>
>>>
>>>
>>
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Geoffroy Jabouley <ge...@gmail.com>.
Hello

thanks for your interest. Following are the requested logs, which will
result in a pretty big mail.

Mesos/Marathon are *NOT running inside docker*, we only use Docker as our
mesos containerizer.

For reminder, here is the use case performed to get the logs file:

--------------------------------

Our cluster: 3 identical mesos nodes with:
    + zookeeper
    + docker 1.5
    + mesos master 0.21.1 configured in HA mode
    + mesos slave 0.21.1 configured with checkpointing, strict and reconnect
    + marathon 0.8.0 configured in HA mode with checkpointing

--------------------------------

*Begin State: *
+ the mesos cluster is up (3 machines)
+ mesos master leader is 10.195.30.19
+ marathon leader is 10.195.30.21
+ 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21

*Action*: stop the mesos master leader process (sudo stop mesos-master)

*Expected*: mesos master leader has changed, active tasks / frameworks
remain unchanged

*End state: *
+ mesos master leader *has changed, now 10.195.30.21*
+ previously running APPTASK on the slave 10.195.30.21 has "disappear" (not
showing anymore on the mesos UI), but *docker container is still running*
+ a n*ew APPTASK is now running on slave 10.195.30.19*
+ marathon framework "registration time" in mesos UI shows "Just now"
+ marathon leader *has changed, now 10.195.30.20*


--------------------------------

Now comes the 6 requested logs, which might contain interesting/relevant
information, but i as a newcomer to mesos it is hard to read...


*from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
from process 1 of user 0; exiting


*from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.011823   922 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships
changed
I0310 11:31:40.015847   915 group.cpp:659] Trying to get
'/mesos/log_replicas/0000000000' in ZooKeeper
I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is
master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading master!
I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
promise request with proposal 2
I0310 11:31:40.017503   915 group.cpp:659] Trying to get
'/mesos/log_replicas/0000000003' in ZooKeeper
I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 893672ns
I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping to
fill missing position
I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
position 8
I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from leveldb
took 13195ns
I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from leveldb
took 3120ns
I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
registry (610B) in 7.385856ms
I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
9263ns; attempting to update the 'registry'
I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 9
I0310 11:31:40.024435   923 replica.cpp:508] Replica received write request
for position 9
I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 1.259338ms
I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned
notice for position 9
I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 404795ns
I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND action
at position 9
I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the
'registry' in 3.029248ms
I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered
registrar
I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log to 9
I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from the
Registry (608B) ; allowing 1days for slaves to re-register
I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 10
I0310 11:31:40.026988   923 replica.cpp:508] Replica received write request
for position 10
I0310 11:31:40.027640   923 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 641018ns
I0310 11:31:40.027652   923 replica.cpp:676] Persisted action at 10
I0310 11:31:40.030848   923 replica.cpp:655] Replica received learned
notice for position 10
I0310 11:31:40.031883   923 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 1.008914ms
I0310 11:31:40.031963   923 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 46724ns
I0310 11:31:40.031977   923 replica.cpp:676] Persisted action at 10
I0310 11:31:40.031986   923 replica.cpp:661] Replica learned TRUNCATE
action at position 10
I0310 11:31:40.055415   918 master.cpp:1383] Received registration request
for framework 'marathon' at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.055496   918 master.cpp:1447] Registering framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.055642   919 hierarchical_allocator_process.hpp:329] Added
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:40.189151   919 master.cpp:3246] Re-registering slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19)
I0310 11:31:40.189280   919 registrar.cpp:445] Applied 1 operations in
15452ns; attempting to update the 'registry'
I0310 11:31:40.189949   919 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.189978   919 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 11
I0310 11:31:40.190112   919 replica.cpp:508] Replica received write request
for position 11
I0310 11:31:40.190563   919 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 437440ns
I0310 11:31:40.190577   919 replica.cpp:676] Persisted action at 11
I0310 11:31:40.191249   921 replica.cpp:655] Replica received learned
notice for position 11
I0310 11:31:40.192159   921 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 892767ns
I0310 11:31:40.192178   921 replica.cpp:676] Persisted action at 11
I0310 11:31:40.192184   921 replica.cpp:661] Replica learned APPEND action
at position 11
I0310 11:31:40.192350   921 registrar.cpp:490] Successfully updated the
'registry' in 3.0528ms
I0310 11:31:40.192387   919 log.cpp:699] Attempting to truncate the log to
11
I0310 11:31:40.192415   919 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 12
I0310 11:31:40.192539   915 replica.cpp:508] Replica received write request
for position 12
I0310 11:31:40.192600   921 master.cpp:3314] Re-registered slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148
I0310 11:31:40.192680   917 hierarchical_allocator_process.hpp:442] Added
slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
available)
I0310 11:31:40.192847   917 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.193164   915 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 610664ns
I0310 11:31:40.193181   915 replica.cpp:676] Persisted action at 12
I0310 11:31:40.193568   915 replica.cpp:655] Replica received learned
notice for position 12
I0310 11:31:40.193948   915 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 364062ns
I0310 11:31:40.193979   915 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 12256ns
I0310 11:31:40.193985   915 replica.cpp:676] Persisted action at 12
I0310 11:31:40.193990   915 replica.cpp:661] Replica learned TRUNCATE
action at position 12
I0310 11:31:40.248615   915 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O0 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.248744   915 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:40.774416   915 master.cpp:3246] Re-registering slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21)
I0310 11:31:40.774976   915 registrar.cpp:445] Applied 1 operations in
42342ns; attempting to update the 'registry'
I0310 11:31:40.777273   921 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.777436   921 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 13
I0310 11:31:40.777989   921 replica.cpp:508] Replica received write request
for position 13
I0310 11:31:40.779558   921 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 1.513714ms
I0310 11:31:40.779633   921 replica.cpp:676] Persisted action at 13
I0310 11:31:40.781821   919 replica.cpp:655] Replica received learned
notice for position 13
I0310 11:31:40.784417   919 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 2.542036ms
I0310 11:31:40.784446   919 replica.cpp:676] Persisted action at 13
I0310 11:31:40.784452   919 replica.cpp:661] Replica learned APPEND action
at position 13
I0310 11:31:40.784711   920 registrar.cpp:490] Successfully updated the
'registry' in 9.68192ms
I0310 11:31:40.784762   917 log.cpp:699] Attempting to truncate the log to
13
I0310 11:31:40.784808   920 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 14
I0310 11:31:40.784865   917 master.hpp:877] Adding task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
I0310 11:31:40.784955   919 replica.cpp:508] Replica received write request
for position 14
W0310 11:31:40.785038   917 master.cpp:4468] Possibly orphaned task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
framework 20150310-112310-320783114-5050-24289-0000 running on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21)
I0310 11:31:40.785105   917 master.cpp:3314] Re-registered slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148
I0310 11:31:40.785162   920 hierarchical_allocator_process.hpp:442] Added
slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
available)
I0310 11:31:40.785679   921 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.786429   919 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 1.454211ms
I0310 11:31:40.786455   919 replica.cpp:676] Persisted action at 14
I0310 11:31:40.786782   919 replica.cpp:655] Replica received learned
notice for position 14
I0310 11:31:40.787833   919 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 1.027014ms
I0310 11:31:40.787873   919 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 14085ns
I0310 11:31:40.787883   919 replica.cpp:676] Persisted action at 14
I0310 11:31:40.787889   919 replica.cpp:661] Replica learned TRUNCATE
action at position 14
I0310 11:31:40.792536   922 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O1 ] on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.792670   922 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S2 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:40.819602   921 master.cpp:3246] Re-registering slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20)
I0310 11:31:40.819736   921 registrar.cpp:445] Applied 1 operations in
16656ns; attempting to update the 'registry'
I0310 11:31:40.820617   921 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.820726   918 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 15
I0310 11:31:40.820938   918 replica.cpp:508] Replica received write request
for position 15
I0310 11:31:40.821641   918 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 670583ns
I0310 11:31:40.821663   918 replica.cpp:676] Persisted action at 15
I0310 11:31:40.822265   917 replica.cpp:655] Replica received learned
notice for position 15
I0310 11:31:40.823463   917 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 1.178687ms
I0310 11:31:40.823490   917 replica.cpp:676] Persisted action at 15
I0310 11:31:40.823498   917 replica.cpp:661] Replica learned APPEND action
at position 15
I0310 11:31:40.823755   917 registrar.cpp:490] Successfully updated the
'registry' in 3.97696ms
I0310 11:31:40.823823   917 log.cpp:699] Attempting to truncate the log to
15
I0310 11:31:40.824147   922 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 16
I0310 11:31:40.824482   922 hierarchical_allocator_process.hpp:442] Added
slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
available)
I0310 11:31:40.824597   921 replica.cpp:508] Replica received write request
for position 16
I0310 11:31:40.824128   917 master.cpp:3314] Re-registered slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148
I0310 11:31:40.824975   917 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.831900   921 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 7.228682ms
I0310 11:31:40.832031   921 replica.cpp:676] Persisted action at 16
I0310 11:31:40.832456   917 replica.cpp:655] Replica received learned
notice for position 16
I0310 11:31:40.835178   917 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 2.674392ms
I0310 11:31:40.835297   917 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 45220ns
I0310 11:31:40.835322   917 replica.cpp:676] Persisted action at 16
I0310 11:31:40.835341   917 replica.cpp:661] Replica learned TRUNCATE
action at position 16
I0310 11:31:40.838281   923 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O2 ] on slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.838389   923 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S0 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:40.948725   919 http.cpp:344] HTTP request for
'/master/redirect'
I0310 11:31:41.479118   918 http.cpp:478] HTTP request for
'/master/state.json'
I0310 11:31:45.368074   918 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:45.385144   917 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O3 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:45.385292   917 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:46.368450   917 master.cpp:3843] Sending 2 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:46.375222   920 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O4 ] on slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:46.375360   920 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O5 ] on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:46.375530   920 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S0 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:46.375599   920 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S2 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:48.031230   915 http.cpp:478] HTTP request for
'/master/state.json'
I0310 11:31:51.374285   922 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:51.379391   921 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O6 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:51.379487   921 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:51.482094   923 http.cpp:478] HTTP request for
'/master/state.json'
I0310 11:31:52.375326   917 master.cpp:3843] Sending 2 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:52.391376   919 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O7 ] on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:52.391512   919 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S2 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:52.391659   921 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O8 ] on slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:52.391751   921 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S0 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:55.062060   918 master.cpp:3611] Performing explicit task state
reconciliation for 1 tasks of framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:55.062588   919 master.cpp:3556] Performing implicit task state
reconciliation for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:56.140990   923 http.cpp:344] HTTP request for
'/master/redirect'
I0310 11:31:57.379288   918 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:57.430888   918 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O9 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:57.431068   918 master.hpp:877] Adding task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
I0310 11:31:57.431089   918 master.cpp:2503] Launching task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19)
I0310 11:31:57.431205   918 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:57.682133   919 master.cpp:3446] Forwarding status update
TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.682186   919 master.cpp:3418] Status update TASK_RUNNING
(UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 from slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19)
I0310 11:31:57.682199   919 master.cpp:4693] Updating the latest state of
task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING


*from MESOS slave 10.195.30.21*
I0310 11:31:28.750200  1074 slave.cpp:2623] master@10.195.30.19:5050 exited
W0310 11:31:28.750249  1074 slave.cpp:2626] Master disconnected! Waiting
for a new master to be elected
I0310 11:31:40.012516  1075 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.012899  1073 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.017143  1072 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.017408  1072 slave.cpp:602] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.017546  1076 status_update_manager.cpp:171] Pausing sending
status updates
I0310 11:31:40.018673  1072 slave.cpp:627] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.018689  1072 slave.cpp:638] Detecting new master
I0310 11:31:40.785364  1075 slave.cpp:824] Re-registered with master
master@10.195.30.21:5050
I0310 11:31:40.785398  1075 status_update_manager.cpp:178] Resuming sending
status updates
I0310 11:32:10.639506  1075 slave.cpp:3321] Current usage 12.27%. Max
allowed age: 5.441217749539572days


*from MESOS slave 10.195.30.19*
I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050 exited
W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
for a new master to be elected
I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing sending
status updates
I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
master@10.195.30.21:5050
I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming sending
status updates
I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 in work directory
'/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
executor
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework '20150310-112310-354337546-5050-895-0000
I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
framework '20150310-112310-354337546-5050-895-0000'
I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
'/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
framework '20150310-112310-354337546-5050-895-0000' in container
'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
10.195.30.19:56574
I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
10.195.30.19:56574
I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received status
update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
UPDATE for status update TASK_RUNNING (UUID:
2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 to
master@10.195.30.21:5050
I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
10.195.30.19:56574
I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received status
update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
ACK for status update TASK_RUNNING (UUID:
2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204 at
/sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
'memory.soft_limit_in_bytes' to 160MB for container
a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
'memory.limit_in_bytes' to 160MB at
/sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
allowed age: 5.555425200437824days


*From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
I0310 11:31:40.017248  1115 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.017334  1115 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.017727  1115 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
[2015-03-10 11:31:40,017] WARN Disconnected
(mesosphere.marathon.MarathonScheduler:224)
[2015-03-10 11:31:40,019] INFO Abdicating
(mesosphere.marathon.MarathonSchedulerService:312)
[2015-03-10 11:31:40,019] INFO Defeat leadership
(mesosphere.marathon.MarathonSchedulerService:285)
[INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-6]
[akka://marathon/user/$b] POSTing to all endpoints.
[INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-5]
[akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
[2015-03-10 11:31:40,021] INFO Stopping driver
(mesosphere.marathon.MarathonSchedulerService:221)
I0310 11:31:40.022001  1115 sched.cpp:1286] Asked to stop the driver
[2015-03-10 11:31:40,024] INFO Setting framework ID to
20150310-112310-320783114-5050-24289-0000
(mesosphere.marathon.MarathonSchedulerService:73)
I0310 11:31:40.026274  1115 sched.cpp:234] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.026418  1115 sched.cpp:242] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.026458  1115 sched.cpp:752] Stopping framework
'20150310-112310-320783114-5050-24289-0000'
[2015-03-10 11:31:40,026] INFO Driver future completed. Executing optional
abdication command. (mesosphere.marathon.MarathonSchedulerService:192)
[2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
(mesosphere.marathon.MarathonSchedulerService:246)
[2015-03-10 11:31:40,032] INFO Defeat leadership
(mesosphere.marathon.MarathonSchedulerService:285)
[2015-03-10 11:31:40,032] INFO Stopping driver
(mesosphere.marathon.MarathonSchedulerService:221)
I0310 11:31:40.032588  1107 sched.cpp:1286] Asked to stop the driver
[2015-03-10 11:31:40,033] INFO Will offer leadership after 500 milliseconds
backoff (mesosphere.marathon.MarathonSchedulerService:334)
[2015-03-10 11:31:40,033] INFO Setting framework ID to
20150310-112310-320783114-5050-24289-0000
(mesosphere.marathon.MarathonSchedulerService:73)
[2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is not
a candidate for leader, current voting: [member_0000000001,
member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
[2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership
(mesosphere.marathon.MarathonSchedulerService:341)
[2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
(com.twitter.common.zookeeper.Group:426)
[2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is not
a candidate for leader, current voting: [member_0000000001,
member_0000000002, member_0000000003]
(com.twitter.common.zookeeper.CandidateImpl:144)
[2015-03-10 11:31:40,568] INFO Candidate /marathon/leader/member_0000000003
waiting for the next leader election, current voting: [member_0000000001,
member_0000000002, member_0000000003]
(com.twitter.common.zookeeper.CandidateImpl:165)


*From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
[2015-03-10 11:31:40,029] INFO Candidate /marathon/leader/member_0000000001
is now leader of group: [member_0000000001, member_0000000002]
(com.twitter.common.zookeeper.CandidateImpl:152)
[2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
(mesosphere.marathon.MarathonSchedulerService:253)
[2015-03-10 11:31:40,044] INFO Elect leadership
(mesosphere.marathon.MarathonSchedulerService:299)
[2015-03-10 11:31:40,044] INFO Running driver
(mesosphere.marathon.MarathonSchedulerService:184)
I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712: Client
environment:zookeeper.version=zookeeper C client 3.4.5
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716: Client
environment:host.name=srv-d2u-9-virtip20
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723: Client
environment:os.name=Linux
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724: Client
environment:os.arch=3.13.0-44-generic
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725: Client
environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733: Client
environment:user.name=(null)
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741: Client
environment:user.home=/root
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753: Client
environment:user.dir=/
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0 sessionId=0
sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
[2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
(mesosphere.marathon.MarathonSchedulerService:329)
2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
initiated connection to server [10.195.30.19:2181]
2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
session establishment complete on server [10.195.30.19:2181],
sessionId=0x14c0335ad7e000d, negotiated timeout=10000
I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
10.195.30.20:45771) connected to ZooKeeper
I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos'
in ZooKeeper
[INFO] [03/10/2015 11:31:40.047] [marathon-akka.actor.default-dispatcher-2]
[akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
[2015-03-10 11:31:40,053] INFO Migration successfully applied for version
Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
20150310-112310-354337546-5050-895-0000
[2015-03-10 11:31:40,056] INFO Registered as
20150310-112310-354337546-5050-895-0000 to master
'20150310-112310-354337546-5050-895'
(mesosphere.marathon.MarathonScheduler:72)
[2015-03-10 11:31:40,063] INFO Stored framework ID
'20150310-112310-354337546-5050-895-0000'
(mesosphere.mesos.util.FrameworkIdUtil:49)
[INFO] [03/10/2015 11:31:40.065] [marathon-akka.actor.default-dispatcher-6]
[akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
[INFO] [03/10/2015 11:31:40.067] [marathon-akka.actor.default-dispatcher-7]
[akka://marathon/user/$b] POSTing to all endpoints.
...
...
...
[2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
(mesosphere.marathon.SchedulerActions:403)
[INFO] [03/10/2015 11:31:55.053]
[marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
encountered. This logging can be turned off or adjusted with configuration
settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
[2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
Mesos master (mesosphere.marathon.SchedulerActions:430)
[2015-03-10 11:31:55,064] INFO Received status update for task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
TASK_LOST (Reconciliation: Task is unknown to the slave)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:31:55,069] INFO Need to scale
/ffaas-backoffice-app-nopersist from 0 up to 1 instances
(mesosphere.marathon.SchedulerActions:488)
[2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
/ffaas-backoffice-app-nopersist (0 queued)
(mesosphere.marathon.SchedulerActions:494)
[2015-03-10 11:31:55,069] INFO Task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
expunged and removed from TaskTracker
(mesosphere.marathon.tasks.TaskTracker:107)
[2015-03-10 11:31:55,070] INFO Sending event notification.
(mesosphere.marathon.MarathonScheduler:262)
[INFO] [03/10/2015 11:31:55.072] [marathon-akka.actor.default-dispatcher-7]
[akka://marathon/user/$b] POSTing to all endpoints.
[2015-03-10 11:31:55,073] INFO Need to scale
/ffaas-backoffice-app-nopersist from 0 up to 1 instances
(mesosphere.marathon.SchedulerActions:488)
[2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
/ffaas-backoffice-app-nopersist. Not scaling.
(mesosphere.marathon.SchedulerActions:498)
...
...
...
[2015-03-10 11:31:57,682] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:31:57,694] INFO Sending event notification.
(mesosphere.marathon.MarathonScheduler:262)
[INFO] [03/10/2015 11:31:57.694]
[marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
POSTing to all endpoints.
...
...
...
[2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
(mesosphere.marathon.tasks.TaskTracker:170)
[INFO] [03/10/2015 11:36:55.050] [marathon-akka.actor.default-dispatcher-2]
[akka://marathon/deadLetters] Message
[mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
encountered. This logging can be turned off or adjusted with configuration
settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
[2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
(mesosphere.marathon.SchedulerActions:403)
[2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
Mesos master (mesosphere.marathon.SchedulerActions:430)
[2015-03-10 11:36:55,063] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING (Reconciliation: Latest task state)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:36:55,065] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING (Reconciliation: Latest task state)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:36:55,066] INFO Already running 1 instances of
/ffaas-backoffice-app-nopersist. Not scaling.
(mesosphere.marathon.SchedulerActions:512)



-- End of logs



2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:

> This is certainly not the expected/desired behavior when failing over a
> mesos master in HA mode. In addition to the master logs Alex requested, can
> you also provide relevant portions of the slave logs for these tasks? If
> the slave processes themselves never failed over, checkpointing and slave
> recovery should be irrelevant. Are you running the mesos-slave itself
> inside a Docker, or any other non-traditional setup?
>
> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
> defaults to "reconnect", and --strict defaults to true, so none of those
> are necessary.
>
> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
> wrote:
>
>> Geoffroy,
>>
>> could you please provide master logs (both from killed and taking over
>> masters)?
>>
>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>> geoffroy.jabouley@gmail.com> wrote:
>>
>>> Hello
>>>
>>> we are facing some unexpecting issues when testing high availability
>>> behaviors of our mesos cluster.
>>>
>>> *Our use case:*
>>>
>>> *State*: the mesos cluster is up (3 machines), 1 docker task is running
>>> on each slave (started from marathon)
>>>
>>> *Action*: stop the mesos master leader process
>>>
>>> *Expected*: mesos master leader has changed, *active tasks remain
>>> unchanged*
>>>
>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>> FAILED but docker containers are still running*, marathon detects
>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>
>>>
>>> Is the seen behavior correct?
>>>
>>> Have we misunderstood the high availability concept? We thought that
>>> doing this use case would not have any impact on the current cluster state
>>> (except leader re-election)
>>>
>>> Thanks in advance for your help
>>> Regards
>>>
>>> ---------------------------------------------------
>>>
>>> our setup is the following:
>>> 3 identical mesos nodes with:
>>>     + zookeeper
>>>     + docker 1.5
>>>     + mesos master 0.21.1 configured in HA mode
>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>> reconnect
>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>
>>> ---------------------------------------------------
>>>
>>> Command lines:
>>>
>>>
>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>
>>> *mesos-slave*
>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
>>> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
>>> --executor_registration_timeout=5mins --hostname=10.195.30.19
>>> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>
>>> *marathon*
>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>
>>
>>
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Adam Bordelon <ad...@mesosphere.io>.
This is certainly not the expected/desired behavior when failing over a
mesos master in HA mode. In addition to the master logs Alex requested, can
you also provide relevant portions of the slave logs for these tasks? If
the slave processes themselves never failed over, checkpointing and slave
recovery should be irrelevant. Are you running the mesos-slave itself
inside a Docker, or any other non-traditional setup?

FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
defaults to "reconnect", and --strict defaults to true, so none of those
are necessary.

On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io> wrote:

> Geoffroy,
>
> could you please provide master logs (both from killed and taking over
> masters)?
>
> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
> geoffroy.jabouley@gmail.com> wrote:
>
>> Hello
>>
>> we are facing some unexpecting issues when testing high availability
>> behaviors of our mesos cluster.
>>
>> *Our use case:*
>>
>> *State*: the mesos cluster is up (3 machines), 1 docker task is running
>> on each slave (started from marathon)
>>
>> *Action*: stop the mesos master leader process
>>
>> *Expected*: mesos master leader has changed, *active tasks remain
>> unchanged*
>>
>> *Seen*: mesos master leader has changed, *all active tasks are now
>> FAILED but docker containers are still running*, marathon detects FAILED
>> tasks and starts new tasks. We end with 2 docker containers running on each
>> machine, but only one is linked to a RUNNING mesos task.
>>
>>
>> Is the seen behavior correct?
>>
>> Have we misunderstood the high availability concept? We thought that
>> doing this use case would not have any impact on the current cluster state
>> (except leader re-election)
>>
>> Thanks in advance for your help
>> Regards
>>
>> ---------------------------------------------------
>>
>> our setup is the following:
>> 3 identical mesos nodes with:
>>     + zookeeper
>>     + docker 1.5
>>     + mesos master 0.21.1 configured in HA mode
>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>> reconnect
>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>
>> ---------------------------------------------------
>>
>> Command lines:
>>
>>
>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>
>> *mesos-slave*
>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
>> --executor_registration_timeout=5mins --hostname=10.195.30.19
>> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>
>> *marathon*
>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>
>
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Posted by Alex Rukletsov <al...@mesosphere.io>.
Geoffroy,

could you please provide master logs (both from killed and taking over
masters)?

On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
geoffroy.jabouley@gmail.com> wrote:

> Hello
>
> we are facing some unexpecting issues when testing high availability
> behaviors of our mesos cluster.
>
> *Our use case:*
>
> *State*: the mesos cluster is up (3 machines), 1 docker task is running
> on each slave (started from marathon)
>
> *Action*: stop the mesos master leader process
>
> *Expected*: mesos master leader has changed, *active tasks remain
> unchanged*
>
> *Seen*: mesos master leader has changed, *all active tasks are now FAILED
> but docker containers are still running*, marathon detects FAILED tasks
> and starts new tasks. We end with 2 docker containers running on each
> machine, but only one is linked to a RUNNING mesos task.
>
>
> Is the seen behavior correct?
>
> Have we misunderstood the high availability concept? We thought that doing
> this use case would not have any impact on the current cluster state
> (except leader re-election)
>
> Thanks in advance for your help
> Regards
>
> ---------------------------------------------------
>
> our setup is the following:
> 3 identical mesos nodes with:
>     + zookeeper
>     + docker 1.5
>     + mesos master 0.21.1 configured in HA mode
>     + mesos slave 0.21.1 configured with checkpointing, strict and
> reconnect
>     + marathon 0.8.0 configured in HA mode with checkpointing
>
> ---------------------------------------------------
>
> Command lines:
>
>
> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>
> *mesos-slave*
> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
> --executor_registration_timeout=5mins --hostname=10.195.30.19
> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>
> *marathon*
> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
> 10.195.30.20:2181,10.195.30.21:2181/mesos
>