You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Geoffroy Jabouley <ge...@gmail.com> on 2015/03/06 13:26:03 UTC
Weird behavior when stopping the mesos master leader of a HA mesos cluster
Hello
we are facing some unexpecting issues when testing high availability
behaviors of our mesos cluster.
*Our use case:*
*State*: the mesos cluster is up (3 machines), 1 docker task is running on
each slave (started from marathon)
*Action*: stop the mesos master leader process
*Expected*: mesos master leader has changed, *active tasks remain unchanged*
*Seen*: mesos master leader has changed, *all active tasks are now FAILED
but docker containers are still running*, marathon detects FAILED tasks and
starts new tasks. We end with 2 docker containers running on each machine,
but only one is linked to a RUNNING mesos task.
Is the seen behavior correct?
Have we misunderstood the high availability concept? We thought that doing
this use case would not have any impact on the current cluster state
(except leader re-election)
Thanks in advance for your help
Regards
---------------------------------------------------
our setup is the following:
3 identical mesos nodes with:
+ zookeeper
+ docker 1.5
+ mesos master 0.21.1 configured in HA mode
+ mesos slave 0.21.1 configured with checkpointing, strict and reconnect
+ marathon 0.8.0 configured in HA mode with checkpointing
---------------------------------------------------
Command lines:
*mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
--cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
--quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
*mesos-slave*
/usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.30.19
--ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
*marathon*
java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
-Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
/usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
--local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
--hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,10.195.30.20:2181
,10.195.30.21:2181/mesos
Re: Weird behavior when stopping the mesos master leader of a HA
mesos cluster
Posted by Geoffroy Jabouley <ge...@gmail.com>.
Thanks a lot Dario for the workaround! It works fine and can be scripted
with ansible.
For the record, the github issue is available here:
https://github.com/mesosphere/marathon/issues/1292
2015-03-12 17:27 GMT+01:00 Dario Rexin <da...@mesosphere.io>:
> Hi Geoffrey,
>
> we identified the issue and will fix it in Marathon 0.8.2. To prevent this
> behaviour for now, you just have to make sure that in a fresh setup
> (Marathon was never connected to Mesos) you first start up a single
> Marathon and let it register with Mesos and then start the other Marathon
> instances. The problem is a race in first registration with Mesos and
> fetching the FrameworkID from Zookeeper. Please let me know if the
> workaround does not help you.
>
> Cheers,
> Dario
>
> On 12 Mar 2015, at 09:20, Alex Rukletsov <al...@mesosphere.io> wrote:
>
> Geoffroy,
>
> yes, it looks like a marathon issue, so feel free to post it there as well.
>
> On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley <
> geoffroy.jabouley@gmail.com> wrote:
>
>> Thanks Alex for your answer. I will have a look.
>>
>> Would it be better to (cross-)post this discussion on the marathon
>> mailing list?
>>
>> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
>>
>> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <al...@mesosphere.io>:
>>
>>> Geoffroy,
>>>
>>> most probably you're hitting this bug:
>>> https://github.com/mesosphere/marathon/issues/1063. The problem is that
>>> Marathon can register instead of re-registering when a master fails
>>> over. From master point of view, it's a new framework, that's why the
>>> previous task is gone and a new one (that technically belongs to a new
>>> framework) is started. You can see that frameworks have two different IDs
>>> (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>>>
>>> Hope that helps,
>>> Alex
>>>
>>> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
>>> geoffroy.jabouley@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> thanks for your interest. Following are the requested logs, which will
>>>> result in a pretty big mail.
>>>>
>>>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as
>>>> our mesos containerizer.
>>>>
>>>> For reminder, here is the use case performed to get the logs file:
>>>>
>>>> --------------------------------
>>>>
>>>> Our cluster: 3 identical mesos nodes with:
>>>> + zookeeper
>>>> + docker 1.5
>>>> + mesos master 0.21.1 configured in HA mode
>>>> + mesos slave 0.21.1 configured with checkpointing, strict and
>>>> reconnect
>>>> + marathon 0.8.0 configured in HA mode with checkpointing
>>>>
>>>> --------------------------------
>>>>
>>>> *Begin State: *
>>>> + the mesos cluster is up (3 machines)
>>>> + mesos master leader is 10.195.30.19
>>>> + marathon leader is 10.195.30.21
>>>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>>>>
>>>> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>>>>
>>>> *Expected*: mesos master leader has changed, active tasks / frameworks
>>>> remain unchanged
>>>>
>>>> *End state: *
>>>> + mesos master leader *has changed, now 10.195.30.21*
>>>> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
>>>> (not showing anymore on the mesos UI), but *docker container is still
>>>> running*
>>>> + a n*ew APPTASK is now running on slave 10.195.30.19*
>>>> + marathon framework "registration time" in mesos UI shows "Just now"
>>>> + marathon leader *has changed, now 10.195.30.20*
>>>>
>>>>
>>>> --------------------------------
>>>>
>>>> Now comes the 6 requested logs, which might contain
>>>> interesting/relevant information, but i as a newcomer to mesos it is hard
>>>> to read...
>>>>
>>>>
>>>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19/>:*
>>>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal
>>>> SIGTERM from process 1 of user 0; exiting
>>>>
>>>>
>>>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21/>:*
>>>> I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.011823 922 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group
>>>> memberships changed
>>>> I0310 11:31:40.015847 915 group.cpp:659] Trying to get
>>>> '/mesos/log_replicas/0000000000' in ZooKeeper
>>>> I0310 11:31:40.016047 922 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader
>>>> is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
>>>> I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading
>>>> master!
>>>> I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar
>>>> I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar
>>>> I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer
>>>> I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit
>>>> promise request with proposal 2
>>>> I0310 11:31:40.017503 915 group.cpp:659] Trying to get
>>>> '/mesos/log_replicas/0000000003' in ZooKeeper
>>>> I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8
>>>> bytes) to leveldb took 893672ns
>>>> I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2
>>>> I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: {
>>>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
>>>> I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping
>>>> to fill missing position
>>>> I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending
>>>> position 8
>>>> I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from
>>>> leveldb took 13195ns
>>>> I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from
>>>> leveldb took 3120ns
>>>> I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the
>>>> registry (610B) in 7.385856ms
>>>> I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in
>>>> 9263ns; attempting to update the 'registry'
>>>> I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 9
>>>> I0310 11:31:40.024435 923 replica.cpp:508] Replica received write
>>>> request for position 9
>>>> I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 1.259338ms
>>>> I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9
>>>> I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned
>>>> notice for position 9
>>>> I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 404795ns
>>>> I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9
>>>> I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND
>>>> action at position 9
>>>> I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.029248ms
>>>> I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered
>>>> registrar
>>>> I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log
>>>> to 9
>>>> I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from
>>>> the Registry (608B) ; allowing 1days for slaves to re-register
>>>> I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 10
>>>> I0310 11:31:40.026988 923 replica.cpp:508] Replica received write
>>>> request for position 10
>>>> I0310 11:31:40.027640 923 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 641018ns
>>>> I0310 11:31:40.027652 923 replica.cpp:676] Persisted action at 10
>>>> I0310 11:31:40.030848 923 replica.cpp:655] Replica received learned
>>>> notice for position 10
>>>> I0310 11:31:40.031883 923 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 1.008914ms
>>>> I0310 11:31:40.031963 923 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 46724ns
>>>> I0310 11:31:40.031977 923 replica.cpp:676] Persisted action at 10
>>>> I0310 11:31:40.031986 923 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 10
>>>> I0310 11:31:40.055415 918 master.cpp:1383] Received registration
>>>> request for framework 'marathon' at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.055496 918 master.cpp:1447] Registering framework
>>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.055642 919 hierarchical_allocator_process.hpp:329]
>>>> Added framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.189151 919 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:40.189280 919 registrar.cpp:445] Applied 1 operations in
>>>> 15452ns; attempting to update the 'registry'
>>>> I0310 11:31:40.189949 919 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.189978 919 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 11
>>>> I0310 11:31:40.190112 919 replica.cpp:508] Replica received write
>>>> request for position 11
>>>> I0310 11:31:40.190563 919 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 437440ns
>>>> I0310 11:31:40.190577 919 replica.cpp:676] Persisted action at 11
>>>> I0310 11:31:40.191249 921 replica.cpp:655] Replica received learned
>>>> notice for position 11
>>>> I0310 11:31:40.192159 921 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 892767ns
>>>> I0310 11:31:40.192178 921 replica.cpp:676] Persisted action at 11
>>>> I0310 11:31:40.192184 921 replica.cpp:661] Replica learned APPEND
>>>> action at position 11
>>>> I0310 11:31:40.192350 921 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.0528ms
>>>> I0310 11:31:40.192387 919 log.cpp:699] Attempting to truncate the log
>>>> to 11
>>>> I0310 11:31:40.192415 919 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 12
>>>> I0310 11:31:40.192539 915 replica.cpp:508] Replica received write
>>>> request for position 12
>>>> I0310 11:31:40.192600 921 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.192680 917 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.192847 917 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.193164 915 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 610664ns
>>>> I0310 11:31:40.193181 915 replica.cpp:676] Persisted action at 12
>>>> I0310 11:31:40.193568 915 replica.cpp:655] Replica received learned
>>>> notice for position 12
>>>> I0310 11:31:40.193948 915 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 364062ns
>>>> I0310 11:31:40.193979 915 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 12256ns
>>>> I0310 11:31:40.193985 915 replica.cpp:676] Persisted action at 12
>>>> I0310 11:31:40.193990 915 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 12
>>>> I0310 11:31:40.248615 915 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.248744 915 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.774416 915 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21)
>>>> I0310 11:31:40.774976 915 registrar.cpp:445] Applied 1 operations in
>>>> 42342ns; attempting to update the 'registry'
>>>> I0310 11:31:40.777273 921 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.777436 921 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 13
>>>> I0310 11:31:40.777989 921 replica.cpp:508] Replica received write
>>>> request for position 13
>>>> I0310 11:31:40.779558 921 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 1.513714ms
>>>> I0310 11:31:40.779633 921 replica.cpp:676] Persisted action at 13
>>>> I0310 11:31:40.781821 919 replica.cpp:655] Replica received learned
>>>> notice for position 13
>>>> I0310 11:31:40.784417 919 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 2.542036ms
>>>> I0310 11:31:40.784446 919 replica.cpp:676] Persisted action at 13
>>>> I0310 11:31:40.784452 919 replica.cpp:661] Replica learned APPEND
>>>> action at position 13
>>>> I0310 11:31:40.784711 920 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 9.68192ms
>>>> I0310 11:31:40.784762 917 log.cpp:699] Attempting to truncate the log
>>>> to 13
>>>> I0310 11:31:40.784808 920 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 14
>>>> I0310 11:31:40.784865 917 master.hpp:877] Adding task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
>>>> I0310 11:31:40.784955 919 replica.cpp:508] Replica received write
>>>> request for position 14
>>>> W0310 11:31:40.785038 917 master.cpp:4468] Possibly orphaned task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
>>>> framework 20150310-112310-320783114-5050-24289-0000 running on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21)
>>>> I0310 11:31:40.785105 917 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.785162 920 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.785679 921 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.786429 919 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 1.454211ms
>>>> I0310 11:31:40.786455 919 replica.cpp:676] Persisted action at 14
>>>> I0310 11:31:40.786782 919 replica.cpp:655] Replica received learned
>>>> notice for position 14
>>>> I0310 11:31:40.787833 919 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 1.027014ms
>>>> I0310 11:31:40.787873 919 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 14085ns
>>>> I0310 11:31:40.787883 919 replica.cpp:676] Persisted action at 14
>>>> I0310 11:31:40.787889 919 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 14
>>>> I0310 11:31:40.792536 922 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.792670 922 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.819602 921 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20)
>>>> I0310 11:31:40.819736 921 registrar.cpp:445] Applied 1 operations in
>>>> 16656ns; attempting to update the 'registry'
>>>> I0310 11:31:40.820617 921 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.820726 918 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 15
>>>> I0310 11:31:40.820938 918 replica.cpp:508] Replica received write
>>>> request for position 15
>>>> I0310 11:31:40.821641 918 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 670583ns
>>>> I0310 11:31:40.821663 918 replica.cpp:676] Persisted action at 15
>>>> I0310 11:31:40.822265 917 replica.cpp:655] Replica received learned
>>>> notice for position 15
>>>> I0310 11:31:40.823463 917 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 1.178687ms
>>>> I0310 11:31:40.823490 917 replica.cpp:676] Persisted action at 15
>>>> I0310 11:31:40.823498 917 replica.cpp:661] Replica learned APPEND
>>>> action at position 15
>>>> I0310 11:31:40.823755 917 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.97696ms
>>>> I0310 11:31:40.823823 917 log.cpp:699] Attempting to truncate the log
>>>> to 15
>>>> I0310 11:31:40.824147 922 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 16
>>>> I0310 11:31:40.824482 922 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.824597 921 replica.cpp:508] Replica received write
>>>> request for position 16
>>>> I0310 11:31:40.824128 917 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.824975 917 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.831900 921 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 7.228682ms
>>>> I0310 11:31:40.832031 921 replica.cpp:676] Persisted action at 16
>>>> I0310 11:31:40.832456 917 replica.cpp:655] Replica received learned
>>>> notice for position 16
>>>> I0310 11:31:40.835178 917 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 2.674392ms
>>>> I0310 11:31:40.835297 917 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 45220ns
>>>> I0310 11:31:40.835322 917 replica.cpp:676] Persisted action at 16
>>>> I0310 11:31:40.835341 917 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 16
>>>> I0310 11:31:40.838281 923 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.838389 923 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.948725 919 http.cpp:344] HTTP request for
>>>> '/master/redirect'
>>>> I0310 11:31:41.479118 918 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:45.368074 918 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:45.385144 917 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:45.385292 917 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:46.368450 917 master.cpp:3843] Sending 2 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375222 920 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375360 920 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375530 920 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:46.375599 920 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:48.031230 915 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:51.374285 922 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:51.379391 921 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:51.379487 921 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:51.482094 923 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:52.375326 917 master.cpp:3843] Sending 2 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391376 919 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391512 919 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:52.391659 921 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391751 921 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:55.062060 918 master.cpp:3611] Performing explicit task
>>>> state reconciliation for 1 tasks of framework
>>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:55.062588 919 master.cpp:3556] Performing implicit task
>>>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:56.140990 923 http.cpp:344] HTTP request for
>>>> '/master/redirect'
>>>> I0310 11:31:57.379288 918 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:57.430888 918 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:57.431068 918 master.hpp:877] Adding task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
>>>> I0310 11:31:57.431089 918 master.cpp:2503] Launching task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:57.431205 918 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.682133 919 master.cpp:3446] Forwarding status update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.682186 919 master.cpp:3418] Status update TASK_RUNNING
>>>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 from slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:57.682199 919 master.cpp:4693] Updating the latest state
>>>> of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799
>>>> of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>>>>
>>>>
>>>> *from MESOS slave 10.195.30.21*
>>>> I0310 11:31:28.750200 1074 slave.cpp:2623] master@10.195.30.19:5050
>>>> exited
>>>> W0310 11:31:28.750249 1074 slave.cpp:2626] Master disconnected!
>>>> Waiting for a new master to be elected
>>>> I0310 11:31:40.012516 1075 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.012899 1073 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.017143 1072 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.017408 1072 slave.cpp:602] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.017546 1076 status_update_manager.cpp:171] Pausing
>>>> sending status updates
>>>> I0310 11:31:40.018673 1072 slave.cpp:627] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.018689 1072 slave.cpp:638] Detecting new master
>>>> I0310 11:31:40.785364 1075 slave.cpp:824] Re-registered with master
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.785398 1075 status_update_manager.cpp:178] Resuming
>>>> sending status updates
>>>> I0310 11:32:10.639506 1075 slave.cpp:3321] Current usage 12.27%. Max
>>>> allowed age: 5.441217749539572days
>>>>
>>>>
>>>> *from MESOS slave 10.195.30.19*
>>>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
>>>> exited
>>>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected!
>>>> Waiting for a new master to be elected
>>>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
>>>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing
>>>> sending status updates
>>>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
>>>> sending status updates
>>>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 in work directory
>>>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
>>>> executor
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework '20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
>>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
>>>> framework '20150310-112310-354337546-5050-895-0000'
>>>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
>>>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
>>>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework '20150310-112310-354337546-5050-895-0000' in container
>>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received
>>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346]
>>>> Checkpointing UPDATE for status update TASK_RUNNING (UUID:
>>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 to
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
>>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received
>>>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e)
>>>> for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346]
>>>> Checkpointing ACK for status update TASK_RUNNING (UUID:
>>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to
>>>> 204 at
>>>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
>>>> 'memory.soft_limit_in_bytes' to 160MB for container
>>>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
>>>> 'memory.limit_in_bytes' to 160MB at
>>>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
>>>> allowed age: 5.555425200437824days
>>>>
>>>>
>>>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21/>:*
>>>> I0310 11:31:40.017248 1115 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.017334 1115 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.017727 1115 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> [2015-03-10 11:31:40,017] WARN Disconnected
>>>> (mesosphere.marathon.MarathonScheduler:224)
>>>> [2015-03-10 11:31:40,019] INFO Abdicating
>>>> (mesosphere.marathon.MarathonSchedulerService:312)
>>>> [2015-03-10 11:31:40,019] INFO Defeat leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>>> [INFO] [03/10/2015 11:31:40.019]
>>>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> [INFO] [03/10/2015 11:31:40.019]
>>>> [marathon-akka.actor.default-dispatcher-5] [
>>>> akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
>>>> [2015-03-10 11:31:40,021] INFO Stopping driver
>>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>>> I0310 11:31:40.022001 1115 sched.cpp:1286] Asked to stop the driver
>>>> [2015-03-10 11:31:40,024] INFO Setting framework ID to
>>>> 20150310-112310-320783114-5050-24289-0000
>>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>>> I0310 11:31:40.026274 1115 sched.cpp:234] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.026418 1115 sched.cpp:242] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.026458 1115 sched.cpp:752] Stopping framework
>>>> '20150310-112310-320783114-5050-24289-0000'
>>>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing
>>>> optional abdication command.
>>>> (mesosphere.marathon.MarathonSchedulerService:192)
>>>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
>>>> (mesosphere.marathon.MarathonSchedulerService:246)
>>>> [2015-03-10 11:31:40,032] INFO Defeat leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>>> [2015-03-10 11:31:40,032] INFO Stopping driver
>>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>>> I0310 11:31:40.032588 1107 sched.cpp:1286] Asked to stop the driver
>>>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
>>>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
>>>> [2015-03-10 11:31:40,033] INFO Setting framework ID to
>>>> 20150310-112310-320783114-5050-24289-0000
>>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is
>>>> not a candidate for leader, current voting: [member_0000000001,
>>>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
>>>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering
>>>> leadership (mesosphere.marathon.MarathonSchedulerService:341)
>>>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
>>>> (com.twitter.common.zookeeper.Group:426)
>>>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is
>>>> not a candidate for leader, current voting: [member_0000000001,
>>>> member_0000000002, member_0000000003]
>>>> (com.twitter.common.zookeeper.CandidateImpl:144)
>>>> [2015-03-10 11:31:40,568] INFO Candidate
>>>> /marathon/leader/member_0000000003 waiting for the next leader election,
>>>> current voting: [member_0000000001, member_0000000002, member_0000000003]
>>>> (com.twitter.common.zookeeper.CandidateImpl:165)
>>>>
>>>>
>>>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20/>:*
>>>> [2015-03-10 11:31:40,029] INFO Candidate
>>>> /marathon/leader/member_0000000001 is now leader of group:
>>>> [member_0000000001, member_0000000002]
>>>> (com.twitter.common.zookeeper.CandidateImpl:152)
>>>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
>>>> (mesosphere.marathon.MarathonSchedulerService:253)
>>>> [2015-03-10 11:31:40,044] INFO Elect leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:299)
>>>> [2015-03-10 11:31:40,044] INFO Running driver
>>>> (mesosphere.marathon.MarathonSchedulerService:184)
>>>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
>>>> Client environment:zookeeper.version=zookeeper C client 3.4.5
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
>>>> Client environment:host.name=srv-d2u-9-virtip20
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
>>>> Client environment:os.name=Linux
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
>>>> Client environment:os.arch=3.13.0-44-generic
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
>>>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
>>>> Client environment:user.name=(null)
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
>>>> Client environment:user.home=/root
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
>>>> Client environment:user.dir=/
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
>>>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
>>>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0
>>>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
>>>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
>>>> (mesosphere.marathon.MarathonSchedulerService:329)
>>>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
>>>> initiated connection to server [10.195.30.19:2181]
>>>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
>>>> session establishment complete on server [10.195.30.19:2181],
>>>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
>>>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
>>>> 10.195.30.20:45771) connected to ZooKeeper
>>>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations:
>>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path
>>>> '/mesos' in ZooKeeper
>>>> [INFO] [03/10/2015 11:31:40.047]
>>>> [marathon-akka.actor.default-dispatcher-2] [
>>>> akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
>>>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for
>>>> version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
>>>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
>>>> 20150310-112310-354337546-5050-895-0000
>>>> [2015-03-10 11:31:40,056] INFO Registered as
>>>> 20150310-112310-354337546-5050-895-0000 to master
>>>> '20150310-112310-354337546-5050-895'
>>>> (mesosphere.marathon.MarathonScheduler:72)
>>>> [2015-03-10 11:31:40,063] INFO Stored framework ID
>>>> '20150310-112310-354337546-5050-895-0000'
>>>> (mesosphere.mesos.util.FrameworkIdUtil:49)
>>>> [INFO] [03/10/2015 11:31:40.065]
>>>> [marathon-akka.actor.default-dispatcher-6] [
>>>> akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
>>>> [INFO] [03/10/2015 11:31:40.067]
>>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
>>>> (mesosphere.marathon.SchedulerActions:403)
>>>> [INFO] [03/10/2015 11:31:55.053]
>>>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
>>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
>>>> encountered. This logging can be turned off or adjusted with configuration
>>>> settings 'akka.log-dead-letters' and
>>>> 'akka.log-dead-letters-during-shutdown'.
>>>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
>>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>>> [2015-03-10 11:31:55,064] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
>>>> TASK_LOST (Reconciliation: Task is unknown to the slave)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:31:55,069] INFO Need to scale
>>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>>> (mesosphere.marathon.SchedulerActions:488)
>>>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
>>>> /ffaas-backoffice-app-nopersist (0 queued)
>>>> (mesosphere.marathon.SchedulerActions:494)
>>>> [2015-03-10 11:31:55,069] INFO Task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
>>>> expunged and removed from TaskTracker
>>>> (mesosphere.marathon.tasks.TaskTracker:107)
>>>> [2015-03-10 11:31:55,070] INFO Sending event notification.
>>>> (mesosphere.marathon.MarathonScheduler:262)
>>>> [INFO] [03/10/2015 11:31:55.072]
>>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> [2015-03-10 11:31:55,073] INFO Need to scale
>>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>>> (mesosphere.marathon.SchedulerActions:488)
>>>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
>>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>>> (mesosphere.marathon.SchedulerActions:498)
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:31:57,682] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:31:57,694] INFO Sending event notification.
>>>> (mesosphere.marathon.MarathonScheduler:262)
>>>> [INFO] [03/10/2015 11:31:57.694]
>>>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
>>>> (mesosphere.marathon.tasks.TaskTracker:170)
>>>> [INFO] [03/10/2015 11:36:55.050]
>>>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
>>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
>>>> encountered. This logging can be turned off or adjusted with configuration
>>>> settings 'akka.log-dead-letters' and
>>>> 'akka.log-dead-letters-during-shutdown'.
>>>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
>>>> (mesosphere.marathon.SchedulerActions:403)
>>>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
>>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>>> [2015-03-10 11:36:55,063] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING (Reconciliation: Latest task state)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:36:55,065] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING (Reconciliation: Latest task state)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
>>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>>> (mesosphere.marathon.SchedulerActions:512)
>>>>
>>>>
>>>>
>>>> -- End of logs
>>>>
>>>>
>>>>
>>>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>>>>
>>>>> This is certainly not the expected/desired behavior when failing over
>>>>> a mesos master in HA mode. In addition to the master logs Alex requested,
>>>>> can you also provide relevant portions of the slave logs for these tasks?
>>>>> If the slave processes themselves never failed over, checkpointing and
>>>>> slave recovery should be irrelevant. Are you running the mesos-slave itself
>>>>> inside a Docker, or any other non-traditional setup?
>>>>>
>>>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>>>>> defaults to "reconnect", and --strict defaults to true, so none of those
>>>>> are necessary.
>>>>>
>>>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>>>>> wrote:
>>>>>
>>>>>> Geoffroy,
>>>>>>
>>>>>> could you please provide master logs (both from killed and taking
>>>>>> over masters)?
>>>>>>
>>>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>>>>> geoffroy.jabouley@gmail.com> wrote:
>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> we are facing some unexpecting issues when testing high availability
>>>>>>> behaviors of our mesos cluster.
>>>>>>>
>>>>>>> *Our use case:*
>>>>>>>
>>>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>>>>> running on each slave (started from marathon)
>>>>>>>
>>>>>>> *Action*: stop the mesos master leader process
>>>>>>>
>>>>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>>>>> unchanged*
>>>>>>>
>>>>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>>>>> FAILED but docker containers are still running*, marathon detects
>>>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>>>>
>>>>>>>
>>>>>>> Is the seen behavior correct?
>>>>>>>
>>>>>>> Have we misunderstood the high availability concept? We thought that
>>>>>>> doing this use case would not have any impact on the current cluster state
>>>>>>> (except leader re-election)
>>>>>>>
>>>>>>> Thanks in advance for your help
>>>>>>> Regards
>>>>>>>
>>>>>>> ---------------------------------------------------
>>>>>>>
>>>>>>> our setup is the following:
>>>>>>> 3 identical mesos nodes with:
>>>>>>> + zookeeper
>>>>>>> + docker 1.5
>>>>>>> + mesos master 0.21.1 configured in HA mode
>>>>>>> + mesos slave 0.21.1 configured with checkpointing, strict and
>>>>>>> reconnect
>>>>>>> + marathon 0.8.0 configured in HA mode with checkpointing
>>>>>>>
>>>>>>> ---------------------------------------------------
>>>>>>>
>>>>>>> Command lines:
>>>>>>>
>>>>>>>
>>>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>>>>
>>>>>>> *mesos-slave*
>>>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint
>>>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins
>>>>>>> --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>>>>
>>>>>>> *marathon*
>>>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Posted by Dario Rexin <da...@mesosphere.io>.
Hi Geoffrey,
we identified the issue and will fix it in Marathon 0.8.2. To prevent this behaviour for now, you just have to make sure that in a fresh setup (Marathon was never connected to Mesos) you first start up a single Marathon and let it register with Mesos and then start the other Marathon instances. The problem is a race in first registration with Mesos and fetching the FrameworkID from Zookeeper. Please let me know if the workaround does not help you.
Cheers,
Dario
> On 12 Mar 2015, at 09:20, Alex Rukletsov <al...@mesosphere.io> wrote:
>
> Geoffroy,
>
> yes, it looks like a marathon issue, so feel free to post it there as well.
>
> On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley <geoffroy.jabouley@gmail.com <ma...@gmail.com>> wrote:
> Thanks Alex for your answer. I will have a look.
>
> Would it be better to (cross-)post this discussion on the marathon mailing list?
>
> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
>
> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <alex@mesosphere.io <ma...@mesosphere.io>>:
> Geoffroy,
>
> most probably you're hitting this bug: https://github.com/mesosphere/marathon/issues/1063 <https://github.com/mesosphere/marathon/issues/1063>. The problem is that Marathon can register instead of re-registering when a master fails over. From master point of view, it's a new framework, that's why the previous task is gone and a new one (that technically belongs to a new framework) is started. You can see that frameworks have two different IDs (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>
> Hope that helps,
> Alex
>
> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <geoffroy.jabouley@gmail.com <ma...@gmail.com>> wrote:
> Hello
>
> thanks for your interest. Following are the requested logs, which will result in a pretty big mail.
>
> Mesos/Marathon are NOT running inside docker, we only use Docker as our mesos containerizer.
>
> For reminder, here is the use case performed to get the logs file:
>
> --------------------------------
>
> Our cluster: 3 identical mesos nodes with:
> + zookeeper
> + docker 1.5
> + mesos master 0.21.1 configured in HA mode
> + mesos slave 0.21.1 configured with checkpointing, strict and reconnect
> + marathon 0.8.0 configured in HA mode with checkpointing
>
> --------------------------------
>
> Begin State:
> + the mesos cluster is up (3 machines)
> + mesos master leader is 10.195.30.19
> + marathon leader is 10.195.30.21
> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>
> Action: stop the mesos master leader process (sudo stop mesos-master)
>
> Expected: mesos master leader has changed, active tasks / frameworks remain unchanged
>
> End state:
> + mesos master leader has changed, now 10.195.30.21
> + previously running APPTASK on the slave 10.195.30.21 has "disappear" (not showing anymore on the mesos UI), but docker container is still running
> + a new APPTASK is now running on slave 10.195.30.19
> + marathon framework "registration time" in mesos UI shows "Just now"
> + marathon leader has changed, now 10.195.30.20
>
>
> --------------------------------
>
> Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read...
>
>
> from previous MESOS master leader 10.195.30.19 <http://10.195.30.19/>:
> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting
>
>
> from new MESOS master leader 10.195.30.21 <http://10.195.30.21/>:
> I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.011823 922 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships changed
> I0310 11:31:40.015847 915 group.cpp:659] Trying to get '/mesos/log_replicas/0000000000' in ZooKeeper
> I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 <http://master@10.195.30.21:5050/> with id 20150310-112310-354337546-5050-895
> I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master!
> I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar
> I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar
> I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer
> I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit promise request with proposal 2
> I0310 11:31:40.017503 915 group.cpp:659] Trying to get '/mesos/log_replicas/0000000003' in ZooKeeper
> I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns
> I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2
> I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@10.195.30.20:5050 <http://10.195.30.20:5050/>, log-replica(1)@10.195.30.21:5050 <http://10.195.30.21:5050/> }
> I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping to fill missing position
> I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending position 8
> I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from leveldb took 13195ns
> I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from leveldb took 3120ns
> I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the registry (610B) in 7.385856ms
> I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in 9263ns; attempting to update the 'registry'
> I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 9
> I0310 11:31:40.024435 923 replica.cpp:508] Replica received write request for position 9
> I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 1.259338ms
> I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned notice for position 9
> I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 404795ns
> I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND action at position 9
> I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the 'registry' in 3.029248ms
> I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered registrar
> I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log to 9
> I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from the Registry (608B) ; allowing 1days for slaves to re-register
> I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 10
> I0310 11:31:40.026988 923 replica.cpp:508] Replica received write request for position 10
> I0310 11:31:40.027640 923 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 641018ns
> I0310 11:31:40.027652 923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.030848 923 replica.cpp:655] Replica received learned notice for position 10
> I0310 11:31:40.031883 923 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 1.008914ms
> I0310 11:31:40.031963 923 leveldb.cpp:401] Deleting ~2 keys from leveldb took 46724ns
> I0310 11:31:40.031977 923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.031986 923 replica.cpp:661] Replica learned TRUNCATE action at position 10
> I0310 11:31:40.055415 918 master.cpp:1383] Received registration request for framework 'marathon' at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.055496 918 master.cpp:1447] Registering framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.055642 919 hierarchical_allocator_process.hpp:329] Added framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.189151 919 master.cpp:3246] Re-registering slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19)
> I0310 11:31:40.189280 919 registrar.cpp:445] Applied 1 operations in 15452ns; attempting to update the 'registry'
> I0310 11:31:40.189949 919 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.189978 919 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 11
> I0310 11:31:40.190112 919 replica.cpp:508] Replica received write request for position 11
> I0310 11:31:40.190563 919 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 437440ns
> I0310 11:31:40.190577 919 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.191249 921 replica.cpp:655] Replica received learned notice for position 11
> I0310 11:31:40.192159 921 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 892767ns
> I0310 11:31:40.192178 921 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.192184 921 replica.cpp:661] Replica learned APPEND action at position 11
> I0310 11:31:40.192350 921 registrar.cpp:490] Successfully updated the 'registry' in 3.0528ms
> I0310 11:31:40.192387 919 log.cpp:699] Attempting to truncate the log to 11
> I0310 11:31:40.192415 919 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 12
> I0310 11:31:40.192539 915 replica.cpp:508] Replica received write request for position 12
> I0310 11:31:40.192600 921 master.cpp:3314] Re-registered slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> I0310 11:31:40.192680 917 hierarchical_allocator_process.hpp:442] Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 available)
> I0310 11:31:40.192847 917 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.193164 915 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 610664ns
> I0310 11:31:40.193181 915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193568 915 replica.cpp:655] Replica received learned notice for position 12
> I0310 11:31:40.193948 915 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 364062ns
> I0310 11:31:40.193979 915 leveldb.cpp:401] Deleting ~2 keys from leveldb took 12256ns
> I0310 11:31:40.193985 915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193990 915 replica.cpp:661] Replica learned TRUNCATE action at position 12
> I0310 11:31:40.248615 915 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.248744 915 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.774416 915 master.cpp:3246] Re-registering slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21)
> I0310 11:31:40.774976 915 registrar.cpp:445] Applied 1 operations in 42342ns; attempting to update the 'registry'
> I0310 11:31:40.777273 921 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.777436 921 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 13
> I0310 11:31:40.777989 921 replica.cpp:508] Replica received write request for position 13
> I0310 11:31:40.779558 921 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 1.513714ms
> I0310 11:31:40.779633 921 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.781821 919 replica.cpp:655] Replica received learned notice for position 13
> I0310 11:31:40.784417 919 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 2.542036ms
> I0310 11:31:40.784446 919 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.784452 919 replica.cpp:661] Replica learned APPEND action at position 13
> I0310 11:31:40.784711 920 registrar.cpp:490] Successfully updated the 'registry' in 9.68192ms
> I0310 11:31:40.784762 917 log.cpp:699] Attempting to truncate the log to 13
> I0310 11:31:40.784808 920 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 14
> I0310 11:31:40.784865 917 master.hpp:877] Adding task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
> I0310 11:31:40.784955 919 replica.cpp:508] Replica received write request for position 14
> W0310 11:31:40.785038 917 master.cpp:4468] Possibly orphaned task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of framework 20150310-112310-320783114-5050-24289-0000 running on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21)
> I0310 11:31:40.785105 917 master.cpp:3314] Re-registered slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> I0310 11:31:40.785162 920 hierarchical_allocator_process.hpp:442] Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 available)
> I0310 11:31:40.785679 921 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.786429 919 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 1.454211ms
> I0310 11:31:40.786455 919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.786782 919 replica.cpp:655] Replica received learned notice for position 14
> I0310 11:31:40.787833 919 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 1.027014ms
> I0310 11:31:40.787873 919 leveldb.cpp:401] Deleting ~2 keys from leveldb took 14085ns
> I0310 11:31:40.787883 919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.787889 919 replica.cpp:661] Replica learned TRUNCATE action at position 14
> I0310 11:31:40.792536 922 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.792670 922 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S2 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.819602 921 master.cpp:3246] Re-registering slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20)
> I0310 11:31:40.819736 921 registrar.cpp:445] Applied 1 operations in 16656ns; attempting to update the 'registry'
> I0310 11:31:40.820617 921 log.cpp:680] Attempting to append 647 bytes to the log
> I0310 11:31:40.820726 918 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 15
> I0310 11:31:40.820938 918 replica.cpp:508] Replica received write request for position 15
> I0310 11:31:40.821641 918 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 670583ns
> I0310 11:31:40.821663 918 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.822265 917 replica.cpp:655] Replica received learned notice for position 15
> I0310 11:31:40.823463 917 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 1.178687ms
> I0310 11:31:40.823490 917 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.823498 917 replica.cpp:661] Replica learned APPEND action at position 15
> I0310 11:31:40.823755 917 registrar.cpp:490] Successfully updated the 'registry' in 3.97696ms
> I0310 11:31:40.823823 917 log.cpp:699] Attempting to truncate the log to 15
> I0310 11:31:40.824147 922 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 16
> I0310 11:31:40.824482 922 hierarchical_allocator_process.hpp:442] Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 available)
> I0310 11:31:40.824597 921 replica.cpp:508] Replica received write request for position 16
> I0310 11:31:40.824128 917 master.cpp:3314] Re-registered slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> I0310 11:31:40.824975 917 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.831900 921 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 7.228682ms
> I0310 11:31:40.832031 921 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.832456 917 replica.cpp:655] Replica received learned notice for position 16
> I0310 11:31:40.835178 917 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 2.674392ms
> I0310 11:31:40.835297 917 leveldb.cpp:401] Deleting ~2 keys from leveldb took 45220ns
> I0310 11:31:40.835322 917 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.835341 917 replica.cpp:661] Replica learned TRUNCATE action at position 16
> I0310 11:31:40.838281 923 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:40.838389 923 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S0 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.948725 919 http.cpp:344] HTTP request for '/master/redirect'
> I0310 11:31:41.479118 918 http.cpp:478] HTTP request for '/master/state.json'
> I0310 11:31:45.368074 918 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:45.385144 917 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:45.385292 917 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.368450 917 master.cpp:3843] Sending 2 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:46.375222 920 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:46.375360 920 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:46.375530 920 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S0 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.375599 920 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S2 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:48.031230 915 http.cpp:478] HTTP request for '/master/state.json'
> I0310 11:31:51.374285 922 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:51.379391 921 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:51.379487 921 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:51.482094 923 http.cpp:478] HTTP request for '/master/state.json'
> I0310 11:31:52.375326 917 master.cpp:3843] Sending 2 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:52.391376 919 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 <http://10.195.30.21:5051/> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:52.391512 919 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S2 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:52.391659 921 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 <http://10.195.30.20:5051/> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:52.391751 921 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S0 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:55.062060 918 master.cpp:3611] Performing explicit task state reconciliation for 1 tasks of framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:55.062588 919 master.cpp:3556] Performing implicit task state reconciliation for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:56.140990 923 http.cpp:344] HTTP request for '/master/redirect'
> I0310 11:31:57.379288 918 master.cpp:3843] Sending 1 offers to framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:57.430888 918 master.cpp:2344] Processing reply for offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/>
> I0310 11:31:57.431068 918 master.hpp:877] Adding task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
> I0310 11:31:57.431089 918 master.cpp:2503] Launching task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 (marathon) at scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 <http://scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771/> with resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19)
> I0310 11:31:57.431205 918 hierarchical_allocator_process.hpp:563] Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148) on slave 20150310-112310-320783114-5050-24289-S1 from framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682133 919 master.cpp:3446] Forwarding status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682186 919 master.cpp:3418] Status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 from slave 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 <http://10.195.30.19:5051/> (10.195.30.19)
> I0310 11:31:57.682199 919 master.cpp:4693] Updating the latest state of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>
>
> from MESOS slave 10.195.30.21
> I0310 11:31:28.750200 1074 slave.cpp:2623] master@10.195.30.19:5050 <http://master@10.195.30.19:5050/> exited
> W0310 11:31:28.750249 1074 slave.cpp:2626] Master disconnected! Waiting for a new master to be elected
> I0310 11:31:40.012516 1075 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.012899 1073 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017143 1072 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.017408 1072 slave.cpp:602] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.017546 1076 status_update_manager.cpp:171] Pausing sending status updates
> I0310 11:31:40.018673 1072 slave.cpp:627] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.018689 1072 slave.cpp:638] Detecting new master
> I0310 11:31:40.785364 1075 slave.cpp:824] Re-registered with master master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.785398 1075 status_update_manager.cpp:178] Resuming sending status updates
> I0310 11:32:10.639506 1075 slave.cpp:3321] Current usage 12.27%. Max allowed age: 5.441217749539572days
>
>
> from MESOS slave 10.195.30.19
> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050 <http://master@10.195.30.19:5050/> exited
> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting for a new master to be elected
> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing sending status updates
> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming sending status updates
> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 in work directory '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for executor ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework '20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of framework '20150310-112310-354337546-5050-895-0000'
> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of framework '20150310-112310-354337546-5050-895-0000' in container 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of framework 20150310-112310-354337546-5050-895-0000 from executor(1)@10.195.30.19:56574 <http://10.195.30.19:56574/>
> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for executor 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 from executor(1)@10.195.30.19:56574 <http://10.195.30.19:56574/>
> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 to master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000 to executor(1)@10.195.30.19:56574 <http://10.195.30.19:56574/>
> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204 at /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2 for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated 'memory.soft_limit_in_bytes' to 160MB for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated 'memory.limit_in_bytes' to 160MB at /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2 for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max allowed age: 5.555425200437824days
>
>
> From previous Marathon leader 10.195.30.21 <http://10.195.30.21/>:
> I0310 11:31:40.017248 1115 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.017334 1115 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017727 1115 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> [2015-03-10 11:31:40,017] WARN Disconnected (mesosphere.marathon.MarathonScheduler:224)
> [2015-03-10 11:31:40,019] INFO Abdicating (mesosphere.marathon.MarathonSchedulerService:312)
> [2015-03-10 11:31:40,019] INFO Defeat leadership (mesosphere.marathon.MarathonSchedulerService:285)
> [INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b] POSTing to all endpoints.
> [INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
> [2015-03-10 11:31:40,021] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.022001 1115 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,024] INFO Setting framework ID to 20150310-112310-320783114-5050-24289-0000 (mesosphere.marathon.MarathonSchedulerService:73)
> I0310 11:31:40.026274 1115 sched.cpp:234] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.026418 1115 sched.cpp:242] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.026458 1115 sched.cpp:752] Stopping framework '20150310-112310-320783114-5050-24289-0000'
> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing optional abdication command. (mesosphere.marathon.MarathonSchedulerService:192)
> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:246)
> [2015-03-10 11:31:40,032] INFO Defeat leadership (mesosphere.marathon.MarathonSchedulerService:285)
> [2015-03-10 11:31:40,032] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.032588 1107 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500 milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
> [2015-03-10 11:31:40,033] INFO Setting framework ID to 20150310-112310-320783114-5050-24289-0000 (mesosphere.marathon.MarathonSchedulerService:73)
> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is not a candidate for leader, current voting: [member_0000000001, member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:341)
> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003 (com.twitter.common.zookeeper.Group:426)
> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is not a candidate for leader, current voting: [member_0000000001, member_0000000002, member_0000000003] (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,568] INFO Candidate /marathon/leader/member_0000000003 waiting for the next leader election, current voting: [member_0000000001, member_0000000002, member_0000000003] (com.twitter.common.zookeeper.CandidateImpl:165)
>
>
> From new Marathon leader 10.195.30.20 <http://10.195.30.20/>:
> [2015-03-10 11:31:40,029] INFO Candidate /marathon/leader/member_0000000001 is now leader of group: [member_0000000001, member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:152)
> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:253)
> [2015-03-10 11:31:40,044] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService:299)
> [2015-03-10 11:31:40,044] INFO Running driver (mesosphere.marathon.MarathonSchedulerService:184)
> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716: Client environment:host.name <http://host.name/>=srv-d2u-9-virtip20
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723: Client environment:os.name <http://os.name/>=Linux
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-44-generic
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725: Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733: Client environment:user.name <http://user.name/>=(null)
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741: Client environment:user.home=/root
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753: Client environment:user.dir=/
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181 <http://10.195.30.21:2181/> sessionTimeout=10000 watcher=0x7fda9da9a6a0 sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff (mesosphere.marathon.MarathonSchedulerService:329)
> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703: initiated connection to server [10.195.30.19:2181 <http://10.195.30.19:2181/>]
> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750: session establishment complete on server [10.195.30.19:2181 <http://10.195.30.19:2181/>], sessionId=0x14c0335ad7e000d, negotiated timeout=10000
> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@10.195.30.20:45771 <http://10.195.30.20:45771/>) connected to ZooKeeper
> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
> [INFO] [03/10/2015 11:31:40.047] [marathon-akka.actor.default-dispatcher-2] [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader: (id='2')
> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
> [2015-03-10 11:31:40,053] INFO Migration successfully applied for version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>) is detected
> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at master@10.195.30.21:5050 <http://master@10.195.30.21:5050/>
> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided. Attempting to register without authentication
> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with 20150310-112310-354337546-5050-895-0000
> [2015-03-10 11:31:40,056] INFO Registered as 20150310-112310-354337546-5050-895-0000 to master '20150310-112310-354337546-5050-895' (mesosphere.marathon.MarathonScheduler:72)
> [2015-03-10 11:31:40,063] INFO Stored framework ID '20150310-112310-354337546-5050-895-0000' (mesosphere.mesos.util.FrameworkIdUtil:49)
> [INFO] [03/10/2015 11:31:40.065] [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
> [INFO] [03/10/2015 11:31:40.067] [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b] POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps (mesosphere.marathon.SchedulerActions:403)
> [INFO] [03/10/2015 11:31:55.053] [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters] Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:31:55,064] INFO Received status update for task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799: TASK_LOST (Reconciliation: Task is unknown to the slave) (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:55,069] INFO Need to scale /ffaas-backoffice-app-nopersist from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for /ffaas-backoffice-app-nopersist (0 queued) (mesosphere.marathon.SchedulerActions:494)
> [2015-03-10 11:31:55,069] INFO Task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 expunged and removed from TaskTracker (mesosphere.marathon.tasks.TaskTracker:107)
> [2015-03-10 11:31:55,070] INFO Sending event notification. (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:55.072] [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b] POSTing to all endpoints.
> [2015-03-10 11:31:55,073] INFO Need to scale /ffaas-backoffice-app-nopersist from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for /ffaas-backoffice-app-nopersist. Not scaling. (mesosphere.marathon.SchedulerActions:498)
> ...
> ...
> ...
> [2015-03-10 11:31:57,682] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:57,694] INFO Sending event notification. (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:57.694] [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b] POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store (mesosphere.marathon.tasks.TaskTracker:170)
> [INFO] [03/10/2015 11:36:55.050] [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters] Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps (mesosphere.marathon.SchedulerActions:403)
> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:36:55,063] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,065] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,066] INFO Already running 1 instances of /ffaas-backoffice-app-nopersist. Not scaling. (mesosphere.marathon.SchedulerActions:512)
>
>
>
> -- End of logs
>
>
>
> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>>:
> This is certainly not the expected/desired behavior when failing over a mesos master in HA mode. In addition to the master logs Alex requested, can you also provide relevant portions of the slave logs for these tasks? If the slave processes themselves never failed over, checkpointing and slave recovery should be irrelevant. Are you running the mesos-slave itself inside a Docker, or any other non-traditional setup?
>
> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover defaults to "reconnect", and --strict defaults to true, so none of those are necessary.
>
> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <alex@mesosphere.io <ma...@mesosphere.io>> wrote:
> Geoffroy,
>
> could you please provide master logs (both from killed and taking over masters)?
>
> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <geoffroy.jabouley@gmail.com <ma...@gmail.com>> wrote:
> Hello
>
> we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster.
>
> Our use case:
>
> State: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon)
>
> Action: stop the mesos master leader process
>
> Expected: mesos master leader has changed, active tasks remain unchanged
>
> Seen: mesos master leader has changed, all active tasks are now FAILED but docker containers are still running, marathon detects FAILED tasks and starts new tasks. We end with 2 docker containers running on each machine, but only one is linked to a RUNNING mesos task.
>
>
> Is the seen behavior correct?
>
> Have we misunderstood the high availability concept? We thought that doing this use case would not have any impact on the current cluster state (except leader re-election)
>
> Thanks in advance for your help
> Regards
>
> ---------------------------------------------------
>
> our setup is the following:
> 3 identical mesos nodes with:
> + zookeeper
> + docker 1.5
> + mesos master 0.21.1 configured in HA mode
> + mesos slave 0.21.1 configured with checkpointing, strict and reconnect
> + marathon 0.8.0 configured in HA mode with checkpointing
>
> ---------------------------------------------------
>
> Command lines:
>
> mesos-master
> usr/sbin/mesos-master --zk=zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/mesos <http://10.195.30.21:2181/mesos> --port=5050 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>
> mesos-slave
> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/mesos <http://10.195.30.21:2181/mesos> --checkpoint --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.195.30.19 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>
> marathon
> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port 8443 --checkpoint --zk zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/marathon <http://10.195.30.21:2181/marathon> --master zk://10.195.30.19:2181 <http://10.195.30.19:2181/>,10.195.30.20:2181 <http://10.195.30.20:2181/>,10.195.30.21:2181/mesos <http://10.195.30.21:2181/mesos>
>
>
>
>
>
>
Re: Weird behavior when stopping the mesos master leader of a HA
mesos cluster
Posted by Alex Rukletsov <al...@mesosphere.io>.
Geoffroy,
yes, it looks like a marathon issue, so feel free to post it there as well.
On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley <
geoffroy.jabouley@gmail.com> wrote:
> Thanks Alex for your answer. I will have a look.
>
> Would it be better to (cross-)post this discussion on the marathon mailing
> list?
>
> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
>
> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <al...@mesosphere.io>:
>
>> Geoffroy,
>>
>> most probably you're hitting this bug:
>> https://github.com/mesosphere/marathon/issues/1063. The problem is that
>> Marathon can register instead of re-registering when a master fails
>> over. From master point of view, it's a new framework, that's why the
>> previous task is gone and a new one (that technically belongs to a new
>> framework) is started. You can see that frameworks have two different IDs
>> (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>>
>> Hope that helps,
>> Alex
>>
>> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
>> geoffroy.jabouley@gmail.com> wrote:
>>
>>> Hello
>>>
>>> thanks for your interest. Following are the requested logs, which will
>>> result in a pretty big mail.
>>>
>>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as
>>> our mesos containerizer.
>>>
>>> For reminder, here is the use case performed to get the logs file:
>>>
>>> --------------------------------
>>>
>>> Our cluster: 3 identical mesos nodes with:
>>> + zookeeper
>>> + docker 1.5
>>> + mesos master 0.21.1 configured in HA mode
>>> + mesos slave 0.21.1 configured with checkpointing, strict and
>>> reconnect
>>> + marathon 0.8.0 configured in HA mode with checkpointing
>>>
>>> --------------------------------
>>>
>>> *Begin State: *
>>> + the mesos cluster is up (3 machines)
>>> + mesos master leader is 10.195.30.19
>>> + marathon leader is 10.195.30.21
>>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>>>
>>> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>>>
>>> *Expected*: mesos master leader has changed, active tasks / frameworks
>>> remain unchanged
>>>
>>> *End state: *
>>> + mesos master leader *has changed, now 10.195.30.21*
>>> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
>>> (not showing anymore on the mesos UI), but *docker container is still
>>> running*
>>> + a n*ew APPTASK is now running on slave 10.195.30.19*
>>> + marathon framework "registration time" in mesos UI shows "Just now"
>>> + marathon leader *has changed, now 10.195.30.20*
>>>
>>>
>>> --------------------------------
>>>
>>> Now comes the 6 requested logs, which might contain interesting/relevant
>>> information, but i as a newcomer to mesos it is hard to read...
>>>
>>>
>>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
>>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
>>> from process 1 of user 0; exiting
>>>
>>>
>>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
>>> I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.011823 922 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships
>>> changed
>>> I0310 11:31:40.015847 915 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000000' in ZooKeeper
>>> I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is
>>> master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
>>> I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading
>>> master!
>>> I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar
>>> I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar
>>> I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer
>>> I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit
>>> promise request with proposal 2
>>> I0310 11:31:40.017503 915 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000003' in ZooKeeper
>>> I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8
>>> bytes) to leveldb took 893672ns
>>> I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2
>>> I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: {
>>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
>>> I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping
>>> to fill missing position
>>> I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending
>>> position 8
>>> I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from
>>> leveldb took 13195ns
>>> I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from
>>> leveldb took 3120ns
>>> I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the
>>> registry (610B) in 7.385856ms
>>> I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in
>>> 9263ns; attempting to update the 'registry'
>>> I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 9
>>> I0310 11:31:40.024435 923 replica.cpp:508] Replica received write
>>> request for position 9
>>> I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 1.259338ms
>>> I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9
>>> I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned
>>> notice for position 9
>>> I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 404795ns
>>> I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9
>>> I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND
>>> action at position 9
>>> I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the
>>> 'registry' in 3.029248ms
>>> I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered
>>> registrar
>>> I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log
>>> to 9
>>> I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from the
>>> Registry (608B) ; allowing 1days for slaves to re-register
>>> I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 10
>>> I0310 11:31:40.026988 923 replica.cpp:508] Replica received write
>>> request for position 10
>>> I0310 11:31:40.027640 923 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 641018ns
>>> I0310 11:31:40.027652 923 replica.cpp:676] Persisted action at 10
>>> I0310 11:31:40.030848 923 replica.cpp:655] Replica received learned
>>> notice for position 10
>>> I0310 11:31:40.031883 923 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 1.008914ms
>>> I0310 11:31:40.031963 923 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 46724ns
>>> I0310 11:31:40.031977 923 replica.cpp:676] Persisted action at 10
>>> I0310 11:31:40.031986 923 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 10
>>> I0310 11:31:40.055415 918 master.cpp:1383] Received registration
>>> request for framework 'marathon' at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.055496 918 master.cpp:1447] Registering framework
>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.055642 919 hierarchical_allocator_process.hpp:329]
>>> Added framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.189151 919 master.cpp:3246] Re-registering slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19)
>>> I0310 11:31:40.189280 919 registrar.cpp:445] Applied 1 operations in
>>> 15452ns; attempting to update the 'registry'
>>> I0310 11:31:40.189949 919 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.189978 919 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 11
>>> I0310 11:31:40.190112 919 replica.cpp:508] Replica received write
>>> request for position 11
>>> I0310 11:31:40.190563 919 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 437440ns
>>> I0310 11:31:40.190577 919 replica.cpp:676] Persisted action at 11
>>> I0310 11:31:40.191249 921 replica.cpp:655] Replica received learned
>>> notice for position 11
>>> I0310 11:31:40.192159 921 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 892767ns
>>> I0310 11:31:40.192178 921 replica.cpp:676] Persisted action at 11
>>> I0310 11:31:40.192184 921 replica.cpp:661] Replica learned APPEND
>>> action at position 11
>>> I0310 11:31:40.192350 921 registrar.cpp:490] Successfully updated the
>>> 'registry' in 3.0528ms
>>> I0310 11:31:40.192387 919 log.cpp:699] Attempting to truncate the log
>>> to 11
>>> I0310 11:31:40.192415 919 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 12
>>> I0310 11:31:40.192539 915 replica.cpp:508] Replica received write
>>> request for position 12
>>> I0310 11:31:40.192600 921 master.cpp:3314] Re-registered slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148
>>> I0310 11:31:40.192680 917 hierarchical_allocator_process.hpp:442]
>>> Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>> available)
>>> I0310 11:31:40.192847 917 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.193164 915 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 610664ns
>>> I0310 11:31:40.193181 915 replica.cpp:676] Persisted action at 12
>>> I0310 11:31:40.193568 915 replica.cpp:655] Replica received learned
>>> notice for position 12
>>> I0310 11:31:40.193948 915 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 364062ns
>>> I0310 11:31:40.193979 915 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 12256ns
>>> I0310 11:31:40.193985 915 replica.cpp:676] Persisted action at 12
>>> I0310 11:31:40.193990 915 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 12
>>> I0310 11:31:40.248615 915 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.248744 915 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.774416 915 master.cpp:3246] Re-registering slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21)
>>> I0310 11:31:40.774976 915 registrar.cpp:445] Applied 1 operations in
>>> 42342ns; attempting to update the 'registry'
>>> I0310 11:31:40.777273 921 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.777436 921 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 13
>>> I0310 11:31:40.777989 921 replica.cpp:508] Replica received write
>>> request for position 13
>>> I0310 11:31:40.779558 921 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 1.513714ms
>>> I0310 11:31:40.779633 921 replica.cpp:676] Persisted action at 13
>>> I0310 11:31:40.781821 919 replica.cpp:655] Replica received learned
>>> notice for position 13
>>> I0310 11:31:40.784417 919 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 2.542036ms
>>> I0310 11:31:40.784446 919 replica.cpp:676] Persisted action at 13
>>> I0310 11:31:40.784452 919 replica.cpp:661] Replica learned APPEND
>>> action at position 13
>>> I0310 11:31:40.784711 920 registrar.cpp:490] Successfully updated the
>>> 'registry' in 9.68192ms
>>> I0310 11:31:40.784762 917 log.cpp:699] Attempting to truncate the log
>>> to 13
>>> I0310 11:31:40.784808 920 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 14
>>> I0310 11:31:40.784865 917 master.hpp:877] Adding task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
>>> I0310 11:31:40.784955 919 replica.cpp:508] Replica received write
>>> request for position 14
>>> W0310 11:31:40.785038 917 master.cpp:4468] Possibly orphaned task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
>>> framework 20150310-112310-320783114-5050-24289-0000 running on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21)
>>> I0310 11:31:40.785105 917 master.cpp:3314] Re-registered slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148
>>> I0310 11:31:40.785162 920 hierarchical_allocator_process.hpp:442]
>>> Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
>>> available)
>>> I0310 11:31:40.785679 921 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.786429 919 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 1.454211ms
>>> I0310 11:31:40.786455 919 replica.cpp:676] Persisted action at 14
>>> I0310 11:31:40.786782 919 replica.cpp:655] Replica received learned
>>> notice for position 14
>>> I0310 11:31:40.787833 919 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 1.027014ms
>>> I0310 11:31:40.787873 919 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 14085ns
>>> I0310 11:31:40.787883 919 replica.cpp:676] Persisted action at 14
>>> I0310 11:31:40.787889 919 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 14
>>> I0310 11:31:40.792536 922 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.792670 922 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.819602 921 master.cpp:3246] Re-registering slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20)
>>> I0310 11:31:40.819736 921 registrar.cpp:445] Applied 1 operations in
>>> 16656ns; attempting to update the 'registry'
>>> I0310 11:31:40.820617 921 log.cpp:680] Attempting to append 647 bytes
>>> to the log
>>> I0310 11:31:40.820726 918 coordinator.cpp:340] Coordinator attempting
>>> to write APPEND action at position 15
>>> I0310 11:31:40.820938 918 replica.cpp:508] Replica received write
>>> request for position 15
>>> I0310 11:31:40.821641 918 leveldb.cpp:343] Persisting action (666
>>> bytes) to leveldb took 670583ns
>>> I0310 11:31:40.821663 918 replica.cpp:676] Persisted action at 15
>>> I0310 11:31:40.822265 917 replica.cpp:655] Replica received learned
>>> notice for position 15
>>> I0310 11:31:40.823463 917 leveldb.cpp:343] Persisting action (668
>>> bytes) to leveldb took 1.178687ms
>>> I0310 11:31:40.823490 917 replica.cpp:676] Persisted action at 15
>>> I0310 11:31:40.823498 917 replica.cpp:661] Replica learned APPEND
>>> action at position 15
>>> I0310 11:31:40.823755 917 registrar.cpp:490] Successfully updated the
>>> 'registry' in 3.97696ms
>>> I0310 11:31:40.823823 917 log.cpp:699] Attempting to truncate the log
>>> to 15
>>> I0310 11:31:40.824147 922 coordinator.cpp:340] Coordinator attempting
>>> to write TRUNCATE action at position 16
>>> I0310 11:31:40.824482 922 hierarchical_allocator_process.hpp:442]
>>> Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>> available)
>>> I0310 11:31:40.824597 921 replica.cpp:508] Replica received write
>>> request for position 16
>>> I0310 11:31:40.824128 917 master.cpp:3314] Re-registered slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148
>>> I0310 11:31:40.824975 917 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.831900 921 leveldb.cpp:343] Persisting action (16
>>> bytes) to leveldb took 7.228682ms
>>> I0310 11:31:40.832031 921 replica.cpp:676] Persisted action at 16
>>> I0310 11:31:40.832456 917 replica.cpp:655] Replica received learned
>>> notice for position 16
>>> I0310 11:31:40.835178 917 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 2.674392ms
>>> I0310 11:31:40.835297 917 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 45220ns
>>> I0310 11:31:40.835322 917 replica.cpp:676] Persisted action at 16
>>> I0310 11:31:40.835341 917 replica.cpp:661] Replica learned TRUNCATE
>>> action at position 16
>>> I0310 11:31:40.838281 923 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:40.838389 923 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:40.948725 919 http.cpp:344] HTTP request for
>>> '/master/redirect'
>>> I0310 11:31:41.479118 918 http.cpp:478] HTTP request for
>>> '/master/state.json'
>>> I0310 11:31:45.368074 918 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:45.385144 917 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:45.385292 917 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:46.368450 917 master.cpp:3843] Sending 2 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:46.375222 920 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:46.375360 920 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:46.375530 920 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:46.375599 920 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:48.031230 915 http.cpp:478] HTTP request for
>>> '/master/state.json'
>>> I0310 11:31:51.374285 922 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:51.379391 921 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:51.379487 921 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:51.482094 923 http.cpp:478] HTTP request for
>>> '/master/state.json'
>>> I0310 11:31:52.375326 917 master.cpp:3843] Sending 2 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:52.391376 919 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave
>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:52.391512 919 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:52.391659 921 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave
>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:52.391751 921 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:55.062060 918 master.cpp:3611] Performing explicit task
>>> state reconciliation for 1 tasks of framework
>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:55.062588 919 master.cpp:3556] Performing implicit task
>>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:56.140990 923 http.cpp:344] HTTP request for
>>> '/master/redirect'
>>> I0310 11:31:57.379288 918 master.cpp:3843] Sending 1 offers to
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:57.430888 918 master.cpp:2344] Processing reply for
>>> offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>> (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>> I0310 11:31:57.431068 918 master.hpp:877] Adding task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
>>> I0310 11:31:57.431089 918 master.cpp:2503] Launching task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19)
>>> I0310 11:31:57.431205 918 hierarchical_allocator_process.hpp:563]
>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>> 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.682133 919 master.cpp:3446] Forwarding status update
>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.682186 919 master.cpp:3418] Status update TASK_RUNNING
>>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 from slave
>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>> (10.195.30.19)
>>> I0310 11:31:57.682199 919 master.cpp:4693] Updating the latest state
>>> of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799
>>> of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>>>
>>>
>>> *from MESOS slave 10.195.30.21*
>>> I0310 11:31:28.750200 1074 slave.cpp:2623] master@10.195.30.19:5050
>>> exited
>>> W0310 11:31:28.750249 1074 slave.cpp:2626] Master disconnected! Waiting
>>> for a new master to be elected
>>> I0310 11:31:40.012516 1075 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.012899 1073 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.017143 1072 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.017408 1072 slave.cpp:602] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.017546 1076 status_update_manager.cpp:171] Pausing
>>> sending status updates
>>> I0310 11:31:40.018673 1072 slave.cpp:627] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.018689 1072 slave.cpp:638] Detecting new master
>>> I0310 11:31:40.785364 1075 slave.cpp:824] Re-registered with master
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.785398 1075 status_update_manager.cpp:178] Resuming
>>> sending status updates
>>> I0310 11:32:10.639506 1075 slave.cpp:3321] Current usage 12.27%. Max
>>> allowed age: 5.441217749539572days
>>>
>>>
>>> *from MESOS slave 10.195.30.19*
>>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
>>> exited
>>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
>>> for a new master to be elected
>>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
>>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing
>>> sending status updates
>>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
>>> sending status updates
>>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 in work directory
>>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
>>> executor
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework '20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
>>> executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
>>> framework '20150310-112310-354337546-5050-895-0000'
>>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
>>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
>>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>> framework '20150310-112310-354337546-5050-895-0000' in container
>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for
>>> executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>> 10.195.30.19:56574
>>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>> executor
>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>> 10.195.30.19:56574
>>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received
>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
>>> UPDATE for status update TASK_RUNNING (UUID:
>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 to
>>> master@10.195.30.21:5050
>>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
>>> 10.195.30.19:56574
>>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received
>>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e)
>>> for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
>>> ACK for status update TASK_RUNNING (UUID:
>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>> framework 20150310-112310-354337546-5050-895-0000
>>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204
>>> at
>>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
>>> 'memory.soft_limit_in_bytes' to 160MB for container
>>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
>>> 'memory.limit_in_bytes' to 160MB at
>>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
>>> allowed age: 5.555425200437824days
>>>
>>>
>>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
>>> I0310 11:31:40.017248 1115 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.017334 1115 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> I0310 11:31:40.017727 1115 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> [2015-03-10 11:31:40,017] WARN Disconnected
>>> (mesosphere.marathon.MarathonScheduler:224)
>>> [2015-03-10 11:31:40,019] INFO Abdicating
>>> (mesosphere.marathon.MarathonSchedulerService:312)
>>> [2015-03-10 11:31:40,019] INFO Defeat leadership
>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>> [INFO] [03/10/2015 11:31:40.019]
>>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> [INFO] [03/10/2015 11:31:40.019]
>>> [marathon-akka.actor.default-dispatcher-5]
>>> [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
>>> [2015-03-10 11:31:40,021] INFO Stopping driver
>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>> I0310 11:31:40.022001 1115 sched.cpp:1286] Asked to stop the driver
>>> [2015-03-10 11:31:40,024] INFO Setting framework ID to
>>> 20150310-112310-320783114-5050-24289-0000
>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>> I0310 11:31:40.026274 1115 sched.cpp:234] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.026418 1115 sched.cpp:242] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.026458 1115 sched.cpp:752] Stopping framework
>>> '20150310-112310-320783114-5050-24289-0000'
>>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing
>>> optional abdication command.
>>> (mesosphere.marathon.MarathonSchedulerService:192)
>>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
>>> (mesosphere.marathon.MarathonSchedulerService:246)
>>> [2015-03-10 11:31:40,032] INFO Defeat leadership
>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>> [2015-03-10 11:31:40,032] INFO Stopping driver
>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>> I0310 11:31:40.032588 1107 sched.cpp:1286] Asked to stop the driver
>>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
>>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
>>> [2015-03-10 11:31:40,033] INFO Setting framework ID to
>>> 20150310-112310-320783114-5050-24289-0000
>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is
>>> not a candidate for leader, current voting: [member_0000000001,
>>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
>>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering
>>> leadership (mesosphere.marathon.MarathonSchedulerService:341)
>>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
>>> (com.twitter.common.zookeeper.Group:426)
>>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is
>>> not a candidate for leader, current voting: [member_0000000001,
>>> member_0000000002, member_0000000003]
>>> (com.twitter.common.zookeeper.CandidateImpl:144)
>>> [2015-03-10 11:31:40,568] INFO Candidate
>>> /marathon/leader/member_0000000003 waiting for the next leader election,
>>> current voting: [member_0000000001, member_0000000002, member_0000000003]
>>> (com.twitter.common.zookeeper.CandidateImpl:165)
>>>
>>>
>>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
>>> [2015-03-10 11:31:40,029] INFO Candidate
>>> /marathon/leader/member_0000000001 is now leader of group:
>>> [member_0000000001, member_0000000002]
>>> (com.twitter.common.zookeeper.CandidateImpl:152)
>>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
>>> (mesosphere.marathon.MarathonSchedulerService:253)
>>> [2015-03-10 11:31:40,044] INFO Elect leadership
>>> (mesosphere.marathon.MarathonSchedulerService:299)
>>> [2015-03-10 11:31:40,044] INFO Running driver
>>> (mesosphere.marathon.MarathonSchedulerService:184)
>>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
>>> Client environment:zookeeper.version=zookeeper C client 3.4.5
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
>>> Client environment:host.name=srv-d2u-9-virtip20
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
>>> Client environment:os.name=Linux
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
>>> Client environment:os.arch=3.13.0-44-generic
>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
>>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
>>> Client environment:user.name=(null)
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
>>> Client environment:user.home=/root
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
>>> Client environment:user.dir=/
>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
>>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
>>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0
>>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
>>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
>>> (mesosphere.marathon.MarathonSchedulerService:329)
>>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
>>> initiated connection to server [10.195.30.19:2181]
>>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
>>> session establishment complete on server [10.195.30.19:2181],
>>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
>>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
>>> 10.195.30.20:45771) connected to ZooKeeper
>>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations:
>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path
>>> '/mesos' in ZooKeeper
>>> [INFO] [03/10/2015 11:31:40.047]
>>> [marathon-akka.actor.default-dispatcher-2]
>>> [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
>>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
>>> (id='2')
>>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
>>> '/mesos/info_0000000002' in ZooKeeper
>>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for
>>> version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
>>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
>>> master@10.195.30.21:5050) is detected
>>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
>>> master@10.195.30.21:5050
>>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
>>> Attempting to register without authentication
>>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
>>> 20150310-112310-354337546-5050-895-0000
>>> [2015-03-10 11:31:40,056] INFO Registered as
>>> 20150310-112310-354337546-5050-895-0000 to master
>>> '20150310-112310-354337546-5050-895'
>>> (mesosphere.marathon.MarathonScheduler:72)
>>> [2015-03-10 11:31:40,063] INFO Stored framework ID
>>> '20150310-112310-354337546-5050-895-0000'
>>> (mesosphere.mesos.util.FrameworkIdUtil:49)
>>> [INFO] [03/10/2015 11:31:40.065]
>>> [marathon-akka.actor.default-dispatcher-6]
>>> [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
>>> [INFO] [03/10/2015 11:31:40.067]
>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> ...
>>> ...
>>> ...
>>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
>>> (mesosphere.marathon.SchedulerActions:403)
>>> [INFO] [03/10/2015 11:31:55.053]
>>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
>>> encountered. This logging can be turned off or adjusted with configuration
>>> settings 'akka.log-dead-letters' and
>>> 'akka.log-dead-letters-during-shutdown'.
>>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>> [2015-03-10 11:31:55,064] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
>>> TASK_LOST (Reconciliation: Task is unknown to the slave)
>>> (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:31:55,069] INFO Need to scale
>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>> (mesosphere.marathon.SchedulerActions:488)
>>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
>>> /ffaas-backoffice-app-nopersist (0 queued)
>>> (mesosphere.marathon.SchedulerActions:494)
>>> [2015-03-10 11:31:55,069] INFO Task
>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
>>> expunged and removed from TaskTracker
>>> (mesosphere.marathon.tasks.TaskTracker:107)
>>> [2015-03-10 11:31:55,070] INFO Sending event notification.
>>> (mesosphere.marathon.MarathonScheduler:262)
>>> [INFO] [03/10/2015 11:31:55.072]
>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> [2015-03-10 11:31:55,073] INFO Need to scale
>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>> (mesosphere.marathon.SchedulerActions:488)
>>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>> (mesosphere.marathon.SchedulerActions:498)
>>> ...
>>> ...
>>> ...
>>> [2015-03-10 11:31:57,682] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:31:57,694] INFO Sending event notification.
>>> (mesosphere.marathon.MarathonScheduler:262)
>>> [INFO] [03/10/2015 11:31:57.694]
>>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
>>> POSTing to all endpoints.
>>> ...
>>> ...
>>> ...
>>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
>>> (mesosphere.marathon.tasks.TaskTracker:170)
>>> [INFO] [03/10/2015 11:36:55.050]
>>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
>>> encountered. This logging can be turned off or adjusted with configuration
>>> settings 'akka.log-dead-letters' and
>>> 'akka.log-dead-letters-during-shutdown'.
>>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
>>> (mesosphere.marathon.SchedulerActions:403)
>>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>> [2015-03-10 11:36:55,063] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>> TASK_RUNNING (Reconciliation: Latest task state)
>>> (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:36:55,065] INFO Received status update for task
>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>> TASK_RUNNING (Reconciliation: Latest task state)
>>> (mesosphere.marathon.MarathonScheduler:148)
>>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>> (mesosphere.marathon.SchedulerActions:512)
>>>
>>>
>>>
>>> -- End of logs
>>>
>>>
>>>
>>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>>>
>>>> This is certainly not the expected/desired behavior when failing over a
>>>> mesos master in HA mode. In addition to the master logs Alex requested, can
>>>> you also provide relevant portions of the slave logs for these tasks? If
>>>> the slave processes themselves never failed over, checkpointing and slave
>>>> recovery should be irrelevant. Are you running the mesos-slave itself
>>>> inside a Docker, or any other non-traditional setup?
>>>>
>>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>>>> defaults to "reconnect", and --strict defaults to true, so none of those
>>>> are necessary.
>>>>
>>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>>>> wrote:
>>>>
>>>>> Geoffroy,
>>>>>
>>>>> could you please provide master logs (both from killed and taking over
>>>>> masters)?
>>>>>
>>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>>>> geoffroy.jabouley@gmail.com> wrote:
>>>>>
>>>>>> Hello
>>>>>>
>>>>>> we are facing some unexpecting issues when testing high availability
>>>>>> behaviors of our mesos cluster.
>>>>>>
>>>>>> *Our use case:*
>>>>>>
>>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>>>> running on each slave (started from marathon)
>>>>>>
>>>>>> *Action*: stop the mesos master leader process
>>>>>>
>>>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>>>> unchanged*
>>>>>>
>>>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>>>> FAILED but docker containers are still running*, marathon detects
>>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>>>
>>>>>>
>>>>>> Is the seen behavior correct?
>>>>>>
>>>>>> Have we misunderstood the high availability concept? We thought that
>>>>>> doing this use case would not have any impact on the current cluster state
>>>>>> (except leader re-election)
>>>>>>
>>>>>> Thanks in advance for your help
>>>>>> Regards
>>>>>>
>>>>>> ---------------------------------------------------
>>>>>>
>>>>>> our setup is the following:
>>>>>> 3 identical mesos nodes with:
>>>>>> + zookeeper
>>>>>> + docker 1.5
>>>>>> + mesos master 0.21.1 configured in HA mode
>>>>>> + mesos slave 0.21.1 configured with checkpointing, strict and
>>>>>> reconnect
>>>>>> + marathon 0.8.0 configured in HA mode with checkpointing
>>>>>>
>>>>>> ---------------------------------------------------
>>>>>>
>>>>>> Command lines:
>>>>>>
>>>>>>
>>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>>>
>>>>>> *mesos-slave*
>>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,
>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint
>>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins
>>>>>> --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>>>
>>>>>> *marathon*
>>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Weird behavior when stopping the mesos master leader of a HA
mesos cluster
Posted by Geoffroy Jabouley <ge...@gmail.com>.
Thanks Alex for your answer. I will have a look.
Would it be better to (cross-)post this discussion on the marathon mailing
list?
Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
2015-03-11 22:18 GMT+01:00 Alex Rukletsov <al...@mesosphere.io>:
> Geoffroy,
>
> most probably you're hitting this bug:
> https://github.com/mesosphere/marathon/issues/1063. The problem is that
> Marathon can register instead of re-registering when a master fails
> over. From master point of view, it's a new framework, that's why the
> previous task is gone and a new one (that technically belongs to a new
> framework) is started. You can see that frameworks have two different IDs
> (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>
> Hope that helps,
> Alex
>
> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
> geoffroy.jabouley@gmail.com> wrote:
>
>> Hello
>>
>> thanks for your interest. Following are the requested logs, which will
>> result in a pretty big mail.
>>
>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as
>> our mesos containerizer.
>>
>> For reminder, here is the use case performed to get the logs file:
>>
>> --------------------------------
>>
>> Our cluster: 3 identical mesos nodes with:
>> + zookeeper
>> + docker 1.5
>> + mesos master 0.21.1 configured in HA mode
>> + mesos slave 0.21.1 configured with checkpointing, strict and
>> reconnect
>> + marathon 0.8.0 configured in HA mode with checkpointing
>>
>> --------------------------------
>>
>> *Begin State: *
>> + the mesos cluster is up (3 machines)
>> + mesos master leader is 10.195.30.19
>> + marathon leader is 10.195.30.21
>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>>
>> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>>
>> *Expected*: mesos master leader has changed, active tasks / frameworks
>> remain unchanged
>>
>> *End state: *
>> + mesos master leader *has changed, now 10.195.30.21*
>> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
>> (not showing anymore on the mesos UI), but *docker container is still
>> running*
>> + a n*ew APPTASK is now running on slave 10.195.30.19*
>> + marathon framework "registration time" in mesos UI shows "Just now"
>> + marathon leader *has changed, now 10.195.30.20*
>>
>>
>> --------------------------------
>>
>> Now comes the 6 requested logs, which might contain interesting/relevant
>> information, but i as a newcomer to mesos it is hard to read...
>>
>>
>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
>> from process 1 of user 0; exiting
>>
>>
>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
>> I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.011823 922 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships
>> changed
>> I0310 11:31:40.015847 915 group.cpp:659] Trying to get
>> '/mesos/log_replicas/0000000000' in ZooKeeper
>> I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is
>> master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
>> I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading
>> master!
>> I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar
>> I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar
>> I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer
>> I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit
>> promise request with proposal 2
>> I0310 11:31:40.017503 915 group.cpp:659] Trying to get
>> '/mesos/log_replicas/0000000003' in ZooKeeper
>> I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8
>> bytes) to leveldb took 893672ns
>> I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2
>> I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: {
>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
>> I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping to
>> fill missing position
>> I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending
>> position 8
>> I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from
>> leveldb took 13195ns
>> I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from
>> leveldb took 3120ns
>> I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the
>> registry (610B) in 7.385856ms
>> I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in
>> 9263ns; attempting to update the 'registry'
>> I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 9
>> I0310 11:31:40.024435 923 replica.cpp:508] Replica received write
>> request for position 9
>> I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 1.259338ms
>> I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9
>> I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned
>> notice for position 9
>> I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 404795ns
>> I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9
>> I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND
>> action at position 9
>> I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the
>> 'registry' in 3.029248ms
>> I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered
>> registrar
>> I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log
>> to 9
>> I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from the
>> Registry (608B) ; allowing 1days for slaves to re-register
>> I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 10
>> I0310 11:31:40.026988 923 replica.cpp:508] Replica received write
>> request for position 10
>> I0310 11:31:40.027640 923 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 641018ns
>> I0310 11:31:40.027652 923 replica.cpp:676] Persisted action at 10
>> I0310 11:31:40.030848 923 replica.cpp:655] Replica received learned
>> notice for position 10
>> I0310 11:31:40.031883 923 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 1.008914ms
>> I0310 11:31:40.031963 923 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 46724ns
>> I0310 11:31:40.031977 923 replica.cpp:676] Persisted action at 10
>> I0310 11:31:40.031986 923 replica.cpp:661] Replica learned TRUNCATE
>> action at position 10
>> I0310 11:31:40.055415 918 master.cpp:1383] Received registration
>> request for framework 'marathon' at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.055496 918 master.cpp:1447] Registering framework
>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.055642 919 hierarchical_allocator_process.hpp:329] Added
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.189151 919 master.cpp:3246] Re-registering slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19)
>> I0310 11:31:40.189280 919 registrar.cpp:445] Applied 1 operations in
>> 15452ns; attempting to update the 'registry'
>> I0310 11:31:40.189949 919 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.189978 919 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 11
>> I0310 11:31:40.190112 919 replica.cpp:508] Replica received write
>> request for position 11
>> I0310 11:31:40.190563 919 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 437440ns
>> I0310 11:31:40.190577 919 replica.cpp:676] Persisted action at 11
>> I0310 11:31:40.191249 921 replica.cpp:655] Replica received learned
>> notice for position 11
>> I0310 11:31:40.192159 921 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 892767ns
>> I0310 11:31:40.192178 921 replica.cpp:676] Persisted action at 11
>> I0310 11:31:40.192184 921 replica.cpp:661] Replica learned APPEND
>> action at position 11
>> I0310 11:31:40.192350 921 registrar.cpp:490] Successfully updated the
>> 'registry' in 3.0528ms
>> I0310 11:31:40.192387 919 log.cpp:699] Attempting to truncate the log
>> to 11
>> I0310 11:31:40.192415 919 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 12
>> I0310 11:31:40.192539 915 replica.cpp:508] Replica received write
>> request for position 12
>> I0310 11:31:40.192600 921 master.cpp:3314] Re-registered slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148
>> I0310 11:31:40.192680 917 hierarchical_allocator_process.hpp:442] Added
>> slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>> available)
>> I0310 11:31:40.192847 917 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.193164 915 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 610664ns
>> I0310 11:31:40.193181 915 replica.cpp:676] Persisted action at 12
>> I0310 11:31:40.193568 915 replica.cpp:655] Replica received learned
>> notice for position 12
>> I0310 11:31:40.193948 915 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 364062ns
>> I0310 11:31:40.193979 915 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 12256ns
>> I0310 11:31:40.193985 915 replica.cpp:676] Persisted action at 12
>> I0310 11:31:40.193990 915 replica.cpp:661] Replica learned TRUNCATE
>> action at position 12
>> I0310 11:31:40.248615 915 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O0 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.248744 915 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.774416 915 master.cpp:3246] Re-registering slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21)
>> I0310 11:31:40.774976 915 registrar.cpp:445] Applied 1 operations in
>> 42342ns; attempting to update the 'registry'
>> I0310 11:31:40.777273 921 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.777436 921 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 13
>> I0310 11:31:40.777989 921 replica.cpp:508] Replica received write
>> request for position 13
>> I0310 11:31:40.779558 921 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 1.513714ms
>> I0310 11:31:40.779633 921 replica.cpp:676] Persisted action at 13
>> I0310 11:31:40.781821 919 replica.cpp:655] Replica received learned
>> notice for position 13
>> I0310 11:31:40.784417 919 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 2.542036ms
>> I0310 11:31:40.784446 919 replica.cpp:676] Persisted action at 13
>> I0310 11:31:40.784452 919 replica.cpp:661] Replica learned APPEND
>> action at position 13
>> I0310 11:31:40.784711 920 registrar.cpp:490] Successfully updated the
>> 'registry' in 9.68192ms
>> I0310 11:31:40.784762 917 log.cpp:699] Attempting to truncate the log
>> to 13
>> I0310 11:31:40.784808 920 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 14
>> I0310 11:31:40.784865 917 master.hpp:877] Adding task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
>> I0310 11:31:40.784955 919 replica.cpp:508] Replica received write
>> request for position 14
>> W0310 11:31:40.785038 917 master.cpp:4468] Possibly orphaned task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
>> framework 20150310-112310-320783114-5050-24289-0000 running on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21)
>> I0310 11:31:40.785105 917 master.cpp:3314] Re-registered slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148
>> I0310 11:31:40.785162 920 hierarchical_allocator_process.hpp:442] Added
>> slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
>> available)
>> I0310 11:31:40.785679 921 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.786429 919 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 1.454211ms
>> I0310 11:31:40.786455 919 replica.cpp:676] Persisted action at 14
>> I0310 11:31:40.786782 919 replica.cpp:655] Replica received learned
>> notice for position 14
>> I0310 11:31:40.787833 919 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 1.027014ms
>> I0310 11:31:40.787873 919 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 14085ns
>> I0310 11:31:40.787883 919 replica.cpp:676] Persisted action at 14
>> I0310 11:31:40.787889 919 replica.cpp:661] Replica learned TRUNCATE
>> action at position 14
>> I0310 11:31:40.792536 922 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O1 ] on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.792670 922 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S2 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.819602 921 master.cpp:3246] Re-registering slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20)
>> I0310 11:31:40.819736 921 registrar.cpp:445] Applied 1 operations in
>> 16656ns; attempting to update the 'registry'
>> I0310 11:31:40.820617 921 log.cpp:680] Attempting to append 647 bytes
>> to the log
>> I0310 11:31:40.820726 918 coordinator.cpp:340] Coordinator attempting
>> to write APPEND action at position 15
>> I0310 11:31:40.820938 918 replica.cpp:508] Replica received write
>> request for position 15
>> I0310 11:31:40.821641 918 leveldb.cpp:343] Persisting action (666
>> bytes) to leveldb took 670583ns
>> I0310 11:31:40.821663 918 replica.cpp:676] Persisted action at 15
>> I0310 11:31:40.822265 917 replica.cpp:655] Replica received learned
>> notice for position 15
>> I0310 11:31:40.823463 917 leveldb.cpp:343] Persisting action (668
>> bytes) to leveldb took 1.178687ms
>> I0310 11:31:40.823490 917 replica.cpp:676] Persisted action at 15
>> I0310 11:31:40.823498 917 replica.cpp:661] Replica learned APPEND
>> action at position 15
>> I0310 11:31:40.823755 917 registrar.cpp:490] Successfully updated the
>> 'registry' in 3.97696ms
>> I0310 11:31:40.823823 917 log.cpp:699] Attempting to truncate the log
>> to 15
>> I0310 11:31:40.824147 922 coordinator.cpp:340] Coordinator attempting
>> to write TRUNCATE action at position 16
>> I0310 11:31:40.824482 922 hierarchical_allocator_process.hpp:442] Added
>> slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>> available)
>> I0310 11:31:40.824597 921 replica.cpp:508] Replica received write
>> request for position 16
>> I0310 11:31:40.824128 917 master.cpp:3314] Re-registered slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148
>> I0310 11:31:40.824975 917 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.831900 921 leveldb.cpp:343] Persisting action (16 bytes)
>> to leveldb took 7.228682ms
>> I0310 11:31:40.832031 921 replica.cpp:676] Persisted action at 16
>> I0310 11:31:40.832456 917 replica.cpp:655] Replica received learned
>> notice for position 16
>> I0310 11:31:40.835178 917 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 2.674392ms
>> I0310 11:31:40.835297 917 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 45220ns
>> I0310 11:31:40.835322 917 replica.cpp:676] Persisted action at 16
>> I0310 11:31:40.835341 917 replica.cpp:661] Replica learned TRUNCATE
>> action at position 16
>> I0310 11:31:40.838281 923 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O2 ] on slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:40.838389 923 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S0 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:40.948725 919 http.cpp:344] HTTP request for
>> '/master/redirect'
>> I0310 11:31:41.479118 918 http.cpp:478] HTTP request for
>> '/master/state.json'
>> I0310 11:31:45.368074 918 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:45.385144 917 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O3 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:45.385292 917 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:46.368450 917 master.cpp:3843] Sending 2 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:46.375222 920 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O4 ] on slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:46.375360 920 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O5 ] on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:46.375530 920 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S0 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:46.375599 920 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S2 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:48.031230 915 http.cpp:478] HTTP request for
>> '/master/state.json'
>> I0310 11:31:51.374285 922 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:51.379391 921 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O6 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:51.379487 921 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:51.482094 923 http.cpp:478] HTTP request for
>> '/master/state.json'
>> I0310 11:31:52.375326 917 master.cpp:3843] Sending 2 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:52.391376 919 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O7 ] on slave
>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:52.391512 919 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S2 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:52.391659 921 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O8 ] on slave
>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:52.391751 921 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S0 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:55.062060 918 master.cpp:3611] Performing explicit task
>> state reconciliation for 1 tasks of framework
>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:55.062588 919 master.cpp:3556] Performing implicit task
>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:56.140990 923 http.cpp:344] HTTP request for
>> '/master/redirect'
>> I0310 11:31:57.379288 918 master.cpp:3843] Sending 1 offers to
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:57.430888 918 master.cpp:2344] Processing reply for offers:
>> [ 20150310-112310-354337546-5050-895-O9 ] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>> (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>> I0310 11:31:57.431068 918 master.hpp:877] Adding task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
>> I0310 11:31:57.431089 918 master.cpp:2503] Launching task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19)
>> I0310 11:31:57.431205 918 hierarchical_allocator_process.hpp:563]
>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>> 20150310-112310-320783114-5050-24289-S1 from framework
>> 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.682133 919 master.cpp:3446] Forwarding status update
>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.682186 919 master.cpp:3418] Status update TASK_RUNNING
>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 from slave
>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>> (10.195.30.19)
>> I0310 11:31:57.682199 919 master.cpp:4693] Updating the latest state of
>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>>
>>
>> *from MESOS slave 10.195.30.21*
>> I0310 11:31:28.750200 1074 slave.cpp:2623] master@10.195.30.19:5050
>> exited
>> W0310 11:31:28.750249 1074 slave.cpp:2626] Master disconnected! Waiting
>> for a new master to be elected
>> I0310 11:31:40.012516 1075 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.012899 1073 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.017143 1072 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.017408 1072 slave.cpp:602] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.017546 1076 status_update_manager.cpp:171] Pausing
>> sending status updates
>> I0310 11:31:40.018673 1072 slave.cpp:627] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.018689 1072 slave.cpp:638] Detecting new master
>> I0310 11:31:40.785364 1075 slave.cpp:824] Re-registered with master
>> master@10.195.30.21:5050
>> I0310 11:31:40.785398 1075 status_update_manager.cpp:178] Resuming
>> sending status updates
>> I0310 11:32:10.639506 1075 slave.cpp:3321] Current usage 12.27%. Max
>> allowed age: 5.441217749539572days
>>
>>
>> *from MESOS slave 10.195.30.19*
>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
>> exited
>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
>> for a new master to be elected
>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing
>> sending status updates
>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
>> master@10.195.30.21:5050
>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
>> sending status updates
>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 in work directory
>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
>> executor
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework '20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
>> executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
>> framework '20150310-112310-354337546-5050-895-0000'
>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>> framework '20150310-112310-354337546-5050-895-0000' in container
>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>> 10.195.30.19:56574
>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>> executor
>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>> 10.195.30.19:56574
>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received
>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
>> UPDATE for status update TASK_RUNNING (UUID:
>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 to
>> master@10.195.30.21:5050
>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
>> 10.195.30.19:56574
>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received
>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e)
>> for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
>> ACK for status update TASK_RUNNING (UUID:
>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>> framework 20150310-112310-354337546-5050-895-0000
>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204
>> at
>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
>> 'memory.soft_limit_in_bytes' to 160MB for container
>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
>> 'memory.limit_in_bytes' to 160MB at
>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
>> allowed age: 5.555425200437824days
>>
>>
>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
>> I0310 11:31:40.017248 1115 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.017334 1115 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> I0310 11:31:40.017727 1115 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> [2015-03-10 11:31:40,017] WARN Disconnected
>> (mesosphere.marathon.MarathonScheduler:224)
>> [2015-03-10 11:31:40,019] INFO Abdicating
>> (mesosphere.marathon.MarathonSchedulerService:312)
>> [2015-03-10 11:31:40,019] INFO Defeat leadership
>> (mesosphere.marathon.MarathonSchedulerService:285)
>> [INFO] [03/10/2015 11:31:40.019]
>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> [INFO] [03/10/2015 11:31:40.019]
>> [marathon-akka.actor.default-dispatcher-5]
>> [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
>> [2015-03-10 11:31:40,021] INFO Stopping driver
>> (mesosphere.marathon.MarathonSchedulerService:221)
>> I0310 11:31:40.022001 1115 sched.cpp:1286] Asked to stop the driver
>> [2015-03-10 11:31:40,024] INFO Setting framework ID to
>> 20150310-112310-320783114-5050-24289-0000
>> (mesosphere.marathon.MarathonSchedulerService:73)
>> I0310 11:31:40.026274 1115 sched.cpp:234] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.026418 1115 sched.cpp:242] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.026458 1115 sched.cpp:752] Stopping framework
>> '20150310-112310-320783114-5050-24289-0000'
>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing
>> optional abdication command.
>> (mesosphere.marathon.MarathonSchedulerService:192)
>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
>> (mesosphere.marathon.MarathonSchedulerService:246)
>> [2015-03-10 11:31:40,032] INFO Defeat leadership
>> (mesosphere.marathon.MarathonSchedulerService:285)
>> [2015-03-10 11:31:40,032] INFO Stopping driver
>> (mesosphere.marathon.MarathonSchedulerService:221)
>> I0310 11:31:40.032588 1107 sched.cpp:1286] Asked to stop the driver
>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
>> [2015-03-10 11:31:40,033] INFO Setting framework ID to
>> 20150310-112310-320783114-5050-24289-0000
>> (mesosphere.marathon.MarathonSchedulerService:73)
>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is
>> not a candidate for leader, current voting: [member_0000000001,
>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership
>> (mesosphere.marathon.MarathonSchedulerService:341)
>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
>> (com.twitter.common.zookeeper.Group:426)
>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is
>> not a candidate for leader, current voting: [member_0000000001,
>> member_0000000002, member_0000000003]
>> (com.twitter.common.zookeeper.CandidateImpl:144)
>> [2015-03-10 11:31:40,568] INFO Candidate
>> /marathon/leader/member_0000000003 waiting for the next leader election,
>> current voting: [member_0000000001, member_0000000002, member_0000000003]
>> (com.twitter.common.zookeeper.CandidateImpl:165)
>>
>>
>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
>> [2015-03-10 11:31:40,029] INFO Candidate
>> /marathon/leader/member_0000000001 is now leader of group:
>> [member_0000000001, member_0000000002]
>> (com.twitter.common.zookeeper.CandidateImpl:152)
>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
>> (mesosphere.marathon.MarathonSchedulerService:253)
>> [2015-03-10 11:31:40,044] INFO Elect leadership
>> (mesosphere.marathon.MarathonSchedulerService:299)
>> [2015-03-10 11:31:40,044] INFO Running driver
>> (mesosphere.marathon.MarathonSchedulerService:184)
>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
>> Client environment:zookeeper.version=zookeeper C client 3.4.5
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
>> Client environment:host.name=srv-d2u-9-virtip20
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
>> Client environment:os.name=Linux
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
>> Client environment:os.arch=3.13.0-44-generic
>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
>> Client environment:user.name=(null)
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
>> Client environment:user.home=/root
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
>> Client environment:user.dir=/
>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0
>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
>> (mesosphere.marathon.MarathonSchedulerService:329)
>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
>> initiated connection to server [10.195.30.19:2181]
>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
>> session establishment complete on server [10.195.30.19:2181],
>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
>> 10.195.30.20:45771) connected to ZooKeeper
>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos'
>> in ZooKeeper
>> [INFO] [03/10/2015 11:31:40.047]
>> [marathon-akka.actor.default-dispatcher-2]
>> [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
>> (id='2')
>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
>> '/mesos/info_0000000002' in ZooKeeper
>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for version
>> Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
>> master@10.195.30.21:5050) is detected
>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
>> master@10.195.30.21:5050
>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
>> Attempting to register without authentication
>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
>> 20150310-112310-354337546-5050-895-0000
>> [2015-03-10 11:31:40,056] INFO Registered as
>> 20150310-112310-354337546-5050-895-0000 to master
>> '20150310-112310-354337546-5050-895'
>> (mesosphere.marathon.MarathonScheduler:72)
>> [2015-03-10 11:31:40,063] INFO Stored framework ID
>> '20150310-112310-354337546-5050-895-0000'
>> (mesosphere.mesos.util.FrameworkIdUtil:49)
>> [INFO] [03/10/2015 11:31:40.065]
>> [marathon-akka.actor.default-dispatcher-6]
>> [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
>> [INFO] [03/10/2015 11:31:40.067]
>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> ...
>> ...
>> ...
>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
>> (mesosphere.marathon.SchedulerActions:403)
>> [INFO] [03/10/2015 11:31:55.053]
>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
>> encountered. This logging can be turned off or adjusted with configuration
>> settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>> [2015-03-10 11:31:55,064] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
>> TASK_LOST (Reconciliation: Task is unknown to the slave)
>> (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:31:55,069] INFO Need to scale
>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>> (mesosphere.marathon.SchedulerActions:488)
>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
>> /ffaas-backoffice-app-nopersist (0 queued)
>> (mesosphere.marathon.SchedulerActions:494)
>> [2015-03-10 11:31:55,069] INFO Task
>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
>> expunged and removed from TaskTracker
>> (mesosphere.marathon.tasks.TaskTracker:107)
>> [2015-03-10 11:31:55,070] INFO Sending event notification.
>> (mesosphere.marathon.MarathonScheduler:262)
>> [INFO] [03/10/2015 11:31:55.072]
>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> [2015-03-10 11:31:55,073] INFO Need to scale
>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>> (mesosphere.marathon.SchedulerActions:488)
>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
>> /ffaas-backoffice-app-nopersist. Not scaling.
>> (mesosphere.marathon.SchedulerActions:498)
>> ...
>> ...
>> ...
>> [2015-03-10 11:31:57,682] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:31:57,694] INFO Sending event notification.
>> (mesosphere.marathon.MarathonScheduler:262)
>> [INFO] [03/10/2015 11:31:57.694]
>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
>> POSTing to all endpoints.
>> ...
>> ...
>> ...
>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
>> (mesosphere.marathon.tasks.TaskTracker:170)
>> [INFO] [03/10/2015 11:36:55.050]
>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
>> encountered. This logging can be turned off or adjusted with configuration
>> settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
>> (mesosphere.marathon.SchedulerActions:403)
>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>> [2015-03-10 11:36:55,063] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>> TASK_RUNNING (Reconciliation: Latest task state)
>> (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:36:55,065] INFO Received status update for task
>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>> TASK_RUNNING (Reconciliation: Latest task state)
>> (mesosphere.marathon.MarathonScheduler:148)
>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
>> /ffaas-backoffice-app-nopersist. Not scaling.
>> (mesosphere.marathon.SchedulerActions:512)
>>
>>
>>
>> -- End of logs
>>
>>
>>
>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>>
>>> This is certainly not the expected/desired behavior when failing over a
>>> mesos master in HA mode. In addition to the master logs Alex requested, can
>>> you also provide relevant portions of the slave logs for these tasks? If
>>> the slave processes themselves never failed over, checkpointing and slave
>>> recovery should be irrelevant. Are you running the mesos-slave itself
>>> inside a Docker, or any other non-traditional setup?
>>>
>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>>> defaults to "reconnect", and --strict defaults to true, so none of those
>>> are necessary.
>>>
>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>>> wrote:
>>>
>>>> Geoffroy,
>>>>
>>>> could you please provide master logs (both from killed and taking over
>>>> masters)?
>>>>
>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>>> geoffroy.jabouley@gmail.com> wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> we are facing some unexpecting issues when testing high availability
>>>>> behaviors of our mesos cluster.
>>>>>
>>>>> *Our use case:*
>>>>>
>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>>> running on each slave (started from marathon)
>>>>>
>>>>> *Action*: stop the mesos master leader process
>>>>>
>>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>>> unchanged*
>>>>>
>>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>>> FAILED but docker containers are still running*, marathon detects
>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>>
>>>>>
>>>>> Is the seen behavior correct?
>>>>>
>>>>> Have we misunderstood the high availability concept? We thought that
>>>>> doing this use case would not have any impact on the current cluster state
>>>>> (except leader re-election)
>>>>>
>>>>> Thanks in advance for your help
>>>>> Regards
>>>>>
>>>>> ---------------------------------------------------
>>>>>
>>>>> our setup is the following:
>>>>> 3 identical mesos nodes with:
>>>>> + zookeeper
>>>>> + docker 1.5
>>>>> + mesos master 0.21.1 configured in HA mode
>>>>> + mesos slave 0.21.1 configured with checkpointing, strict and
>>>>> reconnect
>>>>> + marathon 0.8.0 configured in HA mode with checkpointing
>>>>>
>>>>> ---------------------------------------------------
>>>>>
>>>>> Command lines:
>>>>>
>>>>>
>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>>
>>>>> *mesos-slave*
>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,
>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint
>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins
>>>>> --hostname=10.195.30.19 --ip=10.195.30.19
>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>>
>>>>> *marathon*
>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>>
>>>>
>>>>
>>>
>>
>
Re: Weird behavior when stopping the mesos master leader of a HA
mesos cluster
Posted by Alex Rukletsov <al...@mesosphere.io>.
Geoffroy,
most probably you're hitting this bug:
https://github.com/mesosphere/marathon/issues/1063. The problem is that
Marathon can register instead of re-registering when a master fails
over. From master point of view, it's a new framework, that's why the
previous task is gone and a new one (that technically belongs to a new
framework) is started. You can see that frameworks have two different IDs
(check lines 11:31:40.055496 and 11:31:40.785038) in your example.
Hope that helps,
Alex
On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
geoffroy.jabouley@gmail.com> wrote:
> Hello
>
> thanks for your interest. Following are the requested logs, which will
> result in a pretty big mail.
>
> Mesos/Marathon are *NOT running inside docker*, we only use Docker as our
> mesos containerizer.
>
> For reminder, here is the use case performed to get the logs file:
>
> --------------------------------
>
> Our cluster: 3 identical mesos nodes with:
> + zookeeper
> + docker 1.5
> + mesos master 0.21.1 configured in HA mode
> + mesos slave 0.21.1 configured with checkpointing, strict and
> reconnect
> + marathon 0.8.0 configured in HA mode with checkpointing
>
> --------------------------------
>
> *Begin State: *
> + the mesos cluster is up (3 machines)
> + mesos master leader is 10.195.30.19
> + marathon leader is 10.195.30.21
> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>
> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>
> *Expected*: mesos master leader has changed, active tasks / frameworks
> remain unchanged
>
> *End state: *
> + mesos master leader *has changed, now 10.195.30.21*
> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
> (not showing anymore on the mesos UI), but *docker container is still
> running*
> + a n*ew APPTASK is now running on slave 10.195.30.19*
> + marathon framework "registration time" in mesos UI shows "Just now"
> + marathon leader *has changed, now 10.195.30.20*
>
>
> --------------------------------
>
> Now comes the 6 requested logs, which might contain interesting/relevant
> information, but i as a newcomer to mesos it is hard to read...
>
>
> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
> from process 1 of user 0; exiting
>
>
> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
> I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.011823 922 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships
> changed
> I0310 11:31:40.015847 915 group.cpp:659] Trying to get
> '/mesos/log_replicas/0000000000' in ZooKeeper
> I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is
> master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
> I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master!
> I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar
> I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar
> I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer
> I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit
> promise request with proposal 2
> I0310 11:31:40.017503 915 group.cpp:659] Trying to get
> '/mesos/log_replicas/0000000003' in ZooKeeper
> I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes)
> to leveldb took 893672ns
> I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2
> I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: {
> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
> I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping to
> fill missing position
> I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending
> position 8
> I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from leveldb
> took 13195ns
> I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from leveldb
> took 3120ns
> I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the
> registry (610B) in 7.385856ms
> I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in
> 9263ns; attempting to update the 'registry'
> I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 9
> I0310 11:31:40.024435 923 replica.cpp:508] Replica received write
> request for position 9
> I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 1.259338ms
> I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned
> notice for position 9
> I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 404795ns
> I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9
> I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND action
> at position 9
> I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the
> 'registry' in 3.029248ms
> I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered
> registrar
> I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log to
> 9
> I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from the
> Registry (608B) ; allowing 1days for slaves to re-register
> I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 10
> I0310 11:31:40.026988 923 replica.cpp:508] Replica received write
> request for position 10
> I0310 11:31:40.027640 923 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 641018ns
> I0310 11:31:40.027652 923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.030848 923 replica.cpp:655] Replica received learned
> notice for position 10
> I0310 11:31:40.031883 923 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 1.008914ms
> I0310 11:31:40.031963 923 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 46724ns
> I0310 11:31:40.031977 923 replica.cpp:676] Persisted action at 10
> I0310 11:31:40.031986 923 replica.cpp:661] Replica learned TRUNCATE
> action at position 10
> I0310 11:31:40.055415 918 master.cpp:1383] Received registration request
> for framework 'marathon' at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.055496 918 master.cpp:1447] Registering framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.055642 919 hierarchical_allocator_process.hpp:329] Added
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.189151 919 master.cpp:3246] Re-registering slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19)
> I0310 11:31:40.189280 919 registrar.cpp:445] Applied 1 operations in
> 15452ns; attempting to update the 'registry'
> I0310 11:31:40.189949 919 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.189978 919 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 11
> I0310 11:31:40.190112 919 replica.cpp:508] Replica received write
> request for position 11
> I0310 11:31:40.190563 919 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 437440ns
> I0310 11:31:40.190577 919 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.191249 921 replica.cpp:655] Replica received learned
> notice for position 11
> I0310 11:31:40.192159 921 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 892767ns
> I0310 11:31:40.192178 921 replica.cpp:676] Persisted action at 11
> I0310 11:31:40.192184 921 replica.cpp:661] Replica learned APPEND action
> at position 11
> I0310 11:31:40.192350 921 registrar.cpp:490] Successfully updated the
> 'registry' in 3.0528ms
> I0310 11:31:40.192387 919 log.cpp:699] Attempting to truncate the log to
> 11
> I0310 11:31:40.192415 919 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 12
> I0310 11:31:40.192539 915 replica.cpp:508] Replica received write
> request for position 12
> I0310 11:31:40.192600 921 master.cpp:3314] Re-registered slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148
> I0310 11:31:40.192680 917 hierarchical_allocator_process.hpp:442] Added
> slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> available)
> I0310 11:31:40.192847 917 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.193164 915 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 610664ns
> I0310 11:31:40.193181 915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193568 915 replica.cpp:655] Replica received learned
> notice for position 12
> I0310 11:31:40.193948 915 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 364062ns
> I0310 11:31:40.193979 915 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 12256ns
> I0310 11:31:40.193985 915 replica.cpp:676] Persisted action at 12
> I0310 11:31:40.193990 915 replica.cpp:661] Replica learned TRUNCATE
> action at position 12
> I0310 11:31:40.248615 915 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O0 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.248744 915 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.774416 915 master.cpp:3246] Re-registering slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21)
> I0310 11:31:40.774976 915 registrar.cpp:445] Applied 1 operations in
> 42342ns; attempting to update the 'registry'
> I0310 11:31:40.777273 921 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.777436 921 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 13
> I0310 11:31:40.777989 921 replica.cpp:508] Replica received write
> request for position 13
> I0310 11:31:40.779558 921 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 1.513714ms
> I0310 11:31:40.779633 921 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.781821 919 replica.cpp:655] Replica received learned
> notice for position 13
> I0310 11:31:40.784417 919 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 2.542036ms
> I0310 11:31:40.784446 919 replica.cpp:676] Persisted action at 13
> I0310 11:31:40.784452 919 replica.cpp:661] Replica learned APPEND action
> at position 13
> I0310 11:31:40.784711 920 registrar.cpp:490] Successfully updated the
> 'registry' in 9.68192ms
> I0310 11:31:40.784762 917 log.cpp:699] Attempting to truncate the log to
> 13
> I0310 11:31:40.784808 920 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 14
> I0310 11:31:40.784865 917 master.hpp:877] Adding task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
> I0310 11:31:40.784955 919 replica.cpp:508] Replica received write
> request for position 14
> W0310 11:31:40.785038 917 master.cpp:4468] Possibly orphaned task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
> framework 20150310-112310-320783114-5050-24289-0000 running on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21)
> I0310 11:31:40.785105 917 master.cpp:3314] Re-registered slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148
> I0310 11:31:40.785162 920 hierarchical_allocator_process.hpp:442] Added
> slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
> available)
> I0310 11:31:40.785679 921 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.786429 919 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 1.454211ms
> I0310 11:31:40.786455 919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.786782 919 replica.cpp:655] Replica received learned
> notice for position 14
> I0310 11:31:40.787833 919 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 1.027014ms
> I0310 11:31:40.787873 919 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 14085ns
> I0310 11:31:40.787883 919 replica.cpp:676] Persisted action at 14
> I0310 11:31:40.787889 919 replica.cpp:661] Replica learned TRUNCATE
> action at position 14
> I0310 11:31:40.792536 922 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O1 ] on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.792670 922 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S2 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.819602 921 master.cpp:3246] Re-registering slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20)
> I0310 11:31:40.819736 921 registrar.cpp:445] Applied 1 operations in
> 16656ns; attempting to update the 'registry'
> I0310 11:31:40.820617 921 log.cpp:680] Attempting to append 647 bytes to
> the log
> I0310 11:31:40.820726 918 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 15
> I0310 11:31:40.820938 918 replica.cpp:508] Replica received write
> request for position 15
> I0310 11:31:40.821641 918 leveldb.cpp:343] Persisting action (666 bytes)
> to leveldb took 670583ns
> I0310 11:31:40.821663 918 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.822265 917 replica.cpp:655] Replica received learned
> notice for position 15
> I0310 11:31:40.823463 917 leveldb.cpp:343] Persisting action (668 bytes)
> to leveldb took 1.178687ms
> I0310 11:31:40.823490 917 replica.cpp:676] Persisted action at 15
> I0310 11:31:40.823498 917 replica.cpp:661] Replica learned APPEND action
> at position 15
> I0310 11:31:40.823755 917 registrar.cpp:490] Successfully updated the
> 'registry' in 3.97696ms
> I0310 11:31:40.823823 917 log.cpp:699] Attempting to truncate the log to
> 15
> I0310 11:31:40.824147 922 coordinator.cpp:340] Coordinator attempting to
> write TRUNCATE action at position 16
> I0310 11:31:40.824482 922 hierarchical_allocator_process.hpp:442] Added
> slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
> available)
> I0310 11:31:40.824597 921 replica.cpp:508] Replica received write
> request for position 16
> I0310 11:31:40.824128 917 master.cpp:3314] Re-registered slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148
> I0310 11:31:40.824975 917 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.831900 921 leveldb.cpp:343] Persisting action (16 bytes)
> to leveldb took 7.228682ms
> I0310 11:31:40.832031 921 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.832456 917 replica.cpp:655] Replica received learned
> notice for position 16
> I0310 11:31:40.835178 917 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 2.674392ms
> I0310 11:31:40.835297 917 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 45220ns
> I0310 11:31:40.835322 917 replica.cpp:676] Persisted action at 16
> I0310 11:31:40.835341 917 replica.cpp:661] Replica learned TRUNCATE
> action at position 16
> I0310 11:31:40.838281 923 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O2 ] on slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:40.838389 923 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S0 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:40.948725 919 http.cpp:344] HTTP request for
> '/master/redirect'
> I0310 11:31:41.479118 918 http.cpp:478] HTTP request for
> '/master/state.json'
> I0310 11:31:45.368074 918 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:45.385144 917 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O3 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:45.385292 917 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.368450 917 master.cpp:3843] Sending 2 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:46.375222 920 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O4 ] on slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:46.375360 920 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O5 ] on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:46.375530 920 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S0 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:46.375599 920 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S2 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:48.031230 915 http.cpp:478] HTTP request for
> '/master/state.json'
> I0310 11:31:51.374285 922 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:51.379391 921 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O6 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:51.379487 921 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:51.482094 923 http.cpp:478] HTTP request for
> '/master/state.json'
> I0310 11:31:52.375326 917 master.cpp:3843] Sending 2 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:52.391376 919 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O7 ] on slave
> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:52.391512 919 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S2 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:52.391659 921 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O8 ] on slave
> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:52.391751 921 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
> cpus(*):2; mem(*):6961; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S0 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:55.062060 918 master.cpp:3611] Performing explicit task
> state reconciliation for 1 tasks of framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:55.062588 919 master.cpp:3556] Performing implicit task
> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:56.140990 923 http.cpp:344] HTTP request for
> '/master/redirect'
> I0310 11:31:57.379288 918 master.cpp:3843] Sending 1 offers to framework
> 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:57.430888 918 master.cpp:2344] Processing reply for offers:
> [ 20150310-112310-354337546-5050-895-O9 ] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
> (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
> I0310 11:31:57.431068 918 master.hpp:877] Adding task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
> I0310 11:31:57.431089 918 master.cpp:2503] Launching task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19)
> I0310 11:31:57.431205 918 hierarchical_allocator_process.hpp:563]
> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
> 20150310-112310-320783114-5050-24289-S1 from framework
> 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682133 919 master.cpp:3446] Forwarding status update
> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.682186 919 master.cpp:3418] Status update TASK_RUNNING
> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 from slave
> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
> (10.195.30.19)
> I0310 11:31:57.682199 919 master.cpp:4693] Updating the latest state of
> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>
>
> *from MESOS slave 10.195.30.21*
> I0310 11:31:28.750200 1074 slave.cpp:2623] master@10.195.30.19:5050
> exited
> W0310 11:31:28.750249 1074 slave.cpp:2626] Master disconnected! Waiting
> for a new master to be elected
> I0310 11:31:40.012516 1075 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.012899 1073 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017143 1072 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.017408 1072 slave.cpp:602] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.017546 1076 status_update_manager.cpp:171] Pausing sending
> status updates
> I0310 11:31:40.018673 1072 slave.cpp:627] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.018689 1072 slave.cpp:638] Detecting new master
> I0310 11:31:40.785364 1075 slave.cpp:824] Re-registered with master
> master@10.195.30.21:5050
> I0310 11:31:40.785398 1075 status_update_manager.cpp:178] Resuming
> sending status updates
> I0310 11:32:10.639506 1075 slave.cpp:3321] Current usage 12.27%. Max
> allowed age: 5.441217749539572days
>
>
> *from MESOS slave 10.195.30.19*
> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
> exited
> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
> for a new master to be elected
> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing sending
> status updates
> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
> master@10.195.30.21:5050
> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
> sending status updates
> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 in work directory
> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
> executor
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework '20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
> executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
> framework '20150310-112310-354337546-5050-895-0000'
> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
> framework '20150310-112310-354337546-5050-895-0000' in container
> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
> 10.195.30.19:56574
> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
> executor
> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
> 10.195.30.19:56574
> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received status
> update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
> UPDATE for status update TASK_RUNNING (UUID:
> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 to
> master@10.195.30.21:5050
> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
> 10.195.30.19:56574
> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received status
> update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
> ACK for status update TASK_RUNNING (UUID:
> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
> framework 20150310-112310-354337546-5050-895-0000
> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204
> at
> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
> 'memory.soft_limit_in_bytes' to 160MB for container
> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
> 'memory.limit_in_bytes' to 160MB at
> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
> allowed age: 5.555425200437824days
>
>
> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
> I0310 11:31:40.017248 1115 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.017334 1115 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> I0310 11:31:40.017727 1115 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> [2015-03-10 11:31:40,017] WARN Disconnected
> (mesosphere.marathon.MarathonScheduler:224)
> [2015-03-10 11:31:40,019] INFO Abdicating
> (mesosphere.marathon.MarathonSchedulerService:312)
> [2015-03-10 11:31:40,019] INFO Defeat leadership
> (mesosphere.marathon.MarathonSchedulerService:285)
> [INFO] [03/10/2015 11:31:40.019]
> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
> POSTing to all endpoints.
> [INFO] [03/10/2015 11:31:40.019]
> [marathon-akka.actor.default-dispatcher-5]
> [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
> [2015-03-10 11:31:40,021] INFO Stopping driver
> (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.022001 1115 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,024] INFO Setting framework ID to
> 20150310-112310-320783114-5050-24289-0000
> (mesosphere.marathon.MarathonSchedulerService:73)
> I0310 11:31:40.026274 1115 sched.cpp:234] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.026418 1115 sched.cpp:242] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.026458 1115 sched.cpp:752] Stopping framework
> '20150310-112310-320783114-5050-24289-0000'
> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing optional
> abdication command. (mesosphere.marathon.MarathonSchedulerService:192)
> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
> (mesosphere.marathon.MarathonSchedulerService:246)
> [2015-03-10 11:31:40,032] INFO Defeat leadership
> (mesosphere.marathon.MarathonSchedulerService:285)
> [2015-03-10 11:31:40,032] INFO Stopping driver
> (mesosphere.marathon.MarathonSchedulerService:221)
> I0310 11:31:40.032588 1107 sched.cpp:1286] Asked to stop the driver
> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
> [2015-03-10 11:31:40,033] INFO Setting framework ID to
> 20150310-112310-320783114-5050-24289-0000
> (mesosphere.marathon.MarathonSchedulerService:73)
> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is not
> a candidate for leader, current voting: [member_0000000001,
> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership
> (mesosphere.marathon.MarathonSchedulerService:341)
> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
> (com.twitter.common.zookeeper.Group:426)
> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is not
> a candidate for leader, current voting: [member_0000000001,
> member_0000000002, member_0000000003]
> (com.twitter.common.zookeeper.CandidateImpl:144)
> [2015-03-10 11:31:40,568] INFO Candidate
> /marathon/leader/member_0000000003 waiting for the next leader election,
> current voting: [member_0000000001, member_0000000002, member_0000000003]
> (com.twitter.common.zookeeper.CandidateImpl:165)
>
>
> *From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
> [2015-03-10 11:31:40,029] INFO Candidate
> /marathon/leader/member_0000000001 is now leader of group:
> [member_0000000001, member_0000000002]
> (com.twitter.common.zookeeper.CandidateImpl:152)
> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
> (mesosphere.marathon.MarathonSchedulerService:253)
> [2015-03-10 11:31:40,044] INFO Elect leadership
> (mesosphere.marathon.MarathonSchedulerService:299)
> [2015-03-10 11:31:40,044] INFO Running driver
> (mesosphere.marathon.MarathonSchedulerService:184)
> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
> Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
> Client environment:host.name=srv-d2u-9-virtip20
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
> Client environment:os.name=Linux
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
> Client environment:os.arch=3.13.0-44-generic
> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
> Client environment:user.name=(null)
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
> Client environment:user.home=/root
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
> Client environment:user.dir=/
> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0 sessionId=0
> sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
> (mesosphere.marathon.MarathonSchedulerService:329)
> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
> initiated connection to server [10.195.30.19:2181]
> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
> session establishment complete on server [10.195.30.19:2181],
> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
> 10.195.30.20:45771) connected to ZooKeeper
> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos'
> in ZooKeeper
> [INFO] [03/10/2015 11:31:40.047]
> [marathon-akka.actor.default-dispatcher-2]
> [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
> (id='2')
> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
> '/mesos/info_0000000002' in ZooKeeper
> [2015-03-10 11:31:40,053] INFO Migration successfully applied for version
> Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
> master@10.195.30.21:5050) is detected
> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
> master@10.195.30.21:5050
> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
> Attempting to register without authentication
> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
> 20150310-112310-354337546-5050-895-0000
> [2015-03-10 11:31:40,056] INFO Registered as
> 20150310-112310-354337546-5050-895-0000 to master
> '20150310-112310-354337546-5050-895'
> (mesosphere.marathon.MarathonScheduler:72)
> [2015-03-10 11:31:40,063] INFO Stored framework ID
> '20150310-112310-354337546-5050-895-0000'
> (mesosphere.mesos.util.FrameworkIdUtil:49)
> [INFO] [03/10/2015 11:31:40.065]
> [marathon-akka.actor.default-dispatcher-6]
> [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
> [INFO] [03/10/2015 11:31:40.067]
> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
> POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
> (mesosphere.marathon.SchedulerActions:403)
> [INFO] [03/10/2015 11:31:55.053]
> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
> encountered. This logging can be turned off or adjusted with configuration
> settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
> Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:31:55,064] INFO Received status update for task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
> TASK_LOST (Reconciliation: Task is unknown to the slave)
> (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:55,069] INFO Need to scale
> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
> (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
> /ffaas-backoffice-app-nopersist (0 queued)
> (mesosphere.marathon.SchedulerActions:494)
> [2015-03-10 11:31:55,069] INFO Task
> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
> expunged and removed from TaskTracker
> (mesosphere.marathon.tasks.TaskTracker:107)
> [2015-03-10 11:31:55,070] INFO Sending event notification.
> (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:55.072]
> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
> POSTing to all endpoints.
> [2015-03-10 11:31:55,073] INFO Need to scale
> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
> (mesosphere.marathon.SchedulerActions:488)
> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
> /ffaas-backoffice-app-nopersist. Not scaling.
> (mesosphere.marathon.SchedulerActions:498)
> ...
> ...
> ...
> [2015-03-10 11:31:57,682] INFO Received status update for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:31:57,694] INFO Sending event notification.
> (mesosphere.marathon.MarathonScheduler:262)
> [INFO] [03/10/2015 11:31:57.694]
> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
> POSTing to all endpoints.
> ...
> ...
> ...
> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
> (mesosphere.marathon.tasks.TaskTracker:170)
> [INFO] [03/10/2015 11:36:55.050]
> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
> encountered. This logging can be turned off or adjusted with configuration
> settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
> (mesosphere.marathon.SchedulerActions:403)
> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
> Mesos master (mesosphere.marathon.SchedulerActions:430)
> [2015-03-10 11:36:55,063] INFO Received status update for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
> TASK_RUNNING (Reconciliation: Latest task state)
> (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,065] INFO Received status update for task
> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
> TASK_RUNNING (Reconciliation: Latest task state)
> (mesosphere.marathon.MarathonScheduler:148)
> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
> /ffaas-backoffice-app-nopersist. Not scaling.
> (mesosphere.marathon.SchedulerActions:512)
>
>
>
> -- End of logs
>
>
>
> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
>
>> This is certainly not the expected/desired behavior when failing over a
>> mesos master in HA mode. In addition to the master logs Alex requested, can
>> you also provide relevant portions of the slave logs for these tasks? If
>> the slave processes themselves never failed over, checkpointing and slave
>> recovery should be irrelevant. Are you running the mesos-slave itself
>> inside a Docker, or any other non-traditional setup?
>>
>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>> defaults to "reconnect", and --strict defaults to true, so none of those
>> are necessary.
>>
>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
>> wrote:
>>
>>> Geoffroy,
>>>
>>> could you please provide master logs (both from killed and taking over
>>> masters)?
>>>
>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>> geoffroy.jabouley@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> we are facing some unexpecting issues when testing high availability
>>>> behaviors of our mesos cluster.
>>>>
>>>> *Our use case:*
>>>>
>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>> running on each slave (started from marathon)
>>>>
>>>> *Action*: stop the mesos master leader process
>>>>
>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>> unchanged*
>>>>
>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>> FAILED but docker containers are still running*, marathon detects
>>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>
>>>>
>>>> Is the seen behavior correct?
>>>>
>>>> Have we misunderstood the high availability concept? We thought that
>>>> doing this use case would not have any impact on the current cluster state
>>>> (except leader re-election)
>>>>
>>>> Thanks in advance for your help
>>>> Regards
>>>>
>>>> ---------------------------------------------------
>>>>
>>>> our setup is the following:
>>>> 3 identical mesos nodes with:
>>>> + zookeeper
>>>> + docker 1.5
>>>> + mesos master 0.21.1 configured in HA mode
>>>> + mesos slave 0.21.1 configured with checkpointing, strict and
>>>> reconnect
>>>> + marathon 0.8.0 configured in HA mode with checkpointing
>>>>
>>>> ---------------------------------------------------
>>>>
>>>> Command lines:
>>>>
>>>>
>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>
>>>> *mesos-slave*
>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181
>>>> ,10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
>>>> --executor_registration_timeout=5mins --hostname=10.195.30.19
>>>> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>>
>>>> *marathon*
>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>
>>>
>>>
>>
>
Re: Weird behavior when stopping the mesos master leader of a HA
mesos cluster
Posted by Geoffroy Jabouley <ge...@gmail.com>.
Hello
thanks for your interest. Following are the requested logs, which will
result in a pretty big mail.
Mesos/Marathon are *NOT running inside docker*, we only use Docker as our
mesos containerizer.
For reminder, here is the use case performed to get the logs file:
--------------------------------
Our cluster: 3 identical mesos nodes with:
+ zookeeper
+ docker 1.5
+ mesos master 0.21.1 configured in HA mode
+ mesos slave 0.21.1 configured with checkpointing, strict and reconnect
+ marathon 0.8.0 configured in HA mode with checkpointing
--------------------------------
*Begin State: *
+ the mesos cluster is up (3 machines)
+ mesos master leader is 10.195.30.19
+ marathon leader is 10.195.30.21
+ 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
*Action*: stop the mesos master leader process (sudo stop mesos-master)
*Expected*: mesos master leader has changed, active tasks / frameworks
remain unchanged
*End state: *
+ mesos master leader *has changed, now 10.195.30.21*
+ previously running APPTASK on the slave 10.195.30.21 has "disappear" (not
showing anymore on the mesos UI), but *docker container is still running*
+ a n*ew APPTASK is now running on slave 10.195.30.19*
+ marathon framework "registration time" in mesos UI shows "Just now"
+ marathon leader *has changed, now 10.195.30.20*
--------------------------------
Now comes the 6 requested logs, which might contain interesting/relevant
information, but i as a newcomer to mesos it is hard to read...
*from previous MESOS master leader 10.195.30.19 <http://10.195.30.19>:*
W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
from process 1 of user 0; exiting
*from new MESOS master leader 10.195.30.21 <http://10.195.30.21>:*
I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.011823 922 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships
changed
I0310 11:31:40.015847 915 group.cpp:659] Trying to get
'/mesos/log_replicas/0000000000' in ZooKeeper
I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is
master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master!
I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar
I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar
I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer
I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit
promise request with proposal 2
I0310 11:31:40.017503 915 group.cpp:659] Trying to get
'/mesos/log_replicas/0000000003' in ZooKeeper
I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 893672ns
I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2
I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping to
fill missing position
I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending
position 8
I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from leveldb
took 13195ns
I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from leveldb
took 3120ns
I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the
registry (610B) in 7.385856ms
I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in
9263ns; attempting to update the 'registry'
I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 9
I0310 11:31:40.024435 923 replica.cpp:508] Replica received write request
for position 9
I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 1.259338ms
I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9
I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned
notice for position 9
I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 404795ns
I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9
I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND action
at position 9
I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the
'registry' in 3.029248ms
I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered
registrar
I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log to 9
I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from the
Registry (608B) ; allowing 1days for slaves to re-register
I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 10
I0310 11:31:40.026988 923 replica.cpp:508] Replica received write request
for position 10
I0310 11:31:40.027640 923 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 641018ns
I0310 11:31:40.027652 923 replica.cpp:676] Persisted action at 10
I0310 11:31:40.030848 923 replica.cpp:655] Replica received learned
notice for position 10
I0310 11:31:40.031883 923 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 1.008914ms
I0310 11:31:40.031963 923 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 46724ns
I0310 11:31:40.031977 923 replica.cpp:676] Persisted action at 10
I0310 11:31:40.031986 923 replica.cpp:661] Replica learned TRUNCATE
action at position 10
I0310 11:31:40.055415 918 master.cpp:1383] Received registration request
for framework 'marathon' at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.055496 918 master.cpp:1447] Registering framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.055642 919 hierarchical_allocator_process.hpp:329] Added
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:40.189151 919 master.cpp:3246] Re-registering slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19)
I0310 11:31:40.189280 919 registrar.cpp:445] Applied 1 operations in
15452ns; attempting to update the 'registry'
I0310 11:31:40.189949 919 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.189978 919 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 11
I0310 11:31:40.190112 919 replica.cpp:508] Replica received write request
for position 11
I0310 11:31:40.190563 919 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 437440ns
I0310 11:31:40.190577 919 replica.cpp:676] Persisted action at 11
I0310 11:31:40.191249 921 replica.cpp:655] Replica received learned
notice for position 11
I0310 11:31:40.192159 921 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 892767ns
I0310 11:31:40.192178 921 replica.cpp:676] Persisted action at 11
I0310 11:31:40.192184 921 replica.cpp:661] Replica learned APPEND action
at position 11
I0310 11:31:40.192350 921 registrar.cpp:490] Successfully updated the
'registry' in 3.0528ms
I0310 11:31:40.192387 919 log.cpp:699] Attempting to truncate the log to
11
I0310 11:31:40.192415 919 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 12
I0310 11:31:40.192539 915 replica.cpp:508] Replica received write request
for position 12
I0310 11:31:40.192600 921 master.cpp:3314] Re-registered slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148
I0310 11:31:40.192680 917 hierarchical_allocator_process.hpp:442] Added
slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
available)
I0310 11:31:40.192847 917 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.193164 915 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 610664ns
I0310 11:31:40.193181 915 replica.cpp:676] Persisted action at 12
I0310 11:31:40.193568 915 replica.cpp:655] Replica received learned
notice for position 12
I0310 11:31:40.193948 915 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 364062ns
I0310 11:31:40.193979 915 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 12256ns
I0310 11:31:40.193985 915 replica.cpp:676] Persisted action at 12
I0310 11:31:40.193990 915 replica.cpp:661] Replica learned TRUNCATE
action at position 12
I0310 11:31:40.248615 915 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O0 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.248744 915 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:40.774416 915 master.cpp:3246] Re-registering slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21)
I0310 11:31:40.774976 915 registrar.cpp:445] Applied 1 operations in
42342ns; attempting to update the 'registry'
I0310 11:31:40.777273 921 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.777436 921 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 13
I0310 11:31:40.777989 921 replica.cpp:508] Replica received write request
for position 13
I0310 11:31:40.779558 921 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 1.513714ms
I0310 11:31:40.779633 921 replica.cpp:676] Persisted action at 13
I0310 11:31:40.781821 919 replica.cpp:655] Replica received learned
notice for position 13
I0310 11:31:40.784417 919 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 2.542036ms
I0310 11:31:40.784446 919 replica.cpp:676] Persisted action at 13
I0310 11:31:40.784452 919 replica.cpp:661] Replica learned APPEND action
at position 13
I0310 11:31:40.784711 920 registrar.cpp:490] Successfully updated the
'registry' in 9.68192ms
I0310 11:31:40.784762 917 log.cpp:699] Attempting to truncate the log to
13
I0310 11:31:40.784808 920 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 14
I0310 11:31:40.784865 917 master.hpp:877] Adding task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
I0310 11:31:40.784955 919 replica.cpp:508] Replica received write request
for position 14
W0310 11:31:40.785038 917 master.cpp:4468] Possibly orphaned task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
framework 20150310-112310-320783114-5050-24289-0000 running on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21)
I0310 11:31:40.785105 917 master.cpp:3314] Re-registered slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148
I0310 11:31:40.785162 920 hierarchical_allocator_process.hpp:442] Added
slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
available)
I0310 11:31:40.785679 921 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.786429 919 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 1.454211ms
I0310 11:31:40.786455 919 replica.cpp:676] Persisted action at 14
I0310 11:31:40.786782 919 replica.cpp:655] Replica received learned
notice for position 14
I0310 11:31:40.787833 919 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 1.027014ms
I0310 11:31:40.787873 919 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 14085ns
I0310 11:31:40.787883 919 replica.cpp:676] Persisted action at 14
I0310 11:31:40.787889 919 replica.cpp:661] Replica learned TRUNCATE
action at position 14
I0310 11:31:40.792536 922 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O1 ] on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.792670 922 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S2 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:40.819602 921 master.cpp:3246] Re-registering slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20)
I0310 11:31:40.819736 921 registrar.cpp:445] Applied 1 operations in
16656ns; attempting to update the 'registry'
I0310 11:31:40.820617 921 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.820726 918 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 15
I0310 11:31:40.820938 918 replica.cpp:508] Replica received write request
for position 15
I0310 11:31:40.821641 918 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 670583ns
I0310 11:31:40.821663 918 replica.cpp:676] Persisted action at 15
I0310 11:31:40.822265 917 replica.cpp:655] Replica received learned
notice for position 15
I0310 11:31:40.823463 917 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 1.178687ms
I0310 11:31:40.823490 917 replica.cpp:676] Persisted action at 15
I0310 11:31:40.823498 917 replica.cpp:661] Replica learned APPEND action
at position 15
I0310 11:31:40.823755 917 registrar.cpp:490] Successfully updated the
'registry' in 3.97696ms
I0310 11:31:40.823823 917 log.cpp:699] Attempting to truncate the log to
15
I0310 11:31:40.824147 922 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 16
I0310 11:31:40.824482 922 hierarchical_allocator_process.hpp:442] Added
slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
available)
I0310 11:31:40.824597 921 replica.cpp:508] Replica received write request
for position 16
I0310 11:31:40.824128 917 master.cpp:3314] Re-registered slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148
I0310 11:31:40.824975 917 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.831900 921 leveldb.cpp:343] Persisting action (16 bytes)
to leveldb took 7.228682ms
I0310 11:31:40.832031 921 replica.cpp:676] Persisted action at 16
I0310 11:31:40.832456 917 replica.cpp:655] Replica received learned
notice for position 16
I0310 11:31:40.835178 917 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 2.674392ms
I0310 11:31:40.835297 917 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 45220ns
I0310 11:31:40.835322 917 replica.cpp:676] Persisted action at 16
I0310 11:31:40.835341 917 replica.cpp:661] Replica learned TRUNCATE
action at position 16
I0310 11:31:40.838281 923 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O2 ] on slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:40.838389 923 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S0 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:40.948725 919 http.cpp:344] HTTP request for
'/master/redirect'
I0310 11:31:41.479118 918 http.cpp:478] HTTP request for
'/master/state.json'
I0310 11:31:45.368074 918 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:45.385144 917 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O3 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:45.385292 917 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:46.368450 917 master.cpp:3843] Sending 2 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:46.375222 920 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O4 ] on slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:46.375360 920 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O5 ] on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:46.375530 920 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S0 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:46.375599 920 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S2 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:48.031230 915 http.cpp:478] HTTP request for
'/master/state.json'
I0310 11:31:51.374285 922 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:51.379391 921 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O6 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:51.379487 921 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:51.482094 923 http.cpp:478] HTTP request for
'/master/state.json'
I0310 11:31:52.375326 917 master.cpp:3843] Sending 2 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:52.391376 919 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O7 ] on slave
20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
(10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:52.391512 919 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S2 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:52.391659 921 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O8 ] on slave
20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
(10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:52.391751 921 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
cpus(*):2; mem(*):6961; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S0 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:55.062060 918 master.cpp:3611] Performing explicit task state
reconciliation for 1 tasks of framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:55.062588 919 master.cpp:3556] Performing implicit task state
reconciliation for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:56.140990 923 http.cpp:344] HTTP request for
'/master/redirect'
I0310 11:31:57.379288 918 master.cpp:3843] Sending 1 offers to framework
20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:57.430888 918 master.cpp:2344] Processing reply for offers: [
20150310-112310-354337546-5050-895-O9 ] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
(marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
I0310 11:31:57.431068 918 master.hpp:877] Adding task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
I0310 11:31:57.431089 918 master.cpp:2503] Launching task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 (marathon) at
scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19)
I0310 11:31:57.431205 918 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
20150310-112310-320783114-5050-24289-S1 from framework
20150310-112310-354337546-5050-895-0000
I0310 11:31:57.682133 919 master.cpp:3446] Forwarding status update
TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.682186 919 master.cpp:3418] Status update TASK_RUNNING
(UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 from slave
20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
(10.195.30.19)
I0310 11:31:57.682199 919 master.cpp:4693] Updating the latest state of
task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
*from MESOS slave 10.195.30.21*
I0310 11:31:28.750200 1074 slave.cpp:2623] master@10.195.30.19:5050 exited
W0310 11:31:28.750249 1074 slave.cpp:2626] Master disconnected! Waiting
for a new master to be elected
I0310 11:31:40.012516 1075 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.012899 1073 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.017143 1072 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.017408 1072 slave.cpp:602] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.017546 1076 status_update_manager.cpp:171] Pausing sending
status updates
I0310 11:31:40.018673 1072 slave.cpp:627] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.018689 1072 slave.cpp:638] Detecting new master
I0310 11:31:40.785364 1075 slave.cpp:824] Re-registered with master
master@10.195.30.21:5050
I0310 11:31:40.785398 1075 status_update_manager.cpp:178] Resuming sending
status updates
I0310 11:32:10.639506 1075 slave.cpp:3321] Current usage 12.27%. Max
allowed age: 5.441217749539572days
*from MESOS slave 10.195.30.19*
I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050 exited
W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! Waiting
for a new master to be elected
I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing sending
status updates
I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
master@10.195.30.21:5050
I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming sending
status updates
I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 in work directory
'/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
executor
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework '20150310-112310-354337546-5050-895-0000
I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
framework '20150310-112310-354337546-5050-895-0000'
I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
'/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
framework '20150310-112310-354337546-5050-895-0000' in container
'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
10.195.30.19:56574
I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
executor
'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
10.195.30.19:56574
I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received status
update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.678411 24461 status_update_manager.hpp:346] Checkpointing
UPDATE for status update TASK_RUNNING (UUID:
2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 to
master@10.195.30.21:5050
I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
10.195.30.19:56574
I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received status
update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.689028 24461 status_update_manager.hpp:346] Checkpointing
ACK for status update TASK_RUNNING (UUID:
2afb200e-a172-49ec-b807-dc47ea381e1e) for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
framework 20150310-112310-354337546-5050-895-0000
I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to 204 at
/sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
'memory.soft_limit_in_bytes' to 160MB for container
a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
'memory.limit_in_bytes' to 160MB at
/sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
allowed age: 5.555425200437824days
*From previous Marathon leader 10.195.30.21 <http://10.195.30.21>:*
I0310 11:31:40.017248 1115 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.017334 1115 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0310 11:31:40.017727 1115 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
[2015-03-10 11:31:40,017] WARN Disconnected
(mesosphere.marathon.MarathonScheduler:224)
[2015-03-10 11:31:40,019] INFO Abdicating
(mesosphere.marathon.MarathonSchedulerService:312)
[2015-03-10 11:31:40,019] INFO Defeat leadership
(mesosphere.marathon.MarathonSchedulerService:285)
[INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-6]
[akka://marathon/user/$b] POSTing to all endpoints.
[INFO] [03/10/2015 11:31:40.019] [marathon-akka.actor.default-dispatcher-5]
[akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
[2015-03-10 11:31:40,021] INFO Stopping driver
(mesosphere.marathon.MarathonSchedulerService:221)
I0310 11:31:40.022001 1115 sched.cpp:1286] Asked to stop the driver
[2015-03-10 11:31:40,024] INFO Setting framework ID to
20150310-112310-320783114-5050-24289-0000
(mesosphere.marathon.MarathonSchedulerService:73)
I0310 11:31:40.026274 1115 sched.cpp:234] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.026418 1115 sched.cpp:242] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.026458 1115 sched.cpp:752] Stopping framework
'20150310-112310-320783114-5050-24289-0000'
[2015-03-10 11:31:40,026] INFO Driver future completed. Executing optional
abdication command. (mesosphere.marathon.MarathonSchedulerService:192)
[2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
(mesosphere.marathon.MarathonSchedulerService:246)
[2015-03-10 11:31:40,032] INFO Defeat leadership
(mesosphere.marathon.MarathonSchedulerService:285)
[2015-03-10 11:31:40,032] INFO Stopping driver
(mesosphere.marathon.MarathonSchedulerService:221)
I0310 11:31:40.032588 1107 sched.cpp:1286] Asked to stop the driver
[2015-03-10 11:31:40,033] INFO Will offer leadership after 500 milliseconds
backoff (mesosphere.marathon.MarathonSchedulerService:334)
[2015-03-10 11:31:40,033] INFO Setting framework ID to
20150310-112310-320783114-5050-24289-0000
(mesosphere.marathon.MarathonSchedulerService:73)
[2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is not
a candidate for leader, current voting: [member_0000000001,
member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
[2015-03-10 11:31:40,552] INFO Using HA and therefore offering leadership
(mesosphere.marathon.MarathonSchedulerService:341)
[2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
(com.twitter.common.zookeeper.Group:426)
[2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is not
a candidate for leader, current voting: [member_0000000001,
member_0000000002, member_0000000003]
(com.twitter.common.zookeeper.CandidateImpl:144)
[2015-03-10 11:31:40,568] INFO Candidate /marathon/leader/member_0000000003
waiting for the next leader election, current voting: [member_0000000001,
member_0000000002, member_0000000003]
(com.twitter.common.zookeeper.CandidateImpl:165)
*From new Marathon leader 10.195.30.20 <http://10.195.30.20>:*
[2015-03-10 11:31:40,029] INFO Candidate /marathon/leader/member_0000000001
is now leader of group: [member_0000000001, member_0000000002]
(com.twitter.common.zookeeper.CandidateImpl:152)
[2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
(mesosphere.marathon.MarathonSchedulerService:253)
[2015-03-10 11:31:40,044] INFO Elect leadership
(mesosphere.marathon.MarathonSchedulerService:299)
[2015-03-10 11:31:40,044] INFO Running driver
(mesosphere.marathon.MarathonSchedulerService:184)
I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712: Client
environment:zookeeper.version=zookeeper C client 3.4.5
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716: Client
environment:host.name=srv-d2u-9-virtip20
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723: Client
environment:os.name=Linux
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724: Client
environment:os.arch=3.13.0-44-generic
2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725: Client
environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733: Client
environment:user.name=(null)
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741: Client
environment:user.home=/root
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753: Client
environment:user.dir=/
2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0 sessionId=0
sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
[2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
(mesosphere.marathon.MarathonSchedulerService:329)
2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
initiated connection to server [10.195.30.19:2181]
2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
session establishment complete on server [10.195.30.19:2181],
sessionId=0x14c0335ad7e000d, negotiated timeout=10000
I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
10.195.30.20:45771) connected to ZooKeeper
I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path '/mesos'
in ZooKeeper
[INFO] [03/10/2015 11:31:40.047] [marathon-akka.actor.default-dispatcher-2]
[akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
[2015-03-10 11:31:40,053] INFO Migration successfully applied for version
Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
master@10.195.30.21:5050
I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
Attempting to register without authentication
I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
20150310-112310-354337546-5050-895-0000
[2015-03-10 11:31:40,056] INFO Registered as
20150310-112310-354337546-5050-895-0000 to master
'20150310-112310-354337546-5050-895'
(mesosphere.marathon.MarathonScheduler:72)
[2015-03-10 11:31:40,063] INFO Stored framework ID
'20150310-112310-354337546-5050-895-0000'
(mesosphere.mesos.util.FrameworkIdUtil:49)
[INFO] [03/10/2015 11:31:40.065] [marathon-akka.actor.default-dispatcher-6]
[akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
[INFO] [03/10/2015 11:31:40.067] [marathon-akka.actor.default-dispatcher-7]
[akka://marathon/user/$b] POSTing to all endpoints.
...
...
...
[2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
(mesosphere.marathon.SchedulerActions:403)
[INFO] [03/10/2015 11:31:55.053]
[marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
encountered. This logging can be turned off or adjusted with configuration
settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
[2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
Mesos master (mesosphere.marathon.SchedulerActions:430)
[2015-03-10 11:31:55,064] INFO Received status update for task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
TASK_LOST (Reconciliation: Task is unknown to the slave)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:31:55,069] INFO Need to scale
/ffaas-backoffice-app-nopersist from 0 up to 1 instances
(mesosphere.marathon.SchedulerActions:488)
[2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
/ffaas-backoffice-app-nopersist (0 queued)
(mesosphere.marathon.SchedulerActions:494)
[2015-03-10 11:31:55,069] INFO Task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
expunged and removed from TaskTracker
(mesosphere.marathon.tasks.TaskTracker:107)
[2015-03-10 11:31:55,070] INFO Sending event notification.
(mesosphere.marathon.MarathonScheduler:262)
[INFO] [03/10/2015 11:31:55.072] [marathon-akka.actor.default-dispatcher-7]
[akka://marathon/user/$b] POSTing to all endpoints.
[2015-03-10 11:31:55,073] INFO Need to scale
/ffaas-backoffice-app-nopersist from 0 up to 1 instances
(mesosphere.marathon.SchedulerActions:488)
[2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
/ffaas-backoffice-app-nopersist. Not scaling.
(mesosphere.marathon.SchedulerActions:498)
...
...
...
[2015-03-10 11:31:57,682] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:31:57,694] INFO Sending event notification.
(mesosphere.marathon.MarathonScheduler:262)
[INFO] [03/10/2015 11:31:57.694]
[marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
POSTing to all endpoints.
...
...
...
[2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
(mesosphere.marathon.tasks.TaskTracker:170)
[INFO] [03/10/2015 11:36:55.050] [marathon-akka.actor.default-dispatcher-2]
[akka://marathon/deadLetters] Message
[mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
encountered. This logging can be turned off or adjusted with configuration
settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
[2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
(mesosphere.marathon.SchedulerActions:403)
[2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
Mesos master (mesosphere.marathon.SchedulerActions:430)
[2015-03-10 11:36:55,063] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING (Reconciliation: Latest task state)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:36:55,065] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING (Reconciliation: Latest task state)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:36:55,066] INFO Already running 1 instances of
/ffaas-backoffice-app-nopersist. Not scaling.
(mesosphere.marathon.SchedulerActions:512)
-- End of logs
2015-03-10 10:25 GMT+01:00 Adam Bordelon <ad...@mesosphere.io>:
> This is certainly not the expected/desired behavior when failing over a
> mesos master in HA mode. In addition to the master logs Alex requested, can
> you also provide relevant portions of the slave logs for these tasks? If
> the slave processes themselves never failed over, checkpointing and slave
> recovery should be irrelevant. Are you running the mesos-slave itself
> inside a Docker, or any other non-traditional setup?
>
> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
> defaults to "reconnect", and --strict defaults to true, so none of those
> are necessary.
>
> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io>
> wrote:
>
>> Geoffroy,
>>
>> could you please provide master logs (both from killed and taking over
>> masters)?
>>
>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>> geoffroy.jabouley@gmail.com> wrote:
>>
>>> Hello
>>>
>>> we are facing some unexpecting issues when testing high availability
>>> behaviors of our mesos cluster.
>>>
>>> *Our use case:*
>>>
>>> *State*: the mesos cluster is up (3 machines), 1 docker task is running
>>> on each slave (started from marathon)
>>>
>>> *Action*: stop the mesos master leader process
>>>
>>> *Expected*: mesos master leader has changed, *active tasks remain
>>> unchanged*
>>>
>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>> FAILED but docker containers are still running*, marathon detects
>>> FAILED tasks and starts new tasks. We end with 2 docker containers running
>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>
>>>
>>> Is the seen behavior correct?
>>>
>>> Have we misunderstood the high availability concept? We thought that
>>> doing this use case would not have any impact on the current cluster state
>>> (except leader re-election)
>>>
>>> Thanks in advance for your help
>>> Regards
>>>
>>> ---------------------------------------------------
>>>
>>> our setup is the following:
>>> 3 identical mesos nodes with:
>>> + zookeeper
>>> + docker 1.5
>>> + mesos master 0.21.1 configured in HA mode
>>> + mesos slave 0.21.1 configured with checkpointing, strict and
>>> reconnect
>>> + marathon 0.8.0 configured in HA mode with checkpointing
>>>
>>> ---------------------------------------------------
>>>
>>> Command lines:
>>>
>>>
>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>
>>> *mesos-slave*
>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
>>> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
>>> --executor_registration_timeout=5mins --hostname=10.195.30.19
>>> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>>
>>> *marathon*
>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>
>>
>>
>
Re: Weird behavior when stopping the mesos master leader of a HA
mesos cluster
Posted by Adam Bordelon <ad...@mesosphere.io>.
This is certainly not the expected/desired behavior when failing over a
mesos master in HA mode. In addition to the master logs Alex requested, can
you also provide relevant portions of the slave logs for these tasks? If
the slave processes themselves never failed over, checkpointing and slave
recovery should be irrelevant. Are you running the mesos-slave itself
inside a Docker, or any other non-traditional setup?
FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
defaults to "reconnect", and --strict defaults to true, so none of those
are necessary.
On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <al...@mesosphere.io> wrote:
> Geoffroy,
>
> could you please provide master logs (both from killed and taking over
> masters)?
>
> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
> geoffroy.jabouley@gmail.com> wrote:
>
>> Hello
>>
>> we are facing some unexpecting issues when testing high availability
>> behaviors of our mesos cluster.
>>
>> *Our use case:*
>>
>> *State*: the mesos cluster is up (3 machines), 1 docker task is running
>> on each slave (started from marathon)
>>
>> *Action*: stop the mesos master leader process
>>
>> *Expected*: mesos master leader has changed, *active tasks remain
>> unchanged*
>>
>> *Seen*: mesos master leader has changed, *all active tasks are now
>> FAILED but docker containers are still running*, marathon detects FAILED
>> tasks and starts new tasks. We end with 2 docker containers running on each
>> machine, but only one is linked to a RUNNING mesos task.
>>
>>
>> Is the seen behavior correct?
>>
>> Have we misunderstood the high availability concept? We thought that
>> doing this use case would not have any impact on the current cluster state
>> (except leader re-election)
>>
>> Thanks in advance for your help
>> Regards
>>
>> ---------------------------------------------------
>>
>> our setup is the following:
>> 3 identical mesos nodes with:
>> + zookeeper
>> + docker 1.5
>> + mesos master 0.21.1 configured in HA mode
>> + mesos slave 0.21.1 configured with checkpointing, strict and
>> reconnect
>> + marathon 0.8.0 configured in HA mode with checkpointing
>>
>> ---------------------------------------------------
>>
>> Command lines:
>>
>>
>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>
>> *mesos-slave*
>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
>> --executor_registration_timeout=5mins --hostname=10.195.30.19
>> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>
>> *marathon*
>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>
>
>
Re: Weird behavior when stopping the mesos master leader of a HA
mesos cluster
Posted by Alex Rukletsov <al...@mesosphere.io>.
Geoffroy,
could you please provide master logs (both from killed and taking over
masters)?
On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
geoffroy.jabouley@gmail.com> wrote:
> Hello
>
> we are facing some unexpecting issues when testing high availability
> behaviors of our mesos cluster.
>
> *Our use case:*
>
> *State*: the mesos cluster is up (3 machines), 1 docker task is running
> on each slave (started from marathon)
>
> *Action*: stop the mesos master leader process
>
> *Expected*: mesos master leader has changed, *active tasks remain
> unchanged*
>
> *Seen*: mesos master leader has changed, *all active tasks are now FAILED
> but docker containers are still running*, marathon detects FAILED tasks
> and starts new tasks. We end with 2 docker containers running on each
> machine, but only one is linked to a RUNNING mesos task.
>
>
> Is the seen behavior correct?
>
> Have we misunderstood the high availability concept? We thought that doing
> this use case would not have any impact on the current cluster state
> (except leader re-election)
>
> Thanks in advance for your help
> Regards
>
> ---------------------------------------------------
>
> our setup is the following:
> 3 identical mesos nodes with:
> + zookeeper
> + docker 1.5
> + mesos master 0.21.1 configured in HA mode
> + mesos slave 0.21.1 configured with checkpointing, strict and
> reconnect
> + marathon 0.8.0 configured in HA mode with checkpointing
>
> ---------------------------------------------------
>
> Command lines:
>
>
> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>
> *mesos-slave*
> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
> --executor_registration_timeout=5mins --hostname=10.195.30.19
> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>
> *marathon*
> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
> 10.195.30.20:2181,10.195.30.21:2181/mesos
>