You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Adam B (JIRA)" <ji...@apache.org> on 2014/04/30 22:29:16 UTC
[jira] [Comment Edited] (MESOS-1264) Slave authentication retries
can trigger TASK_LOST for non-checkpointing frameworks.
[ https://issues.apache.org/jira/browse/MESOS-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985104#comment-13985104 ]
Adam B edited comment on MESOS-1264 at 4/30/14 8:28 PM:
--------------------------------------------------------
TL;DR: If the slave is recovering after dying, we should removeFrameworks/removeOffers, but if the slave is reregistering to the same master from the same process (ZK failover?), we probably should not removeFrameworks (sends TASK_LOST). How can the master tell which scenario it is?
Previously, this disconnect(Slave) code was only called for checkpointing slaves in Master::exited(). I pulled all of that code out and reused it at the start of Master::authenticate() to disconect an authenticating slave before re-authenticating/re-registering. The disconnect(Slave) code currently does 3 things: 1) disable the slave in the allocator so it makes no more offers, 2) remove (non-checkpointing) frameworks, which sends the TASK_LOST message you were seeing, and 3) remove all offers on that slave and have the allocator recover those resources.
1) I definitely want the slave disabled in the allocator, so that we don't offer any of this slave's resources in the middle of re-registration. It will be re-enabled once (re)registration completes.
2) Removing non-checkpointed frameworks could be unnecessary/overkill for a live re-registration, but would need to happen eventually if the re-auth/registration fails (sends ShutdownMessage, which should trigger Master::exited?). Successful re-registrations will reconcile the tasks with the slave.
3) I am less sure about removing/rescinding the slave's offers, since a launchTasks during re-registration would probably fail anyway, but the slave's actual resources could have changed between registrations. The offers are rescinded during deactivate(Framework), so maybe it should happen in disconnect(Slave) too.
I can easily make it so that only the desired steps (1 and maybe 3) happen on re-authentication, and we can hope/expect/test that failed auth/reg will clean up and send TASK_LOST correctly.
Even still, all 3 of these disconnect(Slave) steps were originally intended for checkpointing slaves. Should non-checkpointing slaves behave any differently on re-authentication/registration?
The slave authenticate()s with the master on A) recovery or B) detection of a new master (and it would only trigger disconnect when the "new" master is actually the same).
A) If the slave process did actually shutdown and a new (recovering) process is re-registering while the master thinks the old one is still active, a non-checkpointing slave would have lost all its tasks, and a checkpointing slave would have lost all tasks from non-checkpointing frameworks. So non-checkpointing slaves should send TASK_LOST for all frameworks.
B) But if the slave for some reason detected a master change to the same master (as the ReconcileIncompleteTasks does), disconnect(Slave) will be triggered, and the (non)checkpointed state should be irrelevant. In this case, the tasks are not actually lost, so sending the TASK_LOST is incorrect.
Is there a way to tell whether a re-authenticating/registering slave is the same process that was previously registered? The pid is a libprocess pid, so it should remain the same.
was (Author: adam-mesos):
Previously, this disconnect(Slave) code was only called for checkpointing slaves in Master::exited(). I pulled all of that code out and reused it at the start of Master::authenticate() to disconect an authenticating slave before re-authenticating/re-registering. The disconnect(Slave) code currently does 3 things: 1) disable the slave in the allocator so it makes no more offers, 2) remove (non-checkpointing) frameworks, which sends the TASK_LOST message you were seeing, and 3) remove all offers on that slave and have the allocator recover those resources.
1) I definitely want the slave disabled in the allocator, so that we don't offer any of this slave's resources in the middle of re-registration. It will be re-enabled once (re)registration completes.
2) Removing non-checkpointed frameworks could be unnecessary/overkill for a live re-registration, but would need to happen eventually if the re-auth/registration fails (sends ShutdownMessage, which should trigger Master::exited?). Successful re-registrations will reconcile the tasks with the slave.
3) I am less sure about removing/rescinding the slave's offers, since a launchTasks during re-registration would probably fail anyway, but the slave's actual resources could have changed between registrations. The offers are rescinded during deactivate(Framework), so maybe it should happen in disconnect(Slave) too.
I can easily make it so that only the desired steps (1 and maybe 3) happen on re-authentication, and we can hope/expect/test that failed auth/reg will clean up and send TASK_LOST correctly.
Even still, all 3 of these disconnect(Slave) steps were originally intended for checkpointing slaves. Should non-checkpointing slaves behave any differently on re-authentication/registration?
The slave authenticate()s with the master on A) recovery or B) detection of a new master (and it would only trigger disconnect when the "new" master is actually the same).
A) If the slave process did actually shutdown and a new (recovering) process is re-registering while the master thinks the old one is still active, a non-checkpointing slave would have lost all its tasks, and a checkpointing slave would have lost all tasks from non-checkpointing frameworks. So non-checkpointing slaves should send TASK_LOST for all frameworks.
B) But if the slave for some reason detected a master change to the same master (as the ReconcileIncompleteTasks does), disconnect(Slave) will be triggered, and the (non)checkpointed state should be irrelevant. In this case, the tasks are not actually lost, so sending the TASK_LOST is incorrect.
Is there a way to tell whether a re-authenticating/registering slave is the same process that was previously registered? The pid is a libprocess pid, so it should remain the same.
> Slave authentication retries can trigger TASK_LOST for non-checkpointing frameworks.
> ------------------------------------------------------------------------------------
>
> Key: MESOS-1264
> URL: https://issues.apache.org/jira/browse/MESOS-1264
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.19.0
> Reporter: Benjamin Mahler
> Assignee: Adam B
>
> Looks like there is a regression with slave authentication that is making the FaultToleranceTest.ReconcileIncompleteTasks flaky.
> The following is what appears to be happening in the test and seems to be a regression:
> 1. Slave re-detects leading Master.
> 2. Slave re-authenticates with Master.
> 3. Master sees slave as already activated, calls disconnect().
> 4. For non-checkpointing frameworks, this call to disconnect() assumes the slave has exited, and will send TASK_LOST.
> 5. In the case where the slave did not exit, the tasks are not lost.
> Either we can consider this case to be valid as TASK_LOST, similarly to when the slave connection closes in Master::exited(UPID), updating the test as necessary.
> Or we can try to avoid sending TASK_LOST for authentication retries.
> {noformat}
> [ RUN ] FaultToleranceTest.ReconcileIncompleteTasks
> Using temporary directory '/tmp/FaultToleranceTest_ReconcileIncompleteTasks_NvDg7P'
> I0428 22:54:27.878968 533 leveldb.cpp:174] Opened db in 152.905109ms
> I0428 22:54:27.901624 533 leveldb.cpp:181] Compacted db in 22.554687ms
> I0428 22:54:27.901720 533 leveldb.cpp:196] Created db iterator in 5625ns
> I0428 22:54:27.901761 533 leveldb.cpp:202] Seeked to beginning of db in 930ns
> I0428 22:54:27.901922 533 leveldb.cpp:271] Iterated through 0 keys in the db in 363ns
> I0428 22:54:27.901958 533 replica.cpp:729] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> I0428 22:54:27.902427 559 recover.cpp:425] Starting replica recovery
> I0428 22:54:27.902935 561 recover.cpp:451] Replica is in EMPTY status
> I0428 22:54:27.904425 556 master.cpp:266] Master 20140428-225427-1740121354-35964-533 (HOSTNAME) started on 10.37.184.103:35964
> I0428 22:54:27.904470 556 master.cpp:303] Master only allowing authenticated frameworks to register
> I0428 22:54:27.904491 556 master.cpp:308] Master only allowing authenticated slaves to register
> I0428 22:54:27.904515 556 credentials.hpp:35] Loading credentials for authentication
> I0428 22:54:27.904561 551 replica.cpp:626] Replica in EMPTY status received a broadcasted recover request
> W0428 22:54:27.904619 556 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/FaultToleranceTest_ReconcileIncompleteTasks_NvDg7P/credentials': No such file or directory
> I0428 22:54:27.904887 550 recover.cpp:188] Received a recover response from a replica in EMPTY status
> I0428 22:54:27.905383 568 recover.cpp:542] Updating replica status to STARTING
> I0428 22:54:27.906673 558 master.cpp:922] The newly elected leader is master@10.37.184.103:35964 with id 20140428-225427-1740121354-35964-533
> I0428 22:54:27.906708 558 master.cpp:932] Elected as the leading master!
> I0428 22:54:27.906728 558 master.cpp:753] Recovering from registrar
> I0428 22:54:27.906842 556 registrar.cpp:275] Recovering registrar
> I0428 22:54:27.937743 561 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 31.905754ms
> I0428 22:54:27.937796 561 replica.cpp:320] Persisted replica status to STARTING
> I0428 22:54:27.938019 561 recover.cpp:451] Replica is in STARTING status
> I0428 22:54:27.939944 569 replica.cpp:626] Replica in STARTING status received a broadcasted recover request
> I0428 22:54:27.940160 565 recover.cpp:188] Received a recover response from a replica in STARTING status
> I0428 22:54:27.941345 558 recover.cpp:542] Updating replica status to VOTING
> I0428 22:54:27.962074 553 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 20.298725ms
> I0428 22:54:27.962128 553 replica.cpp:320] Persisted replica status to VOTING
> I0428 22:54:27.962266 553 recover.cpp:556] Successfully joined the Paxos group
> I0428 22:54:27.962432 553 recover.cpp:440] Recover process terminated
> I0428 22:54:27.962946 566 log.cpp:656] Attempting to start the writer
> I0428 22:54:27.964815 547 replica.cpp:474] Replica received implicit promise request with proposal 1
> I0428 22:54:27.970388 547 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 5.516929ms
> I0428 22:54:27.970440 547 replica.cpp:342] Persisted promised to 1
> I0428 22:54:27.971122 569 coordinator.cpp:229] Coordinator attemping to fill missing position
> I0428 22:54:27.972815 570 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
> I0428 22:54:27.978729 570 leveldb.cpp:341] Persisting action (8 bytes) to leveldb took 5.798681ms
> I0428 22:54:27.978806 570 replica.cpp:664] Persisted action at 0
> I0428 22:54:27.980425 547 replica.cpp:508] Replica received write request for position 0
> I0428 22:54:27.980496 547 leveldb.cpp:436] Reading position from leveldb took 21889ns
> I0428 22:54:27.987072 547 leveldb.cpp:341] Persisting action (14 bytes) to leveldb took 6.529569ms
> I0428 22:54:27.987125 547 replica.cpp:664] Persisted action at 0
> I0428 22:54:27.987746 559 replica.cpp:643] Replica received learned notice for position 0
> I0428 22:54:27.995412 559 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 7.515296ms
> I0428 22:54:27.995465 559 replica.cpp:664] Persisted action at 0
> I0428 22:54:27.995499 559 replica.cpp:649] Replica learned NOP action at position 0
> I0428 22:54:27.996111 560 log.cpp:672] Writer started with ending position 0
> I0428 22:54:27.997576 562 leveldb.cpp:436] Reading position from leveldb took 23701ns
> I0428 22:54:28.000907 550 registrar.cpp:308] Successfully recovered registrar
> I0428 22:54:28.001000 550 registrar.cpp:379] Attempting to update the 'registry'
> I0428 22:54:28.003427 567 log.cpp:680] Attempting to append 153 bytes to the log
> I0428 22:54:28.003566 548 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 1
> I0428 22:54:28.004590 552 replica.cpp:508] Replica received write request for position 1
> I0428 22:54:28.079499 552 leveldb.cpp:341] Persisting action (172 bytes) to leveldb took 74.874271ms
> I0428 22:54:28.079545 552 replica.cpp:664] Persisted action at 1
> I0428 22:54:28.080243 548 replica.cpp:643] Replica received learned notice for position 1
> I0428 22:54:28.103667 548 leveldb.cpp:341] Persisting action (174 bytes) to leveldb took 23.258596ms
> I0428 22:54:28.103731 548 replica.cpp:664] Persisted action at 1
> I0428 22:54:28.103847 548 replica.cpp:649] Replica learned APPEND action at position 1
> I0428 22:54:28.105062 568 registrar.cpp:427] Successfully updated 'registry'
> I0428 22:54:28.105399 561 log.cpp:699] Attempting to truncate the log to 1
> I0428 22:54:28.105626 559 master.cpp:780] Recovered 0 slaves from the Registry (115B) ; allowing 10mins for slaves to re-register
> I0428 22:54:28.105702 547 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 2
> I0428 22:54:28.107192 548 replica.cpp:508] Replica received write request for position 2
> I0428 22:54:28.110117 547 slave.cpp:140] Slave started on 4)@10.37.184.103:35964
> I0428 22:54:28.110178 547 slave.cpp:149] Moving slave process into its own cgroup
> I0428 22:54:28.111990 548 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 4.755207ms
> I0428 22:54:28.112031 548 replica.cpp:664] Persisted action at 2
> I0428 22:54:28.112576 558 replica.cpp:643] Replica received learned notice for position 2
> I0428 22:54:28.113647 533 sched.cpp:121] Version: 0.19.0
> I0428 22:54:28.114310 549 sched.cpp:217] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.114351 549 sched.cpp:268] Authenticating with master master@10.37.184.103:35964
> I0428 22:54:28.114554 561 authenticatee.hpp:128] Creating new client SASL connection
> I0428 22:54:28.114681 549 master.cpp:2795] Authenticating scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.114882 570 authenticator.hpp:148] Creating new server SASL connection
> I0428 22:54:28.115015 570 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0428 22:54:28.115047 570 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0428 22:54:28.115114 570 authenticator.hpp:254] Received SASL authentication start
> I0428 22:54:28.115195 570 authenticator.hpp:342] Authentication requires more steps
> I0428 22:54:28.115279 548 authenticatee.hpp:265] Received SASL authentication step
> I0428 22:54:28.115485 556 authenticator.hpp:282] Received SASL authentication step
> I0428 22:54:28.115576 556 authenticator.hpp:334] Authentication success
> I0428 22:54:28.115681 555 authenticatee.hpp:305] Authentication success
> I0428 22:54:28.115736 549 master.cpp:2835] Successfully authenticated scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.116276 566 sched.cpp:342] Successfully authenticated with master master@10.37.184.103:35964
> I0428 22:54:28.116435 561 master.cpp:981] Received registration request from scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.116557 561 master.cpp:999] Registering framework 20140428-225427-1740121354-35964-533-0000 at scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.116726 552 sched.cpp:392] Framework registered with 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.116801 561 hierarchical_allocator_process.hpp:332] Added framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.120318 558 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.6976ms
> I0428 22:54:28.120383 558 leveldb.cpp:399] Deleting ~1 keys from leveldb took 18353ns
> I0428 22:54:28.120409 558 replica.cpp:664] Persisted action at 2
> I0428 22:54:28.120434 558 replica.cpp:649] Replica learned TRUNCATE action at position 2
> I0428 22:54:28.135422 547 slave.cpp:149] Moving slave process into its own cgroup
> I0428 22:54:28.155840 547 credentials.hpp:35] Loading credentials for authentication
> W0428 22:54:28.155907 547 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/FaultToleranceTest_ReconcileIncompleteTasks_aPd9kr/credential': No such file or directory
> I0428 22:54:28.155962 547 slave.cpp:231] Slave using credential for: test-principal
> I0428 22:54:28.156137 547 slave.cpp:244] Slave resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I0428 22:54:28.156743 547 slave.cpp:272] Slave hostname: HOSTNAME
> I0428 22:54:28.156797 547 slave.cpp:273] Slave checkpoint: false
> I0428 22:54:28.157765 567 state.cpp:33] Recovering state from '/tmp/FaultToleranceTest_ReconcileIncompleteTasks_aPd9kr/meta'
> I0428 22:54:28.158015 564 status_update_manager.cpp:193] Recovering status update manager
> I0428 22:54:28.158342 548 slave.cpp:2943] Finished recovery
> I0428 22:54:28.158860 566 slave.cpp:525] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.159013 566 slave.cpp:585] Authenticating with master master@10.37.184.103:35964
> I0428 22:54:28.159083 548 status_update_manager.cpp:167] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.159139 566 slave.cpp:558] Detecting new master
> I0428 22:54:28.159340 567 authenticatee.hpp:128] Creating new client SASL connection
> I0428 22:54:28.159664 567 master.cpp:2795] Authenticating slave(4)@10.37.184.103:35964
> I0428 22:54:28.159989 559 authenticator.hpp:148] Creating new server SASL connection
> I0428 22:54:28.160220 548 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0428 22:54:28.160253 548 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0428 22:54:28.160328 548 authenticator.hpp:254] Received SASL authentication start
> I0428 22:54:28.160426 548 authenticator.hpp:342] Authentication requires more steps
> I0428 22:54:28.160490 548 authenticatee.hpp:265] Received SASL authentication step
> I0428 22:54:28.160568 548 authenticator.hpp:282] Received SASL authentication step
> I0428 22:54:28.160673 548 authenticator.hpp:334] Authentication success
> I0428 22:54:28.160758 558 authenticatee.hpp:305] Authentication success
> I0428 22:54:28.160787 548 master.cpp:2835] Successfully authenticated slave(4)@10.37.184.103:35964
> I0428 22:54:28.161053 570 slave.cpp:642] Successfully authenticated with master master@10.37.184.103:35964
> I0428 22:54:28.161638 556 registrar.cpp:379] Attempting to update the 'registry'
> I0428 22:54:28.164188 570 log.cpp:680] Attempting to append 378 bytes to the log
> I0428 22:54:28.164309 558 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 3
> I0428 22:54:28.165631 569 replica.cpp:508] Replica received write request for position 3
> I0428 22:54:28.186933 569 leveldb.cpp:341] Persisting action (397 bytes) to leveldb took 21.263363ms
> I0428 22:54:28.186987 569 replica.cpp:664] Persisted action at 3
> I0428 22:54:28.187626 555 replica.cpp:643] Replica received learned notice for position 3
> I0428 22:54:28.195307 555 leveldb.cpp:341] Persisting action (399 bytes) to leveldb took 7.628475ms
> I0428 22:54:28.195361 555 replica.cpp:664] Persisted action at 3
> I0428 22:54:28.195397 555 replica.cpp:649] Replica learned APPEND action at position 3
> I0428 22:54:28.196445 547 registrar.cpp:427] Successfully updated 'registry'
> I0428 22:54:28.196738 553 log.cpp:699] Attempting to truncate the log to 3
> I0428 22:54:28.197031 552 master.cpp:2169] Admitted slave on HOSTNAME at slave(4)@IP:35964
> I0428 22:54:28.197108 552 master.cpp:3283] Adding slave 20140428-225427-1740121354-35964-533-0 at HOST with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I0428 22:54:28.196979 568 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 4
> I0428 22:54:28.197506 560 slave.cpp:675] Registered with master master@10.37.184.103:35964; given slave ID 20140428-225427-1740121354-35964-533-0
> I0428 22:54:28.197764 547 hierarchical_allocator_process.hpp:445] Added slave 20140428-225427-1740121354-35964-533-0 (HOST) with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (and cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] available)
> I0428 22:54:28.198402 565 master.cpp:2744] Sending 1 offers to framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.199038 568 replica.cpp:508] Replica received write request for position 4
> I0428 22:54:28.200683 548 master.cpp:1806] Processing reply for offers: [ 20140428-225427-1740121354-35964-533-0 ] on slave 20140428-225427-1740121354-35964-533-0 (HOST) for framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.201052 548 master.hpp:558] Adding task 1 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.201151 548 master.cpp:2919] Launching task 1 of framework 20140428-225427-1740121354-35964-533-0000 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.201370 555 slave.cpp:905] Got assigned task 1 for framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.201853 555 slave.cpp:1015] Launching task 1 for framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.203640 568 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 4.423216ms
> I0428 22:54:28.203698 568 replica.cpp:664] Persisted action at 4
> I0428 22:54:28.204627 547 replica.cpp:643] Replica received learned notice for position 4
> I0428 22:54:28.206099 555 exec.cpp:131] Version: 0.19.0
> I0428 22:54:28.206481 555 slave.cpp:1125] Queuing task '1' for executor default of framework '20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.206670 555 slave.cpp:2282] Monitoring executor 'default' of framework '20140428-225427-1740121354-35964-533-0000' in container '77e0cf10-1c8d-47a7-946d-eb56f26e339d'
> I0428 22:54:28.206925 555 slave.cpp:1598] Got registration for executor 'default' of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.207355 555 slave.cpp:1717] Flushing queued task 1 for executor 'default' of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.207478 557 exec.cpp:205] Executor registered on slave 20140428-225427-1740121354-35964-533-0
> I0428 22:54:28.211570 557 slave.cpp:1953] Handling status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 from executor(4)@10.37.184.103:35964
> I0428 22:54:28.211958 547 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.29439ms
> I0428 22:54:28.211987 559 status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.212195 547 leveldb.cpp:399] Deleting ~2 keys from leveldb took 56641ns
> I0428 22:54:28.212311 547 replica.cpp:664] Persisted action at 4
> I0428 22:54:28.212358 547 replica.cpp:649] Replica learned TRUNCATE action at position 4
> I0428 22:54:28.212357 559 status_update_manager.cpp:373] Forwarding status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 to master@10.37.184.103:35964
> I0428 22:54:28.212858 559 slave.cpp:2076] Sending acknowledgement for status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 to executor(4)@10.37.184.103:35964
> I0428 22:54:28.222946 565 slave.cpp:525] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.223008 565 slave.cpp:585] Authenticating with master master@10.37.184.103:35964
> I0428 22:54:28.223079 564 status_update_manager.cpp:167] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.223125 565 slave.cpp:558] Detecting new master
> W0428 22:54:28.223120 564 status_update_manager.cpp:181] Resending status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.223142 568 authenticatee.hpp:128] Creating new client SASL connection
> I0428 22:54:28.223157 564 status_update_manager.cpp:373] Forwarding status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 to master@10.37.184.103:35964
> I0428 22:54:28.223357 565 master.cpp:1263] Disconnecting slave 20140428-225427-1740121354-35964-533-0
> I0428 22:54:28.223460 566 hierarchical_allocator_process.hpp:484] Slave 20140428-225427-1740121354-35964-533-0 disconnected
> I0428 22:54:28.223486 565 master.cpp:1283] Removing non-checkpointing framework 20140428-225427-1740121354-35964-533-0000 from disconnected slave 20140428-225427-1740121354-35964-533-0(HOST)
> I0428 22:54:28.223505 565 master.cpp:3235] Removing framework 20140428-225427-1740121354-35964-533-0000 from slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.225428 565 master.cpp:2444] Status update TASK_LOST (UUID: dfa02d31-0a93-4808-924c-0969412c7e52) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 from @0.0.0.0:0
> I0428 22:54:28.225581 565 master.hpp:576] Removing task 1 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.225934 565 master.cpp:2795] Authenticating slave(4)@10.37.184.103:35964
> I0428 22:54:28.225975 564 hierarchical_allocator_process.hpp:637] Recovered cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (total allocatable: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]) on slave 20140428-225427-1740121354-35964-533-0 from framework 20140428-225427-1740121354-35964-533-0000
> W0428 22:54:28.226312 565 master.cpp:2437] Status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 from slave(4)@10.37.184.103:35964 (HOST): error, couldn't lookup task
> I0428 22:54:28.226378 563 authenticator.hpp:148] Creating new server SASL connection
> I0428 22:54:28.226554 554 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0428 22:54:28.226663 554 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0428 22:54:28.226667 553 status_update_manager.cpp:398] Received status update acknowledgement (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.226872 552 authenticator.hpp:254] Received SASL authentication start
> I0428 22:54:28.226994 552 authenticator.hpp:342] Authentication requires more steps
> I0428 22:54:28.227147 556 authenticatee.hpp:265] Received SASL authentication step
> I0428 22:54:28.227288 556 authenticator.hpp:282] Received SASL authentication step
> I0428 22:54:28.227378 556 authenticator.hpp:334] Authentication success
> I0428 22:54:28.227519 548 authenticatee.hpp:305] Authentication success
> I0428 22:54:28.227524 556 master.cpp:2835] Successfully authenticated slave(4)@10.37.184.103:35964
> I0428 22:54:28.227975 556 slave.cpp:642] Successfully authenticated with master master@10.37.184.103:35964
> W0428 22:54:28.228243 568 master.cpp:2244] Slave at slave(4)@10.37.184.103:35964 (HOST) is being allowed to re-register with an already in use id (20140428-225427-1740121354-35964-533-0)
> I0428 22:54:28.228421 556 slave.cpp:725] Re-registered with master master@10.37.184.103:35964
> I0428 22:54:28.228457 553 hierarchical_allocator_process.hpp:498] Slave 20140428-225427-1740121354-35964-533-0 reconnected
> ../../src/tests/fault_tolerance_tests.cpp:2084: Failure
> Value of: status.get().state()
> Actual: TASK_LOST
> Expected: TASK_FINISHED
> /var/tmp/sclly9xeu: line 8: 533 Segmentation fault './bin/mesos-tests.sh' '--gtest_filter=FaultToleranceTest.ReconcileIncompleteTasks' '--gtest_repeat=-1' '--gtest_break_on_failure' '--verbose'
> [bmahler@HOST build]$ Write failed: Broken pipe
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)