You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Adam B (JIRA)" <ji...@apache.org> on 2014/04/30 22:29:16 UTC
[jira] [Comment Edited] (MESOS-1264) Slave authentication retries can trigger TASK_LOST for non-checkpointing frameworks.

    [ https://issues.apache.org/jira/browse/MESOS-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985104#comment-13985104 ] 

Adam B edited comment on MESOS-1264 at 4/30/14 8:28 PM:
--------------------------------------------------------

TL;DR: If the slave is recovering after dying, we should removeFrameworks/removeOffers, but if the slave is reregistering to the same master from the same process (ZK failover?), we probably should not removeFrameworks (sends TASK_LOST). How can the master tell which scenario it is?

Previously, this disconnect(Slave) code was only called for checkpointing slaves in Master::exited(). I pulled all of that code out and reused it at the start of Master::authenticate() to disconect an authenticating slave before re-authenticating/re-registering. The disconnect(Slave) code currently does 3 things: 1) disable the slave in the allocator so it makes no more offers, 2) remove (non-checkpointing) frameworks, which sends the TASK_LOST message you were seeing, and 3) remove all offers on that slave and have the allocator recover those resources.
1) I definitely want the slave disabled in the allocator, so that we don't offer any of this slave's resources in the middle of re-registration. It will be re-enabled once (re)registration completes.
2) Removing non-checkpointed frameworks could be unnecessary/overkill for a live re-registration, but would need to happen eventually if the re-auth/registration fails (sends ShutdownMessage, which should trigger Master::exited?). Successful re-registrations will reconcile the tasks with the slave.
3) I am less sure about removing/rescinding the slave's offers, since a launchTasks during re-registration would probably fail anyway, but the slave's actual resources could have changed between registrations. The offers are rescinded during deactivate(Framework), so maybe it should happen in disconnect(Slave) too.
I can easily make it so that only the desired steps (1 and maybe 3) happen on re-authentication, and we can hope/expect/test that failed auth/reg will clean up and send TASK_LOST correctly.

Even still, all 3 of these disconnect(Slave) steps were originally intended for checkpointing slaves. Should non-checkpointing slaves behave any differently on re-authentication/registration?
The slave authenticate()s with the master on A) recovery or B) detection of a new master (and it would only trigger disconnect when the "new" master is actually the same).
A) If the slave process did actually shutdown and a new (recovering) process is re-registering while the master thinks the old one is still active, a non-checkpointing slave would have lost all its tasks, and a checkpointing slave would have lost all tasks from non-checkpointing frameworks. So non-checkpointing slaves should send TASK_LOST for all frameworks.
B) But if the slave for some reason detected a master change to the same master (as the ReconcileIncompleteTasks does), disconnect(Slave) will be triggered, and the (non)checkpointed state should be irrelevant. In this case, the tasks are not actually lost, so sending the TASK_LOST is incorrect.
Is there a way to tell whether a re-authenticating/registering slave is the same process that was previously registered? The pid is a libprocess pid, so it should remain the same.


was (Author: adam-mesos):
Previously, this disconnect(Slave) code was only called for checkpointing slaves in Master::exited(). I pulled all of that code out and reused it at the start of Master::authenticate() to disconect an authenticating slave before re-authenticating/re-registering. The disconnect(Slave) code currently does 3 things: 1) disable the slave in the allocator so it makes no more offers, 2) remove (non-checkpointing) frameworks, which sends the TASK_LOST message you were seeing, and 3) remove all offers on that slave and have the allocator recover those resources.
1) I definitely want the slave disabled in the allocator, so that we don't offer any of this slave's resources in the middle of re-registration. It will be re-enabled once (re)registration completes.
2) Removing non-checkpointed frameworks could be unnecessary/overkill for a live re-registration, but would need to happen eventually if the re-auth/registration fails (sends ShutdownMessage, which should trigger Master::exited?). Successful re-registrations will reconcile the tasks with the slave.
3) I am less sure about removing/rescinding the slave's offers, since a launchTasks during re-registration would probably fail anyway, but the slave's actual resources could have changed between registrations. The offers are rescinded during deactivate(Framework), so maybe it should happen in disconnect(Slave) too.
I can easily make it so that only the desired steps (1 and maybe 3) happen on re-authentication, and we can hope/expect/test that failed auth/reg will clean up and send TASK_LOST correctly.

Even still, all 3 of these disconnect(Slave) steps were originally intended for checkpointing slaves. Should non-checkpointing slaves behave any differently on re-authentication/registration?
The slave authenticate()s with the master on A) recovery or B) detection of a new master (and it would only trigger disconnect when the "new" master is actually the same).
A) If the slave process did actually shutdown and a new (recovering) process is re-registering while the master thinks the old one is still active, a non-checkpointing slave would have lost all its tasks, and a checkpointing slave would have lost all tasks from non-checkpointing frameworks. So non-checkpointing slaves should send TASK_LOST for all frameworks.
B) But if the slave for some reason detected a master change to the same master (as the ReconcileIncompleteTasks does), disconnect(Slave) will be triggered, and the (non)checkpointed state should be irrelevant. In this case, the tasks are not actually lost, so sending the TASK_LOST is incorrect.
Is there a way to tell whether a re-authenticating/registering slave is the same process that was previously registered? The pid is a libprocess pid, so it should remain the same.

> Slave authentication retries can trigger TASK_LOST for non-checkpointing frameworks.
> ------------------------------------------------------------------------------------
>
>                 Key: MESOS-1264
>                 URL: https://issues.apache.org/jira/browse/MESOS-1264
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Benjamin Mahler
>            Assignee: Adam B
>
> Looks like there is a regression with slave authentication that is making the FaultToleranceTest.ReconcileIncompleteTasks flaky.
> The following is what appears to be happening in the test and seems to be a regression:
> 1. Slave re-detects leading Master.
> 2. Slave re-authenticates with Master.
> 3. Master sees slave as already activated, calls disconnect().
> 4. For non-checkpointing frameworks, this call to disconnect() assumes the slave has exited, and will send TASK_LOST.
> 5. In the case where the slave did not exit, the tasks are not lost.
> Either we can consider this case to be valid as TASK_LOST, similarly to when the slave connection closes in Master::exited(UPID), updating the test as necessary.
> Or we can try to avoid sending TASK_LOST for authentication retries.
> {noformat}
> [ RUN      ] FaultToleranceTest.ReconcileIncompleteTasks
> Using temporary directory '/tmp/FaultToleranceTest_ReconcileIncompleteTasks_NvDg7P'
> I0428 22:54:27.878968   533 leveldb.cpp:174] Opened db in 152.905109ms
> I0428 22:54:27.901624   533 leveldb.cpp:181] Compacted db in 22.554687ms
> I0428 22:54:27.901720   533 leveldb.cpp:196] Created db iterator in 5625ns
> I0428 22:54:27.901761   533 leveldb.cpp:202] Seeked to beginning of db in 930ns
> I0428 22:54:27.901922   533 leveldb.cpp:271] Iterated through 0 keys in the db in 363ns
> I0428 22:54:27.901958   533 replica.cpp:729] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> I0428 22:54:27.902427   559 recover.cpp:425] Starting replica recovery
> I0428 22:54:27.902935   561 recover.cpp:451] Replica is in EMPTY status
> I0428 22:54:27.904425   556 master.cpp:266] Master 20140428-225427-1740121354-35964-533 (HOSTNAME) started on 10.37.184.103:35964
> I0428 22:54:27.904470   556 master.cpp:303] Master only allowing authenticated frameworks to register
> I0428 22:54:27.904491   556 master.cpp:308] Master only allowing authenticated slaves to register
> I0428 22:54:27.904515   556 credentials.hpp:35] Loading credentials for authentication
> I0428 22:54:27.904561   551 replica.cpp:626] Replica in EMPTY status received a broadcasted recover request
> W0428 22:54:27.904619   556 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/FaultToleranceTest_ReconcileIncompleteTasks_NvDg7P/credentials': No such file or directory
> I0428 22:54:27.904887   550 recover.cpp:188] Received a recover response from a replica in EMPTY status
> I0428 22:54:27.905383   568 recover.cpp:542] Updating replica status to STARTING
> I0428 22:54:27.906673   558 master.cpp:922] The newly elected leader is master@10.37.184.103:35964 with id 20140428-225427-1740121354-35964-533
> I0428 22:54:27.906708   558 master.cpp:932] Elected as the leading master!
> I0428 22:54:27.906728   558 master.cpp:753] Recovering from registrar
> I0428 22:54:27.906842   556 registrar.cpp:275] Recovering registrar
> I0428 22:54:27.937743   561 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 31.905754ms
> I0428 22:54:27.937796   561 replica.cpp:320] Persisted replica status to STARTING
> I0428 22:54:27.938019   561 recover.cpp:451] Replica is in STARTING status
> I0428 22:54:27.939944   569 replica.cpp:626] Replica in STARTING status received a broadcasted recover request
> I0428 22:54:27.940160   565 recover.cpp:188] Received a recover response from a replica in STARTING status
> I0428 22:54:27.941345   558 recover.cpp:542] Updating replica status to VOTING
> I0428 22:54:27.962074   553 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 20.298725ms
> I0428 22:54:27.962128   553 replica.cpp:320] Persisted replica status to VOTING
> I0428 22:54:27.962266   553 recover.cpp:556] Successfully joined the Paxos group
> I0428 22:54:27.962432   553 recover.cpp:440] Recover process terminated
> I0428 22:54:27.962946   566 log.cpp:656] Attempting to start the writer
> I0428 22:54:27.964815   547 replica.cpp:474] Replica received implicit promise request with proposal 1
> I0428 22:54:27.970388   547 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 5.516929ms
> I0428 22:54:27.970440   547 replica.cpp:342] Persisted promised to 1
> I0428 22:54:27.971122   569 coordinator.cpp:229] Coordinator attemping to fill missing position
> I0428 22:54:27.972815   570 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
> I0428 22:54:27.978729   570 leveldb.cpp:341] Persisting action (8 bytes) to leveldb took 5.798681ms
> I0428 22:54:27.978806   570 replica.cpp:664] Persisted action at 0
> I0428 22:54:27.980425   547 replica.cpp:508] Replica received write request for position 0
> I0428 22:54:27.980496   547 leveldb.cpp:436] Reading position from leveldb took 21889ns
> I0428 22:54:27.987072   547 leveldb.cpp:341] Persisting action (14 bytes) to leveldb took 6.529569ms
> I0428 22:54:27.987125   547 replica.cpp:664] Persisted action at 0
> I0428 22:54:27.987746   559 replica.cpp:643] Replica received learned notice for position 0
> I0428 22:54:27.995412   559 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 7.515296ms
> I0428 22:54:27.995465   559 replica.cpp:664] Persisted action at 0
> I0428 22:54:27.995499   559 replica.cpp:649] Replica learned NOP action at position 0
> I0428 22:54:27.996111   560 log.cpp:672] Writer started with ending position 0
> I0428 22:54:27.997576   562 leveldb.cpp:436] Reading position from leveldb took 23701ns
> I0428 22:54:28.000907   550 registrar.cpp:308] Successfully recovered registrar
> I0428 22:54:28.001000   550 registrar.cpp:379] Attempting to update the 'registry'
> I0428 22:54:28.003427   567 log.cpp:680] Attempting to append 153 bytes to the log
> I0428 22:54:28.003566   548 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 1
> I0428 22:54:28.004590   552 replica.cpp:508] Replica received write request for position 1
> I0428 22:54:28.079499   552 leveldb.cpp:341] Persisting action (172 bytes) to leveldb took 74.874271ms
> I0428 22:54:28.079545   552 replica.cpp:664] Persisted action at 1
> I0428 22:54:28.080243   548 replica.cpp:643] Replica received learned notice for position 1
> I0428 22:54:28.103667   548 leveldb.cpp:341] Persisting action (174 bytes) to leveldb took 23.258596ms
> I0428 22:54:28.103731   548 replica.cpp:664] Persisted action at 1
> I0428 22:54:28.103847   548 replica.cpp:649] Replica learned APPEND action at position 1
> I0428 22:54:28.105062   568 registrar.cpp:427] Successfully updated 'registry'
> I0428 22:54:28.105399   561 log.cpp:699] Attempting to truncate the log to 1
> I0428 22:54:28.105626   559 master.cpp:780] Recovered 0 slaves from the Registry (115B) ; allowing 10mins for slaves to re-register
> I0428 22:54:28.105702   547 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 2
> I0428 22:54:28.107192   548 replica.cpp:508] Replica received write request for position 2
> I0428 22:54:28.110117   547 slave.cpp:140] Slave started on 4)@10.37.184.103:35964
> I0428 22:54:28.110178   547 slave.cpp:149] Moving slave process into its own cgroup
> I0428 22:54:28.111990   548 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 4.755207ms
> I0428 22:54:28.112031   548 replica.cpp:664] Persisted action at 2
> I0428 22:54:28.112576   558 replica.cpp:643] Replica received learned notice for position 2
> I0428 22:54:28.113647   533 sched.cpp:121] Version: 0.19.0
> I0428 22:54:28.114310   549 sched.cpp:217] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.114351   549 sched.cpp:268] Authenticating with master master@10.37.184.103:35964
> I0428 22:54:28.114554   561 authenticatee.hpp:128] Creating new client SASL connection
> I0428 22:54:28.114681   549 master.cpp:2795] Authenticating scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.114882   570 authenticator.hpp:148] Creating new server SASL connection
> I0428 22:54:28.115015   570 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0428 22:54:28.115047   570 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0428 22:54:28.115114   570 authenticator.hpp:254] Received SASL authentication start
> I0428 22:54:28.115195   570 authenticator.hpp:342] Authentication requires more steps
> I0428 22:54:28.115279   548 authenticatee.hpp:265] Received SASL authentication step
> I0428 22:54:28.115485   556 authenticator.hpp:282] Received SASL authentication step
> I0428 22:54:28.115576   556 authenticator.hpp:334] Authentication success
> I0428 22:54:28.115681   555 authenticatee.hpp:305] Authentication success
> I0428 22:54:28.115736   549 master.cpp:2835] Successfully authenticated scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.116276   566 sched.cpp:342] Successfully authenticated with master master@10.37.184.103:35964
> I0428 22:54:28.116435   561 master.cpp:981] Received registration request from scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.116557   561 master.cpp:999] Registering framework 20140428-225427-1740121354-35964-533-0000 at scheduler(4)@10.37.184.103:35964
> I0428 22:54:28.116726   552 sched.cpp:392] Framework registered with 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.116801   561 hierarchical_allocator_process.hpp:332] Added framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.120318   558 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.6976ms
> I0428 22:54:28.120383   558 leveldb.cpp:399] Deleting ~1 keys from leveldb took 18353ns
> I0428 22:54:28.120409   558 replica.cpp:664] Persisted action at 2
> I0428 22:54:28.120434   558 replica.cpp:649] Replica learned TRUNCATE action at position 2
> I0428 22:54:28.135422   547 slave.cpp:149] Moving slave process into its own cgroup
> I0428 22:54:28.155840   547 credentials.hpp:35] Loading credentials for authentication
> W0428 22:54:28.155907   547 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/FaultToleranceTest_ReconcileIncompleteTasks_aPd9kr/credential': No such file or directory
> I0428 22:54:28.155962   547 slave.cpp:231] Slave using credential for: test-principal
> I0428 22:54:28.156137   547 slave.cpp:244] Slave resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I0428 22:54:28.156743   547 slave.cpp:272] Slave hostname: HOSTNAME
> I0428 22:54:28.156797   547 slave.cpp:273] Slave checkpoint: false
> I0428 22:54:28.157765   567 state.cpp:33] Recovering state from '/tmp/FaultToleranceTest_ReconcileIncompleteTasks_aPd9kr/meta'
> I0428 22:54:28.158015   564 status_update_manager.cpp:193] Recovering status update manager
> I0428 22:54:28.158342   548 slave.cpp:2943] Finished recovery
> I0428 22:54:28.158860   566 slave.cpp:525] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.159013   566 slave.cpp:585] Authenticating with master master@10.37.184.103:35964
> I0428 22:54:28.159083   548 status_update_manager.cpp:167] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.159139   566 slave.cpp:558] Detecting new master
> I0428 22:54:28.159340   567 authenticatee.hpp:128] Creating new client SASL connection
> I0428 22:54:28.159664   567 master.cpp:2795] Authenticating slave(4)@10.37.184.103:35964
> I0428 22:54:28.159989   559 authenticator.hpp:148] Creating new server SASL connection
> I0428 22:54:28.160220   548 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0428 22:54:28.160253   548 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0428 22:54:28.160328   548 authenticator.hpp:254] Received SASL authentication start
> I0428 22:54:28.160426   548 authenticator.hpp:342] Authentication requires more steps
> I0428 22:54:28.160490   548 authenticatee.hpp:265] Received SASL authentication step
> I0428 22:54:28.160568   548 authenticator.hpp:282] Received SASL authentication step
> I0428 22:54:28.160673   548 authenticator.hpp:334] Authentication success
> I0428 22:54:28.160758   558 authenticatee.hpp:305] Authentication success
> I0428 22:54:28.160787   548 master.cpp:2835] Successfully authenticated slave(4)@10.37.184.103:35964
> I0428 22:54:28.161053   570 slave.cpp:642] Successfully authenticated with master master@10.37.184.103:35964
> I0428 22:54:28.161638   556 registrar.cpp:379] Attempting to update the 'registry'
> I0428 22:54:28.164188   570 log.cpp:680] Attempting to append 378 bytes to the log
> I0428 22:54:28.164309   558 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 3
> I0428 22:54:28.165631   569 replica.cpp:508] Replica received write request for position 3
> I0428 22:54:28.186933   569 leveldb.cpp:341] Persisting action (397 bytes) to leveldb took 21.263363ms
> I0428 22:54:28.186987   569 replica.cpp:664] Persisted action at 3
> I0428 22:54:28.187626   555 replica.cpp:643] Replica received learned notice for position 3
> I0428 22:54:28.195307   555 leveldb.cpp:341] Persisting action (399 bytes) to leveldb took 7.628475ms
> I0428 22:54:28.195361   555 replica.cpp:664] Persisted action at 3
> I0428 22:54:28.195397   555 replica.cpp:649] Replica learned APPEND action at position 3
> I0428 22:54:28.196445   547 registrar.cpp:427] Successfully updated 'registry'
> I0428 22:54:28.196738   553 log.cpp:699] Attempting to truncate the log to 3
> I0428 22:54:28.197031   552 master.cpp:2169] Admitted slave on HOSTNAME at slave(4)@IP:35964
> I0428 22:54:28.197108   552 master.cpp:3283] Adding slave 20140428-225427-1740121354-35964-533-0 at HOST with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I0428 22:54:28.196979   568 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 4
> I0428 22:54:28.197506   560 slave.cpp:675] Registered with master master@10.37.184.103:35964; given slave ID 20140428-225427-1740121354-35964-533-0
> I0428 22:54:28.197764   547 hierarchical_allocator_process.hpp:445] Added slave 20140428-225427-1740121354-35964-533-0 (HOST) with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (and cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] available)
> I0428 22:54:28.198402   565 master.cpp:2744] Sending 1 offers to framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.199038   568 replica.cpp:508] Replica received write request for position 4
> I0428 22:54:28.200683   548 master.cpp:1806] Processing reply for offers: [ 20140428-225427-1740121354-35964-533-0 ] on slave 20140428-225427-1740121354-35964-533-0 (HOST) for framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.201052   548 master.hpp:558] Adding task 1 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.201151   548 master.cpp:2919] Launching task 1 of framework 20140428-225427-1740121354-35964-533-0000 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.201370   555 slave.cpp:905] Got assigned task 1 for framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.201853   555 slave.cpp:1015] Launching task 1 for framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.203640   568 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 4.423216ms
> I0428 22:54:28.203698   568 replica.cpp:664] Persisted action at 4
> I0428 22:54:28.204627   547 replica.cpp:643] Replica received learned notice for position 4
> I0428 22:54:28.206099   555 exec.cpp:131] Version: 0.19.0
> I0428 22:54:28.206481   555 slave.cpp:1125] Queuing task '1' for executor default of framework '20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.206670   555 slave.cpp:2282] Monitoring executor 'default' of framework '20140428-225427-1740121354-35964-533-0000' in container '77e0cf10-1c8d-47a7-946d-eb56f26e339d'
> I0428 22:54:28.206925   555 slave.cpp:1598] Got registration for executor 'default' of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.207355   555 slave.cpp:1717] Flushing queued task 1 for executor 'default' of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.207478   557 exec.cpp:205] Executor registered on slave 20140428-225427-1740121354-35964-533-0
> I0428 22:54:28.211570   557 slave.cpp:1953] Handling status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 from executor(4)@10.37.184.103:35964
> I0428 22:54:28.211958   547 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.29439ms
> I0428 22:54:28.211987   559 status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.212195   547 leveldb.cpp:399] Deleting ~2 keys from leveldb took 56641ns
> I0428 22:54:28.212311   547 replica.cpp:664] Persisted action at 4
> I0428 22:54:28.212358   547 replica.cpp:649] Replica learned TRUNCATE action at position 4
> I0428 22:54:28.212357   559 status_update_manager.cpp:373] Forwarding status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 to master@10.37.184.103:35964
> I0428 22:54:28.212858   559 slave.cpp:2076] Sending acknowledgement for status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 to executor(4)@10.37.184.103:35964
> I0428 22:54:28.222946   565 slave.cpp:525] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.223008   565 slave.cpp:585] Authenticating with master master@10.37.184.103:35964
> I0428 22:54:28.223079   564 status_update_manager.cpp:167] New master detected at master@10.37.184.103:35964
> I0428 22:54:28.223125   565 slave.cpp:558] Detecting new master
> W0428 22:54:28.223120   564 status_update_manager.cpp:181] Resending status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.223142   568 authenticatee.hpp:128] Creating new client SASL connection
> I0428 22:54:28.223157   564 status_update_manager.cpp:373] Forwarding status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 to master@10.37.184.103:35964
> I0428 22:54:28.223357   565 master.cpp:1263] Disconnecting slave 20140428-225427-1740121354-35964-533-0
> I0428 22:54:28.223460   566 hierarchical_allocator_process.hpp:484] Slave 20140428-225427-1740121354-35964-533-0 disconnected
> I0428 22:54:28.223486   565 master.cpp:1283] Removing non-checkpointing framework 20140428-225427-1740121354-35964-533-0000 from disconnected slave 20140428-225427-1740121354-35964-533-0(HOST)
> I0428 22:54:28.223505   565 master.cpp:3235] Removing framework 20140428-225427-1740121354-35964-533-0000 from slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.225428   565 master.cpp:2444] Status update TASK_LOST (UUID: dfa02d31-0a93-4808-924c-0969412c7e52) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 from @0.0.0.0:0
> I0428 22:54:28.225581   565 master.hpp:576] Removing task 1 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140428-225427-1740121354-35964-533-0 (HOST)
> I0428 22:54:28.225934   565 master.cpp:2795] Authenticating slave(4)@10.37.184.103:35964
> I0428 22:54:28.225975   564 hierarchical_allocator_process.hpp:637] Recovered cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (total allocatable: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]) on slave 20140428-225427-1740121354-35964-533-0 from framework 20140428-225427-1740121354-35964-533-0000
> W0428 22:54:28.226312   565 master.cpp:2437] Status update TASK_FINISHED (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000 from slave(4)@10.37.184.103:35964 (HOST): error, couldn't lookup task
> I0428 22:54:28.226378   563 authenticator.hpp:148] Creating new server SASL connection
> I0428 22:54:28.226554   554 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0428 22:54:28.226663   554 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0428 22:54:28.226667   553 status_update_manager.cpp:398] Received status update acknowledgement (UUID: e227720d-b0fb-4f5f-a40f-be27f9a3a219) for task 1 of framework 20140428-225427-1740121354-35964-533-0000
> I0428 22:54:28.226872   552 authenticator.hpp:254] Received SASL authentication start
> I0428 22:54:28.226994   552 authenticator.hpp:342] Authentication requires more steps
> I0428 22:54:28.227147   556 authenticatee.hpp:265] Received SASL authentication step
> I0428 22:54:28.227288   556 authenticator.hpp:282] Received SASL authentication step
> I0428 22:54:28.227378   556 authenticator.hpp:334] Authentication success
> I0428 22:54:28.227519   548 authenticatee.hpp:305] Authentication success
> I0428 22:54:28.227524   556 master.cpp:2835] Successfully authenticated slave(4)@10.37.184.103:35964
> I0428 22:54:28.227975   556 slave.cpp:642] Successfully authenticated with master master@10.37.184.103:35964
> W0428 22:54:28.228243   568 master.cpp:2244] Slave at slave(4)@10.37.184.103:35964 (HOST) is being allowed to re-register with an already in use id (20140428-225427-1740121354-35964-533-0)
> I0428 22:54:28.228421   556 slave.cpp:725] Re-registered with master master@10.37.184.103:35964
> I0428 22:54:28.228457   553 hierarchical_allocator_process.hpp:498] Slave 20140428-225427-1740121354-35964-533-0 reconnected
> ../../src/tests/fault_tolerance_tests.cpp:2084: Failure
> Value of: status.get().state()
>   Actual: TASK_LOST
> Expected: TASK_FINISHED
> /var/tmp/sclly9xeu: line 8:   533 Segmentation fault      './bin/mesos-tests.sh' '--gtest_filter=FaultToleranceTest.ReconcileIncompleteTasks' '--gtest_repeat=-1' '--gtest_break_on_failure' '--verbose'
> [bmahler@HOST build]$ Write failed: Broken pipe
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)