You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2015/11/12 20:30:11 UTC

[jira] [Created] (MESOS-3907) PersistentVolumeTest.SlaveRecovery test is flaky

Vinod Kone created MESOS-3907:
---------------------------------

             Summary: PersistentVolumeTest.SlaveRecovery test is flaky
                 Key: MESOS-3907
                 URL: https://issues.apache.org/jira/browse/MESOS-3907
             Project: Mesos
          Issue Type: Bug
         Environment: ASF CI
            Reporter: Vinod Kone


Looks like the executor didn't re-register in time after slave restart. The Clock::settle() should've guaranteed that the executor sent re-register and slave processed it, so it's not clear why this race happened.

{code}

I1112 17:49:00.663300 28118 slave.cpp:392] Slave resources: cpus(*):2; mem(*):1024; disk(role1):1024; ports(*):[31000-32000]
I1112 17:49:00.663388 28118 slave.cpp:400] Slave attributes: [  ]
I1112 17:49:00.663406 28118 slave.cpp:405] Slave hostname: f4f9d084d7ba
I1112 17:49:00.663419 28118 slave.cpp:410] Slave checkpoint: true
I1112 17:49:00.669028 28115 state.cpp:54] Recovering state from '/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/meta'
I1112 17:49:00.675704 28121 fetcher.cpp:79] Clearing fetcher cache
I1112 17:49:00.675799 28121 slave.cpp:4318] Recovering framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.675891 28121 slave.cpp:5108] Recovering executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.677217 28119 status_update_manager.cpp:202] Recovering status update manager
I1112 17:49:00.677250 28119 status_update_manager.cpp:210] Recovering executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.677326 28119 status_update_manager.cpp:499] Creating StatusUpdate stream for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.677896 28119 status_update_manager.cpp:802] Replaying status update stream for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145
I1112 17:49:00.678973 28121 slave.cpp:682] Successfully attached file '/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/slaves/5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0/frameworks/5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000/executors/2b35f3fc-373c-4b5f-bbe6-dc3279117145/runs/b9dbc84e-57fa-49d5-840f-8f6d9f0233fc'
I1112 17:49:00.679440 28121 containerizer.cpp:385] Recovering containerizer
I1112 17:49:00.679527 28121 containerizer.cpp:440] Recovering container 'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' for executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.685341 28115 slave.cpp:4170] Sending reconnect request to executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865
I1112 17:49:00.690004  1529 exec.cpp:256] Received reconnect request from slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0
I1112 17:49:00.703845 28122 hierarchical.cpp:977] No resources available to allocate!
I1112 17:49:00.703917 28122 hierarchical.cpp:1070] No inverse offers to send out!
I1112 17:49:00.704025 28122 hierarchical.cpp:876] Performed allocation for 1 slaves in 271350ns
I1112 17:49:00.703974 28116 slave.cpp:2702] Cleaning up un-reregistered executors
I1112 17:49:00.704107 28116 slave.cpp:2720] Killing un-reregistered executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865
I1112 17:49:00.704262 28116 slave.cpp:4230] Finished recovery
I1112 17:49:00.704397 28123 containerizer.cpp:1075] Destroying container 'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc'
I1112 17:49:00.706696 28116 slave.cpp:4387] Querying resource estimator for oversubscribable resources
I1112 17:49:00.707142 28127 slave.cpp:4401] Received oversubscribable resources  from the resource estimator
W1112 17:49:00.707540 28127 slave.cpp:2546] Shutting down executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 because the slave is not in recovery mode
I1112 17:49:00.708133 28116 slave.cpp:729] New master detected at master@172.17.6.239:32961
I1112 17:49:00.708228 28116 slave.cpp:792] Authenticating with master master@172.17.6.239:32961
I1112 17:49:00.708253 28116 slave.cpp:797] Using default CRAM-MD5 authenticatee
I1112 17:49:00.708421 28116 slave.cpp:765] Detecting new master
I1112 17:49:00.708533 28116 status_update_manager.cpp:176] Pausing sending status updates
I1112 17:49:00.708673 28116 authenticatee.cpp:123] Creating new client SASL connection
I1112 17:49:00.709913 28116 master.cpp:5150] Authenticating slave(197)@172.17.6.239:32961
I1112 17:49:00.710218 28116 authenticator.cpp:415] Starting authentication session for crammd5_authenticatee(467)@172.17.6.239:32961
I1112 17:49:00.710438  1516 exec.cpp:383] Executor asked to shutdown
I1112 17:49:00.710582  1516 exec.cpp:398] Executor::shutdown took 35481ns
I1112 17:49:00.710598 28116 authenticator.cpp:100] Creating new server SASL connection
I1112 17:49:00.710705  1516 exec.cpp:82] Scheduling shutdown of the executor
Shutting down
Sending SIGTERM to process tree at pid 1531
I1112 17:49:00.710830 28116 authenticatee.cpp:214] Received SASL authentication mechanisms: CRAM-MD5
I1112 17:49:00.710866 28116 authenticatee.cpp:240] Attempting to authenticate with mechanism 'CRAM-MD5'
I1112 17:49:00.712005 28116 authenticator.cpp:205] Received SASL authentication start
I1112 17:49:00.712069 28116 authenticator.cpp:327] Authentication requires more steps
I1112 17:49:00.712214 28124 authenticatee.cpp:260] Received SASL authentication step
I1112 17:49:00.712357 28116 authenticator.cpp:233] Received SASL authentication step
I1112 17:49:00.712389 28116 auxprop.cpp:109] Request to lookup properties for user: 'test-principal' realm: 'f4f9d084d7ba' server FQDN: 'f4f9d084d7ba' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false 
I1112 17:49:00.712412 28116 auxprop.cpp:181] Looking up auxiliary property '*userPassword'
I1112 17:49:00.712467 28116 auxprop.cpp:181] Looking up auxiliary property '*cmusaslsecretCRAM-MD5'
I1112 17:49:00.712507 28116 auxprop.cpp:109] Request to lookup properties for user: 'test-principal' realm: 'f4f9d084d7ba' server FQDN: 'f4f9d084d7ba' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true 
I1112 17:49:00.712524 28116 auxprop.cpp:131] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true
I1112 17:49:00.712534 28116 auxprop.cpp:131] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
I1112 17:49:00.712555 28116 authenticator.cpp:319] Authentication success
I1112 17:49:00.712708 28128 authenticatee.cpp:300] Authentication success
I1112 17:49:00.721004 28122 master.cpp:5180] Successfully authenticated principal 'test-principal' at slave(197)@172.17.6.239:32961
I1112 17:49:00.721352 28117 authenticator.cpp:433] Authentication session cleanup for crammd5_authenticatee(467)@172.17.6.239:32961
Killing the following process trees:
[ 
--- 1531 sh -c while true; do test -d path1; done 
]
I1112 17:49:00.721968 28128 slave.cpp:860] Successfully authenticated with master master@172.17.6.239:32961
I1112 17:49:00.723456 28130 master.cpp:4019] Re-registering slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(196)@172.17.6.239:32961 (f4f9d084d7ba)
I1112 17:49:00.724056 28126 hierarchical.cpp:417] Slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 reactivated
I1112 17:49:00.723914 28130 master.cpp:4207] Sending updated checkpointed resources disk(role1)[id1:path1]:64 to slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba)
I1112 17:49:00.724306 28128 slave.cpp:1254] Will retry registration in 17.384714ms if necessary
I1112 17:49:00.726639 28128 slave.cpp:1004] Re-registered with master master@172.17.6.239:32961
I1112 17:49:00.726775 28122 master.cpp:3479] Telling slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba) to kill task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (default) at scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961
I1112 17:49:00.726794 28127 status_update_manager.cpp:183] Resuming sending status updates
I1112 17:49:00.726913 28128 slave.cpp:1040] Forwarding total oversubscribed resources 
I1112 17:49:00.727146 28126 master.cpp:4269] Received update of slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba) with total oversubscribed resources 
I1112 17:49:00.727447 28126 hierarchical.cpp:400] Slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 (f4f9d084d7ba) updated with oversubscribed resources  (total: cpus(*):2; mem(*):1024; disk(role1):960; ports(*):[31000-32000]; disk(role1)[id1:path1]:64, allocated: cpus(*):1; mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64)
I1112 17:49:00.728248 28126 hierarchical.cpp:1162] Filtered offer with cpus(*):1; mem(*):896; ports(*):[31000-32000]; disk(role1):928 on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 for framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.728301 28126 hierarchical.cpp:977] No resources available to allocate!
I1112 17:49:00.728350 28126 hierarchical.cpp:1070] No inverse offers to send out!
I1112 17:49:00.728374 28126 hierarchical.cpp:892] Performed allocation for slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 in 865336ns
I1112 17:49:00.728529 28128 slave.cpp:2178] Updating framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 pid to scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961
I1112 17:49:00.728587 28128 slave.cpp:2194] Checkpointing framework pid 'scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961' to '/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/meta/slaves/5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0/frameworks/5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000/framework.pid'
I1112 17:49:00.729202 28119 status_update_manager.cpp:183] Resuming sending status updates
I1112 17:49:00.729709 28128 slave.cpp:2274] Updated checkpointed resources from disk(role1)[id1:path1]:64 to disk(role1)[id1:path1]:64
I1112 17:49:00.729821 28128 slave.cpp:1821] Asked to kill task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
W1112 17:49:00.729884 28128 slave.cpp:1948] Ignoring kill task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 because the executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865 is terminating/terminated
I1112 17:49:00.804975 28117 containerizer.cpp:1258] Executor for container 'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' has exited
I1112 17:49:00.807514 28117 slave.cpp:3553] Executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 terminated with signal Killed
I1112 17:49:00.807701 28117 slave.cpp:2762] Handling status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 from @0.0.0.0:0
I1112 17:49:00.807884 28117 slave.cpp:5298] Terminating task 2b35f3fc-373c-4b5f-bbe6-dc3279117145
W1112 17:49:00.808377 28117 containerizer.cpp:967] Ignoring update for unknown container: b9dbc84e-57fa-49d5-840f-8f6d9f0233fc
I1112 17:49:00.808811 28117 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.808878 28117 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.814139 28117 status_update_manager.cpp:376] Forwarding update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 to the slave
I1112 17:49:00.814621 28122 slave.cpp:3087] Forwarding the update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 to master@172.17.6.239:32961
I1112 17:49:00.814888 28122 slave.cpp:3005] Status update manager successfully handled status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.815172 28117 master.cpp:4414] Status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 from slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba)
I1112 17:49:00.815218 28117 master.cpp:4462] Forwarding status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.815402 28117 master.cpp:6066] Updating the state of task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (latest state: TASK_LOST, status update state: TASK_LOST)
I1112 17:49:00.815636 28122 sched.cpp:921] Scheduler::statusUpdate took 135083ns
I1112 17:49:00.816069 28117 master.cpp:3571] Processing ACKNOWLEDGE call 0c911b27-4aa4-4435-965d-1b4b5033a40c for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (default) at scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961 on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0
I1112 17:49:00.816118 28122 hierarchical.cpp:744] Recovered cpus(*):1; mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64 (total: cpus(*):2; mem(*):1024; disk(role1):960; ports(*):[31000-32000]; disk(role1)[id1:path1]:64, allocated: ) on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 from framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.816128 28117 master.cpp:6132] Removing task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 with resources cpus(*):1; mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba)
I1112 17:49:00.816721 28127 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.817196 28127 status_update_manager.cpp:826] Checkpointing ACK for status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
../../src/tests/persistent_volume_tests.cpp:709: Failure
Value of: status2.get().state()
  Actual: TASK_LOST
Expected: TASK_KILLED

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)