You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Ian Downes (JIRA)" <ji...@apache.org> on 2014/05/29 22:07:02 UTC

[jira] [Closed] (MESOS-1331) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.

     [ https://issues.apache.org/jira/browse/MESOS-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ian Downes closed MESOS-1331.
-----------------------------

    Resolution: Fixed

Branch: refs/heads/master
Commit: b669ee97da5b5ab15751fe41e5dc50180d7d3bce
Parents: b8cbeac
Author: Ian Downes <id...@twitter.com>
Authored: Wed Apr 30 15:29:58 2014 -0700
Committer: Ian Downes <id...@twitter.com>
Committed: Tue May 20 15:26:15 2014 -0700


> SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
> -------------------------------------------------------
>
>                 Key: MESOS-1331
>                 URL: https://issues.apache.org/jira/browse/MESOS-1331
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.19.0
>            Reporter: Benjamin Mahler
>             Fix For: 0.19.0
>
>
> {noformat}
>  RUN      ] SlaveRecoveryTest/0.NonCheckpointingFramework
> Using temporary directory '/tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_aP60b8'
> I0508 20:05:15.992506 37575 leveldb.cpp:174] Opened db in 171.124155ms
> I0508 20:05:16.017724 37575 leveldb.cpp:181] Compacted db in 25118us
> I0508 20:05:16.017792 37575 leveldb.cpp:196] Created db iterator in 5810ns
> I0508 20:05:16.017822 37575 leveldb.cpp:202] Seeked to beginning of db in 1290ns
> I0508 20:05:16.017853 37575 leveldb.cpp:271] Iterated through 0 keys in the db in 587ns
> I0508 20:05:16.017899 37575 replica.cpp:741] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> I0508 20:05:16.018878 37595 recover.cpp:425] Starting replica recovery
> I0508 20:05:16.019269 37592 recover.cpp:451] Replica is in EMPTY status
> I0508 20:05:16.020742 37593 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> I0508 20:05:16.021250 37595 master.cpp:267] Master 20140508-200516-1740121354-60182-37575 (smfd-bkq-03-sr4.devel.twitter.com) started on 10.37.184.103:60182
> I0508 20:05:16.021291 37595 master.cpp:304] Master only allowing authenticated frameworks to register
> I0508 20:05:16.021306 37595 master.cpp:309] Master only allowing authenticated slaves to register
> I0508 20:05:16.021318 37595 credentials.hpp:35] Loading credentials for authentication
> I0508 20:05:16.021481 37612 recover.cpp:188] Received a recover response from a replica in EMPTY status
> W0508 20:05:16.021528 37595 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_aP60b8/credentials': No such file or directory
> I0508 20:05:16.022090 37603 recover.cpp:542] Updating replica status to STARTING
> I0508 20:05:16.023128 37606 master.cpp:919] The newly elected leader is master@10.37.184.103:60182 with id 20140508-200516-1740121354-60182-37575
> I0508 20:05:16.023175 37606 master.cpp:929] Elected as the leading master!
> I0508 20:05:16.023207 37606 master.cpp:750] Recovering from registrar
> I0508 20:05:16.023367 37590 registrar.cpp:313] Recovering registrar
> I0508 20:05:16.059332 37611 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 36.962934ms
> I0508 20:05:16.059419 37611 replica.cpp:320] Persisted replica status to STARTING
> I0508 20:05:16.059635 37611 recover.cpp:451] Replica is in STARTING status
> I0508 20:05:16.061924 37598 replica.cpp:638] Replica in STARTING status received a broadcasted recover request
> I0508 20:05:16.062656 37595 recover.cpp:188] Received a recover response from a replica in STARTING status
> I0508 20:05:16.063076 37611 recover.cpp:542] Updating replica status to VOTING
> I0508 20:05:16.081363 37592 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 18.158871ms
> I0508 20:05:16.081449 37592 replica.cpp:320] Persisted replica status to VOTING
> I0508 20:05:16.081564 37592 recover.cpp:556] Successfully joined the Paxos group
> I0508 20:05:16.081925 37592 recover.cpp:440] Recover process terminated
> I0508 20:05:16.082630 37601 log.cpp:656] Attempting to start the writer
> I0508 20:05:16.084640 37601 replica.cpp:474] Replica received implicit promise request with proposal 1
> I0508 20:05:16.089722 37601 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 5.046965ms
> I0508 20:05:16.089864 37601 replica.cpp:342] Persisted promised to 1
> I0508 20:05:16.090682 37598 coordinator.cpp:230] Coordinator attemping to fill missing position
> I0508 20:05:16.092527 37597 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
> I0508 20:05:16.097997 37597 leveldb.cpp:341] Persisting action (8 bytes) to leveldb took 5.317699ms
> I0508 20:05:16.098073 37597 replica.cpp:676] Persisted action at 0
> I0508 20:05:16.099889 37591 replica.cpp:508] Replica received write request for position 0
> I0508 20:05:16.099982 37591 leveldb.cpp:436] Reading position from leveldb took 41686ns
> I0508 20:05:16.106329 37591 leveldb.cpp:341] Persisting action (14 bytes) to leveldb took 6.286379ms
> I0508 20:05:16.106410 37591 replica.cpp:676] Persisted action at 0
> I0508 20:05:16.107161 37592 replica.cpp:655] Replica received learned notice for position 0
> I0508 20:05:16.114692 37592 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 7.478084ms
> I0508 20:05:16.114835 37592 replica.cpp:676] Persisted action at 0
> I0508 20:05:16.114939 37592 replica.cpp:661] Replica learned NOP action at position 0
> I0508 20:05:16.115711 37604 log.cpp:672] Writer started with ending position 0
> I0508 20:05:16.117002 37599 leveldb.cpp:436] Reading position from leveldb took 18438ns
> I0508 20:05:16.120064 37606 registrar.cpp:346] Successfully fetched the registry (0B)
> I0508 20:05:16.120125 37606 registrar.cpp:422] Attempting to update the 'registry'
> I0508 20:05:16.122580 37608 log.cpp:680] Attempting to append 155 bytes to the log
> I0508 20:05:16.122792 37607 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 1
> I0508 20:05:16.123879 37605 replica.cpp:508] Replica received write request for position 1
> I0508 20:05:16.131319 37605 leveldb.cpp:341] Persisting action (174 bytes) to leveldb took 7.391658ms
> I0508 20:05:16.131402 37605 replica.cpp:676] Persisted action at 1
> I0508 20:05:16.132503 37595 replica.cpp:655] Replica received learned notice for position 1
> I0508 20:05:16.139683 37595 leveldb.cpp:341] Persisting action (176 bytes) to leveldb took 7.108266ms
> I0508 20:05:16.139770 37595 replica.cpp:676] Persisted action at 1
> I0508 20:05:16.139932 37595 replica.cpp:661] Replica learned APPEND action at position 1
> I0508 20:05:16.140727 37598 registrar.cpp:479] Successfully updated 'registry'
> I0508 20:05:16.140985 37598 registrar.cpp:372] Successfully recovered registrar
> I0508 20:05:16.141175 37591 log.cpp:699] Attempting to truncate the log to 1
> I0508 20:05:16.141296 37601 master.cpp:777] Recovered 0 slaves from the Registry (117B) ; allowing 10mins for slaves to re-register
> I0508 20:05:16.141649 37610 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 2
> I0508 20:05:16.143034 37609 replica.cpp:508] Replica received write request for position 2
> I0508 20:05:16.144410 37575 mesos_containerizer.cpp:122] Using isolation: cgroups/cpu,cgroups/mem
> I0508 20:05:16.148017 37609 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 4.943311ms
> I0508 20:05:16.148066 37609 replica.cpp:676] Persisted action at 2
> I0508 20:05:16.148726 37589 replica.cpp:655] Replica received learned notice for position 2
> I0508 20:05:16.156313 37589 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.453827ms
> I0508 20:05:16.156415 37589 leveldb.cpp:399] Deleting ~1 keys from leveldb took 24402ns
> I0508 20:05:16.156452 37589 replica.cpp:676] Persisted action at 2
> I0508 20:05:16.156484 37589 replica.cpp:661] Replica learned TRUNCATE action at position 2
> I0508 20:05:16.168103 37575 linux_launcher.cpp:66] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0508 20:05:16.171707 37609 slave.cpp:140] Slave started on 6)@10.37.184.103:60182
> I0508 20:05:16.171931 37609 slave.cpp:149] Moving slave process into its own cgroup
> I0508 20:05:16.174898 37575 sched.cpp:121] Version: 0.19.0
> I0508 20:05:16.175509 37607 sched.cpp:217] New master detected at master@10.37.184.103:60182
> I0508 20:05:16.175560 37607 sched.cpp:268] Authenticating with master master@10.37.184.103:60182
> I0508 20:05:16.175889 37606 authenticatee.hpp:128] Creating new client SASL connection
> I0508 20:05:16.176017 37606 master.cpp:2796] Authenticating scheduler(6)@10.37.184.103:60182
> I0508 20:05:16.176208 37599 authenticator.hpp:148] Creating new server SASL connection
> I0508 20:05:16.176324 37599 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0508 20:05:16.176353 37599 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0508 20:05:16.176414 37599 authenticator.hpp:254] Received SASL authentication start
> I0508 20:05:16.176493 37599 authenticator.hpp:342] Authentication requires more steps
> I0508 20:05:16.176558 37599 authenticatee.hpp:265] Received SASL authentication step
> I0508 20:05:16.176681 37603 authenticator.hpp:282] Received SASL authentication step
> I0508 20:05:16.176782 37603 authenticator.hpp:334] Authentication success
> I0508 20:05:16.176908 37603 authenticatee.hpp:305] Authentication success
> I0508 20:05:16.176935 37601 master.cpp:2836] Successfully authenticated scheduler(6)@10.37.184.103:60182
> I0508 20:05:16.177193 37603 sched.cpp:342] Successfully authenticated with master master@10.37.184.103:60182
> I0508 20:05:16.177511 37604 master.cpp:978] Received registration request from scheduler(6)@10.37.184.103:60182
> I0508 20:05:16.177633 37604 master.cpp:996] Registering framework 20140508-200516-1740121354-60182-37575-0000 at scheduler(6)@10.37.184.103:60182
> I0508 20:05:16.177810 37604 sched.cpp:392] Framework registered with 20140508-200516-1740121354-60182-37575-0000
> I0508 20:05:16.177891 37595 hierarchical_allocator_process.hpp:331] Added framework 20140508-200516-1740121354-60182-37575-0000
> I0508 20:05:16.180985 37609 slave.cpp:149] Moving slave process into its own cgroup
> I0508 20:05:16.186306 37609 credentials.hpp:35] Loading credentials for authentication
> W0508 20:05:16.186379 37609 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_TXCwTS/credential': No such file or directory
> I0508 20:05:16.186441 37609 slave.cpp:230] Slave using credential for: test-principal
> I0508 20:05:16.186610 37609 slave.cpp:243] Slave resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I0508 20:05:16.187018 37609 slave.cpp:271] Slave hostname: smfd-bkq-03-sr4.devel.twitter.com
> I0508 20:05:16.187150 37609 slave.cpp:272] Slave checkpoint: true
> I0508 20:05:16.187824 37597 state.cpp:33] Recovering state from '/tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_TXCwTS/meta'
> I0508 20:05:16.188226 37612 status_update_manager.cpp:193] Recovering status update manager
> I0508 20:05:16.188405 37612 mesos_containerizer.cpp:279] Recovering containerizer
> I0508 20:05:16.190035 37594 mem.cpp:165] Removing orphaned cgroup 'mesos_test_39687eb8-5053-4094-846f-d737be84f6e4/slave'
> I0508 20:05:16.190281 37608 cpushare.cpp:189] Removing orphaned cgroup 'cpuacct/mesos_test_39687eb8-5053-4094-846f-d737be84f6e4/slave'
> I0508 20:05:16.192466 37599 slave.cpp:2943] Finished recovery
> I0508 20:05:16.192909 37599 slave.cpp:524] New master detected at master@10.37.184.103:60182
> I0508 20:05:16.192977 37599 slave.cpp:584] Authenticating with master master@10.37.184.103:60182
> I0508 20:05:16.193020 37598 status_update_manager.cpp:167] New master detected at master@10.37.184.103:60182
> I0508 20:05:16.193830 37599 slave.cpp:557] Detecting new master
> I0508 20:05:16.193846 37589 authenticatee.hpp:128] Creating new client SASL connection
> I0508 20:05:16.194098 37592 master.cpp:2796] Authenticating slave(6)@10.37.184.103:60182
> I0508 20:05:16.194519 37602 authenticator.hpp:148] Creating new server SASL connection
> I0508 20:05:16.194674 37590 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
> I0508 20:05:16.194768 37590 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
> I0508 20:05:16.195200 37607 authenticator.hpp:254] Received SASL authentication start
> I0508 20:05:16.195261 37607 authenticator.hpp:342] Authentication requires more steps
> I0508 20:05:16.195386 37606 authenticatee.hpp:265] Received SASL authentication step
> I0508 20:05:16.195545 37604 authenticator.hpp:282] Received SASL authentication step
> I0508 20:05:16.195613 37604 authenticator.hpp:334] Authentication success
> I0508 20:05:16.195688 37606 authenticatee.hpp:305] Authentication success
> I0508 20:05:16.195700 37604 master.cpp:2836] Successfully authenticated slave(6)@10.37.184.103:60182
> I0508 20:05:16.196056 37606 slave.cpp:641] Successfully authenticated with master master@10.37.184.103:60182
> I0508 20:05:16.196316 37594 master.cpp:2132] Registering slave at slave(6)@10.37.184.103:60182 (smfd-bkq-03-sr4.devel.twitter.com) with id 20140508-200516-1740121354-60182-37575-0
> I0508 20:05:16.196648 37603 registrar.cpp:422] Attempting to update the 'registry'
> I0508 20:05:16.199156 37607 log.cpp:680] Attempting to append 382 bytes to the log
> I0508 20:05:16.199429 37603 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 3
> I0508 20:05:16.200638 37602 replica.cpp:508] Replica received write request for position 3
> I0508 20:05:16.222914 37602 leveldb.cpp:341] Persisting action (401 bytes) to leveldb took 22.226643ms
> I0508 20:05:16.222998 37602 replica.cpp:676] Persisted action at 3
> I0508 20:05:16.223551 37605 replica.cpp:655] Replica received learned notice for position 3
> I0508 20:05:16.231291 37605 leveldb.cpp:341] Persisting action (403 bytes) to leveldb took 7.699023ms
> I0508 20:05:16.231374 37605 replica.cpp:676] Persisted action at 3
> I0508 20:05:16.231400 37605 replica.cpp:661] Replica learned APPEND action at position 3
> I0508 20:05:16.232504 37589 registrar.cpp:479] Successfully updated 'registry'
> I0508 20:05:16.232722 37594 log.cpp:699] Attempting to truncate the log to 3
> I0508 20:05:16.232893 37591 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 4
> I0508 20:05:16.233067 37610 master.cpp:2172] Registered slave 20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182 (smfd-bkq-03-sr4.devel.twitter.com)
> I0508 20:05:16.233126 37610 master.cpp:3281] Adding slave 20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182 (smfd-bkq-03-sr4.devel.twitter.com) with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I0508 20:05:16.233392 37598 slave.cpp:674] Registered with master master@10.37.184.103:60182; given slave ID 20140508-200516-1740121354-60182-37575-0
> I0508 20:05:16.233686 37598 slave.cpp:687] Checkpointing SlaveInfo to '/tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_TXCwTS/meta/slaves/20140508-200516-1740121354-60182-37575-0/slave.info'
> I0508 20:05:16.233767 37600 hierarchical_allocator_process.hpp:444] Added slave 20140508-200516-1740121354-60182-37575-0 (smfd-bkq-03-sr4.devel.twitter.com) with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (and cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] available)
> I0508 20:05:16.233808 37605 replica.cpp:508] Replica received write request for position 4
> I0508 20:05:16.234395 37595 master.cpp:2745] Sending 1 offers to framework 20140508-200516-1740121354-60182-37575-0000
> I0508 20:05:16.237540 37602 master.cpp:1810] Processing reply for offers: [ 20140508-200516-1740121354-60182-37575-0 ] on slave 20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182 (smfd-bkq-03-sr4.devel.twitter.com) for framework 20140508-200516-1740121354-60182-37575-0000
> I0508 20:05:16.237859 37602 master.hpp:584] Adding task 7d10204d-a2f6-4b6c-b4f0-922de0923728 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140508-200516-1740121354-60182-37575-0 (smfd-bkq-03-sr4.devel.twitter.com)
> I0508 20:05:16.237992 37602 master.cpp:2920] Launching task 7d10204d-a2f6-4b6c-b4f0-922de0923728 of framework 20140508-200516-1740121354-60182-37575-0000 with resources cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182 (smfd-bkq-03-sr4.devel.twitter.com)
> I0508 20:05:16.238281 37612 slave.cpp:904] Got assigned task 7d10204d-a2f6-4b6c-b4f0-922de0923728 for framework 20140508-200516-1740121354-60182-37575-0000
> I0508 20:05:16.238638 37612 slave.cpp:1014] Launching task 7d10204d-a2f6-4b6c-b4f0-922de0923728 for framework 20140508-200516-1740121354-60182-37575-0000
> I0508 20:05:16.239637 37605 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 5.773305ms
> I0508 20:05:16.239725 37605 replica.cpp:676] Persisted action at 4
> I0508 20:05:16.240603 37594 replica.cpp:655] Replica received learned notice for position 4
> I0508 20:05:16.241828 37592 mesos_containerizer.cpp:523] Starting container '22acc835-edb6-4771-b9d6-0a11ca542aee' for executor '7d10204d-a2f6-4b6c-b4f0-922de0923728' of framework '20140508-200516-1740121354-60182-37575-0000'
> I0508 20:05:16.241981 37612 slave.cpp:1124] Queuing task '7d10204d-a2f6-4b6c-b4f0-922de0923728' for executor 7d10204d-a2f6-4b6c-b4f0-922de0923728 of framework '20140508-200516-1740121354-60182-37575-0000
> I0508 20:05:16.244410 37601 mem.cpp:413] Started listening for OOM events for container 22acc835-edb6-4771-b9d6-0a11ca542aee
> I0508 20:05:16.245115 37601 mem.cpp:277] Updated 'memory.soft_limit_in_bytes' to 1GB for container 22acc835-edb6-4771-b9d6-0a11ca542aee
> I0508 20:05:16.245321 37612 cpushare.cpp:334] Updated 'cpu.shares' to 2048 (cpus 2) for container 22acc835-edb6-4771-b9d6-0a11ca542aee
> I0508 20:05:16.246103 37601 mem.cpp:307] Updated 'memory.limit_in_bytes' to 1GB for container 22acc835-edb6-4771-b9d6-0a11ca542aee
> I0508 20:05:16.247835 37591 linux_launcher.cpp:222] Cloning child process with flags = 0
> I0508 20:05:16.247961 37594 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.292704ms
> I0508 20:05:16.248569 37594 leveldb.cpp:399] Deleting ~2 keys from leveldb took 132796ns
> I0508 20:05:16.248649 37594 replica.cpp:676] Persisted action at 4
> I0508 20:05:16.248749 37594 replica.cpp:661] Replica learned TRUNCATE action at position 4
> I0508 20:05:16.427090 37598 mesos_containerizer.cpp:623] Fetching URIs for container '22acc835-edb6-4771-b9d6-0a11ca542aee' using command '/home/bmahler/git/mesos/build/src/mesos-fetcher'
> I0508 20:05:16.761432 37594 slave.cpp:2281] Monitoring executor '7d10204d-a2f6-4b6c-b4f0-922de0923728' of framework '20140508-200516-1740121354-60182-37575-0000' in container '22acc835-edb6-4771-b9d6-0a11ca542aee'
> ../../src/tests/slave_recovery_tests.cpp:1167: Failure
> Failed to wait 10secs for update
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)