You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2017/06/02 23:38:04 UTC

[jira] [Commented] (MESOS-1606) Slave failed to checkpoint on Mac OS X

    [ https://issues.apache.org/jira/browse/MESOS-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035631#comment-16035631 ] 

Neil Conway commented on MESOS-1606:
------------------------------------

Perhaps a disk I/O error, e.g., due to a flaky disk?

> Slave failed to checkpoint on Mac OS X
> --------------------------------------
>
>                 Key: MESOS-1606
>                 URL: https://issues.apache.org/jira/browse/MESOS-1606
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>         Environment: Mac OS X, Darwin Kernel Version 13.3.0
>            Reporter: Zuyu Zhang
>
> {noformat}
> This bug happens to test_framework and LowLevelSchedulerLibprocess as well.
> [ RUN      ] ExamplesTest.LowLevelSchedulerPthread
> Using temporary directory '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al'
> Enabling authentication for the scheduler
> I0715 19:03:59.296200 2019271440 scheduler.cpp:132] Version: 0.20.0
> I0715 19:03:59.300429 2019271440 leveldb.cpp:176] Opened db in 1982us
> I0715 19:03:59.300900 2019271440 leveldb.cpp:183] Compacted db in 447us
> I0715 19:03:59.300946 2019271440 leveldb.cpp:198] Created db iterator in 27us
> I0715 19:03:59.300978 2019271440 leveldb.cpp:204] Seeked to beginning of db in 16us
> I0715 19:03:59.301007 2019271440 leveldb.cpp:273] Iterated through 0 keys in the db in 20us
> I0715 19:03:59.301053 2019271440 replica.cpp:741] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> I0715 19:03:59.301713 222965760 recover.cpp:425] Starting replica recovery
> I0715 19:03:59.301914 222965760 recover.cpp:451] Replica is in EMPTY status
> I0715 19:03:59.302671 221892608 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> I0715 19:03:59.302781 224575488 recover.cpp:188] Received a recover response from a replica in EMPTY status
> I0715 19:03:59.303050 225112064 recover.cpp:542] Updating replica status to STARTING
> I0715 19:03:59.303432 222965760 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 298us
> I0715 19:03:59.303475 222965760 replica.cpp:320] Persisted replica status to STARTING
> I0715 19:03:59.303540 221356032 recover.cpp:451] Replica is in STARTING status
> I0715 19:03:59.303797 224575488 master.cpp:288] Master 20140715-190359-16777343-64313-60122 (localhost) started on 127.0.0.1:64313
> I0715 19:03:59.303848 224575488 master.cpp:325] Master only allowing authenticated frameworks to register
> I0715 19:03:59.303865 224575488 master.cpp:332] Master allowing unauthenticated slaves to register
> I0715 19:03:59.303884 224575488 credentials.hpp:36] Loading credentials for authentication from '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al/credentials'
> W0715 19:03:59.303961 224575488 credentials.hpp:51] Permissions on credentials file '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al/credentials' are too open. It is recommended that your credentials file is NOT accessible by others.
> I0715 19:03:59.304028 224575488 master.cpp:359] Authorization enabled
> I0715 19:03:59.304379 223502336 replica.cpp:638] Replica in STARTING status received a broadcasted recover request
> I0715 19:03:59.304505 2019271440 containerizer.cpp:124] Using isolation: posix/cpu,posix/mem
> I0715 19:03:59.304666 223502336 recover.cpp:188] Received a recover response from a replica in STARTING status
> I0715 19:03:59.304805 223502336 recover.cpp:542] Updating replica status to VOTING
> I0715 19:03:59.305186 223502336 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 214us
> I0715 19:03:59.305219 223502336 replica.cpp:320] Persisted replica status to VOTING
> I0715 19:03:59.305250 223502336 recover.cpp:556] Successfully joined the Paxos group
> I0715 19:03:59.305361 223502336 recover.cpp:440] Recover process terminated
> I0715 19:03:59.305927 224038912 slave.cpp:168] Slave started on 1)@127.0.0.1:64313
> I0715 19:03:59.306221 224038912 slave.cpp:279] Slave resources: cpus(*):4; mem(*):7168; disk(*):470714; ports(*):[31000-32000]
> I0715 19:03:59.306234 2019271440 containerizer.cpp:124] Using isolation: posix/cpu,posix/mem
> I0715 19:03:59.306248 223502336 master.cpp:1128] The newly elected leader is master@127.0.0.1:64313 with id 20140715-190359-16777343-64313-60122
> I0715 19:03:59.306269 223502336 master.cpp:1141] Elected as the leading master!
> I0715 19:03:59.306293 223502336 master.cpp:959] Recovering from registrar
> I0715 19:03:59.306395 225112064 registrar.cpp:313] Recovering registrar
> I0715 19:03:59.306617 221892608 log.cpp:656] Attempting to start the writer
> I0715 19:03:59.306952 224575488 slave.cpp:168] Slave started on 2)@127.0.0.1:64313
> I0715 19:03:59.307158 224575488 slave.cpp:279] Slave resources: cpus(*):4; mem(*):7168; disk(*):470714; ports(*):[31000-32000]
> I0715 19:03:59.307207 222965760 replica.cpp:474] Replica received implicit promise request with proposal 1
> I0715 19:03:59.307401 224038912 slave.cpp:324] Slave hostname: localhost
> I0715 19:03:59.307459 224038912 slave.cpp:325] Slave checkpoint: true
> I0715 19:03:59.307446 222965760 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 232us
> I0715 19:03:59.307512 222965760 replica.cpp:342] Persisted promised to 1
> I0715 19:03:59.307615 224575488 slave.cpp:324] Slave hostname: localhost
> I0715 19:03:59.307631 224575488 slave.cpp:325] Slave checkpoint: true
> I0715 19:03:59.307802 222965760 coordinator.cpp:230] Coordinator attemping to fill missing position
> I0715 19:03:59.307924 223502336 state.cpp:33] Recovering state from '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta'
> I0715 19:03:59.308027 2019271440 containerizer.cpp:124] Using isolation: posix/cpu,posix/mem
> I0715 19:03:59.308171 222429184 status_update_manager.cpp:193] Recovering status update manager
> I0715 19:03:59.308205 225112064 state.cpp:33] Recovering state from '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/1/meta'
> I0715 19:03:59.308316 221892608 containerizer.cpp:287] Recovering containerizer
> I0715 19:03:59.308384 221356032 status_update_manager.cpp:193] Recovering status update manager
> I0715 19:03:59.308575 225112064 containerizer.cpp:287] Recovering containerizer
> I0715 19:03:59.309072 222429184 slave.cpp:3130] Finished recovery
> I0715 19:03:59.309079 223502336 slave.cpp:3130] Finished recovery
> F0715 19:03:59.309267 222429184 slave.cpp:3141] CHECK_SOME(state::checkpoint(path, bootId.get())): Failed to checkpoint '1405473915' to '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta/boot_id': Failed to open file '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta/boot_id': No such file or directory
> *** Check failure stack trace: ***
> I0715 19:03:59.309270 221892608 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
> I0715 19:03:59.309516 221892608 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 219us
> I0715 19:03:59.309502 223502336 slave.cpp:168] Slave started on 3)@127.0.0.1:64313
> I0715 19:03:59.309582 222965760 slave.cpp:603] New master detected at master@127.0.0.1:64313
> I0715 19:03:59.309588 221892608 replica.cpp:676] Persisted action at 0
> I0715 19:03:59.309665 222965760 slave.cpp:639] No credentials provided. Attempting to register without authentication
> I0715 19:03:59.309685 225112064 status_update_manager.cpp:167] New master detected at master@127.0.0.1:64313
> I0715 19:03:59.309798 223502336 slave.cpp:279] Slave resources: cpus(*):4; mem(*):7168; disk(*):470714; ports(*):[31000-32000]
> I0715 19:03:59.310104 224038912 replica.cpp:508] Replica received write request for position 0
> I0715 19:03:59.310331 222965760 slave.cpp:652] Detecting new master
> I0715 19:03:59.310395 224038912 leveldb.cpp:438] Reading position from leveldb took 30us
> I0715 19:03:59.310642 223502336 slave.cpp:324] Slave hostname: localhost
> I0715 19:03:59.310657 223502336 slave.cpp:325] Slave checkpoint: true
> I0715 19:03:59.310689 224038912 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 227us
> I0715 19:03:59.310722 224038912 replica.cpp:676] Persisted action at 0
> I0715 19:03:59.310936 222965760 replica.cpp:655] Replica received learned notice for position 0
> I0715 19:03:59.311103 222965760 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 160us
>     @        0x10b3d54f9  google::LogMessage::SendToLog()
> I0715 19:03:59.311158 221892608 state.cpp:33] Recovering state from '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/2/meta'
> I0715 19:03:59.311436 222965760 replica.cpp:676] Persisted action at 0
> I0715 19:03:59.311514 222965760 replica.cpp:661] Replica learned NOP action at position 0
> I0715 19:03:59.311544 221892608 status_update_manager.cpp:193] Recovering status update manager
> I0715 19:03:59.311612 221892608 containerizer.cpp:287] Recovering containerizer
> I0715 19:03:59.311643 222965760 log.cpp:672] Writer started with ending position 0
>     @        0x10b3d5a24  google::LogMessage::Flush()
> I0715 19:03:59.311983 225112064 slave.cpp:3130] Finished recovery
>     @        0x10b3d8b0f  google::LogMessageFatal::~LogMessageFatal()
> I0715 19:03:59.312419 224038912 leveldb.cpp:438] Reading position from leveldb took 43us
> I0715 19:03:59.312515 222965760 slave.cpp:603] New master detected at master@127.0.0.1:64313
> I0715 19:03:59.312854 222965760 slave.cpp:639] No credentials provided. Attempting to register without authentication
> I0715 19:03:59.312891 222965760 slave.cpp:652] Detecting new master
> I0715 19:03:59.312924 222965760 status_update_manager.cpp:167] New master detected at master@127.0.0.1:64313
>     @        0x10b3d60f9  google::LogMessageFatal::~LogMessageFatal()
>     @        0x10ad381b3  _CheckFatal::~_CheckFatal()
>     @        0x10ad37a29  _CheckFatal::~_CheckFatal()
>     @        0x10af8371f  mesos::internal::slave::Slave::__recover()
>     @        0x10b30df43  process::ProcessBase::visit()
>     @        0x10b304d44  process::ProcessManager::resume()
>     @        0x10b30488f  process::schedule()
>     @     0x7fff907b0899  _pthread_body
>     @     0x7fff907b072a  _pthread_start
>     @     0x7fff907b4fc9  thread_start
> ../../src/tests/script.cpp:85: Failure
> Failed
> low_level_scheduler_pthread_test.sh terminated with signal Abort trap: 6
> make[3]: *** [check-local] Segmentation fault: 11
> make[2]: *** [check-am] Error 2
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)