You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/03/14 03:02:12 UTC

[jira] [Created] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

Benjamin Mahler created MESOS-391:
-------------------------------------

             Summary: Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
                 Key: MESOS-391
                 URL: https://issues.apache.org/jira/browse/MESOS-391
             Project: Mesos
          Issue Type: Bug
            Reporter: Benjamin Mahler


The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC.

As a result of this, the slave crashes:

F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-1937777162-5050-38880-267/frameworks/201103282247-0000000019-0000/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links
*** Check failure stack trace: ***
    @     0x7f9320f82f9d  google::LogMessage::Fail()
    @     0x7f9320f88c07  google::LogMessage::SendToLog()
    @     0x7f9320f8484c  google::LogMessage::Flush()
    @     0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f9320c70312  _CheckSome::~_CheckSome()
    @     0x7f9320c9dd5c  mesos::internal::slave::paths::createExecutorDirectory()
    @     0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
    @     0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
    @     0x7f9320c9cb43  ProtobufProcess<>::handler4<>()
    @     0x7f9320c8678b  std::tr1::_Function_handler<>::_M_invoke()
    @     0x7f9320c9d1ab  ProtobufProcess<>::visit()
    @     0x7f9320e4c774  process::MessageEvent::visit()
    @     0x7f9320e40a1d  process::ProcessManager::resume()
    @     0x7f9320e41268  process::schedule()
    @     0x7f932055973d  start_thread
    @     0x7f931ef3df6d  clone

The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira