You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Ritwik Yadav (JIRA)" <ji...@apache.org> on 2014/12/18 20:39:14 UTC
[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

    [ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252117#comment-14252117 ] 

Ritwik Yadav commented on MESOS-391:
------------------------------------

I read the entire garbage collector implementation of the slave process. It seems to take into account the amount of disk space consumed when determining the delay incurred in garbage collection. From my understanding:
Amount of disk space used is directly proportional to the number of directories (number of links). In this scenario, aren't we already taking this into account?
I guess the trouble arises when the slave creates a lot of links (new directories) within the disk_watch_interval causing it to crash beforehand. Am I right?
In this scenario, a first level approach which comes to mind is to set a flag, say link_max, and associate a counter with every slave. At every creation of a new link, we determine if the counter exceeds link_max and trigger garbage collection immediately.
Please let me know if I understood the problem right so that I can go ahead and make the necessary changes.

> Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-391
>                 URL: https://issues.apache.org/jira/browse/MESOS-391
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>              Labels: newbie, twitter
>
> The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC.
> As a result of this, the slave crashes:
> F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-1937777162-5050-38880-267/frameworks/201103282247-0000000019-0000/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links
> *** Check failure stack trace: ***
>     @     0x7f9320f82f9d  google::LogMessage::Fail()
>     @     0x7f9320f88c07  google::LogMessage::SendToLog()
>     @     0x7f9320f8484c  google::LogMessage::Flush()
>     @     0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f9320c70312  _CheckSome::~_CheckSome()
>     @     0x7f9320c9dd5c  mesos::internal::slave::paths::createExecutorDirectory()
>     @     0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
>     @     0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
>     @     0x7f9320c9cb43  ProtobufProcess<>::handler4<>()
>     @     0x7f9320c8678b  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f9320c9d1ab  ProtobufProcess<>::visit()
>     @     0x7f9320e4c774  process::MessageEvent::visit()
>     @     0x7f9320e40a1d  process::ProcessManager::resume()
>     @     0x7f9320e41268  process::schedule()
>     @     0x7f932055973d  start_thread
>     @     0x7f931ef3df6d  clone
> The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)