You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Ritwik Yadav (JIRA)" <ji...@apache.org> on 2014/12/21 15:01:13 UTC
[jira] [Issue Comment Deleted] (MESOS-391) Slave GarbageCollector
needs to also take into account the number of links, when determining
removal time.
[ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ritwik Yadav updated MESOS-391:
-------------------------------
Comment: was deleted
(was: On a second thought, why should there be a link_max global flag? Wouldn't all slaves have different resources available to them? Setting a global flag would mean that we essentially set it to min(available_disk_on_slave1, available_disk_on_slave2, available_disk_on_slave3,...., available_disk_on_slaveN). Wouldn't it make more sense to determine the number of extra links that can be created by a slave on the fly after every call to checkDiskUsage based on how much space is left and caching it for use while creation of new links until next call to checkDiskUsage?)
> Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
> ----------------------------------------------------------------------------------------------------------
>
> Key: MESOS-391
> URL: https://issues.apache.org/jira/browse/MESOS-391
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Assignee: Ritwik Yadav
> Labels: newbie, twitter
>
> The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC.
> As a result of this, the slave crashes:
> F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-1937777162-5050-38880-267/frameworks/201103282247-0000000019-0000/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links
> *** Check failure stack trace: ***
> @ 0x7f9320f82f9d google::LogMessage::Fail()
> @ 0x7f9320f88c07 google::LogMessage::SendToLog()
> @ 0x7f9320f8484c google::LogMessage::Flush()
> @ 0x7f9320f84ab6 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f9320c70312 _CheckSome::~_CheckSome()
> @ 0x7f9320c9dd5c mesos::internal::slave::paths::createExecutorDirectory()
> @ 0x7f9320c9e60d mesos::internal::slave::Framework::createExecutor()
> @ 0x7f9320c7a7f7 mesos::internal::slave::Slave::runTask()
> @ 0x7f9320c9cb43 ProtobufProcess<>::handler4<>()
> @ 0x7f9320c8678b std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7f9320c9d1ab ProtobufProcess<>::visit()
> @ 0x7f9320e4c774 process::MessageEvent::visit()
> @ 0x7f9320e40a1d process::ProcessManager::resume()
> @ 0x7f9320e41268 process::schedule()
> @ 0x7f932055973d start_thread
> @ 0x7f931ef3df6d clone
> The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)