You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Bernd Mathiske (JIRA)" <ji...@apache.org> on 2015/02/27 15:59:04 UTC

[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

    [ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340223#comment-14340223 ] 

Bernd Mathiske commented on MESOS-391:
--------------------------------------

I am not sure we have fully identified the root cause here yet. For typical Linux file systems, the error message "Too many links" from mkdir() is AFAIK supposed hint at too many entries in a directory, despite the somewhat misleading phrasing in this case. This would mean a flat entry count of the parent directory where we call mkdir(). However, scanning all kinds of documentation on this, there is no clear picture, at least for me, and it may vary by OS and file system. We might also see this error due to a recursive count from higher up in the directory tree, which ultimately means that we would be running into a volume limit. 

I suppose reading Linux and ext3 source code could be next? But what if we run on a different system next time?

Agreed, it is extremely likely that the problem is caused by executor/sandbox directory creation outrunning GC and that speeding up GC early enough might help. But what to measure to anticipate this? In order to answer this, I suggest we reproduce the problem, first. In order to do so I would like to ask for more specifics from [~bmahler] who reported the issue about how this was run:

- What OS and file system was this running on? Knowing the system setup in question may help answer most if not all of the questions in this post.
- Were there many runs with the same executor? This would hint at a flat directory entry count problem.
- Or was it a different executor at least most of the time? This would hint at a volume limit.

BTW, LINK_MAX is defined differently in different contexts. I have seen values of 127 and 32000 and others, for different purposes. Which definition of LINK_MAX is meant here? 

If we take into account "st_nlinks" as suggested, then for which path? For the whole sandbox volume?



> Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-391
>                 URL: https://issues.apache.org/jira/browse/MESOS-391
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Ritwik Yadav
>              Labels: newbie, twitter
>
> The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC.
> As a result of this, the slave crashes:
> F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-1937777162-5050-38880-267/frameworks/201103282247-0000000019-0000/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links
> *** Check failure stack trace: ***
>     @     0x7f9320f82f9d  google::LogMessage::Fail()
>     @     0x7f9320f88c07  google::LogMessage::SendToLog()
>     @     0x7f9320f8484c  google::LogMessage::Flush()
>     @     0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f9320c70312  _CheckSome::~_CheckSome()
>     @     0x7f9320c9dd5c  mesos::internal::slave::paths::createExecutorDirectory()
>     @     0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
>     @     0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
>     @     0x7f9320c9cb43  ProtobufProcess<>::handler4<>()
>     @     0x7f9320c8678b  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f9320c9d1ab  ProtobufProcess<>::visit()
>     @     0x7f9320e4c774  process::MessageEvent::visit()
>     @     0x7f9320e40a1d  process::ProcessManager::resume()
>     @     0x7f9320e41268  process::schedule()
>     @     0x7f932055973d  start_thread
>     @     0x7f931ef3df6d  clone
> The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)