You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joris Van Remoortere (JIRA)" <ji...@apache.org> on 2015/05/25 22:34:36 UTC
[jira] [Issue Comment Deleted] (MESOS-2254) Posix CPU isolator
usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van Remoortere updated MESOS-2254:
----------------------------------------
Comment: was deleted
(was: [~marco-mesos]The endpoint is already rate-limited using a {{process::RateLimiter}} that permits 2 calls per second. The main concern is that even a single call to this API gets more expensive as N executors scan all P processes on the system (N*P) per call.
There are opportunities to cache; however, caching introduces decisions about when to clear the cache (Do we do it on a time based interval? after some number of requests?) as well as stale data. Since the intent of this call is the get a current snapshot of usage data, I would prefer to avoid introducing explicit caching, and instead pass along enough "information" to allow re-use of the data for the same "call" (batching).
In this particular case, the reason we are performing the (N*P) is because the containerizer calls the usage function on the isolator for each container. In my opinion this is the cleanest place to "cache", although I would prefer to call it "batch". The isolator loses the "information" that we are asking for a snapshot of all containers, rather it thinks we are asking for N snapshots.
My proposal would be to modify the interface to allow a batched version of the call, so that the usage call can re-use any data it collects. I think this is the cleanest way to control when we recompute / invalidate the data.
There is also the opportunity to just reduce the full stats parsing to just the subset of pids that we are interested in. This would already provide a ~30x improvement.
P.S. this problem can also be completely avoided by calling into a kernel module that exposes the right information efficiently ;-))
> Posix CPU isolator usage call introduce high cpu load
> -----------------------------------------------------
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
> Issue Type: Bug
> Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The total CPU time was over 200%):
> {code}
> Running Time Self Symbol Name
> 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50
> 27133.0ms 47.8% 0.0 thread_start
> 27133.0ms 47.8% 0.0 _pthread_start
> 27133.0ms 47.8% 0.0 _pthread_body
> 27133.0ms 47.8% 0.0 process::schedule(void*)
> 27133.0ms 47.8% 2.0 process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms 47.8% 1.0 process::ProcessBase::serve(process::Event const&)
> 27125.0ms 47.8% 0.0 process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms 47.8% 0.0 process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms 47.8% 0.0 std::__1::function<void (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms 47.8% 0.0 std::__1::__function::__func<process::Future<mesos::ResourceStatistics> process::dispatch<mesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> const&, process::Future<mesos::ResourceStatistics> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), mesos::ContainerID)::'lambda'(process::ProcessBase*), std::__1::allocator<process::Future<mesos::ResourceStatistics> process::dispatch<mesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> const&, process::Future<mesos::ResourceStatistics> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms 47.8% 1.0 process::Future<mesos::ResourceStatistics> process::dispatch<mesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> const&, process::Future<mesos::ResourceStatistics> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) const
> 27060.0ms 47.7% 1.0 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID const&)
> 27046.0ms 47.7% 2.0 mesos::internal::usage(int, bool, bool)
> 27023.0ms 47.6% 2.0 os::pstree(Option<int>)
> 26748.0ms 47.1% 23.0 os::processes()
> 24809.0ms 43.7% 349.0 os::process(int)
> 8199.0ms 14.4% 47.0 os::sysctl::string() const
> 7562.0ms 13.3% 7562.0 __sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)