You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Alexander Rojas (JIRA)" <ji...@apache.org> on 2017/02/03 10:36:51 UTC
[jira] [Comment Edited] (MESOS-7036) Rate limiter deadlocks during
IO Switchboard-related tests
[ https://issues.apache.org/jira/browse/MESOS-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851315#comment-15851315 ]
Alexander Rojas edited comment on MESOS-7036 at 2/3/17 10:35 AM:
-----------------------------------------------------------------
I've been giving a lot of thought to this and two things are definitely clear for me:
1. This is not a bug in libprocess, this is the equivalent of calling {{std::thread::join()}} within himself. Which is a bug from the side of the user of the library, but not a library bug. Since we can detect the deadlock though, I would suggest aborting in those situations just like {{std::thread}} does when you join with yourself (it actually throws an exception of type {{system_error}} with code {{resource_deadlock_would_occur}}).
2. There are two kinds of fixes for this problem based on the example made by [~alexr], the first is to make wrapper classes uncopyable. As a rule of thumb, if a class has a {{delete}} in its destructor (or a {{terminate}}), it must not be copyable, since the first copy to be destroyed invalidates all others. The second is to use a shared pointer as a RAII manager, e.g.:
{code}
class RateLimiter
{
public:
RateLimiter(int permits, const Duration& duration);
explicit RateLimiter(double permitsPerSecond);
virtual ~RateLimiter();
// Returns a future that becomes ready when the permit is acquired.
// Discarding this future cancels this acquisition.
virtual Future<Nothing> acquire() const;
private:
// Not copyable, not assignable.
RateLimiter(const RateLimiter&);
RateLimiter& operator=(const RateLimiter&);
std::shared_ptr<RateLimiterProcess> process;
};
RateLimiter::RateLimiter(int permits, const Duration& duration)
{
// Custom destructor for `process` which will terminate and wait on the
// process.
process.reset(new RateLimiterProcess(...), [](RateLimiterProcess *process) {
process::terminate(process);
process::wait(process);
delete process;
});
}
{code}
Note that none of this issues completely resolves the fact that you can call {{await()}} and {{terminate()}} from yourself, but will reduce the changes that we make them in our code base. So if the actual cause is more complicated, something else may happen.
was (Author: arojas):
I've been giving a lot of thought to this and two things are definitely clear for me:
1. This is not a bug in libprocess, this is the equivalent of calling {{std::thread::join()}} within himself. Which is a bug from the side of the user of the library, but not a library bug. Since we can detect the deadlock though, I would suggest aborting in those situations just like {{std::thread}} does when you join with yourself (it actually throws an exception of type {{system_error}} with code {{resource_deadlock_would_occur}}).
2. There are two kinds of fixes for this problem based on the example made by [~alexr], the first is to make wrapper classes uncopyable. As a rule of thumb, if a class has a {{delete}} in its destructor (or a {{terminate}}), it must not be copyable, since the first copy to be destroyed invalidates all others. The second is to use a shared pointer as a RAII manager, e.g.:
{code}
class RateLimiter
{
public:
RateLimiter(int permits, const Duration& duration);
explicit RateLimiter(double permitsPerSecond);
virtual ~RateLimiter();
// Returns a future that becomes ready when the permit is acquired.
// Discarding this future cancels this acquisition.
virtual Future<Nothing> acquire() const;
private:
// Not copyable, not assignable.
RateLimiter(const RateLimiter&);
RateLimiter& operator=(const RateLimiter&);
std::shared_ptr<RateLimiterProcess> process;
};
RateLimiter::RateLimiter(int permits, const Duration& duration)
{
// Custom destructor for `process` which will terminate and wait on the
// process.
process.reset(new RateLimiterProcess(...), [](RateLimiterProcess *process) {
process::terminate(process);
process::wait(process);
delete process;
});
}
{code}
Note that none of this issues completely resolves the fact that you can call {await()} and {terminate()} from yourself, but will reduce the changes that we make them in our code base. So if the actual cause is more complicated, something else may happen.
> Rate limiter deadlocks during IO Switchboard-related tests
> ----------------------------------------------------------
>
> Key: MESOS-7036
> URL: https://issues.apache.org/jira/browse/MESOS-7036
> Project: Mesos
> Issue Type: Bug
> Components: test, tests
> Environment: ASF CI
> Reporter: Greg Mann
> Priority: Critical
> Labels: flaky, mesosphere
> Attachments: AgentAPITest.LaunchNestedContainerSessionWithTTY.txt
>
>
> This has been observed a number of times recently on the ASF CI. While I didn't look through every single failed test log, I've noticed the failure occur during the following tests:
> {code}
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> IOSwitchboardTest.ContainerAttachAfterSlaveRestart
> ContentType/AgentAPITest.LaunchNestedContainerSession/1
> ContentType/AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> IOSwitchboardTest.ContainerAttach
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> {code}
> In all cases, we see the following:
> {code}
> **** DEADLOCK DETECTED! ****
> You are waiting on process __limiter__(518)@172.17.0.3:35849 that it is currently executing.
> {code}
> Find attached an entire example log.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)