You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Alexander Rojas (JIRA)" <ji...@apache.org> on 2017/02/03 10:36:51 UTC

[jira] [Comment Edited] (MESOS-7036) Rate limiter deadlocks during IO Switchboard-related tests

    [ https://issues.apache.org/jira/browse/MESOS-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851315#comment-15851315 ] 

Alexander Rojas edited comment on MESOS-7036 at 2/3/17 10:35 AM:
-----------------------------------------------------------------

I've been giving a lot of thought to this and two things are definitely clear for me:

1. This is not a bug in libprocess, this is the equivalent of calling {{std::thread::join()}} within himself. Which is a bug from the side of the user of the library, but not a library bug. Since we can detect the deadlock though, I would suggest aborting in those situations just like {{std::thread}} does when you join with yourself (it actually throws an exception of type {{system_error}} with code {{resource_deadlock_would_occur}}).

2. There are two kinds of fixes for this problem based on the example made by [~alexr], the first is to make wrapper classes uncopyable. As a rule of thumb, if a class has a {{delete}} in its destructor (or a {{terminate}}), it must not be copyable, since the first copy to be destroyed invalidates all others. The second is to use a shared pointer as a RAII manager, e.g.:

{code}
class RateLimiter
{
public:
  RateLimiter(int permits, const Duration& duration);
  explicit RateLimiter(double permitsPerSecond);
  virtual ~RateLimiter();

  // Returns a future that becomes ready when the permit is acquired.
  // Discarding this future cancels this acquisition.
  virtual Future<Nothing> acquire() const;

private:
  // Not copyable, not assignable.
  RateLimiter(const RateLimiter&);
  RateLimiter& operator=(const RateLimiter&);

  std::shared_ptr<RateLimiterProcess> process;
};

RateLimiter::RateLimiter(int permits, const Duration& duration)
{
  // Custom destructor for `process` which will terminate and wait on the
  // process.
  process.reset(new RateLimiterProcess(...), [](RateLimiterProcess *process) {
    process::terminate(process);
    process::wait(process);
    delete process;
  });
}
{code}

Note that none of this issues completely resolves the fact that you can call {{await()}} and {{terminate()}} from yourself, but will reduce the changes that we make them in our code base. So if the actual cause is more complicated, something else may happen.


was (Author: arojas):
I've been giving a lot of thought to this and two things are definitely clear for me:

1. This is not a bug in libprocess, this is the equivalent of calling {{std::thread::join()}} within himself. Which is a bug from the side of the user of the library, but not a library bug. Since we can detect the deadlock though, I would suggest aborting in those situations just like {{std::thread}} does when you join with yourself (it actually throws an exception of type {{system_error}} with code {{resource_deadlock_would_occur}}).

2. There are two kinds of fixes for this problem based on the example made by [~alexr], the first is to make wrapper classes uncopyable. As a rule of thumb, if a class has a {{delete}} in its destructor (or a {{terminate}}), it must not be copyable, since the first copy to be destroyed invalidates all others. The second is to use a shared pointer as a RAII manager, e.g.:

{code}
class RateLimiter
{
public:
  RateLimiter(int permits, const Duration& duration);
  explicit RateLimiter(double permitsPerSecond);
  virtual ~RateLimiter();

  // Returns a future that becomes ready when the permit is acquired.
  // Discarding this future cancels this acquisition.
  virtual Future<Nothing> acquire() const;

private:
  // Not copyable, not assignable.
  RateLimiter(const RateLimiter&);
  RateLimiter& operator=(const RateLimiter&);

  std::shared_ptr<RateLimiterProcess> process;
};

RateLimiter::RateLimiter(int permits, const Duration& duration)
{
  // Custom destructor for `process` which will terminate and wait on the
  // process.
  process.reset(new RateLimiterProcess(...), [](RateLimiterProcess *process) {
    process::terminate(process);
    process::wait(process);
    delete process;
  });
}
{code}

Note that none of this issues completely resolves the fact that you can call {await()} and {terminate()} from yourself, but will reduce the changes that we make them in our code base. So if the actual cause is more complicated, something else may happen.

> Rate limiter deadlocks during IO Switchboard-related tests
> ----------------------------------------------------------
>
>                 Key: MESOS-7036
>                 URL: https://issues.apache.org/jira/browse/MESOS-7036
>             Project: Mesos
>          Issue Type: Bug
>          Components: test, tests
>         Environment: ASF CI
>            Reporter: Greg Mann
>            Priority: Critical
>              Labels: flaky, mesosphere
>         Attachments: AgentAPITest.LaunchNestedContainerSessionWithTTY.txt
>
>
> This has been observed a number of times recently on the ASF CI. While I didn't look through every single failed test log, I've noticed the failure occur during the following tests:
> {code}
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> IOSwitchboardTest.ContainerAttachAfterSlaveRestart
> ContentType/AgentAPITest.LaunchNestedContainerSession/1
> ContentType/AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> IOSwitchboardTest.ContainerAttach
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> {code}
> In all cases, we see the following:
> {code}
> **** DEADLOCK DETECTED! ****
> You are waiting on process __limiter__(518)@172.17.0.3:35849 that it is currently executing.
> {code}
> Find attached an entire example log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)