You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Anand Mazumdar <an...@apache.org> on 2017/01/04 00:38:32 UTC

Review Request 55157: Fixed a bug in the default executor around not committing suicide.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/
-----------------------------------------------------------

Review request for mesos and Vinod Kone.


Bugs: MESOS-6848
    https://issues.apache.org/jira/browse/MESOS-6848


Repository: mesos


Description
-------

This bug is only observed when the task group contains a single task.
The default executor was not committing suicide when this single task
used to exit with a non-zero status code as per the default restart
policy.


Diffs
-----

  src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
  src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 

Diff: https://reviews.apache.org/r/55157/diff/


Testing
-------

make check + added a test


Thanks,

Anand Mazumdar


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Anand Mazumdar <an...@apache.org>.

> On Jan. 9, 2017, 2:10 a.m., Vinod Kone wrote:
> > src/launcher/default_executor.cpp, lines 663-683
> > <https://reviews.apache.org/r/55157/diff/1/?file=1596482#file1596482line663>
> >
> >     How about
> >     
> >     ```
> >     // Check to see if the executor needs to shutdown.
> >     
> >     // If the executor is already in the process of shutting down, return.
> >     if (shuttingDown) {
> >       return;
> >     }
> >     
> >     // If there are no active containers, shutdown the executor.
> >     if (containers.empty()) {
> >       shutdown();
> >       return;
> >     }
> >     
> >     // If this container exited with non-zero status, kill all the other containers,
> >     // per the default policy.
> >     if (taskState == TASK_FAILED) {
> >     
> >       // TODO(Anand): ...
> >       shutdown();
> >     }
> >     
> >     
> >     ```
> 
> Anand Mazumdar wrote:
>     hmm,  I had refrained from splitting the no active containers check and moving it above was because I still wanted to log why the single task pod failed. With the above suggested snippet, we would need to do the logging for both the cases where we return early if the task failed? Another alternative can be that I return early when the executor is shutting down while keeping the rest of the logic same:
>     
>     ```cpp
>         // Ignore if the executor is already in the process of shutting down.
>         if (shuttingDown) {
>           return;
>         }
>     
>         // The default restart policy for a task group is to kill all the
>         // remaining child containers if one of them terminated with a
>         // non-zero exit code.
>         if (taskState == TASK_FAILED) {
>           LOG(ERROR)
>             << "Child container " << containerId << " terminated with status "
>             << (status.isSome() ? WSTRINGIFY(status.get()) : "unknown");
>     
>           // Kill all the other active containers.
>           //
>           // TODO(anand): Invoke `kill()` once per active container
>           // instead of directly invoking `shutdown()`.
>           if (!containers.empty()) {
>             shutdown();
>             return;
>           }
>         }
>     
>         // Shutdown the executor if all the active child containers have terminated.
>         if (containers.empty()) {
>           __shutdown();
>         }
>     ```
>     
>     What do you think?
> 
> Vinod Kone wrote:
>     I think we can print the exit code/reason in the LOG(INFO) message at #656. It is useful irrespective of task status?
>     
>     I moved up `if (containers.empty())` check because it was not very intuitive to me why you had a `!containers.empty()` check inside that if statement. I had to read the `shutdown()` code to understand that. As a side note, it sounds like `shutdown()` should have the logic to call `__shutdown()` if containers is empty.

hmm, I couldn't think of any possible other task states where it might be useful other than `TASK_FAILED` to log the exit status. There might be some `TASK_KILLED` instances (OOM) where we might still be able to retrieve the status? But, neverthless it seemed useful to log the exit status reason.

Can you take a look at the review again?


- Anand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/#review160849
-----------------------------------------------------------


On Jan. 9, 2017, 7:37 p.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55157/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 7:37 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6848
>     https://issues.apache.org/jira/browse/MESOS-6848
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This bug is only observed when the task group contains a single task.
> The default executor was not committing suicide when this single task
> used to exit with a non-zero status code as per the default restart
> policy.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
>   src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 
> 
> Diff: https://reviews.apache.org/r/55157/diff/
> 
> 
> Testing
> -------
> 
> make check + added a test
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Anand Mazumdar <an...@apache.org>.

> On Jan. 9, 2017, 2:10 a.m., Vinod Kone wrote:
> > src/launcher/default_executor.cpp, lines 663-683
> > <https://reviews.apache.org/r/55157/diff/1/?file=1596482#file1596482line663>
> >
> >     How about
> >     
> >     ```
> >     // Check to see if the executor needs to shutdown.
> >     
> >     // If the executor is already in the process of shutting down, return.
> >     if (shuttingDown) {
> >       return;
> >     }
> >     
> >     // If there are no active containers, shutdown the executor.
> >     if (containers.empty()) {
> >       shutdown();
> >       return;
> >     }
> >     
> >     // If this container exited with non-zero status, kill all the other containers,
> >     // per the default policy.
> >     if (taskState == TASK_FAILED) {
> >     
> >       // TODO(Anand): ...
> >       shutdown();
> >     }
> >     
> >     
> >     ```

hmm,  I had refrained from splitting the no active containers check and moving it above was because I still wanted to log why the single task pod failed. With the above suggested snippet, we would need to do the logging for both the cases where we return early if the task failed? Another alternative can be that I return early when the executor is shutting down while keeping the rest of the logic same:

```cpp
    // Ignore if the executor is already in the process of shutting down.
    if (shuttingDown) {
      return;
    }

    // The default restart policy for a task group is to kill all the
    // remaining child containers if one of them terminated with a
    // non-zero exit code.
    if (taskState == TASK_FAILED) {
      LOG(ERROR)
        << "Child container " << containerId << " terminated with status "
        << (status.isSome() ? WSTRINGIFY(status.get()) : "unknown");

      // Kill all the other active containers.
      //
      // TODO(anand): Invoke `kill()` once per active container
      // instead of directly invoking `shutdown()`.
      if (!containers.empty()) {
        shutdown();
        return;
      }
    }

    // Shutdown the executor if all the active child containers have terminated.
    if (containers.empty()) {
      __shutdown();
    }
```

What do you think?


- Anand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/#review160849
-----------------------------------------------------------


On Jan. 4, 2017, 12:38 a.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55157/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:38 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6848
>     https://issues.apache.org/jira/browse/MESOS-6848
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This bug is only observed when the task group contains a single task.
> The default executor was not committing suicide when this single task
> used to exit with a non-zero status code as per the default restart
> policy.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
>   src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 
> 
> Diff: https://reviews.apache.org/r/55157/diff/
> 
> 
> Testing
> -------
> 
> make check + added a test
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Vinod Kone <vi...@gmail.com>.

> On Jan. 9, 2017, 2:10 a.m., Vinod Kone wrote:
> > src/launcher/default_executor.cpp, lines 663-683
> > <https://reviews.apache.org/r/55157/diff/1/?file=1596482#file1596482line663>
> >
> >     How about
> >     
> >     ```
> >     // Check to see if the executor needs to shutdown.
> >     
> >     // If the executor is already in the process of shutting down, return.
> >     if (shuttingDown) {
> >       return;
> >     }
> >     
> >     // If there are no active containers, shutdown the executor.
> >     if (containers.empty()) {
> >       shutdown();
> >       return;
> >     }
> >     
> >     // If this container exited with non-zero status, kill all the other containers,
> >     // per the default policy.
> >     if (taskState == TASK_FAILED) {
> >     
> >       // TODO(Anand): ...
> >       shutdown();
> >     }
> >     
> >     
> >     ```
> 
> Anand Mazumdar wrote:
>     hmm,  I had refrained from splitting the no active containers check and moving it above was because I still wanted to log why the single task pod failed. With the above suggested snippet, we would need to do the logging for both the cases where we return early if the task failed? Another alternative can be that I return early when the executor is shutting down while keeping the rest of the logic same:
>     
>     ```cpp
>         // Ignore if the executor is already in the process of shutting down.
>         if (shuttingDown) {
>           return;
>         }
>     
>         // The default restart policy for a task group is to kill all the
>         // remaining child containers if one of them terminated with a
>         // non-zero exit code.
>         if (taskState == TASK_FAILED) {
>           LOG(ERROR)
>             << "Child container " << containerId << " terminated with status "
>             << (status.isSome() ? WSTRINGIFY(status.get()) : "unknown");
>     
>           // Kill all the other active containers.
>           //
>           // TODO(anand): Invoke `kill()` once per active container
>           // instead of directly invoking `shutdown()`.
>           if (!containers.empty()) {
>             shutdown();
>             return;
>           }
>         }
>     
>         // Shutdown the executor if all the active child containers have terminated.
>         if (containers.empty()) {
>           __shutdown();
>         }
>     ```
>     
>     What do you think?

I think we can print the exit code/reason in the LOG(INFO) message at #656. It is useful irrespective of task status?

I moved up `if (containers.empty())` check because it was not very intuitive to me why you had a `!containers.empty()` check inside that if statement. I had to read the `shutdown()` code to understand that. As a side note, it sounds like `shutdown()` should have the logic to call `__shutdown()` if containers is empty.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/#review160849
-----------------------------------------------------------


On Jan. 4, 2017, 12:38 a.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55157/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:38 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6848
>     https://issues.apache.org/jira/browse/MESOS-6848
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This bug is only observed when the task group contains a single task.
> The default executor was not committing suicide when this single task
> used to exit with a non-zero status code as per the default restart
> policy.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
>   src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 
> 
> Diff: https://reviews.apache.org/r/55157/diff/
> 
> 
> Testing
> -------
> 
> make check + added a test
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/#review160849
-----------------------------------------------------------


Fix it, then Ship it!





src/launcher/default_executor.cpp (lines 663 - 683)
<https://reviews.apache.org/r/55157/#comment232083>

    How about
    
    ```
    // Check to see if the executor needs to shutdown.
    
    // If the executor is already in the process of shutting down, return.
    if (shuttingDown) {
      return;
    }
    
    // If there are no active containers, shutdown the executor.
    if (containers.empty()) {
      shutdown();
      return;
    }
    
    // If this container exited with non-zero status, kill all the other containers,
    // per the default policy.
    if (taskState == TASK_FAILED) {
    
      // TODO(Anand): ...
      shutdown();
    }
    
    ```


- Vinod Kone


On Jan. 4, 2017, 12:38 a.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55157/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:38 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6848
>     https://issues.apache.org/jira/browse/MESOS-6848
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This bug is only observed when the task group contains a single task.
> The default executor was not committing suicide when this single task
> used to exit with a non-zero status code as per the default restart
> policy.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
>   src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 
> 
> Diff: https://reviews.apache.org/r/55157/diff/
> 
> 
> Testing
> -------
> 
> make check + added a test
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/#review160967
-----------------------------------------------------------



Patch looks great!

Reviews applied: [55157]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On Jan. 9, 2017, 7:37 p.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55157/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 7:37 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6848
>     https://issues.apache.org/jira/browse/MESOS-6848
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This bug is only observed when the task group contains a single task.
> The default executor was not committing suicide when this single task
> used to exit with a non-zero status code as per the default restart
> policy.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
>   src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 
> 
> Diff: https://reviews.apache.org/r/55157/diff/
> 
> 
> Testing
> -------
> 
> make check + added a test
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Anand Mazumdar <an...@apache.org>.

> On Jan. 9, 2017, 7:47 p.m., Vinod Kone wrote:
> > src/launcher/default_executor.cpp, lines 672-679
> > <https://reviews.apache.org/r/55157/diff/1-2/?file=1596482#file1596482line672>
> >
> >     shouldn't these be re-ordered? i'm assuming we don't need to call `__shutdown()` if a shutdown is already in progress?

As per our offline discussion, we can't reorder them as `shutdown()` doesn't invoke `__shutdown` implicitly to commit suicide. We wait for all our wait connections to return and commit suicide thereafter in `waited()`. This allows us to keep all the restart/kill policy logic in one place in `waited()`. 

I would add a comment in `shutdown()` in a separate review for posterity.


- Anand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/#review160946
-----------------------------------------------------------


On Jan. 9, 2017, 7:37 p.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55157/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 7:37 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6848
>     https://issues.apache.org/jira/browse/MESOS-6848
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This bug is only observed when the task group contains a single task.
> The default executor was not committing suicide when this single task
> used to exit with a non-zero status code as per the default restart
> policy.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
>   src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 
> 
> Diff: https://reviews.apache.org/r/55157/diff/
> 
> 
> Testing
> -------
> 
> make check + added a test
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/#review160946
-----------------------------------------------------------




src/launcher/default_executor.cpp (lines 665 - 672)
<https://reviews.apache.org/r/55157/#comment232184>

    shouldn't these be re-ordered? i'm assuming we don't need to call `__shutdown()` if a shutdown is already in progress?


- Vinod Kone


On Jan. 9, 2017, 7:37 p.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55157/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 7:37 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6848
>     https://issues.apache.org/jira/browse/MESOS-6848
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This bug is only observed when the task group contains a single task.
> The default executor was not committing suicide when this single task
> used to exit with a non-zero status code as per the default restart
> policy.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
>   src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 
> 
> Diff: https://reviews.apache.org/r/55157/diff/
> 
> 
> Testing
> -------
> 
> make check + added a test
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>


Re: Review Request 55157: Fixed a bug in the default executor around not committing suicide.

Posted by Anand Mazumdar <an...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55157/
-----------------------------------------------------------

(Updated Jan. 9, 2017, 7:37 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Review comments from Vinod


Bugs: MESOS-6848
    https://issues.apache.org/jira/browse/MESOS-6848


Repository: mesos


Description
-------

This bug is only observed when the task group contains a single task.
The default executor was not committing suicide when this single task
used to exit with a non-zero status code as per the default restart
policy.


Diffs (updated)
-----

  src/launcher/default_executor.cpp fd9963996c82461b60888397989e309d51b60aaa 
  src/tests/default_executor_tests.cpp 340e8c8b36a6a3cc6e5bae021e19dfa7a54852c3 

Diff: https://reviews.apache.org/r/55157/diff/


Testing
-------

make check + added a test


Thanks,

Anand Mazumdar