You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Jay Buffington <me...@jaybuff.com> on 2015/05/01 02:51:08 UTC
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure `containerizer->wait` here will return a failure (or a Termination with some reason) when `containerizer->launch` fails. In that way, the `executorTerminated` will properly send status updates to the slave (TASK_LOST/TASK_FAILED).
> >     
> >     Or am I missing something?
> 
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems here. The problem I am refering to is the fact that we don't send status update to the scheduler if containerizer launch fails until executor reregistration timeout happens. Since for docker containerizer, someone might use a very large timeout value, ideally, the slave should send a status update to the scheduler right after containerizer launch fails.
>     
>     After chat with Jay, the problem you guys are refering to is the fact that the scheduler cannot disinguish between the case where the task has failed vs. the case where the configuration of a task is not correct, because in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a containerizer->destroy(..) in executorLaunched when containerizer->launch fails. In that way, it's going to trigger containerizer->wait and thus send status update to the scheduler.
> 
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let sendExecutorTerminatedStatusUpdate to propagate the termination reason to the scheduler.
> 
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in docker containerizer at least we also need to make sure termination message is set correctly as currently it doesn't contain all the error information that we pass back to the launch future.
> 
> Jay Buffington wrote:
>     Sorry for the confusion.   There are three problems that are all related.  Yes, we need to send statusUpdates as soon as containerizer launch fails and yes, we need to set the reason so the scheduler can distinguish a slave failure from a bad request.  However, my intention with this patch is not to address either of those two problems.
>     
>     My goal with this patch is to simply send the containerizer launch failure message back to the scheduler.  I am using Aurora with the docker containerizer.  There are a myriad of reasons that dockerd could fail to "docker run" a container.  Currently, when that fails the user sees a useless and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  With this patch they see the stderr output of "docker run."  This way they can report meaningful error messages to the operations team.
>     
>     I can update this patch to address the other two issues, but the key is that when the containerizer launch fails we have to send a statusUpdate with a message that contains future.failure().  Before this patch we were only logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
>     Thanks for clarifying it, Jay! In fact, my proposal above should be able to solve the third problem cleanly. Check out `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should properly set the message and reason fields in the Termination protobuf (basically why the container gets terminated and what's the error message). The slave will forward the reason and message to the scheduler through status update.
>     
>     I chatted with BenM about this yesterday, and there are a couple of notes I want to write down here.
>     
>     1. We probably need multiple levels for TaskStatus::Reason. In other words, we probably need a "repeated Reason reasons" field in status update message. For instance, for a containerizer launch failure, we probably need two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the second level reason REASON_DOCKER_PULL_FAILURE;
>     
>     2. We probably want to allow extension to TaskStatus::Reason enum. For example, some framework/executor may want to add customized reasons. We could leverage the protobuf extension support to achieve that (https://developers.google.com/protocol-buffers/docs/proto#extensions).
>     
>     3. The semantics around Termination is broken currently and we need to clean it up. For instance, the following code is broken,
>     ```
>     void Slave::sendExecutorTerminatedStatusUpdate(...)
>     {
>       ...
>       if (termination.isReady() && termination.get().killed()) {
>         taskState = TASK_FAILED;
>         // TODO(dhamon): MESOS-2035: Add 'reason' to containerizer::Termination.
>         reason = TaskStatus::REASON_MEMORY_LIMIT;
>       }
>     }
>     ```
>     because we now have disk limit as well.
>     
>     Another issue about Termination is that containerizer->wait sometimes returns a Failure (e.g., launch failed, or destroy failed), that means we cannot get the reason field and message field from Termination anymore. Right now, if that happens, we always set the reason to be REASON_EXECUTOR_TERMINATED and the message to be "Abnormal executor termination" in the status update. I think this is a hack and IMO, the containerizer should always return a valid Termination protobuf to the slave so that the slave can send a meaningful status update to the framework.
> 
> Jay Buffington wrote:
>     Thanks for these comments, Jie.  I'll work implementing your proposal.
>     
>     I was unware of the existance of the Termination protobuf.  I am confused by these comments (which I think are just flat out wrong). https://github.com/apache/mesos/blob/c36d5996327ca765f49c211d489371c99ef8e090/src/slave/slave.cpp#L3177 which say:
>     
>        // A termination failure indicates the containerizer could not destroy a
>        // container.
>        // TODO(idownes): This is a serious error so consider aborting the slave if
>        // this occurs.
>        
>     I don't understand when you would ever return Failure() in your containerizer without doing container->termination.set(termination);
>     
>     Thanks!
> 
> Jie Yu wrote:
>     > I don't understand when you would ever return Failure() in your containerizer without doing container->termination.set(termination);
>     
>     Yeah, I think that's a tech debt and we should correct it. Let me know if you need any help for this.

I finally sat down to work through what you said, Jie.

Check this out:
https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1194-L1202

When pull() fails destroy() is called.  Destroy correctly sets Termination, but then it deletes the container!  Then when we call containerizer->wait() in slave.cpp, we get back a Failure("Unknown container: ...) instead of the termination future like we should.

IMHO the solution here isn't to split up the code from destroy() into to methods: one that sets the termination (call this launchFailed()) and then make destroy() just erase the container.

Then we should only call launchFailed() from launch (here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L636) 

When launch fails we should call containerizer->destroy() from slave's executorLaunched method.


- Jay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
-----------------------------------------------------------


On April 21, 2015, 10:14 a.m., Jay Buffington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> 
> (Updated April 21, 2015, 10:14 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
> 
> 
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> 
> Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> -------
> 
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied a bogus docker image url and saw the "docker pull" failure stderr message in Aurora's web UI.
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>