You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chun-Hung Hsiao (JIRA)" <ji...@apache.org> on 2019/03/29 19:25:02 UTC

[jira] [Comment Edited] (MESOS-9667) Check failure when executor for task using resource provider resources subscribes before agent is registered

    [ https://issues.apache.org/jira/browse/MESOS-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805365#comment-16805365 ] 

Chun-Hung Hsiao edited comment on MESOS-9667 at 3/29/19 7:24 PM:
-----------------------------------------------------------------

Let's consider the following scenario:

 # The agent receives {{RunTaskGroupMessage}} with two tasks {{foo}} and {{bar}} using RP resources, and launch an executor.
 # Upon executor subscription, the agent performs the following steps to launch the task:
    2.1 Publish all resources *allocated to the executor* (i.e., including queued tasks).
    2.2 Ask the containerizer to update the executor container with all resources allocated to the executor.
    2.3 Send a {{LAUNCH_GROUP}} event containing tasks {{foo}} and {{bar}} to the executor.
 # The executor launches task {{foo}} through {{LAUNCH_NESTED_CONTAINER}}.
 # The agent receives {{TASK_STARTING}} for {{foo}} and dequeues the task.
 # The agent restarts and receives an executor resubscription.
 # Upon executor resubscription, the agent "recovers" the executor through the following steps:
    6.1 Publish all resources *allocated to the executor*.
    6.2 Ask the containerizer to update the executor container with all resources allocated to the executor.
    6.3 Send a {{LAUNCH_GROUP}} event containing the pending task {{bar}} to the executor.
 # The executor launches task {{bar}} through {{LAUNCH_NESTED_CONTAINER}}.

The problem described in this ticket is that Step 6.1 would crash if the agent hasn't reregistered yet (and thus the RP manager is not initialized). However, the actual problem to me is broader than just RP manager initialization. Essentially, there will be a period of time before the RP resubscribes that _the resources allocated to the executor is not contained in the agent's total resources!_

We have a couple options here:

* *Initialize the RP manager as early as possible*
  Say if we initialize the RP manager when the agent recovers its ID from the checkpointed state, the CHECK failure would be gone. But if the executor resubscribes before the RP does, the agent would fail the executor and transition all tasks, including the running task {{foo}}, to {{TASK_GONE}}.

* *Block executor reregistration until agent recovery*
  This is basically similar to the above option, but I'm not sure if there's any concern w.r.t. agent recovery. IMO this is inferior to the above option.

* *Publish allocated resources before Step 2.3 and 6.3 and remove Step 2.1 and 6.1*
  The idea here is that since task {{foo}} is already running, the RP resources must have been ready, so it's really not necessary to publish the resources again. Only task {{bar}} would fail if the RP is not subscribed before Step 6.3. Here we could either fail the resource publishing if the RP manager is not ready in Step 6.3, or initialize the RP manager early.

Judging from the master code, it seems okay if we recover allocated resources in the master before RP subscriptions: resources not in an agent's total resources won't be considered available and thus won't be offered.


was (Author: chhsia0):
Let's consider the following scenario:

 # The agent receives {{RunTaskGroupMessage}} with two tasks {{foo}} and {{bar}} using RP resources, and launch an executor.
 # Upon executor subscription, the agent performs the following steps to launch the task:
    2.1 Publish all resources *allocated to the executor* (i.e., including queued tasks).
    2.2 Ask the containerizer to update the executor container with all resources allocated to the executor.
    2.3 Send a {{LAUNCH_GROUP}} event containing tasks {{foo}} and {{bar}} to the executor.
 # The executor launches task {{foo}} through {{LAUNCH_NESTED_CONTAINER}}.
 # The agent receives {{TASK_STARTING}} for {{foo}} and dequeues the task.
 # The agent restarts and receives an executor resubscription.
 # Upon executor resubscription, the agent "recovers" the executor through the following steps:
    6.1 Publish all resources *allocated to the executor*.
    6.2 Ask the containerizer to update the executor container with all resources allocated to the executor.
    6.3 Send a {{LAUNCH_GROUP}} event containing the pending task {{bar}} to the executor.
 # The executor launches task {{bar}} through {{LAUNCH_NESTED_CONTAINER}}.

The problem described in this ticket is that Step 6.1 would crash if the agent hasn't reregistered yet (and thus the RP manager is not initialized). However, the actual problem to me is broader than just RP manager initialization. Essentially, there will be a period of time before the RP resubscribes that _the resources allocated to the executor is not contained in the agent's total resources!_

We have a couple options here:

* *Initialize the RP manager as early as possible*
  Say if we initialize the RP manager when the agent recovers its ID from the checkpointed state, the CHECK failure would be gone. But if the executor resubscribes before the RP does, the agent would fail the executor and transition all tasks, including the running task {{foo}}, to {{TASK_GONE}}.

* *Block executor reregistration until agent recovery*
  This is basically similar to the above option, but I'm not sure if there's any concern w.r.t. agent recovery. IMO this is inferior to the above option.

* *Publish allocated resources before Step 2.3 and 6.3* instead of doing it in Step 2.1 and 6.1. The idea here is that since task {{foo}} is already running, the RP resources must have been ready, so it's really not necessary to publish the resources again. Only task {{bar}} would fail if the RP is not subscribed before Step 6.3. Here we could either fail the resource publishing if the RP manager is not ready in Step 6.3, or initialize the RP manager early.

> Check failure when executor for task using resource provider resources subscribes before agent is registered
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9667
>                 URL: https://issues.apache.org/jira/browse/MESOS-9667
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.8.0
>            Reporter: Benjamin Bannier
>            Priority: Blocker
>              Labels: foundations, mesosphere, mesosphere-dss-ga
>
> When an executor for a task using resource provider resources subscribes before the agent has registered with the master, we trigger a fatal assertion,
> {code:java}
> Mar 21 13:42:47 agent1 mesos-agent[17277]: F0321 13:42:46.845535 17295 slave.cpp:8834] Check failed: 'resourceProviderManager.get()' Must be non NULL
> Mar 21 13:42:47 agent1 mesos-agent[17277]: *** Check failure stack trace: *{code}
> The reason for this failure is that we attempt to publish resources to the resource provider via the resource provider manager, but the resource provider manager is only created once the agent has registered with the master.
> As a workaround one can terminate the executors and their tasks, and let the framework relaunch the tasks (provided it supports that).
> A possible workaround could be to prevent such executors from subscribing until the resource provider manager is available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)