You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chun-Hung Hsiao (JIRA)" <ji...@apache.org> on 2018/11/21 18:42:00 UTC

[jira] [Commented] (MESOS-9387) Surface errors for publishing CSI volumes in task status updates.

    [ https://issues.apache.org/jira/browse/MESOS-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695092#comment-16695092 ] 

Chun-Hung Hsiao commented on MESOS-9387:
----------------------------------------

Thoughts dump:

We need to at least deliver the error message through the RP API.
However, since we're changing the API, I'm thinking about something more aggressive, and more compliant to the ERP story in the RP API:
{noformat}
message Event {
  message PublishResources {
    required UUID uuid = 1;

    // The set of resources that are required to be published.
    repeated Resource required = 2;

    // The set of resources that are allowed to be published. Any resource
    // beyond this set should be unpublished. This set should contain the set of
    // required resources.
    repeated Resource allowed = 3;
  }
}

message Call {
  enum Type {
    UPDATE_PUBLISHED_RESOURCES = 4; // See 'UpdatePublishedResources'.
  }

  message UpdatePublishedResources {
    enum Status {
      UNKNOWN = 0;

      // All required resources are published and all resources that are not in
      // the set of allowed resources are unpublished. In this case, the set of
      // published resources in the `resources` field would be a superset of the
      // required resources and a subset of the allowed resources.
      OK = 1;

      // The resource provider fails to publish certain required resources, or
      // fails to unpublish certain resources that are not in the set of allowed
      // resources. In this case, the set of published resources should still be
      // reported through the `resources` field, and more human-readable
      // information should be provided in the `message` field.
      FAILED = 2;
    }

    required UUID uuid = 1;
    required Status status = 2;
    repeated Resource resources = 3;
    optional string message = 4;
  }

  optional updated_published_resources = 6;
}
{noformat}
The {{UpdatePublishedResources}} is backward-compatible with the original {{UpdatePublishResourcesStatus}} message.

The reason of the change in {{PublishResources}} and {{UpdatePublishedResources}} is to make the call idempotent and be able to apply to resources without identifiers.
The agent should keep track of the current set of published resources through the {{UpdatePublishedResources.resources}} field.
When launching a task, it should call {{PUBLISH_RESOURCES}} with {{required}} set to the set of used resources + resources for a new task,
and {{allowed}} set to the set of required resources + the set of published resources,
then examine the new set of published resources to determine if a task is good to be launched, even if it receives a {{FAILED}} status.
If the new set of published resources does not contain the set of resources used by the task, the error message is surfaced in the task status update.

> Surface errors for publishing CSI volumes in task status updates.
> -----------------------------------------------------------------
>
>                 Key: MESOS-9387
>                 URL: https://issues.apache.org/jira/browse/MESOS-9387
>             Project: Mesos
>          Issue Type: Improvement
>          Components: storage
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Critical
>              Labels: mesosphere, storage
>
> Currently if a CSI volumes is failed to publish (e.g., due to {{mkfs}} errors), the framework will get a {{TASK_FAILED}} with reason {{REASON_CONTAINER_LAUNCH_FAILED}} or {{REASON_CONTAINER_UPDATE_FAILED}} and message "{{Failed to publish resources for resource provider XXX: Received FAILED status}}", which is not informative. We should surface the actual error message to the {{TASK_FAILED}} status update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)