You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "James Peach (JIRA)" <ji...@apache.org> on 2019/07/02 07:09:00 UTC

[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

    [ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876736#comment-16876736 ] 

James Peach commented on MESOS-9875:
------------------------------------

{{f9330006-d885-4ef0-b2c7-c9c6fcc239e5}} is the persistence ID.
{{5fa5c810-2dd3-41cb-9633-a3ef404b08c4}} is the operation UUID.
{{honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14}} is the operation ID.

{noformat}

I0627 22:03:17.360236 3529210 slave.cpp:4282] Updated checkpointed operations from [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED) ] to [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED), 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 (CREATE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14, latest state: OPERATION_PENDING) ]
...
I0627 22:03:17.360723 3529210 slave.cpp:8670] Updating the state of operation 'honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14' (uuid: 5fa5c810-2dd3-41cb-9633-a3ef404b08c4) for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
...
E0627 22:03:17.365811 3529210 slave.cpp:4257] EXIT with status 1: Failed to sync checkpointed resources: Failed to create the persistent volume f9330006-d885-4ef0-b2c7-c9c6fcc239e5 at '/srv/mesos/work/volumes/roles/test-3/f9330006-d885-4ef0-b2c7-c9c6fcc239e5': Operation not permitted
{noformat}


The relevant code sequence is in Slave::applyOperation, and looks roughly like this:

{noformat}
    track the new operation

    checkpointResourceState() (1)

    apply the operation (2)
    report that the operation was applied

    checkpointResourceState() (3)
{noformat}

The operation is checkpointed as pending in (1), but no resource changes are made yet. In (3), the operation is applied by making changes to the agent resources. At (3) the checkpointed resources discrepancy is discovered and the agent tries to create the persistent volume and fails.


> Mesos did not respond correctly when operations should fail
> -----------------------------------------------------------
>
>                 Key: MESOS-9875
>                 URL: https://issues.apache.org/jira/browse/MESOS-9875
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yifan Xing
>            Priority: Major
>
> For testing persistent volumes with `OPERATION_FAILED/ERROR` feedbacks, we sshed into the mesos-agent and made it unable to create subdirectories in /srv/mesos/work/volumes, however, mesos did not respond any operation failed response. Instead, we received `OPERATION_FINISHED` feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
> 2. Make it impossible to create a persistent volume (we expect the agent to crash and reregister, and the master to release that the operation is `OPERATION_DROPPED`):
> * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using the magent modified above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)