You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Bannier (JIRA)" <ji...@apache.org> on 2018/01/10 14:48:00 UTC

[jira] [Commented] (MESOS-8422) Master's UpdateSlave handler not correctly updating terminated operations

    [ https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320357#comment-16320357 ] 

Benjamin Bannier commented on MESOS-8422:
-----------------------------------------

The issue here seems to be that resources of the resource provider sent as part of the {{UpdateSlaveMessage}} after failover already had the operation applied; applying the same operation again when an offer operation status update is received cannot work.

We should update the master to transition newly terminal offer operations when it learns about that from {{UpdateSlaveMessage}}. Since the master will only update resource state in the offer operation status update handler when the operation is _newly_ terminal, this would prevent this setup from becoming a problem.

> Master's UpdateSlave handler not correctly updating terminated operations
> -------------------------------------------------------------------------
>
>                 Key: MESOS-8422
>                 URL: https://issues.apache.org/jira/browse/MESOS-8422
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Gastón Kleiman
>            Assignee: Benjamin Bannier
>              Labels: mesosphere
>
> I created a test that verifies that operation status updates are resent to the master after being dropped en route to it (MESOS-8420).
> The test does the following:
> # Creates a volume from a RAW disk resource.
> # Drops the first `UpdateOperationStatusMessage` message from the agent to the master, so that it isn't acknowledged by the master.
> # Restarts the agent.
> # Verifies that the agent resends the operation status update.
> The good news are that the agent is resending the operation status update, the bad news are that it triggers a CHECK failure that crashes the master.
> Here are the relevant sections of the log produced by the test:
> {noformat}
> [ RUN      ] StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
> [...]
> I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681
> I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME operation with source disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
> I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
> I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096' for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for /slave(2)/api/v1/resource_provider from 10.0.49.2:53598
> I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider message 'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)'
> I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
> I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
> I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating
> I0109 16:36:08.584115 24144 master.cpp:1305] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) disconnected
> [...]
> I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master master@10.0.49.2:40681
> I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> I0109 16:36:08.656008 24140 slave.cpp:1423] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"icuAKyO6TymMt2Y9vyF6Jg=="},"slave_id":{"value":"046b3f21-6e97-4a56-9a13-773f7d481efd-S0"},"update_oversubscribed_resources":true}
> I0109 16:36:08.656121 24149 hierarchical.cpp:754] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 reactivated
> W0109 16:36:08.656481 24113 master.cpp:7277] !!!! update slave message: slave_id {
>   value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> update_oversubscribed_resources: true
> operations {
> }
> resource_version_uuid {
>   value: "\211\313\200+#\272O)\214\267f=\277!z&"
> }
> I0109 16:36:08.656637 24113 master.cpp:7320] Received update of agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(3)@10.0.49.2:40681 (core-dev) with total oversubscribed resources {}
> W0109 16:36:08.657387 24113 master.cpp:7704] Performing explicit reconciliation with agent for known operation 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 since it was not present in original reconciliation message from agent
> I0109 16:36:08.657917 24133 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> W0109 16:36:08.658048 24125 manager.cpp:472] Dropping operation reconciliation message with operation_uuid 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 because resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01 is not subscribed
> I0109 16:36:08.658609 24143 container_daemon.cpp:119] Launching container 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-slrp_test--CONTROLLER_SERVICE-NODE_SERVICE'
> [...]
> I0109 16:36:08.689859 24130 provider.cpp:3066] Sending UPDATE_STATE call with resources 'disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096' and 1 operations to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.690449 24130 provider.cpp:1042] Resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01 is in READY state
> I0109 16:36:08.690491 24105 status_update_manager_process.hpp:385] Resuming operation status update manager
> I0109 16:36:08.690640 24105 status_update_manager_process.hpp:394] Sending operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.693244 24131 http.cpp:1185] HTTP POST for /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
> I0109 16:36:08.693912 24140 http.cpp:1185] HTTP POST for /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
> I0109 16:36:08.693974 24115 manager.cpp:677] Received UPDATE_STATE call with resources '[{"disk":{"source":{"id":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","metadata":{"labels":[{"key":"path","value":"\/tmp\/n5thZ3\/test\/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"}]},"mount":{"root":".\/csi\/org.apache.mesos.csi.test\/slrp_test\/mounts\/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"},"profile":"volume-default","type":"MOUNT"}},"name":"disk","provider_id":{"value":"605b22f5-e39d-4d9f-950a-e7f44d202c01"},"reservations":[{"role":"storage","type":"DYNAMIC"}],"scalar":{"value":4096.0},"type":"SCALAR"}]' and 1 operations from resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01
> I0109 16:36:08.694897 24144 slave.cpp:7065] Handling resource provider message 'UPDATE_STATE: 605b22f5-e39d-4d9f-950a-e7f44d202c01 disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
> I0109 16:36:08.695184 24144 slave.cpp:7182] Forwarding new total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]; disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096
> I0109 16:36:08.696467 24144 slave.cpp:7065] Handling resource provider message 'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)'
> I0109 16:36:08.696594 24144 slave.cpp:7409] Updating the state of operation ' with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
> I0109 16:36:08.696666 24144 slave.cpp:7249] Forwarding status update of operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
> W0109 16:36:08.697093 24142 master.cpp:7277] !!!! update slave message: slave_id {
>   value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> update_oversubscribed_resources: false
> operations {
> }
> resource_version_uuid {
>   value: "\211\313\200+#\272O)\214\267f=\277!z&"
> }
> resource_providers {
>   providers {
>     info {
>       id {
>         value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
>       }
>       type: "org.apache.mesos.rp.local.storage"
>       name: "test"
>       default_reservations {
>         role: "storage"
>         type: DYNAMIC
>       }
>       storage {
>         plugin {
>           type: "org.apache.mesos.csi.test"
>           name: "slrp_test"
>           containers {
>             [...]
>           }
>         }
>       }
>     }
>     total_resources {
>       name: "disk"
>       type: SCALAR
>       scalar {
>         value: 4096
>       }
>       disk {
>         source {
>           type: MOUNT
>           mount {
>             root: "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>           }
>           id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>           metadata {
>             labels {
>               key: "path"
>               value: "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>             }
>           }
>           profile: "volume-default"
>         }
>       }
>       provider_id {
>         value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
>       }
>       reservations {
>         role: "storage"
>         type: DYNAMIC
>       }
>     }
>     operations {
>       operations {
>         framework_id {
>           value: "046b3f21-6e97-4a56-9a13-773f7d481efd-0000"
>         }
>         slave_id {
>           value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
>         }
>         info {
>           type: CREATE_VOLUME
>           create_volume {
>             source {
>               name: "disk"
>               type: SCALAR
>               scalar {
>                 value: 4096
>               }
>               disk {
>                 source {
>                   type: RAW
>                   profile: "volume-default"
>                 }
>               }
>               allocation_info {
>                 role: "storage"
>               }
>               provider_id {
>                 value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
>               }
>               reservations {
>                 role: "storage"
>                 type: DYNAMIC
>               }
>             }
>             target_type: MOUNT
>           }
>         }
>         latest_status {
>           state: OPERATION_FINISHED
>           converted_resources {
>             name: "disk"
>             type: SCALAR
>             scalar {
>               value: 4096
>             }
>             disk {
>               source {
>                 type: MOUNT
>                 mount {
>                   root: "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>                 }
>                 id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>                 metadata {
>                   labels {
>                     key: "path"
>                     value: "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>                   }
>                 }
>                 profile: "volume-default"
>               }
>             }
>             allocation_info {
>               role: "storage"
>             }
>             provider_id {
>               value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
>             }
>             reservations {
>               role: "storage"
>               type: DYNAMIC
>             }
>           }
>           uuid {
>             value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
>           }
>         }
>         statuses {
>           state: OPERATION_FINISHED
>           converted_resources {
>             name: "disk"
>             type: SCALAR
>             scalar {
>               value: 4096
>             }
>             disk {
>               source {
>                 type: MOUNT
>                 mount {
>                   root: "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>                 }
>                 id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>                 metadata {
>                   labels {
>                     key: "path"
>                     value: "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
>                   }
>                 }
>                 profile: "volume-default"
>               }
>             }
>             allocation_info {
>               role: "storage"
>             }
>             provider_id {
>               value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
>             }
>             reservations {
>               role: "storage"
>               type: DYNAMIC
>             }
>           }
>           uuid {
>             value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
>           }
>         }
>         uuid {
>           value: "\030\264\304\245\321bM\317\273!\241<n\340\364\010"
>         }
>       }
>     }
>     resource_version_uuid {
>       value: "M\250\313j\320\301IG\262\0164e\004\367\304\333"
>     }
>   }
> }
> I0109 16:36:08.700137 24142 master.cpp:10411] Updating the state of operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) of framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
> I0109 16:36:08.700417 24146 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F0109 16:36:08.700610 24142 master.cpp:11687] CHECK_SOME(resources): disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096; cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
> *** Check failure stack trace: ***
> F0109 16:36:08.700896 24146 hierarchical.cpp:908] CHECK_SOME(updatedTotal): disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096; cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
> *** Check failure stack trace: ***
>     @     0x7ff06d3bbe7e  (unknown)
>     @     0x7ff06d3bbe7e  (unknown)
>     @     0x7ff06d3bbddd  (unknown)
>     @     0x7ff06d3bbddd  (unknown)
>     @     0x7ff06d3bb7ee  (unknown)
>     @     0x7ff06d3bb7ee  (unknown)
>     @     0x7ff06d3be522  (unknown)
>     @     0x55c1c6c2be77  _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
>     @     0x7ff06d3be522  (unknown)
>     @     0x55c1c6c2be77  _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
>     @     0x7ff06b729277  (unknown)
>     @     0x55c1c6f3be8a  _ZTSN6lambda12CallableOnceIFvRK6ResultIN5mesos2v117resource_provider5EventEEEE10CallableFnINS_8internal7PartialIZNK7process6FutureIS6_E7onReadyISt5_BindIFSt7_Mem_fnIMSG_FbS8_EESG_St12_PlaceholderILi1EEEEbEERKSG_OT_NSG_6PreferEEUlOSQ_S8_E_ISQ_SO_EEEEE
> {noformat}
> We can see that once the SLRP reregisters with the agent, the following happens:
> # The agent will send an {{UpdateSlave}} message to the master including the converted resources and the {{CREATE_VOLUME}} operation with the status {{OPERATION_FINISHED}}.
> # The master will update the agent's resources, including the volume created by the operation.
> # The agent will resend the operation status update.
> # The master will try to apply the operation and crash, because it already updated the agent's resources on step #2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)