You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Alexander Rukletsov (JIRA)" <ji...@apache.org> on 2017/10/06 12:29:00 UTC
[jira] [Commented] (MESOS-7997) ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky.

    [ https://issues.apache.org/jira/browse/MESOS-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194534#comment-16194534 ] 

Alexander Rukletsov commented on MESOS-7997:
--------------------------------------------

I was able to reproduce an instance of this issue locally with some extra logs added. As we can see below, sometimes {{updateAvailable}} call can sneak in-between and lead to an offer without persistent volume, even though we know the volume has been created. Compare offer from a "bad" run:
{noformat}
[{"allocation_info":{"role":"role1"},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
Allocated: disk(allocated: role1)(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}
and from a "good" run:
{noformat}
[{"allocation_info":{"role":"role1"},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":960.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"},{"allocation_info":{"role":"role1"},"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]
Allocated: disk(allocated: role1)(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}

A closer look at a "bad" run reveals a race. An agent [sends an update|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1260-L1265] with its resources after it got registered with the master, which arrives on the master _after_ a persistent volume has been created. We can fix the test by waiting for the {{UpdateSlaveMessage}} to arrive before requesting a volume, but I think we should fix the race in the agent-master communication. A change has been likely introduced in [2af9a5b07dc80151154264e974d03f56a1c25838|https://github.com/apache/mesos/commit/2af9a5b07dc80151154264e974d03f56a1c25838]
{noformat}
I1005 17:30:05.343058 2138112 hierarchical.cpp:593] Added agent c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 (localhost) with disk(reservations: [(STATIC,role1)]):1024; cpus:8; mem:15360; ports:[31000-32000] (allocated: {})
I1005 17:30:05.343085 528384 status_update_manager.cpp:184] Resuming sending status updates
I1005 17:30:05.342851 4284416 master.cpp:6032] Registered agent c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 (localhost) with [{"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
I1005 17:30:05.343230 2138112 hierarchical.cpp:1945] No allocations performed
I1005 17:30:05.343252 2138112 hierarchical.cpp:1488] Performed allocation for 1 agents in 94us
I1005 17:30:05.344882 3747840 process.cpp:3929] Handling HTTP event for process 'master' with path: '/master/api/v1'
I1005 17:30:05.345293 2674688 slave.cpp:1213] Checkpointing SlaveInfo to '/var/folders/h3/8j18s1cx2bn78ms99d3lz4jh0000gq/T/ContentType_MasterAPITest_CreateAndDestroyVolumes_1_RWSUpI/meta/slaves/c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0/slave.info'
I1005 17:30:05.345422 2138112 http.cpp:1185] HTTP POST for /master/api/v1 from 127.0.0.1:61576
I1005 17:30:05.345844 2138112 http.cpp:673] Processing call CREATE_VOLUMES
I1005 17:30:05.346055 2138112 master.cpp:3758] Authorizing principal 'test-principal' to create volumes '[{"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]'
I1005 17:30:05.347273 4284416 master.cpp:9335] Sending updated checkpointed resources disk(reservations: [(STATIC,role1)])[id1:path1]:64 to agent c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 (localhost)
W1005 17:30:05.347931 2025730048 sched.cpp:1711] 
**************************************************
Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
**************************************************
W1005 17:30:05.347995 2025730048 process.cpp:3194] Attempted to spawn already running process version@127.0.0.1:60330
I1005 17:30:05.348299 2674688 slave.cpp:1263] Forwarding total resources disk(reservations: [(STATIC,role1)]):1024; cpus:8; mem:15360; ports:[31000-32000]
I1005 17:30:05.348356 2674688 slave.cpp:1272] Forwarding total oversubscribed resources {}
I1005 17:30:05.348551 2138112 master.cpp:6804] Received update of agent c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 (localhost) with total resources disk(reservations: [(STATIC,role1)]):1024; cpus:8; mem:15360; ports:[31000-32000]
I1005 17:30:05.348595 2138112 master.cpp:6817] Received update of agent c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 (localhost) with total oversubscribed resources {}
I1005 17:30:05.349814 2674688 slave.cpp:3535] Updated checkpointed resources from {} to disk(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}

> ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky.
> -----------------------------------------------------------
>
>                 Key: MESOS-7997
>                 URL: https://issues.apache.org/jira/browse/MESOS-7997
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.5.0
>         Environment: Ubuntu 17.04 with SSL
>            Reporter: Alexander Rukletsov
>              Labels: flaky-test, mesosphere
>         Attachments: CreateAndDestroyVolumes_badrun.txt, CreateAndDestroyVolumes_goodrun.txt
>
>
> Observed a failure on the internal CI:
> {noformat}
> ../../src/tests/api_tests.cpp:3052
> Value of: Resources(offer.resources()).contains( allocatedResources(volume, frameworkInfo.role()))
>   Actual: false
> Expected: true
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)