You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Jan Schlicht <ja...@mesosphere.io> on 2019/03/11 15:39:04 UTC

Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/
-----------------------------------------------------------

Review request for mesos and Chun-Hung Hsiao.


Summary (updated)
-----------------

Fixed flakiness in 'RetryRpcWithExponentialBackoff'.


Bugs: MESOS-9594
    https://issues.apache.org/jira/browse/MESOS-9594


Repository: mesos


Description (updated)
-------

Under some circumstances, offers would be filtered, resulting in the
test being stuck while waiting for offers. This has been resolved by
removing all offer filters before waiting for offers.


Diffs (updated)
-----

  src/tests/storage_local_resource_provider_tests.cpp 7945384867f26fa15dc734a235ae509d5d6d350f 


Diff: https://reviews.apache.org/r/70184/diff/1/


Testing (updated)
-------

make check

I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.


Thanks,

Jan Schlicht


Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On March 27, 2019, 5:10 p.m., Chun-Hung Hsiao wrote:
> > src/tests/storage_local_resource_provider_tests.cpp
> > Lines 4926 (patched)
> > <https://reviews.apache.org/r/70184/diff/1/?file=2130933#file2130933line4926>
> >
> >     I'm not sure if this is the most appropriate fix before understanding why the offer was filtered. Can you elaborate more on why this is needed?

Okay I think I now understand what happened here.

It was not because that the offer filter prevented the framework from getting the offer. No filter is set to refuse the newly created volume.

What actually happened was that, after the master received `OPERATION_FINISHED`, it would call `allocator->updateAllocation(...)` and `allocator->recoverResources(...)`:
https://github.com/apache/mesos/blob/cb4c9660eecae59bc2acbeb37ab19214f7b0e19b/src/master/master.cpp#L11962-L11972
Since these steps were done asynchronously, they raced with the clock advanced that triggered the batch allocation after L4926 in this test.
So in the failed run, the batch allocation was done before the resource (i.e., the created volume) was recovered, so the offer allocated to the framework did not have any "new" resource and thus filtered.
This can be reproduced by adding an `os::sleep(Milliseconds(10))` right before `allocator->updateAllocation(...)` above.

Adding a `REVIVE` call does the trick, because the revive call, which was processed sequentially *after* `OPERATION_FINISHED`, would asynchronuosly queue up an event-based allocation after the above two allocator calls.
So even if the race happens, it ensures that another allocation would be triggered after the created volume is recovered.
However it seems to me that a more proper fix is the following:
```
   EXPECT_EQ(OPERATION_FINISHED, updateOperationStatus->status().state());
-  // Advance the clock to trigger a batch allocation.
+  // Settle the clock to recover the created disk then advance the clock to
+  // trigger a batch allocation.
+  Clock::settle();
   Clock::advance(masterFlags.allocation_interval);

   AWAIT_READY(offers);
```


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/#review214121
-----------------------------------------------------------


On March 11, 2019, 3:39 p.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70184/
> -----------------------------------------------------------
> 
> (Updated March 11, 2019, 3:39 p.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9594
>     https://issues.apache.org/jira/browse/MESOS-9594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under some circumstances, offers would be filtered, resulting in the
> test being stuck while waiting for offers. This has been resolved by
> removing all offer filters before waiting for offers.
> 
> 
> Diffs
> -----
> 
>   src/tests/storage_local_resource_provider_tests.cpp 7945384867f26fa15dc734a235ae509d5d6d350f 
> 
> 
> Diff: https://reviews.apache.org/r/70184/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>


Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/#review214121
-----------------------------------------------------------




src/tests/storage_local_resource_provider_tests.cpp
Lines 4926 (patched)
<https://reviews.apache.org/r/70184/#comment300291>

    I'm not sure if this is the most appropriate fix before understanding why the offer was filtered. Can you elaborate more on why this is needed?


- Chun-Hung Hsiao


On March 11, 2019, 3:39 p.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70184/
> -----------------------------------------------------------
> 
> (Updated March 11, 2019, 3:39 p.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9594
>     https://issues.apache.org/jira/browse/MESOS-9594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under some circumstances, offers would be filtered, resulting in the
> test being stuck while waiting for offers. This has been resolved by
> removing all offer filters before waiting for offers.
> 
> 
> Diffs
> -----
> 
>   src/tests/storage_local_resource_provider_tests.cpp 7945384867f26fa15dc734a235ae509d5d6d350f 
> 
> 
> Diff: https://reviews.apache.org/r/70184/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>


Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/#review213678
-----------------------------------------------------------



PASS: Mesos patch 70184 was successfully built and tested.

Reviews applied: `['70184']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2933/mesos-review-70184

- Mesos Reviewbot Windows


On March 11, 2019, 8:39 a.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70184/
> -----------------------------------------------------------
> 
> (Updated March 11, 2019, 8:39 a.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9594
>     https://issues.apache.org/jira/browse/MESOS-9594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under some circumstances, offers would be filtered, resulting in the
> test being stuck while waiting for offers. This has been resolved by
> removing all offer filters before waiting for offers.
> 
> 
> Diffs
> -----
> 
>   src/tests/storage_local_resource_provider_tests.cpp 7945384867f26fa15dc734a235ae509d5d6d350f 
> 
> 
> Diff: https://reviews.apache.org/r/70184/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>


Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/#review215120
-----------------------------------------------------------


Ship it!




Ship It!

- Chun-Hung Hsiao


On May 6, 2019, 9:48 a.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70184/
> -----------------------------------------------------------
> 
> (Updated May 6, 2019, 9:48 a.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9594
>     https://issues.apache.org/jira/browse/MESOS-9594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under some circumstances, offers would be filtered, resulting in the
> test being stuck while waiting for offers. This has been resolved by
> settling the clock before accepting new offers.
> 
> 
> Diffs
> -----
> 
>   src/tests/storage_local_resource_provider_tests.cpp ec2222d7aeef1cb3d1a5b4b3419dfd912d41a8c6 
> 
> 
> Diff: https://reviews.apache.org/r/70184/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>


Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/#review215060
-----------------------------------------------------------



Patch looks great!

Reviews applied: [70184]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On May 6, 2019, 9:48 a.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70184/
> -----------------------------------------------------------
> 
> (Updated May 6, 2019, 9:48 a.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9594
>     https://issues.apache.org/jira/browse/MESOS-9594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under some circumstances, offers would be filtered, resulting in the
> test being stuck while waiting for offers. This has been resolved by
> settling the clock before accepting new offers.
> 
> 
> Diffs
> -----
> 
>   src/tests/storage_local_resource_provider_tests.cpp ec2222d7aeef1cb3d1a5b4b3419dfd912d41a8c6 
> 
> 
> Diff: https://reviews.apache.org/r/70184/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>


Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

Posted by Jan Schlicht <ja...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/
-----------------------------------------------------------

(Updated May 6, 2019, 11:48 a.m.)


Review request for mesos and Chun-Hung Hsiao.


Changes
-------

Addressed issues.


Bugs: MESOS-9594
    https://issues.apache.org/jira/browse/MESOS-9594


Repository: mesos


Description (updated)
-------

Under some circumstances, offers would be filtered, resulting in the
test being stuck while waiting for offers. This has been resolved by
settling the clock before accepting new offers.


Diffs (updated)
-----

  src/tests/storage_local_resource_provider_tests.cpp ec2222d7aeef1cb3d1a5b4b3419dfd912d41a8c6 


Diff: https://reviews.apache.org/r/70184/diff/2/

Changes: https://reviews.apache.org/r/70184/diff/1-2/


Testing
-------

make check

I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.


Thanks,

Jan Schlicht


Re: Review Request 70184: Fixed flakiness in 'RetryRpcWithExponentialBackoff'.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/#review213597
-----------------------------------------------------------



Patch looks great!

Reviews applied: [70184]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On March 11, 2019, 3:39 p.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70184/
> -----------------------------------------------------------
> 
> (Updated March 11, 2019, 3:39 p.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9594
>     https://issues.apache.org/jira/browse/MESOS-9594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under some circumstances, offers would be filtered, resulting in the
> test being stuck while waiting for offers. This has been resolved by
> removing all offer filters before waiting for offers.
> 
> 
> Diffs
> -----
> 
>   src/tests/storage_local_resource_provider_tests.cpp 7945384867f26fa15dc734a235ae509d5d6d350f 
> 
> 
> Diff: https://reviews.apache.org/r/70184/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> I could not reproduce the flaky behavior of this test case, hence only assume that this patch resolves the flakiness.
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>