You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Chun-Hung Hsiao <ch...@apache.org> on 2018/06/14 03:47:40 UTC

Review Request 67596: Fixed the flakiness in the `NVIDIA_GPU_NvidiaDockerImage` test.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67596/
-----------------------------------------------------------

Review request for mesos, Jie Yu, Joseph Wu, and Kevin Klues.


Bugs: MESOS-6622
    https://issues.apache.org/jira/browse/MESOS-6622


Repository: mesos


Description
-------

This test is flaky because it tries to download the 1GB 'nvidia/cuda'
image from Docker Hub, which might take more than 1 minute and cause
the command executor unable to register in time.

This patch fixes this problem by using the default executor, which does
not wait for fetching task images before registration. If the image
fetch stalls (i.e. makes no progress) more than 1 minute, the container
will fail because of the `--fetcher_stall_timeout` agent flag.

The time we wait for `TASK_FINISHED` is also extended to 180 seconds.


Diffs
-----

  src/tests/containerizer/nvidia_gpu_isolator_tests.cpp d8c3e6d08a70bd129d8ac9c336be7a2bf7a4b0b2 


Diff: https://reviews.apache.org/r/67596/diff/1/


Testing
-------

sudo make check


Thanks,

Chun-Hung Hsiao


Re: Review Request 67596: Fixed the flakiness in the `NVIDIA_GPU_NvidiaDockerImage` test.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67596/#review204759
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['67596']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67596

Relevant logs:

- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67596/logs/mesos-tests-stdout.log):

```
[       OK ] Endpoint/SlaveEndpointTest.NoAuthorizer/2 (116 ms)
[----------] 9 tests from Endpoint/SlaveEndpointTest (1071 ms total)

[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 (35 ms)
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 (40 ms)
[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (77 ms total)

[----------] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN      ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[       OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (948 ms)
[----------] 1 test from IsolationFlag/CpuIsolatorTest (972 ms total)

[----------] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN      ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (947 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (980 ms total)

[----------] Global test environment tear-down
[==========] 988 tests from 97 test cases ran. (499224 ms total)
[  PASSED  ] 987 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SlaveTest.RestartSlaveRequireExecutorAuthentication

 1 FAILED TEST
  YOU HAVE 220 DISABLED TESTS

```

- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67596/logs/mesos-tests-stderr.log):

```
I0614 05:15:00.428949  7368 slave.cpp:3939] Shutting down framework f9a51d4a-3627-4a63-88a1-3dabd4550477-0000
I0614 05:15:00.428949  7368 slave.cpp:6660] Shutting down executor 'e7c13234-6d2f-4a58-959b-25de9b872617' of framework f9a51d4a-3627-4a63-88a1-3dabd4550477-0000 at executor(1)@192.10.1.5:65477
I0614 05:15:00.430958  7368 slave.cpp:931] Agent terminating
W0614 05:15:00.430958  7368 slave.cpp:3935] Ignoring shutdown framework f9a51d4a-3627-4a63-88a1-3dabd4550477-0000 because it is terminating
I0614 05:15:00.431969  9044 master.cpp:10962] Removing task e7c13234-6d2f-4a58-959b-25de9b872617 with resources cpus(allocated: *):4; mem(allocated: *):204I0614 05:15:00.155944  6992 exec.cpp:162] Version: 1.7.0
I0614 05:15:00.181944  9176 exec.cpp:236] Executor registered on agent f9a51d4a-3627-4a63-88a1-3dabd4550477-S0
I0614 05:15:00.186944  8916 executor.cpp:178] Received SUBSCRIBED event
I0614 05:15:00.191944  8916 executor.cpp:182] Subscribed executor on windows-01.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net
I0614 05:15:00.191944  8916 executor.cpp:178] Received LAUNCH event
I0614 05:15:00.196947  8916 executor.cpp:665] Starting task e7c13234-6d2f-4a58-959b-25de9b872617
I0614 05:15:00.280943  8916 executor.cpp:485] Running 'D:\DCOS\mesos\src\mesos-containerizer.exe launch <POSSIBLY-SENSITIVE-DATA>'
I0614 05:15:00.393949  8916 executor.cpp:678] Forked command at 872
I0614 05:15:00.431969  7660 exec.cpp:445] Executor asked to shutdown
I0614 05:15:00.433944  8916 executor.cpp:178] Received SHUTDOWN event
I0614 05:15:00.433944  8916 executor.cpp:781] Shutting down
I0614 05:15:00.433944  8916 executor.cpp:894] Sending SIGTERM to process tree at pid 872
8; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of framework f9a51d4a-3627-4a63-88a1-3dabd4550477-0000 on agent f9a51d4a-3627-4a63-88a1-3dabd4550477-S0 at slave(449)@192.10.1.5:65456 (windows-01.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net)
I0614 05:15:00.437954  9044 master.cpp:1293] Agent f9a51d4a-3627-4a63-88a1-3dabd4550477-S0 at slave(449)@192.10.1.5:65456 (windows-01.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net) disconnected
I0614 05:15:00.437954  9044 master.cpp:3303] Disconnecting agent f9a51d4a-3627-4a63-88a1-3dabd4550477-S0 at slave(449)@192.10.1.5:65456 (windows-01.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net)
I0614 05:15:00.437954  9044 master.cpp:3322] Deactivating agent f9a51d4a-3627-4a63-88a1-3dabd4550477-S0 at slave(449)@192.10.1.5:65456 (windows-01.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net)
I0614 05:15:00.438943  4204 containerizer.cpp:2405] Destroying container a485ff4d-78d0-4d1a-ac6d-0921e2270d50 in RUNNING state
I0614 05:15:00.438943  5712 hierarchical.cpp:344] Removed framework f9a51d4a-3627-4a63-88a1-3dabd4550477-0000
I0614 05:15:00.438943  4204 containerizer.cpp:3019] Transitioning the state of container a485ff4d-78d0-4d1a-ac6d-0921e2270d50 from RUNNING to DESTROYING
I0614 05:15:00.438943  5712 hierarchical.cpp:766] Agent f9a51d4a-3627-4a63-88a1-3dabd4550477-S0 deactivated
I0614 05:15:00.440083  4204 launcher.cpp:155] Asked to destroy container a485ff4d-78d0-4d1a-ac6d-0921e2270d50
I0614 05:15:00.504943  5712 containerizer.cpp:2858] Container a485ff4d-78d0-4d1a-ac6d-0921e2270d50 has exited
I0614 05:15:00.539944  8364 master.cpp:1135] Master terminating
I0614 05:15:00.543944  7992 hierarchical.cpp:609] Removed agent f9a51d4a-3627-4a63-88a1-3dabd4550477-S0
I0614 05:15:00.869982  6420 process.cpp:940] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On June 14, 2018, 3:47 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67596/
> -----------------------------------------------------------
> 
> (Updated June 14, 2018, 3:47 a.m.)
> 
> 
> Review request for mesos, Jie Yu, Joseph Wu, and Kevin Klues.
> 
> 
> Bugs: MESOS-6622
>     https://issues.apache.org/jira/browse/MESOS-6622
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This test is flaky because it tries to download the 1GB 'nvidia/cuda'
> image from Docker Hub, which might take more than 1 minute and cause
> the command executor unable to register in time.
> 
> This patch fixes this problem by using the default executor, which does
> not wait for fetching task images before registration. If the image
> fetch stalls (i.e. makes no progress) more than 1 minute, the container
> will fail because of the `--fetcher_stall_timeout` agent flag.
> 
> The time we wait for `TASK_FINISHED` is also extended to 180 seconds.
> 
> 
> Diffs
> -----
> 
>   src/tests/containerizer/nvidia_gpu_isolator_tests.cpp d8c3e6d08a70bd129d8ac9c336be7a2bf7a4b0b2 
> 
> 
> Diff: https://reviews.apache.org/r/67596/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 67596: Fixed the flakiness in the `NVIDIA_GPU_NvidiaDockerImage` test.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67596/
-----------------------------------------------------------

(Updated June 15, 2018, 11:29 p.m.)


Review request for mesos, Jie Yu, Joseph Wu, and Kevin Klues.


Changes
-------

Improved test isolation.


Bugs: MESOS-6622
    https://issues.apache.org/jira/browse/MESOS-6622


Repository: mesos


Description (updated)
-------

This test is flaky because it tries to download the 1GB 'nvidia/cuda'
image from Docker Hub, which might take more than 1 minute and cause
the command executor unable to register in time.

This patch fixes this problem by using the default executor, which does
not wait for fetching task images before registration. If the image
fetch stalls more than 1 minute, the container will fail because of the
`--fetcher_stall_timeout` agent flag.

The time we wait for `TASK_FINISHED` is also extended to 180 seconds.


Diffs (updated)
-----

  src/tests/containerizer/nvidia_gpu_isolator_tests.cpp d8c3e6d08a70bd129d8ac9c336be7a2bf7a4b0b2 


Diff: https://reviews.apache.org/r/67596/diff/2/

Changes: https://reviews.apache.org/r/67596/diff/1-2/


Testing
-------

sudo make check


Thanks,

Chun-Hung Hsiao


Re: Review Request 67596: Fixed the flakiness in the `NVIDIA_GPU_NvidiaDockerImage` test.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67596/#review204883
-----------------------------------------------------------


Ship it!




Ship It!

- Jie Yu


On June 14, 2018, 3:47 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67596/
> -----------------------------------------------------------
> 
> (Updated June 14, 2018, 3:47 a.m.)
> 
> 
> Review request for mesos, Jie Yu, Joseph Wu, and Kevin Klues.
> 
> 
> Bugs: MESOS-6622
>     https://issues.apache.org/jira/browse/MESOS-6622
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This test is flaky because it tries to download the 1GB 'nvidia/cuda'
> image from Docker Hub, which might take more than 1 minute and cause
> the command executor unable to register in time.
> 
> This patch fixes this problem by using the default executor, which does
> not wait for fetching task images before registration. If the image
> fetch stalls (i.e. makes no progress) more than 1 minute, the container
> will fail because of the `--fetcher_stall_timeout` agent flag.
> 
> The time we wait for `TASK_FINISHED` is also extended to 180 seconds.
> 
> 
> Diffs
> -----
> 
>   src/tests/containerizer/nvidia_gpu_isolator_tests.cpp d8c3e6d08a70bd129d8ac9c336be7a2bf7a4b0b2 
> 
> 
> Diff: https://reviews.apache.org/r/67596/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>