You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Jan Schlicht <ja...@mesosphere.io> on 2016/04/04 16:05:40 UTC
Re: Review Request 44571: Added timeout for destroying Docker
containers.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44571/
-----------------------------------------------------------
(Updated April 4, 2016, 4:05 p.m.)
Review request for mesos, Jie Yu and Joris Van Remoortere.
Changes
-------
Changed order of continuations.
Bugs: MESOS-4673
https://issues.apache.org/jira/browse/MESOS-4673
Repository: mesos
Description
-------
Commands issued to the Docker daemon can hang, causing problems within Mesos.
For example a hanging 'docker stop' can result in an unresponsive executor,
causing the Mesos agent to issue an to run a 'docker stop' itself which might
result in an unresponsive agent (see MESOS-4673).
Adding a timeout can be used as a workaround.
Diffs (updated)
-----
src/slave/containerizer/docker.hpp 89d450e10a84f24ddd46d517e2b4b46ab02c4fda
src/slave/containerizer/docker.cpp 9314d1f9e0b6077fe7c48b860783ab21acc48be6
Diff: https://reviews.apache.org/r/44571/diff/
Testing
-------
sudo ./bin/mesos-tests.sh (to test if existing tests break due to the changed behavior)
Because docker must hang for both the Mesos agent as well as the `mesos-docker-executor`, it can't currently be tested as part of the Mesos integration tests. Here's how to test that the timeout works:
Run with Fedora 23 (Kernel 4.2.3, Docker 1.9.1)
# Start a master
./bin/mesos-master.sh --work_dir=/tmp/mesos &
# Start an agent
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --containerizers=docker &
# Run a task using the docker containerizer
./src/mesos-execute --containerizer=docker --docker_image=alpine --master=127.0.0.1:5050 --name="sleep" --command="sleep 1000" &
# Note the pid of `mesos-execute` as well as the pid of the sleep task run by docker (eg 3323 and 3474)
# Have mesos run `docker inspect` to gather the pid of the docker task
curl -X GET localhost:5051/monitor/statistics
# Now overload docker by trying to run a lot of tasks in parallel
for i in `seq 1 100`; do sudo docker run --rm alpine sleep 60 & done
# Wait until the first of these docker tasks finish, `sudo docker ps` should be unresponsible now
# Kill the `mesos-execute` task (eg 3323)
kill 3323
# Watch the logs of the Mesos agent. At some point it will send a SIGKILL to the docker task (eg 3474)
# Make sure that the docker task is indeed termintad (using `ps fax` or the like)
Thanks,
Jan Schlicht
Re: Review Request 44571: Added timeout for destroying Docker
containers.
Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44571/#review127808
-----------------------------------------------------------
Patch looks great!
Reviews applied: [44571]
Passed command: export OS='ubuntu:14.04' CONFIGURATION='--verbose' COMPILER='gcc' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh
- Mesos ReviewBot
On April 8, 2016, 11:20 a.m., Jan Schlicht wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44571/
> -----------------------------------------------------------
>
> (Updated April 8, 2016, 11:20 a.m.)
>
>
> Review request for mesos, Jie Yu and Joris Van Remoortere.
>
>
> Bugs: MESOS-4673
> https://issues.apache.org/jira/browse/MESOS-4673
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Commands issued to the Docker daemon can hang, causing problems within
> Mesos. For example a hanging 'docker stop' can result in an unresponsive
> executor, causing the Mesos agent to issue an to run a 'docker stop'
> itself which might result in an unresponsive agent (see MESOS-4673).
> Adding a timeout can be used as a workaround.
>
>
> Diffs
> -----
>
> src/slave/constants.hpp 449c8cd9f43f71b4612023eb463969e9db0bc960
> src/slave/containerizer/docker.hpp 35673214ab4bf50151f15e3fad10ff374cda3bbc
> src/slave/containerizer/docker.cpp 5755effec065650aac4473e4b622f4342ad020a3
>
> Diff: https://reviews.apache.org/r/44571/diff/
>
>
> Testing
> -------
>
> sudo ./bin/mesos-tests.sh (to test if existing tests break due to the changed behavior)
>
> Because docker must hang for both the Mesos agent as well as the `mesos-docker-executor`, it can't currently be tested as part of the Mesos integration tests. Here's how to test that the timeout works:
> Run with Fedora 23 (Kernel 4.2.3, Docker 1.9.1)
> # Start a master
> ./bin/mesos-master.sh --work_dir=/tmp/mesos &
>
> # Start an agent
> sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --containerizers=docker &
>
> # Run a task using the docker containerizer
> ./src/mesos-execute --containerizer=docker --docker_image=alpine --master=127.0.0.1:5050 --name="sleep" --command="sleep 1000" &
> # Note the pid of `mesos-execute` as well as the pid of the sleep task run by docker (eg 3323 and 3474)
>
> # Have mesos run `docker inspect` to gather the pid of the docker task
> curl -X GET localhost:5051/monitor/statistics
>
> # Now overload docker by trying to run a lot of tasks in parallel
> for i in `seq 1 100`; do sudo docker run --rm alpine sleep 60 & done
>
> # Wait until the first of these docker tasks finish, `sudo docker ps` should be unresponsible now
> # Kill the `mesos-execute` task (eg 3323)
> kill 3323
>
> # Watch the logs of the Mesos agent. At some point it will send a SIGKILL to the docker task (eg 3474)
> # Make sure that the docker task is indeed termintad (using `ps fax` or the like)
>
>
> Thanks,
>
> Jan Schlicht
>
>
Re: Review Request 44571: Added timeout for destroying Docker
containers.
Posted by Joris Van Remoortere <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44571/#review127841
-----------------------------------------------------------
Ship it!
Just renamed the constant and clarified the comment a little.
- Joris Van Remoortere
On April 8, 2016, 11:20 a.m., Jan Schlicht wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44571/
> -----------------------------------------------------------
>
> (Updated April 8, 2016, 11:20 a.m.)
>
>
> Review request for mesos, Jie Yu and Joris Van Remoortere.
>
>
> Bugs: MESOS-4673
> https://issues.apache.org/jira/browse/MESOS-4673
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Commands issued to the Docker daemon can hang, causing problems within
> Mesos. For example a hanging 'docker stop' can result in an unresponsive
> executor, causing the Mesos agent to issue an to run a 'docker stop'
> itself which might result in an unresponsive agent (see MESOS-4673).
> Adding a timeout can be used as a workaround.
>
>
> Diffs
> -----
>
> src/slave/constants.hpp 449c8cd9f43f71b4612023eb463969e9db0bc960
> src/slave/containerizer/docker.hpp 35673214ab4bf50151f15e3fad10ff374cda3bbc
> src/slave/containerizer/docker.cpp 5755effec065650aac4473e4b622f4342ad020a3
>
> Diff: https://reviews.apache.org/r/44571/diff/
>
>
> Testing
> -------
>
> sudo ./bin/mesos-tests.sh (to test if existing tests break due to the changed behavior)
>
> Because docker must hang for both the Mesos agent as well as the `mesos-docker-executor`, it can't currently be tested as part of the Mesos integration tests. Here's how to test that the timeout works:
> Run with Fedora 23 (Kernel 4.2.3, Docker 1.9.1)
> # Start a master
> ./bin/mesos-master.sh --work_dir=/tmp/mesos &
>
> # Start an agent
> sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --containerizers=docker &
>
> # Run a task using the docker containerizer
> ./src/mesos-execute --containerizer=docker --docker_image=alpine --master=127.0.0.1:5050 --name="sleep" --command="sleep 1000" &
> # Note the pid of `mesos-execute` as well as the pid of the sleep task run by docker (eg 3323 and 3474)
>
> # Have mesos run `docker inspect` to gather the pid of the docker task
> curl -X GET localhost:5051/monitor/statistics
>
> # Now overload docker by trying to run a lot of tasks in parallel
> for i in `seq 1 100`; do sudo docker run --rm alpine sleep 60 & done
>
> # Wait until the first of these docker tasks finish, `sudo docker ps` should be unresponsible now
> # Kill the `mesos-execute` task (eg 3323)
> kill 3323
>
> # Watch the logs of the Mesos agent. At some point it will send a SIGKILL to the docker task (eg 3474)
> # Make sure that the docker task is indeed termintad (using `ps fax` or the like)
>
>
> Thanks,
>
> Jan Schlicht
>
>
Re: Review Request 44571: Added timeout for destroying Docker
containers.
Posted by Jan Schlicht <ja...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44571/
-----------------------------------------------------------
(Updated April 8, 2016, 1:20 p.m.)
Review request for mesos, Jie Yu and Joris Van Remoortere.
Changes
-------
Addressed issues.
Bugs: MESOS-4673
https://issues.apache.org/jira/browse/MESOS-4673
Repository: mesos
Description (updated)
-------
Commands issued to the Docker daemon can hang, causing problems within
Mesos. For example a hanging 'docker stop' can result in an unresponsive
executor, causing the Mesos agent to issue an to run a 'docker stop'
itself which might result in an unresponsive agent (see MESOS-4673).
Adding a timeout can be used as a workaround.
Diffs (updated)
-----
src/slave/constants.hpp 449c8cd9f43f71b4612023eb463969e9db0bc960
src/slave/containerizer/docker.hpp 35673214ab4bf50151f15e3fad10ff374cda3bbc
src/slave/containerizer/docker.cpp 5755effec065650aac4473e4b622f4342ad020a3
Diff: https://reviews.apache.org/r/44571/diff/
Testing
-------
sudo ./bin/mesos-tests.sh (to test if existing tests break due to the changed behavior)
Because docker must hang for both the Mesos agent as well as the `mesos-docker-executor`, it can't currently be tested as part of the Mesos integration tests. Here's how to test that the timeout works:
Run with Fedora 23 (Kernel 4.2.3, Docker 1.9.1)
# Start a master
./bin/mesos-master.sh --work_dir=/tmp/mesos &
# Start an agent
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --containerizers=docker &
# Run a task using the docker containerizer
./src/mesos-execute --containerizer=docker --docker_image=alpine --master=127.0.0.1:5050 --name="sleep" --command="sleep 1000" &
# Note the pid of `mesos-execute` as well as the pid of the sleep task run by docker (eg 3323 and 3474)
# Have mesos run `docker inspect` to gather the pid of the docker task
curl -X GET localhost:5051/monitor/statistics
# Now overload docker by trying to run a lot of tasks in parallel
for i in `seq 1 100`; do sudo docker run --rm alpine sleep 60 & done
# Wait until the first of these docker tasks finish, `sudo docker ps` should be unresponsible now
# Kill the `mesos-execute` task (eg 3323)
kill 3323
# Watch the logs of the Mesos agent. At some point it will send a SIGKILL to the docker task (eg 3474)
# Make sure that the docker task is indeed termintad (using `ps fax` or the like)
Thanks,
Jan Schlicht
Re: Review Request 44571: Added timeout for destroying Docker
containers.
Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44571/#review127685
-----------------------------------------------------------
Fix it, then Ship it!
src/slave/containerizer/docker.cpp (line 1854)
<https://reviews.apache.org/r/44571/#comment191073>
The 1 seconds grace period here is pretty random. Can you make it a constant?
src/slave/containerizer/docker.cpp (line 1954)
<https://reviews.apache.org/r/44571/#comment191074>
This is useless since docker->stop does not handle `discard()` properly. We still rely on the subprocess to terminate anyway.
To be clear, `discard()` does not necessarily mean that the future will be in DISCARDED state. It's up to the owner to decide (calling `promise->discard()`).
- Jie Yu
On April 4, 2016, 2:05 p.m., Jan Schlicht wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44571/
> -----------------------------------------------------------
>
> (Updated April 4, 2016, 2:05 p.m.)
>
>
> Review request for mesos, Jie Yu and Joris Van Remoortere.
>
>
> Bugs: MESOS-4673
> https://issues.apache.org/jira/browse/MESOS-4673
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Commands issued to the Docker daemon can hang, causing problems within Mesos.
> For example a hanging 'docker stop' can result in an unresponsive executor,
> causing the Mesos agent to issue an to run a 'docker stop' itself which might
> result in an unresponsive agent (see MESOS-4673).
> Adding a timeout can be used as a workaround.
>
>
> Diffs
> -----
>
> src/slave/containerizer/docker.hpp 89d450e10a84f24ddd46d517e2b4b46ab02c4fda
> src/slave/containerizer/docker.cpp 9314d1f9e0b6077fe7c48b860783ab21acc48be6
>
> Diff: https://reviews.apache.org/r/44571/diff/
>
>
> Testing
> -------
>
> sudo ./bin/mesos-tests.sh (to test if existing tests break due to the changed behavior)
>
> Because docker must hang for both the Mesos agent as well as the `mesos-docker-executor`, it can't currently be tested as part of the Mesos integration tests. Here's how to test that the timeout works:
> Run with Fedora 23 (Kernel 4.2.3, Docker 1.9.1)
> # Start a master
> ./bin/mesos-master.sh --work_dir=/tmp/mesos &
>
> # Start an agent
> sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --containerizers=docker &
>
> # Run a task using the docker containerizer
> ./src/mesos-execute --containerizer=docker --docker_image=alpine --master=127.0.0.1:5050 --name="sleep" --command="sleep 1000" &
> # Note the pid of `mesos-execute` as well as the pid of the sleep task run by docker (eg 3323 and 3474)
>
> # Have mesos run `docker inspect` to gather the pid of the docker task
> curl -X GET localhost:5051/monitor/statistics
>
> # Now overload docker by trying to run a lot of tasks in parallel
> for i in `seq 1 100`; do sudo docker run --rm alpine sleep 60 & done
>
> # Wait until the first of these docker tasks finish, `sudo docker ps` should be unresponsible now
> # Kill the `mesos-execute` task (eg 3323)
> kill 3323
>
> # Watch the logs of the Mesos agent. At some point it will send a SIGKILL to the docker task (eg 3474)
> # Make sure that the docker task is indeed termintad (using `ps fax` or the like)
>
>
> Thanks,
>
> Jan Schlicht
>
>
Re: Review Request 44571: Added timeout for destroying Docker
containers.
Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44571/#review126848
-----------------------------------------------------------
Patch looks great!
Reviews applied: [44571]
Passed command: export OS='ubuntu:14.04' CONFIGURATION='--verbose' COMPILER='gcc' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh
- Mesos ReviewBot
On April 4, 2016, 2:05 p.m., Jan Schlicht wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44571/
> -----------------------------------------------------------
>
> (Updated April 4, 2016, 2:05 p.m.)
>
>
> Review request for mesos, Jie Yu and Joris Van Remoortere.
>
>
> Bugs: MESOS-4673
> https://issues.apache.org/jira/browse/MESOS-4673
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Commands issued to the Docker daemon can hang, causing problems within Mesos.
> For example a hanging 'docker stop' can result in an unresponsive executor,
> causing the Mesos agent to issue an to run a 'docker stop' itself which might
> result in an unresponsive agent (see MESOS-4673).
> Adding a timeout can be used as a workaround.
>
>
> Diffs
> -----
>
> src/slave/containerizer/docker.hpp 89d450e10a84f24ddd46d517e2b4b46ab02c4fda
> src/slave/containerizer/docker.cpp 9314d1f9e0b6077fe7c48b860783ab21acc48be6
>
> Diff: https://reviews.apache.org/r/44571/diff/
>
>
> Testing
> -------
>
> sudo ./bin/mesos-tests.sh (to test if existing tests break due to the changed behavior)
>
> Because docker must hang for both the Mesos agent as well as the `mesos-docker-executor`, it can't currently be tested as part of the Mesos integration tests. Here's how to test that the timeout works:
> Run with Fedora 23 (Kernel 4.2.3, Docker 1.9.1)
> # Start a master
> ./bin/mesos-master.sh --work_dir=/tmp/mesos &
>
> # Start an agent
> sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --containerizers=docker &
>
> # Run a task using the docker containerizer
> ./src/mesos-execute --containerizer=docker --docker_image=alpine --master=127.0.0.1:5050 --name="sleep" --command="sleep 1000" &
> # Note the pid of `mesos-execute` as well as the pid of the sleep task run by docker (eg 3323 and 3474)
>
> # Have mesos run `docker inspect` to gather the pid of the docker task
> curl -X GET localhost:5051/monitor/statistics
>
> # Now overload docker by trying to run a lot of tasks in parallel
> for i in `seq 1 100`; do sudo docker run --rm alpine sleep 60 & done
>
> # Wait until the first of these docker tasks finish, `sudo docker ps` should be unresponsible now
> # Kill the `mesos-execute` task (eg 3323)
> kill 3323
>
> # Watch the logs of the Mesos agent. At some point it will send a SIGKILL to the docker task (eg 3474)
> # Make sure that the docker task is indeed termintad (using `ps fax` or the like)
>
>
> Thanks,
>
> Jan Schlicht
>
>