You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Jay Taylor <ou...@gmail.com> on 2015/10/08 21:45:27 UTC

Tasks with failed health-checks intermittently not restarted

Using the health-check following parameters:

cmd="exit 1"
delay=5.0
grace-period=10.0
interval=10.0
timeout=10.0
consecutiveFailures=3

Sometimes the tasks are successfully identified as failing and restarted,
however other times the health-check command exits yet the task is left in
a running state and the failure is ignored.

Sample of failed Mesos task log:

STDOUT:

--container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --stop_timeout="0ns"
> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --stop_timeout="0ns"
> Registered docker executor on mesos-worker2a
> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
> Launching health check process: /usr/libexec/mesos/mesos-health-check
> --executor=(1)@192.168.225.59:38776
> --health_check_json={"command":{"shell":true,"value":"docker exec
> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
> sh -c \" exit 1
> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>
> *Health check process launched at pid: 7525*
> *Received task health update, healthy: false**Received task health
> update, healthy: false*



STDERR:

I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
> WARNING: Your kernel does not support swap limit capabilities, memory
> limited without swap.
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health check
> still in grace period
> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
> command check exited with status 1*


Screenshot of the task still running despite health-check exited with
status code 1:

http://i.imgur.com/zx9GQuo.png

The expected behavior when the health-check binary has exited w/ non-zero
status is that the task would be killed and restarted (rather than
continuing to run as outlined above).

-----
Additional note: After hard-coding the "path" string of the health-check
binary parent dir into b/src/docker/executor.cpp, I am able to at least
test the functionality.  The other issue of health-checks for docker tasks
failing to start is still unresolved due to the unpropagated
MESOS_LAUNCH_DIR issue.

Re: Tasks with failed health-checks intermittently not restarted

Posted by Jay Taylor <ou...@gmail.com>.

Interesting, I just got the new cluster up.  I'll re-run both these scenarios and let you know what I find.

What you're saying makes a lot of sense- many different versions of Mesos were installed on the old cluster!  Hopefully that is the cause of the strangeness.

Thanks again for all your help Haosdent!!!



> On Oct 11, 2015, at 11:17 AM, haosdent <ha...@gmail.com> wrote:
> 
> I guess your environment have some problems, just like your met strange behaviour in launcher_dir. Last time I remember also met strange things in my laptop, it is because I make install a older version Mesos while I launch a new version Mesos. If you could `make uninstall` and fresh build through following http://mesos.apache.org/documentation/latest/getting-started/ and still met those problems, we could more sure it is a bug. But so far I could not reproduce your problems after repeat tests, I still think it maybe cause by your build environment is not clear or have error configurations.
> 
>> On Sun, Oct 11, 2015 at 11:02 PM, Jay Taylor <ou...@gmail.com> wrote:
>> I was testing on a set of VMs on a lan with ideal very low latency.
>> 
>> 
>> 
>>> On Oct 11, 2015, at 3:30 AM, haosdent <ha...@gmail.com> wrote:
>>> 
>>> My stdout looks like:
>>> 
>>> ```
>>> Launching health check process: /home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check --executor=(1)@xxxx --health_check_json={"command":{"shell":true,"value":"docker exec mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4 sh -c \" exit 1 \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0} --task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000
>>> Health check process launched at pid: 24720
>>> Received task health update, healthy: false
>>> Received task health update, healthy: false
>>> Received task health update, healthy: false
>>> Killing docker task
>>> Shutting down
>>> ```
>>> 
>>> Does your network latency affect this result?
>>> 
>>> 
>>>> On Sun, Oct 11, 2015 at 6:18 PM, haosdent <ha...@gmail.com> wrote:
>>>> Could not reproduce your problem in my side. But I guess it maybe related to this ticket. MESOS-1613 HealthCheckTest.ConsecutiveFailures is flaky
>>>> 
>>>>> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <ha...@gmail.com> wrote:
>>>>> I think it maybe because health check exit before executor receive the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>>>>> 
>>>>>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <ou...@gmail.com> wrote:
>>>>>> Following up on this:
>>>>>> 
>>>>>> This problem is reproducible when the command is "exit 1".
>>>>>> 
>>>>>> Once I set it to a real curl cmd the intermittent failures stopped and health checks worked as advertised.
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <ou...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Using the health-check following parameters:
>>>>>>> 
>>>>>>> cmd="exit 1"
>>>>>>> delay=5.0
>>>>>>> grace-period=10.0
>>>>>>> interval=10.0
>>>>>>> timeout=10.0
>>>>>>> consecutiveFailures=3
>>>>>>> 
>>>>>>> Sometimes the tasks are successfully identified as failing and restarted, however other times the health-check command exits yet the task is left in a running state and the failure is ignored.
>>>>>>> 
>>>>>>> Sample of failed Mesos task log:
>>>>>>> 
>>>>>>> STDOUT:
>>>>>>> 
>>>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --stop_timeout="0ns"
>>>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --stop_timeout="0ns"
>>>>>>>> Registered docker executor on mesos-worker2a
>>>>>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check --executor=(1)@192.168.225.59:38776 --health_check_json={"command":{"shell":true,"value":"docker exec mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 sh -c \" exit 1 \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0} --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>>>> Health check process launched at pid: 7525
>>>>>>>> Received task health update, healthy: false
>>>>>>>> Received task health update, healthy: false
>>>>>>> 
>>>>>>> 
>>>>>>> STDERR:
>>>>>>> 
>>>>>>>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>>>>>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>>>>>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health check still in grace period
>>>>>>>> W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health command check exited with status 1
>>>>>>> 
>>>>>>> 
>>>>>>> Screenshot of the task still running despite health-check exited with status code 1:
>>>>>>> 
>>>>>>> http://i.imgur.com/zx9GQuo.png
>>>>>>> 
>>>>>>> The expected behavior when the health-check binary has exited w/ non-zero status is that the task would be killed and restarted (rather than continuing to run as outlined above).
>>>>>>> 
>>>>>>> -----
>>>>>>> Additional note: After hard-coding the "path" string of the health-check binary parent dir into b/src/docker/executor.cpp, I am able to at least test the functionality.  The other issue of health-checks for docker tasks failing to start is still unresolved due to the unpropagated MESOS_LAUNCH_DIR issue.
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Best Regards,
>>>> Haosdent Huang
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Haosdent Huang
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Posted by haosdent <ha...@gmail.com>.

I guess your environment have some problems, just like your met strange
behaviour in launcher_dir. Last time I remember also met strange things in
my laptop, it is because I make install a older version Mesos while I
launch a new version Mesos. If you could `make uninstall` and fresh build
through following
http://mesos.apache.org/documentation/latest/getting-started/ and still met
those problems, we could more sure it is a bug. But so far I could not
reproduce your problems after repeat tests, I still think it maybe cause by
your build environment is not clear or have error configurations.

On Sun, Oct 11, 2015 at 11:02 PM, Jay Taylor <ou...@gmail.com> wrote:

> I was testing on a set of VMs on a lan with ideal very low latency.
>
>
>
> On Oct 11, 2015, at 3:30 AM, haosdent <ha...@gmail.com> wrote:
>
> My stdout looks like:
>
> ```
> Launching health check process:
> /home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check
> --executor=(1)@xxxx
> --health_check_json={"command":{"shell":true,"value":"docker exec
> mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4
> sh -c \" exit 1
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0}
> --task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000
> Health check process launched at pid: 24720
> Received task health update, healthy: false
> Received task health update, healthy: false
> Received task health update, healthy: false
> Killing docker task
> Shutting down
> ```
>
> Does your network latency affect this result?
>
>
> On Sun, Oct 11, 2015 at 6:18 PM, haosdent <ha...@gmail.com> wrote:
>
>> Could not reproduce your problem in my side. But I guess it maybe related
>> to this ticket. MESOS-1613
>> <https://issues.apache.org/jira/browse/MESOS-1613> HealthCheckTest.ConsecutiveFailures
>> is flaky
>>
>> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <ha...@gmail.com> wrote:
>>
>>> I think it maybe because health check exit before executor receive
>>> the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>>>
>>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <ou...@gmail.com> wrote:
>>>
>>>> Following up on this:
>>>>
>>>> This problem is reproducible when the command is "exit 1".
>>>>
>>>> Once I set it to a real curl cmd the intermittent failures stopped and
>>>> health checks worked as advertised.
>>>>
>>>>
>>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <ou...@gmail.com> wrote:
>>>>
>>>> Using the health-check following parameters:
>>>>
>>>> cmd="exit 1"
>>>> delay=5.0
>>>> grace-period=10.0
>>>> interval=10.0
>>>> timeout=10.0
>>>> consecutiveFailures=3
>>>>
>>>> Sometimes the tasks are successfully identified as failing and
>>>> restarted, however other times the health-check command exits yet the task
>>>> is left in a running state and the failure is ignored.
>>>>
>>>> Sample of failed Mesos task log:
>>>>
>>>> STDOUT:
>>>>
>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --stop_timeout="0ns"
>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --stop_timeout="0ns"
>>>>> Registered docker executor on mesos-worker2a
>>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check
>>>>> --executor=(1)@192.168.225.59:38776
>>>>> --health_check_json={"command":{"shell":true,"value":"docker exec
>>>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>>>>> sh -c \" exit 1
>>>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>>>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>
>>>>> *Health check process launched at pid: 7525*
>>>>> *Received task health update, healthy: false**Received task health
>>>>> update, healthy: false*
>>>>
>>>>
>>>>
>>>> STDERR:
>>>>
>>>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
>>>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>>>> WARNING: Your kernel does not support swap limit capabilities, memory
>>>>> limited without swap.
>>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health
>>>>> check still in grace period
>>>>> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
>>>>> command check exited with status 1*
>>>>
>>>>
>>>> Screenshot of the task still running despite health-check exited with
>>>> status code 1:
>>>>
>>>> http://i.imgur.com/zx9GQuo.png
>>>>
>>>> The expected behavior when the health-check binary has exited w/
>>>> non-zero status is that the task would be killed and restarted (rather than
>>>> continuing to run as outlined above).
>>>>
>>>> -----
>>>> Additional note: After hard-coding the "path" string of the
>>>> health-check binary parent dir into b/src/docker/executor.cpp, I am able to
>>>> at least test the functionality.  The other issue of health-checks for
>>>> docker tasks failing to start is still unresolved due to the unpropagated
>>>> MESOS_LAUNCH_DIR issue.
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>


-- 
Best Regards,
Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Posted by Jay Taylor <ou...@gmail.com>.

I was testing on a set of VMs on a lan with ideal very low latency.



> On Oct 11, 2015, at 3:30 AM, haosdent <ha...@gmail.com> wrote:
> 
> My stdout looks like:
> 
> ```
> Launching health check process: /home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check --executor=(1)@xxxx --health_check_json={"command":{"shell":true,"value":"docker exec mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4 sh -c \" exit 1 \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0} --task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000
> Health check process launched at pid: 24720
> Received task health update, healthy: false
> Received task health update, healthy: false
> Received task health update, healthy: false
> Killing docker task
> Shutting down
> ```
> 
> Does your network latency affect this result?
> 
> 
>> On Sun, Oct 11, 2015 at 6:18 PM, haosdent <ha...@gmail.com> wrote:
>> Could not reproduce your problem in my side. But I guess it maybe related to this ticket. MESOS-1613 HealthCheckTest.ConsecutiveFailures is flaky
>> 
>>> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <ha...@gmail.com> wrote:
>>> I think it maybe because health check exit before executor receive the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>>> 
>>>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <ou...@gmail.com> wrote:
>>>> Following up on this:
>>>> 
>>>> This problem is reproducible when the command is "exit 1".
>>>> 
>>>> Once I set it to a real curl cmd the intermittent failures stopped and health checks worked as advertised.
>>>> 
>>>> 
>>>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <ou...@gmail.com> wrote:
>>>>> 
>>>>> Using the health-check following parameters:
>>>>> 
>>>>> cmd="exit 1"
>>>>> delay=5.0
>>>>> grace-period=10.0
>>>>> interval=10.0
>>>>> timeout=10.0
>>>>> consecutiveFailures=3
>>>>> 
>>>>> Sometimes the tasks are successfully identified as failing and restarted, however other times the health-check command exits yet the task is left in a running state and the failure is ignored.
>>>>> 
>>>>> Sample of failed Mesos task log:
>>>>> 
>>>>> STDOUT:
>>>>> 
>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --stop_timeout="0ns"
>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --stop_timeout="0ns"
>>>>>> Registered docker executor on mesos-worker2a
>>>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check --executor=(1)@192.168.225.59:38776 --health_check_json={"command":{"shell":true,"value":"docker exec mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 sh -c \" exit 1 \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0} --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>> Health check process launched at pid: 7525
>>>>>> Received task health update, healthy: false
>>>>>> Received task health update, healthy: false
>>>>> 
>>>>> 
>>>>> STDERR:
>>>>> 
>>>>>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>>>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>>>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health check still in grace period
>>>>>> W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health command check exited with status 1
>>>>> 
>>>>> 
>>>>> Screenshot of the task still running despite health-check exited with status code 1:
>>>>> 
>>>>> http://i.imgur.com/zx9GQuo.png
>>>>> 
>>>>> The expected behavior when the health-check binary has exited w/ non-zero status is that the task would be killed and restarted (rather than continuing to run as outlined above).
>>>>> 
>>>>> -----
>>>>> Additional note: After hard-coding the "path" string of the health-check binary parent dir into b/src/docker/executor.cpp, I am able to at least test the functionality.  The other issue of health-checks for docker tasks failing to start is still unresolved due to the unpropagated MESOS_LAUNCH_DIR issue.
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Haosdent Huang
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Haosdent Huang
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Posted by haosdent <ha...@gmail.com>.

My stdout looks like:

```
Launching health check process:
/home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check
--executor=(1)@xxxx
--health_check_json={"command":{"shell":true,"value":"docker exec
mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4
sh -c \" exit 1
\""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0}
--task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000
Health check process launched at pid: 24720
Received task health update, healthy: false
Received task health update, healthy: false
Received task health update, healthy: false
Killing docker task
Shutting down
```

Does your network latency affect this result?


On Sun, Oct 11, 2015 at 6:18 PM, haosdent <ha...@gmail.com> wrote:

> Could not reproduce your problem in my side. But I guess it maybe related
> to this ticket. MESOS-1613
> <https://issues.apache.org/jira/browse/MESOS-1613> HealthCheckTest.ConsecutiveFailures
> is flaky
>
> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <ha...@gmail.com> wrote:
>
>> I think it maybe because health check exit before executor receive
>> the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>>
>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <ou...@gmail.com> wrote:
>>
>>> Following up on this:
>>>
>>> This problem is reproducible when the command is "exit 1".
>>>
>>> Once I set it to a real curl cmd the intermittent failures stopped and
>>> health checks worked as advertised.
>>>
>>>
>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <ou...@gmail.com> wrote:
>>>
>>> Using the health-check following parameters:
>>>
>>> cmd="exit 1"
>>> delay=5.0
>>> grace-period=10.0
>>> interval=10.0
>>> timeout=10.0
>>> consecutiveFailures=3
>>>
>>> Sometimes the tasks are successfully identified as failing and
>>> restarted, however other times the health-check command exits yet the task
>>> is left in a running state and the failure is ignored.
>>>
>>> Sample of failed Mesos task log:
>>>
>>> STDOUT:
>>>
>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>> --stop_timeout="0ns"
>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>> --stop_timeout="0ns"
>>>> Registered docker executor on mesos-worker2a
>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check
>>>> --executor=(1)@192.168.225.59:38776
>>>> --health_check_json={"command":{"shell":true,"value":"docker exec
>>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>>>> sh -c \" exit 1
>>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>
>>>> *Health check process launched at pid: 7525*
>>>> *Received task health update, healthy: false**Received task health
>>>> update, healthy: false*
>>>
>>>
>>>
>>> STDERR:
>>>
>>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
>>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>>> WARNING: Your kernel does not support swap limit capabilities, memory
>>>> limited without swap.
>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health
>>>> check still in grace period
>>>> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
>>>> command check exited with status 1*
>>>
>>>
>>> Screenshot of the task still running despite health-check exited with
>>> status code 1:
>>>
>>> http://i.imgur.com/zx9GQuo.png
>>>
>>> The expected behavior when the health-check binary has exited w/
>>> non-zero status is that the task would be killed and restarted (rather than
>>> continuing to run as outlined above).
>>>
>>> -----
>>> Additional note: After hard-coding the "path" string of the health-check
>>> binary parent dir into b/src/docker/executor.cpp, I am able to at least
>>> test the functionality.  The other issue of health-checks for docker tasks
>>> failing to start is still unresolved due to the unpropagated
>>> MESOS_LAUNCH_DIR issue.
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Posted by haosdent <ha...@gmail.com>.

Could not reproduce your problem in my side. But I guess it maybe related
to this ticket. MESOS-1613
<https://issues.apache.org/jira/browse/MESOS-1613>
HealthCheckTest.ConsecutiveFailures
is flaky

On Fri, Oct 9, 2015 at 12:13 PM, haosdent <ha...@gmail.com> wrote:

> I think it maybe because health check exit before executor receive
> the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>
> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <ou...@gmail.com> wrote:
>
>> Following up on this:
>>
>> This problem is reproducible when the command is "exit 1".
>>
>> Once I set it to a real curl cmd the intermittent failures stopped and
>> health checks worked as advertised.
>>
>>
>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <ou...@gmail.com> wrote:
>>
>> Using the health-check following parameters:
>>
>> cmd="exit 1"
>> delay=5.0
>> grace-period=10.0
>> interval=10.0
>> timeout=10.0
>> consecutiveFailures=3
>>
>> Sometimes the tasks are successfully identified as failing and restarted,
>> however other times the health-check command exits yet the task is left in
>> a running state and the failure is ignored.
>>
>> Sample of failed Mesos task log:
>>
>> STDOUT:
>>
>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --stop_timeout="0ns"
>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>> --stop_timeout="0ns"
>>> Registered docker executor on mesos-worker2a
>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>> Launching health check process: /usr/libexec/mesos/mesos-health-check
>>> --executor=(1)@192.168.225.59:38776
>>> --health_check_json={"command":{"shell":true,"value":"docker exec
>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>>> sh -c \" exit 1
>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>
>>> *Health check process launched at pid: 7525*
>>> *Received task health update, healthy: false**Received task health
>>> update, healthy: false*
>>
>>
>>
>> STDERR:
>>
>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>> WARNING: Your kernel does not support swap limit capabilities, memory
>>> limited without swap.
>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health
>>> check still in grace period
>>> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
>>> command check exited with status 1*
>>
>>
>> Screenshot of the task still running despite health-check exited with
>> status code 1:
>>
>> http://i.imgur.com/zx9GQuo.png
>>
>> The expected behavior when the health-check binary has exited w/ non-zero
>> status is that the task would be killed and restarted (rather than
>> continuing to run as outlined above).
>>
>> -----
>> Additional note: After hard-coding the "path" string of the health-check
>> binary parent dir into b/src/docker/executor.cpp, I am able to at least
>> test the functionality.  The other issue of health-checks for docker tasks
>> failing to start is still unresolved due to the unpropagated
>> MESOS_LAUNCH_DIR issue.
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Posted by haosdent <ha...@gmail.com>.

I think it maybe because health check exit before executor receive
the TaskHealthStatus. I would try "exit 1" and give your feedback later.

On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <ou...@gmail.com> wrote:

> Following up on this:
>
> This problem is reproducible when the command is "exit 1".
>
> Once I set it to a real curl cmd the intermittent failures stopped and
> health checks worked as advertised.
>
>
> On Oct 8, 2015, at 12:45 PM, Jay Taylor <ou...@gmail.com> wrote:
>
> Using the health-check following parameters:
>
> cmd="exit 1"
> delay=5.0
> grace-period=10.0
> interval=10.0
> timeout=10.0
> consecutiveFailures=3
>
> Sometimes the tasks are successfully identified as failing and restarted,
> however other times the health-check command exits yet the task is left in
> a running state and the failure is ignored.
>
> Sample of failed Mesos task log:
>
> STDOUT:
>
> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>> --stop_timeout="0ns"
>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>> --stop_timeout="0ns"
>> Registered docker executor on mesos-worker2a
>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>> Launching health check process: /usr/libexec/mesos/mesos-health-check
>> --executor=(1)@192.168.225.59:38776
>> --health_check_json={"command":{"shell":true,"value":"docker exec
>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>> sh -c \" exit 1
>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>
>> *Health check process launched at pid: 7525*
>> *Received task health update, healthy: false**Received task health
>> update, healthy: false*
>
>
>
> STDERR:
>
> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>> WARNING: Your kernel does not support swap limit capabilities, memory
>> limited without swap.
>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health
>> check still in grace period
>> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
>> command check exited with status 1*
>
>
> Screenshot of the task still running despite health-check exited with
> status code 1:
>
> http://i.imgur.com/zx9GQuo.png
>
> The expected behavior when the health-check binary has exited w/ non-zero
> status is that the task would be killed and restarted (rather than
> continuing to run as outlined above).
>
> -----
> Additional note: After hard-coding the "path" string of the health-check
> binary parent dir into b/src/docker/executor.cpp, I am able to at least
> test the functionality.  The other issue of health-checks for docker tasks
> failing to start is still unresolved due to the unpropagated
> MESOS_LAUNCH_DIR issue.
>
>


-- 
Best Regards,
Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Posted by Jay Taylor <ou...@gmail.com>.

Following up on this:

This problem is reproducible when the command is "exit 1".

Once I set it to a real curl cmd the intermittent failures stopped and health checks worked as advertised.


> On Oct 8, 2015, at 12:45 PM, Jay Taylor <ou...@gmail.com> wrote:
> 
> Using the health-check following parameters:
> 
> cmd="exit 1"
> delay=5.0
> grace-period=10.0
> interval=10.0
> timeout=10.0
> consecutiveFailures=3
> 
> Sometimes the tasks are successfully identified as failing and restarted, however other times the health-check command exits yet the task is left in a running state and the failure is ignored.
> 
> Sample of failed Mesos task log:
> 
> STDOUT:
> 
>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --stop_timeout="0ns"
>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" --stop_timeout="0ns"
>> Registered docker executor on mesos-worker2a
>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>> Launching health check process: /usr/libexec/mesos/mesos-health-check --executor=(1)@192.168.225.59:38776 --health_check_json={"command":{"shell":true,"value":"docker exec mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 sh -c \" exit 1 \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0} --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>> Health check process launched at pid: 7525
>> Received task health update, healthy: false
>> Received task health update, healthy: false
> 
> 
> STDERR:
> 
>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health check still in grace period
>> W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health command check exited with status 1
> 
> 
> Screenshot of the task still running despite health-check exited with status code 1:
> 
> http://i.imgur.com/zx9GQuo.png
> 
> The expected behavior when the health-check binary has exited w/ non-zero status is that the task would be killed and restarted (rather than continuing to run as outlined above).
> 
> -----
> Additional note: After hard-coding the "path" string of the health-check binary parent dir into b/src/docker/executor.cpp, I am able to at least test the functionality.  The other issue of health-checks for docker tasks failing to start is still unresolved due to the unpropagated MESOS_LAUNCH_DIR issue.