You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Weiwei Yang (JIRA)" <ji...@apache.org> on 2017/02/16 05:15:42 UTC
[jira] [Commented] (HADOOP-13837) Always get unable to kill error message even the hadoop process was successfully killed

    [ https://issues.apache.org/jira/browse/HADOOP-13837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869219#comment-15869219 ] 

Weiwei Yang commented on HADOOP-13837:
--------------------------------------

Hi [~aw], [~ajisakaa]

This one seems to become obsolete :( , I'd like to summary the issue again to avoid distractions and try to get this done,

*Issue Summary*

Currently when hadoop-functions.sh kills a process forcibly, it checks the result immediately without waiting any time, this causes the check always fails with error {{ERROR: Unable to kill daemon_pid}}. This is a false alarm that the process actually gets killed successfully.

*The Fix*

The patch is simply fixing two things
# Sleep 3 seconds after kill -9
# Replace pid check with existing function hadoop_status_daemon

Hopefully we can get this fixed in 3.0 alpha3, thanks!

> Always get unable to kill error message even the hadoop process was successfully killed
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13837
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: scripts
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>         Attachments: check_proc.sh, HADOOP-13837.01.patch, HADOOP-13837.02.patch, HADOOP-13837.03.patch, HADOOP-13837.04.patch
>
>
> *Reproduce steps*
> # Setup a hadoop cluster
> # Stop resource manager : yarn --daemon stop resourcemanager
> # Stop node manager : yarn --daemon stop nodemanager
> WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
> ERROR: Unable to kill 20325
> it always gets "Unable to kill <nm_pid>" error message, this gives user impression there is something wrong with the node manager process because it was not able to be forcibly killed. But in fact, the kill command works as expected.
> This was because hadoop-functions.sh did not check process existence after kill properly. Currently it checks the process liveness right after the kill command
> {code}
> ...
> kill -9 "${pid}" >/dev/null 2>&1
> if ps -p "${pid}" > /dev/null 2>&1; then
>       hadoop_error "ERROR: Unable to kill ${pid}"
> ...
> {code}
> when resource manager stopped before node managers, it always takes some additional time until the process completely terminates. I tried to print output of {{ps -p <nm_pid>}} in a while loop after kill -9, and found following
> {noformat}
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> ...
> {noformat}
> in the first 3 times of the loop, the process did not terminate so the exit code of {{ps -p}} are still {{0}}
> *Proposal of a fix*
> Firstly I was thinking to add a more comprehensive pid check, it checks the pid liveness until reaches the HADOOP_STOP_TIMEOUT, but this seems to add too much complexity. Second fix was to simply add a {{sleep 3}} after {{kill -9}}, it should fix the error in most cases with relative small changes to the script.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org