You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Vinod K V (JIRA)" <ji...@apache.org> on 2009/05/08 07:36:45 UTC

[jira] Commented: (HADOOP-5420) Support killing of process groups in LinuxTaskController binary

    [ https://issues.apache.org/jira/browse/HADOOP-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707242#action_12707242 ] 

Vinod K V commented on HADOOP-5420:
-----------------------------------

Looked at the patch. Some comments:
 - The patch needs sync-up with the latest patch on HADOOP-5771
 - Overall, I think the terms terminate and kill are better than the current usage of kill and destroy.

TaskController.java
 - killTask, destroyTask can better be terminateTask, killTask. And accordingly javadoc should be more explicit.
 - killTaskJVM: Null check of context.task belongs to LinuxTaskController, need it to be moved there.

DefaultTaskController.java
 - For WINDOWS process.destroy() is again not needed in destroyTask()

LinuxTaskController.java
 - We completely bring down the TT if we fail to write the pidFile to any one of the disks. This should change. We should fail only if we cannot write to any of the disks.
 - javadoc of setup() is incomplete. It should also quote the additional steps of writing TT pid-file.
 - Log message at +384 can be better: LOG.info("Written :: " + pid + "  to file " + ttPidFile);
 - TT's pidfile should not be made world writable (changeDirectoryPermissions does this). In fact, it can be set 000 permissions right-away. This would still work as we run as root to read it.
 - killHelper can better be something like finishTask()
 - What should we do when the commands for SIGTERM/SIGKILL fail?
 - buildTaskControllerExecutor() : For taskControllerCmd, you can instead use an Array
                                                            Why the usage of command.ordinal() in this method?
 - I think it would be clear to have a separate buildKillTaskArgs(). It would definitely help to also 'javadoc' the task-controller's command-line syntax for buildLaunchTaskArgs and this new buildKillTaskArgs(). With this you can completely remove the killHelper method()

ProcessTree.java
 - killProcessGroup() can be renamed to terminateProcessGroup(), killProcess() to terminateProcess(), sigKillProcessGroup to killProcessGroup() and sigKillProcess to killProcess().
 - Also, I think, with these semantics, we need a ProcessTree.isProcessGroupAlive(pgrpid) along with the current ProcessTree.isAlive(pid)

main.c
 - The argc length check should be command specific.
 - job_pid var can better be task_pid
 - You have a printf statement in the case DESTROY_TASK_JVM. This can be removed or redirected to log if needed.
 - Return values of verify_parent like UNABLE_TO_READ_TT_PID_FILE, OUT_OF_MEMORY are lost, as we are only returning INVALID_PROCESS_LAUNCHING_TASKCONTROLLER in error scenarios. Can't we do something here?

taskcontroller.c
 - Documentation of kill_user_task() should be fixed. Currently, it only refers to SIGTERM.
 - The get_pid_path() function can be removed. Also references to it from code documentation of run_task_as_user, kill_user_task can be removed too along with the TT_SYS_DIR var from taskcontroller.h
 - +343 "//Don't continue if the process is not alive anymore." should instead be "  // Don't continue if the process-group is not alive anymore."

TestKillSubProcessesWithTaskController
 - Rename to TestKillSubProcessesWithLinuxTaskController
 - Remove unused imports
 - Add Apache license header and add javadoc similar to other tests with LinuxTaskController
 - Just like in HADOOP-5771, we may want to verify job-ownership in this test too.
 - The test ran successfully, but it is running the jobs as the same user as the user running the cluster. Need to fix this.
 - Also, I see a lot of messages like this : "WARN  mapred.LinuxTaskController (LinuxTaskController.java:destroyTask(454)) - Exception thrown while sending destroy to the Task VM org.apache.hadoop.util.Shell$ExitCodeException:". Need to debug the cause for these.

We also need a test to verify cleanup of tasks that ignore SIGTERM.


> Support killing of process groups in LinuxTaskController binary
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5420
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5420
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Sreekanth Ramakrishnan
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5420-1.patch, hadoop-5420-2.patch, hadoop-5420-3.patch, hadoop-5420.patch
>
>
> Support setsid based kill in LinuxTaskController.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.