You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2021/05/12 11:08:18 UTC

[GitHub] [dolphinscheduler] quanzhian opened a new issue #5461: worker节点执行任务时有时候无法拿到PID,导致kill -9命令出错导致UI界面任务状态执行情况显示失败,实际任务job执行是成功的

quanzhian opened a new issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461


   org.apache.dolphinscheduler.server.utils.ProcessUtil.java 类下
   
   
       public static void kill(TaskExecutionContext taskExecutionContext) {
           try {
               int processId = taskExecutionContext.getProcessId();
               if (processId == 0) {
                   logger.error("process kill failed, process id :{}, task id:{}",
                       processId, taskExecutionContext.getTaskInstanceId());
                   return;
               }
               // 此处的getPidsStr(processId)得到的进程PID有时候无法拿到,导致执行kill -9 命令报错,需要官方进行一个空值判断
               String cmd = String.format("sudo kill -9 %s", getPidsStr(processId));
   
               logger.info("process id:{}, cmd:{}", processId, cmd);
   
               OSUtils.exeCmd(cmd);
   
           } catch (Exception e) {
               logger.error("kill task failed", e);
           }
           // find log and kill yarn job
           killYarnJob(taskExecutionContext);
       }
   
   
   
   异常日志信息如下:
   
   
   [INFO] 2021-05-12 18:28:32.942  - [taskAppId=TASK-107-71-109]:[347] - task run command:
   sudo -u dolphinscheduler sh /tmp/dolphinscheduler/exec/process/2/107/71/109/107_71_109.command
   [INFO] 2021-05-12 18:28:32.942  - [taskAppId=TASK-107-71-109]:[228] - process start, process id is: 11319
   [INFO] 2021-05-12 18:28:32.942  - [taskAppId=TASK-107-71-109]:[237] - process has exited, execute path:/tmp/dolphinscheduler/exec/process/2/107/71/109, processId:11319 ,exitStatusCode:0
   [ERROR] 2021-05-12 18:28:32.942  - [taskAppId=TASK-107-71-109]:[256] - process has failure , exitStatusCode : 0 , ready to kill ...
   [INFO] 2021-05-12 18:28:32.969 org.apache.dolphinscheduler.server.utils.ProcessUtils:[373] - process id:11319, cmd:sudo kill -9 
   [ERROR] 2021-05-12 18:28:32.981 org.apache.dolphinscheduler.server.utils.ProcessUtils:[378] - kill task failed
   org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: 
   Usage:
    kill [options] <pid|name> [...]
   
   Options:
    -a, --all              do not restrict the name-to-pid conversion to processes
                           with the same uid as the present process
    -s, --signal <sig>     send specified signal
    -q, --queue <sig>      use sigqueue(2) rather than kill(2)
    -p, --pid              print pids without signaling them
    -l, --list [=<signal>] list signal names, or convert one to a name
    -L, --table            list signal names and numbers
   
    -h, --help     display this help and exit
    -V, --version  output version information and exit
   
   For more details see kill(1).
   
   	at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:209)
   	at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:124)
   	at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:127)
   	at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:104)
   	at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:87)
   	at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:394)
   	at org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(OSUtils.java:384)
   	at org.apache.dolphinscheduler.server.utils.ProcessUtils.kill(ProcessUtils.java:375)
   	at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:257)
   	at org.apache.dolphinscheduler.server.worker.task.qtdataIntegration.QtDiTask.handle(QtDiTask.java:166)
   	at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:134)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   [INFO] 2021-05-12 18:28:33.943  - [taskAppId=TASK-107-71-109]:[129] -  -> flinkx starting ...
   	18:28:33.378 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
   	18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
   	18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
   	18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
   	18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
   	18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
   [INFO] 2021-05-12 18:28:33.982 org.apache.dolphinscheduler.service.log.LogClientService:[100] - view log path /mnt/services/dolphinscheduler136/logs/107/71/109.log
   [INFO] 2021-05-12 18:28:33.988 org.apache.dolphinscheduler.remote.NettyRemotingClient:[403] - netty client closed
   [INFO] 2021-05-12 18:28:33.988 org.apache.dolphinscheduler.service.log.LogClientService:[59] - logger client closed
   [INFO] 2021-05-12 18:28:33.989 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[142] - task instance id : 109,task final status : FAILURE
   [INFO] 2021-05-12 18:28:33.989 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[162] - develop mode is: false
   [INFO] 2021-05-12 18:28:33.989 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[180] - exec local path: /tmp/dolphinscheduler/exec/process/2/107/71/109 cleared.
   [INFO] 2021-05-12 18:28:34.944  - [taskAppId=TASK-107-71-109]:[129] -  -> 18:28:34.159 [main] INFO com.dtstack.flinkx.launcher.perjob.PerJobSubmitter - start to submit per-job task, LauncherOptions = Options{mode='yarnPer', job='/tmp/dolphinscheduler/exec/process/2/107/71/109/107_71_109_job.json', monitor='null', jobid='Flink Job', flinkconf='/mnt/services/flink-1.8.8/conf', pluginRoot='/mnt/services/flinkx/plugins', remotePluginPath='null', yarnconf='/etc/hadoop/conf', parallelism='1', priority='1', queue='default', flinkLibJar='/mnt/services/flink-1.8.8/lib', confProp='{"flink.checkpoint.interval":60000}', p='', s='null', pluginLoadMode='shipfile', appId='null'}
   	18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
   	18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
   	18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
   	18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
   	18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
   	18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
   	18:28:34.305 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   	18:28:34.360 [main] INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to dolphinscheduler (auth:SIMPLE)
   	log4j:WARN No appenders could be found for logger (org.apache.hadoop.yarn.ipc.YarnRPC).
   	log4j:WARN Please initialize the log4j system properly.
   	log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
   	18:28:34.543 [main] INFO com.dtstack.flinkx.launcher.perjob.PerJobClusterClientBuilder - ----init yarn success ----
   	18:28:34.666 [main] INFO org.apache.hadoop.conf.Configuration - resource-types.xml not found
   	18:28:34.666 [main] INFO org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find 'resource-types.xml'.
   	18:28:34.704 [main] WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The JobManager or TaskManager memory is below the smallest possible YARN Container size. The value of 'yarn.scheduler.minimum-allocation-mb' is '1024'. Please increase the memory size.YARN will allocate the smaller containers but the scheduler will account for the minimum-allocation-mb, maybe not all instances you requested will start.
   	18:28:34.704 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=1024, numberTaskManagers=1, slotsPerTaskManager=1}
   [INFO] 2021-05-12 18:28:35.945  - [taskAppId=TASK-107-71-109]:[129] -  -> 18:28:35.024 [main] WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
   	18:28:35.033 [main] WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The configuration directory ('/mnt/services/flink-1.8.8/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
   [INFO] 2021-05-12 18:28:36.946  - [taskAppId=TASK-107-71-109]:[129] -  -> 18:28:36.716 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Submitting application master application_1609329939009_5348
   	18:28:36.741 [main] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1609329939009_5348
   	18:28:36.742 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting for the cluster to be allocated
   	18:28:36.744 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying cluster, current state ACCEPTED
   [INFO] 2021-05-12 18:28:40.946  - [taskAppId=TASK-107-71-109]:[129] -  -> 18:28:40.025 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - YARN application has been deployed successfully.
   	18:28:40.320 [main] INFO org.apache.flink.runtime.rest.RestClient - Rest client endpoint started.
   	18:28:40.323 [main] INFO com.dtstack.flinkx.util.YarnUtil - HADOOP_CONF_DIR:/etc/hadoop/conf
   	18:28:40.372 [main] INFO com.dtstack.flinkx.util.YarnUtil - get 1080 config from /etc/hadoop/conf/core-site.xml
   	18:28:40.380 [main] INFO com.dtstack.flinkx.util.YarnUtil - get 23 config from /etc/hadoop/conf/hdfs-site.xml
   	18:28:40.400 [main] INFO com.dtstack.flinkx.util.YarnUtil - hdfs path:hdfs:///apps/flinkx/2021-05-12/816d1ef47c5a5cbd5557580126b17f22
   	18:28:40.401 [main] INFO com.dtstack.flinkx.util.YarnUtil - monitorUrl:bigdata-master01:8088/proxy/application_1609329939009_5348
   	18:28:40.421 [main] INFO com.dtstack.flinkx.launcher.perjob.PerJobSubmitter - deploy per_job with appId: application_1609329939009_5348}, jobId: 816d1ef47c5a5cbd5557580126b17f22
   [INFO] 2021-05-12 18:28:40.947  - [taskAppId=TASK-107-71-109]:[127] - FINALIZE_SESSION
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] zhuangchong commented on issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful

Posted by GitBox <gi...@apache.org>.
zhuangchong commented on issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461#issuecomment-839733814


   Which version? This problem has been fixed in the dev branch.
   
   ![image](https://user-images.githubusercontent.com/37063904/117975484-4eca1c00-b361-11eb-9ff1-394ebca4b247.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461#issuecomment-839685099


   Hi:
   * Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can subscribe to the developer's email,Mail subscription steps reference https://dolphinscheduler.apache.org/zh-cn/community/development/subscribe.html ,Then write the issue URL in the email content and send question to dev@dolphinscheduler.apache.org.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] quanzhian closed issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful

Posted by GitBox <gi...@apache.org>.
quanzhian closed issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] quanzhian commented on issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful

Posted by GitBox <gi...@apache.org>.
quanzhian commented on issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461#issuecomment-840246957


   version: 1.3.6


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org