You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2021/05/12 11:08:18 UTC
[GitHub] [dolphinscheduler] quanzhian opened a new issue #5461: worker节点执行任务时有时候无法拿到PID,导致kill -9命令出错导致UI界面任务状态执行情况显示失败,实际任务job执行是成功的
quanzhian opened a new issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461
org.apache.dolphinscheduler.server.utils.ProcessUtil.java 类下
public static void kill(TaskExecutionContext taskExecutionContext) {
try {
int processId = taskExecutionContext.getProcessId();
if (processId == 0) {
logger.error("process kill failed, process id :{}, task id:{}",
processId, taskExecutionContext.getTaskInstanceId());
return;
}
// 此处的getPidsStr(processId)得到的进程PID有时候无法拿到,导致执行kill -9 命令报错,需要官方进行一个空值判断
String cmd = String.format("sudo kill -9 %s", getPidsStr(processId));
logger.info("process id:{}, cmd:{}", processId, cmd);
OSUtils.exeCmd(cmd);
} catch (Exception e) {
logger.error("kill task failed", e);
}
// find log and kill yarn job
killYarnJob(taskExecutionContext);
}
异常日志信息如下:
[INFO] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[347] - task run command:
sudo -u dolphinscheduler sh /tmp/dolphinscheduler/exec/process/2/107/71/109/107_71_109.command
[INFO] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[228] - process start, process id is: 11319
[INFO] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[237] - process has exited, execute path:/tmp/dolphinscheduler/exec/process/2/107/71/109, processId:11319 ,exitStatusCode:0
[ERROR] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[256] - process has failure , exitStatusCode : 0 , ready to kill ...
[INFO] 2021-05-12 18:28:32.969 org.apache.dolphinscheduler.server.utils.ProcessUtils:[373] - process id:11319, cmd:sudo kill -9
[ERROR] 2021-05-12 18:28:32.981 org.apache.dolphinscheduler.server.utils.ProcessUtils:[378] - kill task failed
org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException:
Usage:
kill [options] <pid|name> [...]
Options:
-a, --all do not restrict the name-to-pid conversion to processes
with the same uid as the present process
-s, --signal <sig> send specified signal
-q, --queue <sig> use sigqueue(2) rather than kill(2)
-p, --pid print pids without signaling them
-l, --list [=<signal>] list signal names, or convert one to a name
-L, --table list signal names and numbers
-h, --help display this help and exit
-V, --version output version information and exit
For more details see kill(1).
at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:209)
at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:124)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:127)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:104)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:87)
at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:394)
at org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(OSUtils.java:384)
at org.apache.dolphinscheduler.server.utils.ProcessUtils.kill(ProcessUtils.java:375)
at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:257)
at org.apache.dolphinscheduler.server.worker.task.qtdataIntegration.QtDiTask.handle(QtDiTask.java:166)
at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:134)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[INFO] 2021-05-12 18:28:33.943 - [taskAppId=TASK-107-71-109]:[129] - -> flinkx starting ...
18:28:33.378 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
18:28:33.381 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
[INFO] 2021-05-12 18:28:33.982 org.apache.dolphinscheduler.service.log.LogClientService:[100] - view log path /mnt/services/dolphinscheduler136/logs/107/71/109.log
[INFO] 2021-05-12 18:28:33.988 org.apache.dolphinscheduler.remote.NettyRemotingClient:[403] - netty client closed
[INFO] 2021-05-12 18:28:33.988 org.apache.dolphinscheduler.service.log.LogClientService:[59] - logger client closed
[INFO] 2021-05-12 18:28:33.989 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[142] - task instance id : 109,task final status : FAILURE
[INFO] 2021-05-12 18:28:33.989 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[162] - develop mode is: false
[INFO] 2021-05-12 18:28:33.989 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[180] - exec local path: /tmp/dolphinscheduler/exec/process/2/107/71/109 cleared.
[INFO] 2021-05-12 18:28:34.944 - [taskAppId=TASK-107-71-109]:[129] - -> 18:28:34.159 [main] INFO com.dtstack.flinkx.launcher.perjob.PerJobSubmitter - start to submit per-job task, LauncherOptions = Options{mode='yarnPer', job='/tmp/dolphinscheduler/exec/process/2/107/71/109/107_71_109_job.json', monitor='null', jobid='Flink Job', flinkconf='/mnt/services/flink-1.8.8/conf', pluginRoot='/mnt/services/flinkx/plugins', remotePluginPath='null', yarnconf='/etc/hadoop/conf', parallelism='1', priority='1', queue='default', flinkLibJar='/mnt/services/flink-1.8.8/lib', confProp='{"flink.checkpoint.interval":60000}', p='', s='null', pluginLoadMode='shipfile', appId='null'}
18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
18:28:34.167 [main] INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
18:28:34.305 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:28:34.360 [main] INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to dolphinscheduler (auth:SIMPLE)
log4j:WARN No appenders could be found for logger (org.apache.hadoop.yarn.ipc.YarnRPC).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
18:28:34.543 [main] INFO com.dtstack.flinkx.launcher.perjob.PerJobClusterClientBuilder - ----init yarn success ----
18:28:34.666 [main] INFO org.apache.hadoop.conf.Configuration - resource-types.xml not found
18:28:34.666 [main] INFO org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find 'resource-types.xml'.
18:28:34.704 [main] WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The JobManager or TaskManager memory is below the smallest possible YARN Container size. The value of 'yarn.scheduler.minimum-allocation-mb' is '1024'. Please increase the memory size.YARN will allocate the smaller containers but the scheduler will account for the minimum-allocation-mb, maybe not all instances you requested will start.
18:28:34.704 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=1024, numberTaskManagers=1, slotsPerTaskManager=1}
[INFO] 2021-05-12 18:28:35.945 - [taskAppId=TASK-107-71-109]:[129] - -> 18:28:35.024 [main] WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18:28:35.033 [main] WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The configuration directory ('/mnt/services/flink-1.8.8/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
[INFO] 2021-05-12 18:28:36.946 - [taskAppId=TASK-107-71-109]:[129] - -> 18:28:36.716 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Submitting application master application_1609329939009_5348
18:28:36.741 [main] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1609329939009_5348
18:28:36.742 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting for the cluster to be allocated
18:28:36.744 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying cluster, current state ACCEPTED
[INFO] 2021-05-12 18:28:40.946 - [taskAppId=TASK-107-71-109]:[129] - -> 18:28:40.025 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - YARN application has been deployed successfully.
18:28:40.320 [main] INFO org.apache.flink.runtime.rest.RestClient - Rest client endpoint started.
18:28:40.323 [main] INFO com.dtstack.flinkx.util.YarnUtil - HADOOP_CONF_DIR:/etc/hadoop/conf
18:28:40.372 [main] INFO com.dtstack.flinkx.util.YarnUtil - get 1080 config from /etc/hadoop/conf/core-site.xml
18:28:40.380 [main] INFO com.dtstack.flinkx.util.YarnUtil - get 23 config from /etc/hadoop/conf/hdfs-site.xml
18:28:40.400 [main] INFO com.dtstack.flinkx.util.YarnUtil - hdfs path:hdfs:///apps/flinkx/2021-05-12/816d1ef47c5a5cbd5557580126b17f22
18:28:40.401 [main] INFO com.dtstack.flinkx.util.YarnUtil - monitorUrl:bigdata-master01:8088/proxy/application_1609329939009_5348
18:28:40.421 [main] INFO com.dtstack.flinkx.launcher.perjob.PerJobSubmitter - deploy per_job with appId: application_1609329939009_5348}, jobId: 816d1ef47c5a5cbd5557580126b17f22
[INFO] 2021-05-12 18:28:40.947 - [taskAppId=TASK-107-71-109]:[127] - FINALIZE_SESSION
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [dolphinscheduler] zhuangchong commented on issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful
Posted by GitBox <gi...@apache.org>.
zhuangchong commented on issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461#issuecomment-839733814
Which version? This problem has been fixed in the dev branch.
![image](https://user-images.githubusercontent.com/37063904/117975484-4eca1c00-b361-11eb-9ff1-394ebca4b247.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461#issuecomment-839685099
Hi:
* Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
* In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
* If you haven't received a reply for a long time, you can subscribe to the developer's email,Mail subscription steps reference https://dolphinscheduler.apache.org/zh-cn/community/development/subscribe.html ,Then write the issue URL in the email content and send question to dev@dolphinscheduler.apache.org.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [dolphinscheduler] quanzhian closed issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful
Posted by GitBox <gi...@apache.org>.
quanzhian closed issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [dolphinscheduler] quanzhian commented on issue #5461: The worker node sometimes fails to get the PID when performing tasks, resulting in an error in the kill -9 command, causing the UI interface task status to display failure, and the actual task job execution is successful
Posted by GitBox <gi...@apache.org>.
quanzhian commented on issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461#issuecomment-840246957
version: 1.3.6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org