You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2020/10/19 06:50:10 UTC

[GitHub] [incubator-dolphinscheduler] shiliquan opened a new issue #3946: The work process is dead

shiliquan opened a new issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946


   在定时任务调度执行时出现了任务失败,可是却看不到任务调度日志,最后重启了work进程后,重跑任务,执行成功!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] whitelowrie commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
whitelowrie commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-737796116


   @shiliquan I have same issue,i guss you kerberse user lost efficacy,when you login other kerberose user. so i suggest you create new system user just run dolphinscheduler!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] xingchun-chen commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
xingchun-chen commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-731071950


   who can look the problem?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] shiliquan commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
shiliquan commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-750825604


   @ Hello, Whitelowrie ! A little bit do not understand what you mean, and ds authentication ticket I am in DS users under the creation of a timing task, every hour to execute, so, on the resolution of this problem can be explained in detail, thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] phoenixhadoop commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
phoenixhadoop commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-730114741


   Is it solved?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] yh2388 commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
yh2388 commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-750830353


   > 没有,当前还是存在这个问题,而且我发现 他好像并不是认证时间的限制,更像是次数限制一样,因为我只有把任务开启定时执行(我是5分钟执行一次)的时候才会出现,如果我不开启定时任务的话,过一个星期手动执行 也是没有问题的。
   > 生产环境每天重启不太现实,你怎么解决因为重启而造成的失败任务呢?手动拉起重跑么?
   
   did `kinit` in your sh?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] shiliquan commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
shiliquan commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-751647875


   @chengshiwen 已经添加了定时任务脚本
   ![image](https://user-images.githubusercontent.com/42087586/103204210-c1e27100-4931-11eb-8317-4ae653b8c0f3.png)
   之前有测试过,确定这种情况只是在定时任务的时候会出现,如果没有做定时任务的话,隔一段时间再去调度任务则正常。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] shiliquan commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
shiliquan commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-730209163


   没有,当前还是存在这个问题,而且我发现  他好像并不是认证时间的限制,更像是次数限制一样,因为我只有把任务开启定时执行(我是5分钟执行一次)的时候才会出现,如果我不开启定时任务的话,过一个星期手动执行 也是没有问题的。
   生产环境每天重启不太现实,你怎么解决因为重启而造成的失败任务呢?手动拉起重跑么?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] chengshiwen edited a comment on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
chengshiwen edited a comment on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-750839630


   @shiliquan Did you add `kinit -kt keytab principal` in the shell script?
   And could you add `klist`, `whoami` and `ls -l /tmp` to troubleshoot the problem under non-crontab and crontab mode respectively?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] chengshiwen commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
chengshiwen commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-750839630


   @shiliquan Did you add `kinit -kt keytab principal` in the shell script?
   And could you add `whoami` and `ls -l /tmp` to troubleshoot the problem under non-crontab and crontab mode respectively?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] xingchun-chen commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
xingchun-chen commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-711796594


   1.which version? 
   2.What is the workflow instance type when it fails? 
   3.What is the task type? 
   Take a screenshot of the process instance DAG


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] phoenixhadoop commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
phoenixhadoop commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-730196220


   这个问题好像始终没有解决,我们每晚23点必须重启下ds 集群来规避这个认证问题


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] samz406 commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
samz406 commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-712644716


   direct sh show_tables.sh  in shell ,is ok?
   直接在worker服务器上shell中执行sh show_tables.sh会成功不


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] shiliquan commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
shiliquan commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-750831967


   Yes, I've also found that it seems to happen only after the timer is turned on, and I haven't found any other solution at the moment, so I'm going to try it locally instead. @ xingchun-chen, please follow up and fix it. It seems I'm not the only one who has this problem. Is it a software bug.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] shiliquan commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
shiliquan commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-711852259


   1、版本:1.3.2
   2、工作流实例类型为shell定时任务执行hsql语句
   master报错信息:
   ![image](https://user-images.githubusercontent.com/42087586/96418482-63a85c00-1225-11eb-9e32-c59b48bd3d9b.png)
   worker报的,没有日志:
   ![image](https://user-images.githubusercontent.com/42087586/96418620-a23e1680-1225-11eb-9486-2cdca811fbed.png)
   任务实例报的,啥也没有:
   ![image](https://user-images.githubusercontent.com/42087586/96418695-c0a41200-1225-11eb-8201-ec93bbee2347.png)
   任务实例执行记录:
   ![image](https://user-images.githubusercontent.com/42087586/96418810-e204fe00-1225-11eb-8d64-de2b9e2f9b9a.png)
   DAG截图:
   ![image](https://user-images.githubusercontent.com/42087586/96419891-542a1280-1227-11eb-8057-791edc1f3999.png)
   
   后面是通过重启了worker进程后才解决了,可是今天又出现了类似的问题:
   任务实例记录:
   ![image](https://user-images.githubusercontent.com/42087586/96420592-43c66780-1228-11eb-9f9c-711bbded8e2a.png)
   报错如下:
   [INFO] 2020-10-19 10:56:59.863 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[93] - script path : /tmp/dolphinscheduler/exec/process/2/14/3420/7820
   [INFO] 2020-10-19 10:56:59.864 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[246] - get resource file from hdfs :/home/dolphinscheduler/dolphinscheduler/dolphinscheduler/resources/hive/show_tables.sh
   [ERROR] 2020-10-19 10:56:59.884 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[249] - Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hdpv3test05.cnbdcu.com/xx.xx.xx.xx"; destination host is: "hdpv3test04.cnbdcu.com":8020; 
   java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hdpv3test05.cnbdcu.com/xx.xx.xx.xx"; destination host is: "hdpv3test04.cnbdcu.com":8020; 
           at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
           at org.apache.hadoop.ipc.Client.call(Client.java:1479)
           at org.apache.hadoop.ipc.Client.call(Client.java:1412)
           at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
           at com.sun.proxy.$Proxy134.getFileInfo(Unknown Source)
           at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
           at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
           at com.sun.proxy.$Proxy135.getFileInfo(Unknown Source)
           at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
           at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
           at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
           at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
           at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:464)
           at org.apache.dolphinscheduler.common.utils.HadoopUtils.copyHdfsToLocal(HadoopUtils.java:333)
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.downloadResource(TaskExecuteThread.java:247)
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:98)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
           at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
           at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
           at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
           at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
           at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
           at org.apache.hadoop.ipc.Client.call(Client.java:1451)
           ... 24 common frames omitted
   Caused by: javax.security.sasl.SaslException: GSS initiate failed
           at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
           at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:414)
           at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560)
           at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375)
           at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:729)
           at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
           at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725)
           ... 27 common frames omitted
   Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
           at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
           at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
           at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
           at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
           at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
           at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
           at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
           ... 36 common frames omitted
   [ERROR] 2020-10-19 10:56:59.884 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[140] - task scheduler failure
   java.lang.RuntimeException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hdpv3test05.cnbdcu.com/xx.xx.xx.xx"; destination host is: "hdpv3test04.cnbdcu.com":8020; 
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.downloadResource(TaskExecuteThread.java:250)
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:98)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   
   解决方法也是重启了worker进程后,重跑成功。
   
   如果worker将死了的话,为何还会不断向该主机发送task呢?这样的话,会不会很不合理?还有,上面这个报错看起来也不像是进程僵死造成的,执行其他的比如:shell查询--ls、pwd,mysql查询等命令是可以执行的,但是重启了worker后就又可以了,这种到底是啥原因?感觉好像和kerberos认证有关,但是该主机直接的票据认证是没有问题的,而且,为啥重启了下就又好了?是不是ds关于kerberos的认证方面有其他的认证期限之类的?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] shiliquan commented on issue #3946: The work process is dead

Posted by GitBox <gi...@apache.org>.
shiliquan commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-712647831


   可以的
   ![image](https://user-images.githubusercontent.com/42087586/96553209-68354900-12e7-11eb-8ba4-6baecf7eef81.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org