You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2020/10/19 08:37:00 UTC

[GitHub] [incubator-dolphinscheduler] shiliquan commented on issue #3946: The work process is dead

shiliquan commented on issue #3946:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/3946#issuecomment-711852259


   1、版本:1.3.2
   2、工作流实例类型为shell定时任务执行hsql语句
   master报错信息:
   ![image](https://user-images.githubusercontent.com/42087586/96418482-63a85c00-1225-11eb-9e32-c59b48bd3d9b.png)
   worker报的,没有日志:
   ![image](https://user-images.githubusercontent.com/42087586/96418620-a23e1680-1225-11eb-9486-2cdca811fbed.png)
   任务实例报的,啥也没有:
   ![image](https://user-images.githubusercontent.com/42087586/96418695-c0a41200-1225-11eb-8201-ec93bbee2347.png)
   任务实例执行记录:
   ![image](https://user-images.githubusercontent.com/42087586/96418810-e204fe00-1225-11eb-8d64-de2b9e2f9b9a.png)
   DAG截图:
   ![image](https://user-images.githubusercontent.com/42087586/96419891-542a1280-1227-11eb-8057-791edc1f3999.png)
   
   后面是通过重启了worker进程后才解决了,可是今天又出现了类似的问题:
   任务实例记录:
   ![image](https://user-images.githubusercontent.com/42087586/96420592-43c66780-1228-11eb-9f9c-711bbded8e2a.png)
   报错如下:
   [INFO] 2020-10-19 10:56:59.863 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[93] - script path : /tmp/dolphinscheduler/exec/process/2/14/3420/7820
   [INFO] 2020-10-19 10:56:59.864 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[246] - get resource file from hdfs :/home/dolphinscheduler/dolphinscheduler/dolphinscheduler/resources/hive/show_tables.sh
   [ERROR] 2020-10-19 10:56:59.884 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[249] - Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hdpv3test05.cnbdcu.com/xx.xx.xx.xx"; destination host is: "hdpv3test04.cnbdcu.com":8020; 
   java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hdpv3test05.cnbdcu.com/xx.xx.xx.xx"; destination host is: "hdpv3test04.cnbdcu.com":8020; 
           at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
           at org.apache.hadoop.ipc.Client.call(Client.java:1479)
           at org.apache.hadoop.ipc.Client.call(Client.java:1412)
           at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
           at com.sun.proxy.$Proxy134.getFileInfo(Unknown Source)
           at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
           at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
           at com.sun.proxy.$Proxy135.getFileInfo(Unknown Source)
           at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
           at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
           at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
           at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
           at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:464)
           at org.apache.dolphinscheduler.common.utils.HadoopUtils.copyHdfsToLocal(HadoopUtils.java:333)
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.downloadResource(TaskExecuteThread.java:247)
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:98)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
           at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
           at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
           at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
           at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
           at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
           at org.apache.hadoop.ipc.Client.call(Client.java:1451)
           ... 24 common frames omitted
   Caused by: javax.security.sasl.SaslException: GSS initiate failed
           at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
           at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:414)
           at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560)
           at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375)
           at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:729)
           at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
           at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725)
           ... 27 common frames omitted
   Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
           at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
           at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
           at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
           at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
           at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
           at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
           at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
           ... 36 common frames omitted
   [ERROR] 2020-10-19 10:56:59.884 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[140] - task scheduler failure
   java.lang.RuntimeException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hdpv3test05.cnbdcu.com/xx.xx.xx.xx"; destination host is: "hdpv3test04.cnbdcu.com":8020; 
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.downloadResource(TaskExecuteThread.java:250)
           at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:98)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   
   解决方法也是重启了worker进程后,重跑成功。
   
   如果worker将死了的话,为何还会不断向该主机发送task呢?这样的话,会不会很不合理?还有,上面这个报错看起来也不像是进程僵死造成的,执行其他的比如:shell查询--ls、pwd,mysql查询等命令是可以执行的,但是重启了worker后就又可以了,这种到底是啥原因?感觉好像和kerberos认证有关,但是该主机直接的票据认证是没有问题的,而且,为啥重启了下就又好了?是不是ds关于kerberos的认证方面有其他的认证期限之类的?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org