You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Prabhu Joseph (JIRA)" <ji...@apache.org> on 2017/05/11 08:36:04 UTC
[jira] [Created] (OOZIE-2887) Oozie Server hangs when there is a
user job has wrong namenode address
Prabhu Joseph created OOZIE-2887:
------------------------------------
Summary: Oozie Server hangs when there is a user job has wrong namenode address
Key: OOZIE-2887
URL: https://issues.apache.org/jira/browse/OOZIE-2887
Project: Oozie
Issue Type: Bug
Components: core
Affects Versions: 4.3.0
Reporter: Prabhu Joseph
Priority: Critical
All the oozie jobs goes to PREP state when a user job tries to connect to wrong namenode address by mistake. Analyzing the jstack, all the threads which tries to submit job waiting to lock "java.util.ServiceLoader"
{code}
"pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 waiting for monitor entry [0x00007f8bf207a000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
- waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
- locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
at org.apache.oozie.command.XCommand.call(XCommand.java:287)
at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
And the thread which tries to connect to wrong NameNode address which has acquired the lock and keeps on retrying to connect to NameNode for ever.
{code}
"pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 waiting on condition [0x00007f8bf1f78000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
- locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
- locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
at org.apache.hadoop.ipc.Client.call(Client.java:1449)
at org.apache.hadoop.ipc.Client.call(Client.java:1396)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
at org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
- locked <0x0000000083b422f8> (a java.lang.Object)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
- locked <0x0000000083c498f0> (a java.lang.Object)
at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
- locked <0x0000000083c49950> (a java.lang.Object)
at org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
- locked <0x0000000081b29098> (a java.util.ServiceLoader)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
at org.apache.oozie.command.XCommand.call(XCommand.java:287)
at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Oozie logs shows the job is retrying to connect to wrong namenode address.
{code}
2017-05-10 05:38:23,194 INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 333 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:24,574 INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 334 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:26,918 INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 335 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:28,659 INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 336 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
{code}
There should be same way to prevent Oozie Server to fall into this trap when some user has wrong details.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)