You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Prabhu Joseph (JIRA)" <ji...@apache.org> on 2017/08/03 08:51:00 UTC

[jira] [Commented] (OOZIE-2887) Oozie Server hangs when there is a user job has wrong namenode address

    [ https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112427#comment-16112427 ] 

Prabhu Joseph commented on OOZIE-2887:
--------------------------------------

The issue happens even with job.properties having the correct namenode address when the NameNode nn1 machine is down. WhiteList configuration does not help here.

{code}
Repro:

NameNode HA - nn1, nn2 
Shutdown nn1
yarn.timeline.service.enabled true
Now all oozie jobs will go to PREP where one thread will keep on retrying to connect to nn1 node and other threads waiting to lock the object.
{code}


> Oozie Server hangs when there is a user job has wrong namenode address 
> -----------------------------------------------------------------------
>
>                 Key: OOZIE-2887
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2887
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.3.0
>            Reporter: Prabhu Joseph
>            Priority: Critical
>
> All the oozie jobs goes to PREP state when a user job tries to connect to wrong namenode address by mistake. Analyzing the jstack, all the threads which tries to submit job waiting to lock "java.util.ServiceLoader"
> {code}
> "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 waiting for monitor entry [0x00007f8bf207a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
>         - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
>         - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> And the thread which tries to connect to wrong NameNode address which has acquired the lock and keeps on retrying to connect to NameNode for ever. 
> {code}
> "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 waiting on condition [0x00007f8bf1f78000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
>         at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
>         - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
>         - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
>         at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1449)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1396)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>         at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
>         at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
>         at org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>        - locked <0x0000000083b422f8> (a java.lang.Object)
>         at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c498f0> (a java.lang.Object)
>         at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c49950> (a java.lang.Object)
>         at org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
>         at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
>         at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
>         - locked <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
>         at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
>         at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
>         at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Oozie logs shows the job is retrying to connect to wrong namenode address.
> {code}
> 2017-05-10 05:38:23,194  INFO Client:904 - SERVER[prabhu2] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 333 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:24,574  INFO Client:904 - SERVER[prabhu2] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 334 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:26,918  INFO Client:904 - SERVER[prabhu2] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 335 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:28,659  INFO Client:904 - SERVER[prabhu2] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 336 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> {code}
> There should be same way to prevent Oozie Server to fall into this trap when some user has wrong details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)