You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Prabhu Joseph (JIRA)" <ji...@apache.org> on 2017/05/11 15:33:04 UTC

[jira] [Updated] (OOZIE-2887) Oozie Server hangs when there is a user job has wrong namenode address

     [ https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prabhu Joseph updated OOZIE-2887:
---------------------------------
    Description: 
All the oozie jobs goes to PREP state when a user job tries to connect to wrong namenode address by mistake. Analyzing the jstack, all the threads which tries to submit job waiting to lock "java.util.ServiceLoader"

{code}

"pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 waiting for monitor entry [0x00007f8bf207a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
        - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
        at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
        at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
        - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
        at org.apache.oozie.command.XCommand.call(XCommand.java:287)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

And the thread which tries to connect to wrong NameNode address which has acquired the lock and keeps on retrying to connect to NameNode for ever. 

{code}
"pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 waiting on condition [0x00007f8bf1f78000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
        - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
        - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
        at org.apache.hadoop.ipc.Client.call(Client.java:1449)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
        at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
        at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
        at org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
       - locked <0x0000000083b422f8> (a java.lang.Object)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        - locked <0x0000000083c498f0> (a java.lang.Object)
        at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        - locked <0x0000000083c49950> (a java.lang.Object)
        at org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
        at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
        at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
        at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
        - locked <0x0000000081b29098> (a java.util.ServiceLoader)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
        at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
        at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
        at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
        at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
        at org.apache.oozie.command.XCommand.call(XCommand.java:287)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

{code}

Oozie logs shows the job is retrying to connect to wrong namenode address.

{code}
2017-05-10 05:38:23,194  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 333 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:24,574  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 334 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:26,918  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 335 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:28,659  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 336 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
{code}

There should be same way to prevent Oozie Server to fall into this trap when some user has wrong details.


  was:
All the oozie jobs goes to PREP state when a user job tries to connect to wrong namenode address by mistake. Analyzing the jstack, all the threads which tries to submit job waiting to lock "java.util.ServiceLoader"

{code}

"pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 waiting for monitor entry [0x00007f8bf207a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
        - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
        at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
        at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
        - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
        at org.apache.oozie.command.XCommand.call(XCommand.java:287)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

And the thread which tries to connect to wrong NameNode address which has acquired the lock and keeps on retrying to connect to NameNode for ever. 

{code}
"pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 waiting on condition [0x00007f8bf1f78000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
        - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
        - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
        at org.apache.hadoop.ipc.Client.call(Client.java:1449)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
        at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
        at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
        at org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
       - locked <0x0000000083b422f8> (a java.lang.Object)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        - locked <0x0000000083c498f0> (a java.lang.Object)
        at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        - locked <0x0000000083c49950> (a java.lang.Object)
        at org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
        at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
        at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
        at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
        - locked <0x0000000081b29098> (a java.util.ServiceLoader)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
        at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
        at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
        at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
        at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
        at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
        at org.apache.oozie.command.XCommand.call(XCommand.java:287)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

{code}

Oozie logs shows the job is retrying to connect to wrong namenode address.

{code}
2017-05-10 05:38:23,194  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 333 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:24,574  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 334 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:26,918  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 335 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
2017-05-10 05:38:28,659  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: hpchdp2e.hpc.ford.com/19.5.224.16:8020. Already tried 336 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
{code}

There should be same way to prevent Oozie Server to fall into this trap when some user has wrong details.



> Oozie Server hangs when there is a user job has wrong namenode address 
> -----------------------------------------------------------------------
>
>                 Key: OOZIE-2887
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2887
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.3.0
>            Reporter: Prabhu Joseph
>            Priority: Critical
>
> All the oozie jobs goes to PREP state when a user job tries to connect to wrong namenode address by mistake. Analyzing the jstack, all the threads which tries to submit job waiting to lock "java.util.ServiceLoader"
> {code}
> "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 waiting for monitor entry [0x00007f8bf207a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
>         - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
>         - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> And the thread which tries to connect to wrong NameNode address which has acquired the lock and keeps on retrying to connect to NameNode for ever. 
> {code}
> "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 waiting on condition [0x00007f8bf1f78000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
>         at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
>         - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
>         - locked <0x0000000083b80360> (a org.apache.hadoop.ipc.Client$Connection)
>         at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1449)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1396)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>         at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
>         at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
>         at org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>        - locked <0x0000000083b422f8> (a java.lang.Object)
>         at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c498f0> (a java.lang.Object)
>         at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c49950> (a java.lang.Object)
>         at org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
>         at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
>         at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
>         - locked <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
>         at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
>         at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
>         at org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
>         at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Oozie logs shows the job is retrying to connect to wrong namenode address.
> {code}
> 2017-05-10 05:38:23,194  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 333 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:24,574  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 334 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:26,918  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 335 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:28,659  INFO Client:904 - SERVER[hpchdp2.hpc.ford.com] Retrying connect to server: prabhu1/172.26.98.45:8020. Already tried 336 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> {code}
> There should be same way to prevent Oozie Server to fall into this trap when some user has wrong details.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)