You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Khaja Hussain (JIRA)" <ji...@apache.org> on 2015/05/01 17:56:07 UTC
[jira] [Commented] (MAPREDUCE-6312) Hive fails due to stale proxy
in ClientServiceDelegate
[ https://issues.apache.org/jira/browse/MAPREDUCE-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523360#comment-14523360 ]
Khaja Hussain commented on MAPREDUCE-6312:
------------------------------------------
We had very similar issue. And found out the /var directory on namenode was full and namenode came down which caused our hive query to fail with this error. I have the details below.
Error Log from Hive:
**************************
2015-05-01 08:51:35,774 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 1818.78 sec
java.io.IOException: Could not find status of job:job_1422659026472_30162
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:294)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1508)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1275)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1093)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:749)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Ended Job = job_1422659026472_30162 with exception 'java.io.IOException(Could not find status of job:job_1422659026472_30162)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 5 Cumulative CPU: 5376.95 sec HDFS Read: 4371724652 HDFS Write: 3100170286 SUCCESS
Log from NameNode:
*****************************
log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
at org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:176)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2027)
log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:369)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:176)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$DefaultAuditLogger.logAuditMessage(FSNamesystem.java:7792)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$DefaultAuditLogger.logAuditEvent(FSNamesystem.java:7787)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logAuditEvent(FSNamesystem.java:340)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logAuditEvent(FSNamesystem.java:320)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logAuditEvent(FSNamesystem.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3532)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileLinkInfo(NameNodeRpcServer.java:796)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileLinkInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:780)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
Logs from YARN from one of our datanodes:
*********************************************************
2015-05-01 08:47:39,234 ERROR logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:uploadLogsForContainer(115)) - Cannot create writer for app application_1422659026472_30160. Disabling log-aggregation for this app.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create file/app-logs/aiqa/logs/application_1422659026472_30160/lpwhp6.npd.com_45454.tmp. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1207)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2235)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2190)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:526)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy75.create(Unknown Source)
at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at $Proxy75.create(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:258)
> Hive fails due to stale proxy in ClientServiceDelegate
> ------------------------------------------------------
>
> Key: MAPREDUCE-6312
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6312
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: client
> Affects Versions: 2.5.0
> Reporter: Radim Kubacki
>
> ClientServiceDelegate initializes its realProxy field to AMProxy for a new or running job. Later when the job finishes it will not update this proxy to query history server and AM will not return valid data for this job.
> We found this while investigating https://issues.cloudera.org/browse/DISTRO-631 that describes Hive failure because it uses loop like
> {code}
> progress(JobClient jc, RunningJob rj) { ...
> while (!rj.isComplete() || (extraRounds > 0)) {
> try {
> Thread.sleep(1000);
> } catch (InterruptedException e) {
> }
> RunningJob newRj = jc.getJob(rj.getID());
> if (newRj == null) {
> // under exceptional load, hadoop may not be able to look up status
> // of finished jobs (because it has purged them from memory). From
> // hive's perspective - it's equivalent to the job having failed.
> // So raise a meaningful exception
> throw new IOException("Could not find status of job:" + rj.getID());
> } else {
> rj = newRj;
> }
> }
> {code}
> In this snippet JobClient.getJob will try to create RunningJob instance referring to job file in /user/$USER/.staging even when job is finished and the file is moved to /user/history/done (or /user/history/done_intermediate).
> Note that Hive queries can succeed if there is a timing where HDFS performs actual file delete with a delay.
> We can try to write a patch if there is an agreement that this should be fixed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)