You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2014/09/26 20:52:33 UTC

[jira] [Comment Edited] (TEZ-1621) Should report error to AM before shuting down TezChild

    [ https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149794#comment-14149794 ] 

Bikas Saha edited comment on TEZ-1621 at 9/26/14 6:51 PM:
----------------------------------------------------------

This will break when we stop sending the failure notification to the AM inline (thus blocking the System.exit() until the AM has been notified). IMO we should remove the System.exit() from these places and move them up to TezChild such that TezChild can observe the Exception/Error and determine if it needs to exit or not. If it needs to exit it can make sure all pending notifications are complete and the AM gets a proper error/diagnostic before exiting. These current exit()s sprayed across the code make graceful cleanup hard to do. And are probably the cause of this jira. If doing the global system.exit() is difficult in this jira then we should at least remove the current system.exit()s and open a follow up jira to handle Error and exit in one place. That will remove the need to special case local mode everywhere in this jira. Ideally, all of Tez code should be using a common util to handle shutdown which exits in non-local mode and does not exit in local mode.
The change to report the error looks good. The above comments are about the existing System.exit()s. Let me know what you think?


was (Author: bikassaha):
This will break when we stop sending the failure notification to the AM inline (thus blocking the System.exit() until the AM has been notified). IMO we should remove the System.exit() from these places and move them up to TezChild such that TezChild can observe the Exception/Error and determine if it needs to exit or not. If it needs to exit it can make sure all pending notifications are complete and the AM gets a proper error/diagnostic before exiting. These current exit()s sprayed across the code make graceful cleanup hard to do. And are probably the cause of this jira. If doing the global system.exit() is difficult in this jira then we should at least remove the current system.exit()s and open a follow up jira to handle Error and exit in one place. That will remove the need to special case local mode everywhere in this jira. Ideally, all of Tez code should be using a common util to handle shutdown which exits in non-local mode and does not exit in local mode.

> Should report error to AM before shuting down TezChild
> ------------------------------------------------------
>
>                 Key: TEZ-1621
>                 URL: https://issues.apache.org/jira/browse/TEZ-1621
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Deepesh Khandelwal
>            Assignee: Jeff Zhang
>         Attachments: Tez-1621.patch, app_logs.txt, console.txt
>
>
> While running an in session testorderedwordcount example the DAG failed with the following error on the console:
> {noformat}
> 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: [Vertex failed, vertexName=initialmap, vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, taskId=task_1411586515507_0110_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1411586515507_0110_01_000002 finished with diagnostics set to [Container failed. Exception from container-launch.
> Container id: container_1411586515507_0110_01_000002
> Exit code: 255
> Stack trace: ExitCodeException exitCode=255:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This wasn't very helpful, the root cause is in the application log:
> {noformat}
> 2014-09-25 01:55:41,246 ERROR [TezChild] org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting now
> java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V
>         at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native Method)
>         at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57)
>         at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291)
>         at org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344)
>         at org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444)
>         at org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575)
>         at org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539)
>         at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
>         at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
>         at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
>         at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
>         at java.io.DataInputStream.read(DataInputStream.java:100)
>         at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>         at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>         at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>         at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
>         at org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167)
>         at org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116)
>         at org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecordReader.nextKeyValue(MapProcessor.java:266)
>         at org.apache.tez.mapreduce.hadoop.mapreduce.MapContextImpl.nextKeyValue(MapContextImpl.java:81)
>         at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:237)
>         at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:124)
>         at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:172)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:167)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> 2014-09-25 01:55:41,250 INFO [TezChild] org.apache.hadoop.util.ExitUtil: Exiting with status -1
> {noformat}
> Attached are the complete console.log and application log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)