You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2017/09/18 19:51:00 UTC
[jira] [Commented] (TEZ-3835) Failure during startup and shutdown caused DAGAppMaster to fail subsequent hive attempts

    [ https://issues.apache.org/jira/browse/TEZ-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170565#comment-16170565 ] 

Jason Lowe commented on TEZ-3835:
---------------------------------

I think this is essentially a duplicate of TEZ-3834.  The reason the staging directory was deleted is because DAGAppMaster mistakenly believed the task schedulers successfully unregistered.  Normally a successful unregistration means there could no longer be any further attempts, but because of the mishandling of errors in hasUnregistered it mistakenly defaulted to true which led to the deletion of the staging directory before the app was complete.

> Failure during startup and shutdown caused DAGAppMaster to fail subsequent hive attempts
> ----------------------------------------------------------------------------------------
>
>                 Key: TEZ-3835
>                 URL: https://issues.apache.org/jira/browse/TEZ-3835
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>
> The staging directory is being deleted as part of shutdown. Hive (and not pig) is using the staging directory to specify resources that should be downloaded by the NMs ContainerLocalizer in order to to start up second AM attempt.
> {noformat:title=NM log exception}
> Failing this attempt.Diagnostics: File does not exist: <app dir>tez.session.local-resources.pb
> java.io.FileNotFoundException: File does not exist: <app dir>tez.session.local-resources.pb
> at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1440)
> at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1433)
> at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1448)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
> at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
> at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
> at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)