You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2014/11/19 14:03:33 UTC

[jira] [Commented] (TEZ-1790) DeallocationTaskRequest may been handled before corresponding AllocationTaskRequest in local mode

    [ https://issues.apache.org/jira/browse/TEZ-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217863#comment-14217863 ] 

Jeff Zhang commented on TEZ-1790:
---------------------------------

Attach the patch.  [~seth.siddharth@gmail.com], [~bikassaha], [~hitesh] please help review it.

* Remove the corresponding AllocationTaskRequest from queue if DeallocationTaskRequest is handled first.

> DeallocationTaskRequest may been handled before corresponding AllocationTaskRequest in local mode
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1790
>                 URL: https://issues.apache.org/jira/browse/TEZ-1790
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1790.patch
>
>
> In Tez Local mode, when dag is kiiled, DeallocationTaskRequest may been handled before corresponding AllocationTaskRequest handled. In that case, The TaskRequest is not really deallocated. The AllocationTaskRequest will been handled after DeallocationTaskRequest. When it is in local session mode, the dag is killed but its TaskRequest is still there, and will continue launch the task attempt. The task attempt will start the heartbeat with the AM, while the AM has started a new DAG. It would cause the following exception. ( The task attempt is heartbeating with a wrong DAG, because its DAG has been killed)
> {code}
> 15:38:24,208 - Thread(TaskHeartbeatThread) - (TezTaskRunner.java:333) - TaskReporter reported error
> java.lang.NullPointerException
> 	at org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.heartbeat(TaskAttemptListenerImpTezDag.java:514)
> 	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> 	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:176)
> 	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> This error will cause the TezChild interuppted
> {code}
> 16:04:26,718 - Thread(TezChild) - (TezTaskRunner.java:221) - Encounted an error while executing task: attempt_1416384252992_0001_2_00_000000_0
> java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> 	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> 	at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:211)
> 	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:173)
> 	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
> 	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> This issue cause TestExceptionPropagation timeout sometimes, especially on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)