You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/05/25 07:03:17 UTC

[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite

    [ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557976#comment-14557976 ] 

Jeff Zhang commented on TEZ-2475:
---------------------------------

[~fs111] It looks like TezChild hangs there for getting task from AM. 
{noformat}
"LocalTaskExecutionThread [0]" daemon prio=5 tid=0x00007fe36db49000 nid=0x221170b waiting on condition [0x00000001186f5000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000007d3f068d0> (a com.google.common.util.concurrent.ListenableFutureTask)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
	at java.util.concurrent.FutureTask.get(FutureTask.java:187)
	at org.apache.tez.runtime.task.TezChild.run(TezChild.java:189)
	at org.apache.tez.dag.app.launcher.LocalContainerLauncher$1.call(LocalContainerLauncher.java:318)
	at org.apache.tez.dag.app.launcher.LocalContainerLauncher$1.call(LocalContainerLauncher.java:307)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

I doubt TezChild hang in this infinite loop. BTW Did you turn off some logs ? It's weird that I didn't find any logs about the ContainerReport. If so, could you turn on these logs and run it again ? This loop would log message so that we can confirm whether it hangs here. 
{code}
    for (int idle = 1; containerTask == null; idle++) {
      long sleepTimeMilliSecs = Math.min(idle * 10, getTaskMaxSleepTime);
      maybeLogSleepMessage(sleepTimeMilliSecs);
      TimeUnit.MILLISECONDS.sleep(sleepTimeMilliSecs);
      containerTask = umbilical.getTask(containerContext);
    }
{code}

> Tez local mode hanging in big testsuite
> ---------------------------------------
>
>                 Key: TEZ-2475
>                 URL: https://issues.apache.org/jira/browse/TEZ-2475
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.6.1
>            Reporter: André Kelpe
>         Attachments: 2015-05-21_15-55-20_buildLog.log.gz
>
>
> we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck:
> The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug.
> What I am seeing, is these kind of stacktraces in the middle of the run:
> 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
>     java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
>         at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
>         at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later.
> I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens.
> I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
> This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)