You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matt Kent <ma...@persai.com> on 2008/03/18 07:08:57 UTC

[Fwd: Re: runtime exceptions not killing job]

Or maybe I can't use attachments, so here's the stack traces inline:

--------------------------task tracker----------------------------

2008-03-17 21:58:30
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed 
mode):

"Attach Listener" daemon prio=10 tid=0x00002aab1205c400 nid=0x523d 
waiting on condition [0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9001" 
daemon prio=10 tid=0x00002aab14317000 nid=0x5230 in Object.wait() 
[0x0000000041c44000..0x0000000041c44ba0]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)
    at java.lang.Object.wait(Object.java:485)
    at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
    - locked <0x00002aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"process reaper" daemon prio=10 tid=0x00002aab1205bc00 nid=0x51c6 
runnable [0x0000000041f47000..0x0000000041f47da0]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

"Thread-408" prio=10 tid=0x00002aab14316000 nid=0x51c5 in Object.wait() 
[0x0000000041d45000..0x0000000041d45a20]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    - locked <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
    at org.apache.hadoop.util.Shell.run(Shell.java:100)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)
    at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)

"SocketListener0-0" prio=10 tid=0x00002aab1205e400 nid=0x519d in 
Object.wait() [0x0000000041038000..0x0000000041038da0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaaf2650c20> (a 
org.mortbay.util.ThreadPool$PoolThread)
    at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
    - locked <0x00002aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread)

"org.apache.hadoop.dfs.DFSClient$LeaseChecker@64a7c45e" daemon prio=10 
tid=0x00002aab183a9000 nid=0x46f5 waiting on condition 
[0x0000000041a42000..0x0000000041a42aa0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
    at java.lang.Thread.run(Thread.java:619)

"org.apache.hadoop.dfs.DFSClient$LeaseChecker@7bcd280b" daemon prio=10 
tid=0x00002aab183ce000 nid=0x46ef waiting on condition 
[0x0000000041840000..0x0000000041840c20]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
    at java.lang.Thread.run(Thread.java:619)

"Map-events fetcher for all reduce tasks on 
tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon 
prio=10 tid=0x00002aab18438400 nid=0x4631 in Object.wait() 
[0x000000004173f000..0x000000004173fda0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3f0ace0> (a java.lang.Object)
    at 
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534)
    - locked <0x00002aaab3f0ace0> (a java.lang.Object)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 
tid=0x00002aab18427400 nid=0x462f waiting on condition 
[0x000000004153d000..0x000000004153daa0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)

"IPC Server handler 1 on 43477" daemon prio=10 tid=0x00002aab18476c00 
nid=0x462e in Object.wait() [0x000000004143c000..0x000000004143cb20]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
    - locked <0x00002aaab41356b0> (a java.util.LinkedList)

"IPC Server handler 0 on 43477" daemon prio=10 tid=0x00002aab18389c00 
nid=0x462d in Object.wait() [0x000000004133b000..0x000000004133bba0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
    - locked <0x00002aaab41356b0> (a java.util.LinkedList)

"IPC Server listener on 43477" daemon prio=10 tid=0x00002aab183ddc00 
nid=0x462c runnable [0x000000004123a000..0x000000004123ac20]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x00002aaab4136b80> (a sun.nio.ch.Util$1)
    - locked <0x00002aaab4136b68> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00002aaab41328c8> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    at org.apache.hadoop.ipc.Server$Listener.run(Server.java:308)

"IPC Server Responder" daemon prio=10 tid=0x00002aab183dd400 nid=0x462b 
runnable [0x0000000041139000..0x0000000041139ca0]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x00002aaab41362d0> (a sun.nio.ch.Util$1)
    - locked <0x00002aaab41362b8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00002aaab4135f10> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at org.apache.hadoop.ipc.Server$Responder.run(Server.java:450)

"Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]" 
prio=10 tid=0x00002aab1838b400 nid=0x4629 runnable 
[0x0000000040f37000..0x0000000040f37da0]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
    - locked <0x00002aaab4136e78> (a java.net.SocksSocketImpl)
    at java.net.ServerSocket.implAccept(ServerSocket.java:453)
    at java.net.ServerSocket.accept(ServerSocket.java:421)
    at org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java:432)
    at org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java:631)

"SessionScavenger" daemon prio=10 tid=0x00002aab183c7400 nid=0x4628 
waiting on condition [0x0000000040e36000..0x0000000040e36a20]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at 
org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587)

"taskCleanup" daemon prio=10 tid=0x00002aab18383000 nid=0x4623 waiting 
on condition [0x0000000040d35000..0x0000000040d35aa0]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00002aaab3f0ac10> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
    at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
    at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:300)
    at java.lang.Thread.run(Thread.java:619)

"Low Memory Detector" daemon prio=10 tid=0x00002aab122b7000 nid=0x461f 
runnable [0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x00002aab122b4800 nid=0x461e 
waiting on condition [0x0000000000000000..0x0000000040a31c10]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x00002aab122b2c00 nid=0x461d 
waiting on condition [0x0000000000000000..0x0000000040930c90]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00002aab1209e800 nid=0x461c 
runnable [0x0000000000000000..0x00000000408309b0]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00002aab12089400 nid=0x461b in 
Object.wait() [0x000000004072f000..0x000000004072fda0]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3e6b448> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
    - locked <0x00002aaab3e6b448> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00002aab12088400 nid=0x461a in 
Object.wait() [0x000000004062e000..0x000000004062ea20]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x0000000040117000 nid=0x4616 in Object.wait() 
[0x000000004022a000..0x000000004022aee0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3f09ff8> (a [I)
    at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:898)
    - locked <0x00002aaab3f09ff8> (a [I)
    at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)

"VM Thread" prio=10 tid=0x00002aab12083800 nid=0x4619 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121400 
nid=0x4617 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040122800 
nid=0x4618 runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aab122b8c00 nid=0x4620 
waiting on condition

JNI global references: 831


----------------------------------------task 
child--------------------------------------------------

2008-03-17 21:58:45
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed 
mode):

"Attach Listener" daemon prio=10 tid=0x00002aaad440f000 nid=0x5254 
runnable [0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00002aaad8045800 nid=0x51c9 waiting on 
condition [0x0000000000000000..0x000000004032ad10]
   java.lang.Thread.State: RUNNABLE

"Thread-12" prio=10 tid=0x00002aaad8046400 nid=0x51fc in Object.wait() 
[0x0000000042940000..0x0000000042940da0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaac900c9f0> (a java.util.LinkedList)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1577)
    - locked <0x00002aaac900c9f0> (a java.util.LinkedList)

"Comm thread for task_200803172138_0005_r_000003_0" daemon prio=10 
tid=0x00002aaad8032400 nid=0x51e3 waiting on condition 
[0x0000000041b39000..0x0000000041b39d20]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.mapred.Task$1.run(Task.java:282)
    at java.lang.Thread.run(Thread.java:619)

"org.apache.hadoop.dfs.DFSClient$LeaseChecker@7c29e357" daemon prio=10 
tid=0x00002aaad802e800 nid=0x51db waiting on condition 
[0x0000000041938000..0x0000000041938ca0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
    at java.lang.Thread.run(Thread.java:619)

"IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9000" 
daemon prio=10 tid=0x00002aaad8024000 nid=0x51da in Object.wait() 
[0x0000000041737000..0x0000000041737c20]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3f9d468> (a 
org.apache.hadoop.ipc.Client$Connection)
    at java.lang.Object.wait(Object.java:485)
    at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
    - locked <0x00002aaab3f9d468> (a 
org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"IPC Client connection to /127.0.0.1:43477" daemon prio=10 
tid=0x00002aaad800b400 nid=0x51d9 in Object.wait() 
[0x0000000041536000..0x0000000041536ba0]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3f9dc88> (a 
org.apache.hadoop.ipc.Client$Connection)
    at java.lang.Object.wait(Object.java:485)
    at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
    - locked <0x00002aaab3f9dc88> (a 
org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 
tid=0x00002aaad8004000 nid=0x51d8 waiting on condition 
[0x0000000041335000..0x0000000041335b20]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)

"Low Memory Detector" daemon prio=10 tid=0x00002aaad3776400 nid=0x51d2 
runnable [0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x00002aaad3773c00 nid=0x51d1 
waiting on condition [0x0000000000000000..0x0000000040e31d00]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x00002aaad3772000 nid=0x51d0 
waiting on condition [0x0000000000000000..0x0000000040d30c90]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00002aaad3770c00 nid=0x51cf 
runnable [0x0000000000000000..0x0000000040c30930]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00002aaad3548800 nid=0x51ce in 
Object.wait() [0x0000000040a2f000..0x0000000040a2fc20]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3e63330> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
    - locked <0x00002aaab3e63330> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00002aaad3547c00 nid=0x51cd in 
Object.wait() [0x000000004082e000..0x000000004082eba0]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00002aaad3543000 nid=0x51cc runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121c00 
nid=0x51ca runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040123000 
nid=0x51cb runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aaad3778400 nid=0x51d3 
waiting on condition

JNI global references: 764


-------- Original Message --------
Subject: 	Re: runtime exceptions not killing job
Date: 	Mon, 17 Mar 2008 22:01:44 -0700
From: 	Matt Kent <ma...@persai.com>
To: 	core-user@hadoop.apache.org
References: 	<47...@persai.com> 
<2B...@yahoo-inc.com>



It seems to happen only with reduce tasks, not map tasks. I reproduced 
it by having a dummy reduce task throw an NPE immediately. The error is 
shown on the reduce details page but the job does not register the task 
as failed. I've attached the task tracker stack trace, the child stack 
trace and a screenshot of the task list page.

Matt

Owen O'Malley wrote:
>
> On Mar 17, 2008, at 3:14 PM, Matt Kent wrote:
>
>> I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 
>> 0.14.1, if a map or reduce task threw a runtime exception such as an 
>> NPE, the task, and ultimately the job, would fail in short order. I 
>> was running on job on my local 0.16.1 cluster today, and when the 
>> reduce tasks started throwing NPEs, the tasks just hung. Eventually 
>> they timed out and were killed, but is this expected behavior in 
>> 0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown.
>
> This sounds like a bug. Tasks should certainly fail immediately if an 
> exception is thrown. Do you know where the exception is being thrown? 
> Can you get a stack trace of the task from jstack after the exception 
> and before the task times out?
>
> Thanks,
>    Owen
>


-- 
Matt Kent
Co-Founder
Persai
1221 40th St #113
Emeryville, CA 94608
matt@persai.com




-- 
Matt Kent
Co-Founder
Persai
1221 40th St #113
Emeryville, CA 94608
matt@persai.com


Re: [Fwd: Re: runtime exceptions not killing job]

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
Thanks Matt for info.
I raised a Jira for this at 
https://issues.apache.org/jira/browse/HADOOP-3039

Thanks
Amareshwari
Matt Kent wrote:
> Or maybe I can't use attachments, so here's the stack traces inline:
>
> --------------------------task tracker----------------------------
>
> 2008-03-17 21:58:30
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed 
> mode):
>
> "Attach Listener" daemon prio=10 tid=0x00002aab1205c400 nid=0x523d 
> waiting on condition [0x0000000000000000..0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "IPC Client connection to 
> bigmike.internal.persai.com/192.168.1.3:9001" daemon prio=10 
> tid=0x00002aab14317000 nid=0x5230 in Object.wait() 
> [0x0000000041c44000..0x0000000041c44ba0]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaaf304da08> (a 
> org.apache.hadoop.ipc.Client$Connection)
>    at java.lang.Object.wait(Object.java:485)
>    at 
> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
>    - locked <0x00002aaaf304da08> (a 
> org.apache.hadoop.ipc.Client$Connection)
>    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)
>
> "process reaper" daemon prio=10 tid=0x00002aab1205bc00 nid=0x51c6 
> runnable [0x0000000041f47000..0x0000000041f47da0]
>   java.lang.Thread.State: RUNNABLE
>    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
>    at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>    at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>
> "Thread-408" prio=10 tid=0x00002aab14316000 nid=0x51c5 in 
> Object.wait() [0x0000000041d45000..0x0000000041d45a20]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
>    at java.lang.Object.wait(Object.java:485)
>    at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>    - locked <0x00002aaaf2cf0948> (a java.lang.UNIXProcess)
>    at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
>    at org.apache.hadoop.util.Shell.run(Shell.java:100)
>    at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)
>    at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
>    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)
>
> "SocketListener0-0" prio=10 tid=0x00002aab1205e400 nid=0x519d in 
> Object.wait() [0x0000000041038000..0x0000000041038da0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaaf2650c20> (a 
> org.mortbay.util.ThreadPool$PoolThread)
>    at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
>    - locked <0x00002aaaf2650c20> (a 
> org.mortbay.util.ThreadPool$PoolThread)
>
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@64a7c45e" daemon prio=10 
> tid=0x00002aab183a9000 nid=0x46f5 waiting on condition 
> [0x0000000041a42000..0x0000000041a42aa0]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>    at java.lang.Thread.sleep(Native Method)
>    at 
> org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
>    at java.lang.Thread.run(Thread.java:619)
>
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@7bcd280b" daemon prio=10 
> tid=0x00002aab183ce000 nid=0x46ef waiting on condition 
> [0x0000000041840000..0x0000000041840c20]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>    at java.lang.Thread.sleep(Native Method)
>    at 
> org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
>    at java.lang.Thread.run(Thread.java:619)
>
> "Map-events fetcher for all reduce tasks on 
> tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon 
> prio=10 tid=0x00002aab18438400 nid=0x4631 in Object.wait() 
> [0x000000004173f000..0x000000004173fda0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3f0ace0> (a java.lang.Object)
>    at 
> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534) 
>
>    - locked <0x00002aaab3f0ace0> (a java.lang.Object)
>
> "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 
> tid=0x00002aab18427400 nid=0x462f waiting on condition 
> [0x000000004153d000..0x000000004153daa0]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>    at java.lang.Thread.sleep(Native Method)
>    at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)
>
> "IPC Server handler 1 on 43477" daemon prio=10 tid=0x00002aab18476c00 
> nid=0x462e in Object.wait() [0x000000004143c000..0x000000004143cb20]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
>    - locked <0x00002aaab41356b0> (a java.util.LinkedList)
>
> "IPC Server handler 0 on 43477" daemon prio=10 tid=0x00002aab18389c00 
> nid=0x462d in Object.wait() [0x000000004133b000..0x000000004133bba0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab41356b0> (a java.util.LinkedList)
>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
>    - locked <0x00002aaab41356b0> (a java.util.LinkedList)
>
> "IPC Server listener on 43477" daemon prio=10 tid=0x00002aab183ddc00 
> nid=0x462c runnable [0x000000004123a000..0x000000004123ac20]
>   java.lang.Thread.State: RUNNABLE
>    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
>    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>    - locked <0x00002aaab4136b80> (a sun.nio.ch.Util$1)
>    - locked <0x00002aaab4136b68> (a 
> java.util.Collections$UnmodifiableSet)
>    - locked <0x00002aaab41328c8> (a sun.nio.ch.EPollSelectorImpl)
>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>    at org.apache.hadoop.ipc.Server$Listener.run(Server.java:308)
>
> "IPC Server Responder" daemon prio=10 tid=0x00002aab183dd400 
> nid=0x462b runnable [0x0000000041139000..0x0000000041139ca0]
>   java.lang.Thread.State: RUNNABLE
>    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:184)
>    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>    - locked <0x00002aaab41362d0> (a sun.nio.ch.Util$1)
>    - locked <0x00002aaab41362b8> (a 
> java.util.Collections$UnmodifiableSet)
>    - locked <0x00002aaab4135f10> (a sun.nio.ch.EPollSelectorImpl)
>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>    at org.apache.hadoop.ipc.Server$Responder.run(Server.java:450)
>
> "Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]" 
> prio=10 tid=0x00002aab1838b400 nid=0x4629 runnable 
> [0x0000000040f37000..0x0000000040f37da0]
>   java.lang.Thread.State: RUNNABLE
>    at java.net.PlainSocketImpl.socketAccept(Native Method)
>    at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
>    - locked <0x00002aaab4136e78> (a java.net.SocksSocketImpl)
>    at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>    at java.net.ServerSocket.accept(ServerSocket.java:421)
>    at 
> org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java:432)
>    at 
> org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java:631)
>
> "SessionScavenger" daemon prio=10 tid=0x00002aab183c7400 nid=0x4628 
> waiting on condition [0x0000000040e36000..0x0000000040e36a20]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>    at java.lang.Thread.sleep(Native Method)
>    at 
> org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587) 
>
>
> "taskCleanup" daemon prio=10 tid=0x00002aab18383000 nid=0x4623 waiting 
> on condition [0x0000000040d35000..0x0000000040d35aa0]
>   java.lang.Thread.State: WAITING (parking)
>    at sun.misc.Unsafe.park(Native Method)
>    - parking to wait for  <0x00002aaab3f0ac10> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) 
>
>    at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) 
>
>    at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:300)
>    at java.lang.Thread.run(Thread.java:619)
>
> "Low Memory Detector" daemon prio=10 tid=0x00002aab122b7000 nid=0x461f 
> runnable [0x0000000000000000..0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x00002aab122b4800 nid=0x461e 
> waiting on condition [0x0000000000000000..0x0000000040a31c10]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x00002aab122b2c00 nid=0x461d 
> waiting on condition [0x0000000000000000..0x0000000040930c90]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x00002aab1209e800 nid=0x461c 
> runnable [0x0000000000000000..0x00000000408309b0]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x00002aab12089400 nid=0x461b in 
> Object.wait() [0x000000004072f000..0x000000004072fda0]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3e6b448> (a 
> java.lang.ref.ReferenceQueue$Lock)
>    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>    - locked <0x00002aaab3e6b448> (a java.lang.ref.ReferenceQueue$Lock)
>    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x00002aab12088400 nid=0x461a 
> in Object.wait() [0x000000004062e000..0x000000004062ea20]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)
>    at java.lang.Object.wait(Object.java:485)
>    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>    - locked <0x00002aaab3e6f298> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x0000000040117000 nid=0x4616 in Object.wait() 
> [0x000000004022a000..0x000000004022aee0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3f09ff8> (a [I)
>    at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:898)
>    - locked <0x00002aaab3f09ff8> (a [I)
>    at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
>    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
>
> "VM Thread" prio=10 tid=0x00002aab12083800 nid=0x4619 runnable
>
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121400 
> nid=0x4617 runnable
>
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040122800 
> nid=0x4618 runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x00002aab122b8c00 nid=0x4620 
> waiting on condition
>
> JNI global references: 831
>
>
> ----------------------------------------task 
> child--------------------------------------------------
>
> 2008-03-17 21:58:45
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed 
> mode):
>
> "Attach Listener" daemon prio=10 tid=0x00002aaad440f000 nid=0x5254 
> runnable [0x0000000000000000..0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "DestroyJavaVM" prio=10 tid=0x00002aaad8045800 nid=0x51c9 waiting on 
> condition [0x0000000000000000..0x000000004032ad10]
>   java.lang.Thread.State: RUNNABLE
>
> "Thread-12" prio=10 tid=0x00002aaad8046400 nid=0x51fc in Object.wait() 
> [0x0000000042940000..0x0000000042940da0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaac900c9f0> (a java.util.LinkedList)
>    at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1577) 
>
>    - locked <0x00002aaac900c9f0> (a java.util.LinkedList)
>
> "Comm thread for task_200803172138_0005_r_000003_0" daemon prio=10 
> tid=0x00002aaad8032400 nid=0x51e3 waiting on condition 
> [0x0000000041b39000..0x0000000041b39d20]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>    at java.lang.Thread.sleep(Native Method)
>    at org.apache.hadoop.mapred.Task$1.run(Task.java:282)
>    at java.lang.Thread.run(Thread.java:619)
>
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@7c29e357" daemon prio=10 
> tid=0x00002aaad802e800 nid=0x51db waiting on condition 
> [0x0000000041938000..0x0000000041938ca0]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>    at java.lang.Thread.sleep(Native Method)
>    at 
> org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
>    at java.lang.Thread.run(Thread.java:619)
>
> "IPC Client connection to 
> bigmike.internal.persai.com/192.168.1.3:9000" daemon prio=10 
> tid=0x00002aaad8024000 nid=0x51da in Object.wait() 
> [0x0000000041737000..0x0000000041737c20]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3f9d468> (a 
> org.apache.hadoop.ipc.Client$Connection)
>    at java.lang.Object.wait(Object.java:485)
>    at 
> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
>    - locked <0x00002aaab3f9d468> (a 
> org.apache.hadoop.ipc.Client$Connection)
>    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)
>
> "IPC Client connection to /127.0.0.1:43477" daemon prio=10 
> tid=0x00002aaad800b400 nid=0x51d9 in Object.wait() 
> [0x0000000041536000..0x0000000041536ba0]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3f9dc88> (a 
> org.apache.hadoop.ipc.Client$Connection)
>    at java.lang.Object.wait(Object.java:485)
>    at 
> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
>    - locked <0x00002aaab3f9dc88> (a 
> org.apache.hadoop.ipc.Client$Connection)
>    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)
>
> "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 
> tid=0x00002aaad8004000 nid=0x51d8 waiting on condition 
> [0x0000000041335000..0x0000000041335b20]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>    at java.lang.Thread.sleep(Native Method)
>    at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)
>
> "Low Memory Detector" daemon prio=10 tid=0x00002aaad3776400 nid=0x51d2 
> runnable [0x0000000000000000..0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x00002aaad3773c00 nid=0x51d1 
> waiting on condition [0x0000000000000000..0x0000000040e31d00]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x00002aaad3772000 nid=0x51d0 
> waiting on condition [0x0000000000000000..0x0000000040d30c90]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x00002aaad3770c00 nid=0x51cf 
> runnable [0x0000000000000000..0x0000000040c30930]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x00002aaad3548800 nid=0x51ce in 
> Object.wait() [0x0000000040a2f000..0x0000000040a2fc20]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3e63330> (a 
> java.lang.ref.ReferenceQueue$Lock)
>    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>    - locked <0x00002aaab3e63330> (a java.lang.ref.ReferenceQueue$Lock)
>    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x00002aaad3547c00 nid=0x51cd 
> in Object.wait() [0x000000004082e000..0x000000004082eba0]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    - waiting on <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)
>    at java.lang.Object.wait(Object.java:485)
>    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>    - locked <0x00002aaab3e66f50> (a java.lang.ref.Reference$Lock)
>
> "VM Thread" prio=10 tid=0x00002aaad3543000 nid=0x51cc runnable
>
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000040121c00 
> nid=0x51ca runnable
>
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000040123000 
> nid=0x51cb runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x00002aaad3778400 nid=0x51d3 
> waiting on condition
>
> JNI global references: 764
>
>
> -------- Original Message --------
> Subject:     Re: runtime exceptions not killing job
> Date:     Mon, 17 Mar 2008 22:01:44 -0700
> From:     Matt Kent <ma...@persai.com>
> To:     core-user@hadoop.apache.org
> References:     <47...@persai.com> 
> <2B...@yahoo-inc.com>
>
>
>
> It seems to happen only with reduce tasks, not map tasks. I reproduced 
> it by having a dummy reduce task throw an NPE immediately. The error 
> is shown on the reduce details page but the job does not register the 
> task as failed. I've attached the task tracker stack trace, the child 
> stack trace and a screenshot of the task list page.
>
> Matt
>
> Owen O'Malley wrote:
>>
>> On Mar 17, 2008, at 3:14 PM, Matt Kent wrote:
>>
>>> I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 
>>> 0.14.1, if a map or reduce task threw a runtime exception such as an 
>>> NPE, the task, and ultimately the job, would fail in short order. I 
>>> was running on job on my local 0.16.1 cluster today, and when the 
>>> reduce tasks started throwing NPEs, the tasks just hung. Eventually 
>>> they timed out and were killed, but is this expected behavior in 
>>> 0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown.
>>
>> This sounds like a bug. Tasks should certainly fail immediately if an 
>> exception is thrown. Do you know where the exception is being thrown? 
>> Can you get a stack trace of the task from jstack after the exception 
>> and before the task times out?
>>
>> Thanks,
>>    Owen
>>
>
>