You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Joseph Stein <cr...@gmail.com> on 2010/05/07 16:31:59 UTC

Re: hanging task

You need to either report status or increment a counter from within
your task. In your Java code there is a little trick to help the job
be “aware” within the cluster of tasks that are not dead but just
working hard.  During execution of a task there is no built in
reporting that the job is running as expected if it is not writing
out.  So this means that if your tasks are taking up a lot of time
doing work it is possible the cluster will see that task as failed
(based on the mapred.task.tracker.expiry.interval setting).

Have no fear there is a way to tell cluster that your task is doing
just fine.  You have 2 ways todo this you can either report the status
or increment a counter.  Both of these will cause the task tracker to
properly know the task is ok and this will get seen by the jobtracker
in turn.  Both of these options are explained in the JavaDoc
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html

Here are some pointers in general you may find useful
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
<jz...@googlemail.com> wrote:
> Hi hadoop folks,
>
> i'm encountering following problem on a 4 node cluster running hadoop-0.20.2.
>
> Its a map only job reading about 9 GB data from outside of hadoop. 31 map tasks at all while 12 map tasks running at a time.
> The first wave of mappers finishes successfully.
> Later on the first tasks are failing and they do that shortly before finishing:
> Task attempt_201005061210_0002_m_000001_0 failed to report status for 602 seconds. Killing!
> Task attempt_201005061210_0002_m_000001_1 failed to report status for 600 seconds. Killing!
> Task attempt_201005061210_0002_m_000001_2 failed to report status for 602 seconds. Killing!
>
> The unusual is that i do not find any signs of the job code in the thread dump the tasktracker takes automatically:
> ----------------------------------------------------------------
> 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201004301437_0050_m_000001_0: Task attempt_201004301437_0050_m_000001_0 failed to report status for 601 seconds. Killing!
> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: Process Thread Dump: lost task
> 34 active threads
> Thread 29555 (process reaper):
>  State: RUNNABLE
>  Blocked count: 0
>  Waited count: 0
>  Stack:
>    java.lang.UNIXProcess.waitForProcessExit(Native Method)
>    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495 spawned.):
>  State: WAITING
>  Blocked count: 1
>  Waited count: 2
>  Waiting on java.lang.UNIXProcess@34b56f69
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>    org.apache.hadoop.util.Shell.run(Shell.java:134)
>    org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> Thread 29550 (Thread-15193):
>  State: WAITING
>  Blocked count: 13
>  Waited count: 14
>  Waiting on java.lang.Object@72c09161
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> Thread 29501 (process reaper):
>  State: RUNNABLE
>  Blocked count: 0
>  Waited count: 0
>  Stack:
>    java.lang.UNIXProcess.waitForProcessExit(Native Method)
>    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
>  State: WAITING
>  Blocked count: 1
>  Waited count: 2
>  Waiting on java.lang.UNIXProcess@56175cd4
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>    org.apache.hadoop.util.Shell.run(Shell.java:134)
>    org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> Thread 29496 (Thread-15162):
>  State: WAITING
>  Blocked count: 13
>  Waited count: 14
>  Waiting on java.lang.Object@3b916de2
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> Thread 29225 (process reaper):
>  State: RUNNABLE
>  Blocked count: 0
>  Waited count: 0
>  Stack:
>    java.lang.UNIXProcess.waitForProcessExit(Native Method)
>    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314 spawned.):
>  State: WAITING
>  Blocked count: 1
>  Waited count: 2
>  Waiting on java.lang.UNIXProcess@39874d3b
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>    org.apache.hadoop.util.Shell.run(Shell.java:134)
>    org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> Thread 29217 (process reaper):
>  State: RUNNABLE
>  Blocked count: 0
>  Waited count: 0
>  Stack:
>    java.lang.UNIXProcess.waitForProcessExit(Native Method)
>    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636 spawned.):
>  State: WAITING
>  Blocked count: 1
>  Waited count: 2
>  Waiting on java.lang.UNIXProcess@2c39220f
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>    org.apache.hadoop.util.Shell.run(Shell.java:134)
>    org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>    org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> Thread 29208 (Thread-15014):
>  State: WAITING
>  Blocked count: 18
>  Waited count: 14
>  Waiting on java.lang.Object@7ddbbae2
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> Thread 29203 (Thread-15011):
>  State: WAITING
>  Blocked count: 13
>  Waited count: 14
>  Waiting on java.lang.Object@2dac3f6f
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from hadoop):
>  State: TIMED_WAITING
>  Blocked count: 10075
>  Waited count: 10075
>  Stack:
>    java.lang.Object.wait(Native Method)
>    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
>    org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
> Thread 33 (Directory/File cleanup thread):
>  State: WAITING
>  Blocked count: 2
>  Waited count: 615
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
> Thread 9 (taskCleanup):
>  State: WAITING
>  Blocked count: 30
>  Waited count: 92
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
>    java.lang.Thread.run(Thread.java:619)
> Thread 32 (TaskLauncher for task):
>  State: WAITING
>  Blocked count: 127
>  Waited count: 123
>  Waiting on java.util.LinkedList@c33377
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> Thread 31 (TaskLauncher for task):
>  State: WAITING
>  Blocked count: 3237
>  Waited count: 3147
>  Waiting on java.util.LinkedList@67001629
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> Thread 30 (Map-events fetcher for all reduce tasks on tracker_node1t:localhost/127.0.0.1:48722):
>  State: WAITING
>  Blocked count: 712
>  Waited count: 804
>  Waiting on java.util.TreeMap@1fec8cf1
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
> Thread 28 (IPC Server handler 7 on 48722):
>  State: WAITING
>  Blocked count: 10
>  Waited count: 3199
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 27 (IPC Server handler 6 on 48722):
>  State: WAITING
>  Blocked count: 265
>  Waited count: 3207
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 26 (IPC Server handler 5 on 48722):
>  State: WAITING
>  Blocked count: 296
>  Waited count: 3206
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 25 (IPC Server handler 4 on 48722):
>  State: WAITING
>  Blocked count: 199
>  Waited count: 3202
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 24 (IPC Server handler 3 on 48722):
>  State: WAITING
>  Blocked count: 216
>  Waited count: 3201
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 23 (IPC Server handler 2 on 48722):
>  State: WAITING
>  Blocked count: 119
>  Waited count: 3199
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 22 (IPC Server handler 1 on 48722):
>  State: WAITING
>  Blocked count: 193
>  Waited count: 3207
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 21 (IPC Server handler 0 on 48722):
>  State: WAITING
>  Blocked count: 285
>  Waited count: 3203
>  Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>  Stack:
>    sun.misc.Unsafe.park(Native Method)
>    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> Thread 18 (IPC Server listener on 48722):
>  State: RUNNABLE
>  Blocked count: 0
>  Waited count: 0
>  Stack:
>    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>    org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
> Thread 20 (IPC Server Responder):
>  State: RUNNABLE
>  Blocked count: 0
>  Waited count: 0
>  Stack:
>    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>    org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
> Thread 17 (Timer-0):
>  State: TIMED_WAITING
>  Blocked count: 1
>  Waited count: 12764
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.util.TimerThread.mainLoop(Timer.java:509)
>    java.util.TimerThread.run(Timer.java:462)
> Thread 16 (295726648@qtp0-0 - Acceptor0 SelectChannelConnector@0.0.0.0:50060):
>  State: RUNNABLE
>  Blocked count: 1310
>  Waited count: 1
>  Stack:
>    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>    org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
>    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
>    org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
>    org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
>    org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> Thread 4 (Signal Dispatcher):
>  State: RUNNABLE
>  Blocked count: 0
>  Waited count: 0
>  Stack:
> Thread 3 (Finalizer):
>  State: WAITING
>  Blocked count: 1337
>  Waited count: 1336
>  Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>  State: WAITING
>  Blocked count: 1544
>  Waited count: 1336
>  Waiting on java.lang.ref.Reference$Lock@46efbdf1
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>  State: RUNNABLE
>  Blocked count: 152594
>  Waited count: 279661
>  Stack:
>    sun.management.ThreadImpl.getThreadInfo0(Native Method)
>    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>    org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
>    org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
>    org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
>    org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
>    org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
>    org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
>
> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201004301437_0050_m_000001_0
> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 1
> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201004301437_0050_m_000001_0 done; removing files.
> 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_201004301437_0050_m_000001_0 not found in cache
> 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
> 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201004301437_0050_m_000001_0 done; removing files.
> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0 task's state:FAILED_UNCLEAN
> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201004301437_0050_m_000001_0
> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201004301437_0050_m_000001_0
> 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201004301437_0050_m_-1095145752 given task: attempt_201004301437_0050_m_000001_0
> 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201004301437_0050_m_000001_0 0.0%
> 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201004301437_0050_m_000001_0 0.0% cleanup
> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201004301437_0050_m_000001_0 is done.
> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201004301437_0050_m_000001_0  was 0
> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201004301437_0050_m_000001_0 done; removing files.
> ----------------------------------------------------------------
>
> The syslog of the attempt just looks like:
> ----------------------------------------------------------------
> 2010-05-04 23:48:03,365 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData: Failed to get version from the manifest file of jar 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar' : null
> 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
> 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2010-05-04 23:48:04,873 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor
> 2010-05-05 00:59:08,753 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=CLEANUP, sessionId=
> 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
> 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201004301437_0050_m_000001_0 is done. And is in the process of commiting
> 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_201004301437_0050_m_000001_0' done.
> ----------------------------------------------------------------
>
> Any ideas ?
>
> best regards
> Johannes



-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Re: hanging task

Posted by Johannes Zillmann <jz...@googlemail.com>.
Hi Raghava,

it was a specific business logic problem not a common applicable problem!

Johannes

On May 28, 2010, at 6:18 PM, Raghava Mutharaju wrote:

> Hello Johannes,
> 
>        Is this about the common tip of using StringBuilder/StringBuffer
> instead of '+' when concatenating Strings? I would like to know more about
> the problem and how you resolved it.
> 
> 
> Regards,
> Raghava.
> 
> On Fri, May 28, 2010 at 6:43 AM, Johannes Zillmann <jzillmann@googlemail.com
>> wrote:
> 
>> Thanks for the valuable input!
>> It was some string concatenation i did in each map call and it got super
>> long. Resolving this made the problem go away!
>> 
>> cheers
>> Johannes
>> 
>> On May 7, 2010, at 6:35 PM, Michael Segel wrote:
>> 
>>> 
>>> There's a couple of things...
>>> 
>>> 1) Inside the map() method, if you're taking longer than the timeout, you
>> will fail.
>>> 2) You could also fail if you run in to GC problems that cause timeout
>> failures.
>>> 
>>> What sometimes happens in long running jobs, you tend to see your GC take
>> longer as the job runs.
>>> 
>>> Do you log a start and end time in each iteration of map()? This will
>> tell you if the individual time spent in a map() method call is too long.
>>> 
>>> Also look at GC. Trying to go through all the logs can really drive one
>> to drink. :-)
>>> 
>>> HTH
>>> 
>>> -Mike
>>> 
>>> 
>>>> Subject: Re: hanging task
>>>> From: jzillmann@googlemail.com
>>>> Date: Fri, 7 May 2010 17:59:01 +0200
>>>> To: common-user@hadoop.apache.org
>>>> 
>>>> Sorry, forgot to say that on map task runs about more then one hour.
>>>> So i do not think its the progress thing. So there are successful map
>> tasks and those which are failing failing long after the 10 min timeout.
>>>> And the question is why there is not any stack of the map task visibly
>> in the thread dump.
>>>> Is it swallowed or is hadoop stuck right after the mapper is done ?
>>>> 
>>>> Johannes
>>>> 
>>>> 
>>>> On May 7, 2010, at 5:16 PM, Raghava Mutharaju wrote:
>>>> 
>>>>> Hello Johannes,
>>>>> 
>>>>> I had a similar problem and I used the first approach suggested by
>> Joseph
>>>>> i.e. to report the status back. I used progress() method as well as the
>>>>> setStatus() method. progress() method javadoc link is given below.
>>>>> 
>>>>> 
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Progressable.html#progress%28%29
>>>>> 
>>>>> This should solve your problem. If you come to know of the reason why
>> it is
>>>>> taking more time than you expected, please mention it here.
>>>>> 
>>>>> Regards,
>>>>> Raghava.
>>>>> 
>>>>> On Fri, May 7, 2010 at 10:31 AM, Joseph Stein <cr...@gmail.com>
>> wrote:
>>>>> 
>>>>>> You need to either report status or increment a counter from within
>>>>>> your task. In your Java code there is a little trick to help the job
>>>>>> be “aware” within the cluster of tasks that are not dead but just
>>>>>> working hard.  During execution of a task there is no built in
>>>>>> reporting that the job is running as expected if it is not writing
>>>>>> out.  So this means that if your tasks are taking up a lot of time
>>>>>> doing work it is possible the cluster will see that task as failed
>>>>>> (based on the mapred.task.tracker.expiry.interval setting).
>>>>>> 
>>>>>> Have no fear there is a way to tell cluster that your task is doing
>>>>>> just fine.  You have 2 ways todo this you can either report the status
>>>>>> or increment a counter.  Both of these will cause the task tracker to
>>>>>> properly know the task is ok and this will get seen by the jobtracker
>>>>>> in turn.  Both of these options are explained in the JavaDoc
>>>>>> 
>>>>>> 
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
>>>>>> 
>>>>>> Here are some pointers in general you may find useful
>>>>>> 
>>>>>> 
>> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
>>>>>> 
>>>>>> On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
>>>>>> <jz...@googlemail.com> wrote:
>>>>>>> Hi hadoop folks,
>>>>>>> 
>>>>>>> i'm encountering following problem on a 4 node cluster running
>>>>>> hadoop-0.20.2.
>>>>>>> 
>>>>>>> Its a map only job reading about 9 GB data from outside of hadoop. 31
>> map
>>>>>> tasks at all while 12 map tasks running at a time.
>>>>>>> The first wave of mappers finishes successfully.
>>>>>>> Later on the first tasks are failing and they do that shortly before
>>>>>> finishing:
>>>>>>> Task attempt_201005061210_0002_m_000001_0 failed to report status for
>> 602
>>>>>> seconds. Killing!
>>>>>>> Task attempt_201005061210_0002_m_000001_1 failed to report status for
>> 600
>>>>>> seconds. Killing!
>>>>>>> Task attempt_201005061210_0002_m_000001_2 failed to report status for
>> 602
>>>>>> seconds. Killing!
>>>>>>> 
>>>>>>> The unusual is that i do not find any signs of the job code in the
>> thread
>>>>>> dump the tasktracker takes automatically:
>>>>>>> ----------------------------------------------------------------
>>>>>>> 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_201004301437_0050_m_000001_0: Task
>>>>>> attempt_201004301437_0050_m_000001_0 failed to report status for 601
>>>>>> seconds. Killing!
>>>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> Process Thread Dump: lost task
>>>>>>> 34 active threads
>>>>>>> Thread 29555 (process reaper):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 0
>>>>>>> Waited count: 0
>>>>>>> Stack:
>>>>>>> java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>>> java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>>> java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>>>> Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495
>> spawned.):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 1
>>>>>>> Waited count: 2
>>>>>>> Waiting on java.lang.UNIXProcess@34b56f69
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>>> org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>>> org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>>>> Thread 29550 (Thread-15193):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 13
>>>>>>> Waited count: 14
>>>>>>> Waiting on java.lang.Object@72c09161
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>>>> Thread 29501 (process reaper):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 0
>>>>>>> Waited count: 0
>>>>>>> Stack:
>>>>>>> java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>>> java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>>> java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>>>> Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 1
>>>>>>> Waited count: 2
>>>>>>> Waiting on java.lang.UNIXProcess@56175cd4
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>>> org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>>> org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>>>> Thread 29496 (Thread-15162):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 13
>>>>>>> Waited count: 14
>>>>>>> Waiting on java.lang.Object@3b916de2
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>>>> Thread 29225 (process reaper):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 0
>>>>>>> Waited count: 0
>>>>>>> Stack:
>>>>>>> java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>>> java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>>> java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>>>> Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314
>> spawned.):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 1
>>>>>>> Waited count: 2
>>>>>>> Waiting on java.lang.UNIXProcess@39874d3b
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>>> org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>>> org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>>>> Thread 29217 (process reaper):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 0
>>>>>>> Waited count: 0
>>>>>>> Stack:
>>>>>>> java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>>> java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>>> java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>>>> Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636
>> spawned.):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 1
>>>>>>> Waited count: 2
>>>>>>> Waiting on java.lang.UNIXProcess@2c39220f
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>>> org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>>> org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>>>> Thread 29208 (Thread-15014):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 18
>>>>>>> Waited count: 14
>>>>>>> Waiting on java.lang.Object@7ddbbae2
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>>>> Thread 29203 (Thread-15011):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 13
>>>>>>> Waited count: 14
>>>>>>> Waiting on java.lang.Object@2dac3f6f
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>>>> Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from
>>>>>> hadoop):
>>>>>>> State: TIMED_WAITING
>>>>>>> Blocked count: 10075
>>>>>>> Waited count: 10075
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
>>>>>>> org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
>>>>>>> Thread 33 (Directory/File cleanup thread):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 2
>>>>>>> Waited count: 615
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
>>>>>>> Thread 9 (taskCleanup):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 30
>>>>>>> Waited count: 92
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
>>>>>>> java.lang.Thread.run(Thread.java:619)
>>>>>>> Thread 32 (TaskLauncher for task):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 127
>>>>>>> Waited count: 123
>>>>>>> Waiting on java.util.LinkedList@c33377
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
>>>>>>> Thread 31 (TaskLauncher for task):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 3237
>>>>>>> Waited count: 3147
>>>>>>> Waiting on java.util.LinkedList@67001629
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
>>>>>>> Thread 30 (Map-events fetcher for all reduce tasks on
>>>>>> tracker_node1t:localhost/127.0.0.1:48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 712
>>>>>>> Waited count: 804
>>>>>>> Waiting on java.util.TreeMap@1fec8cf1
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
>>>>>>> Thread 28 (IPC Server handler 7 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 10
>>>>>>> Waited count: 3199
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 27 (IPC Server handler 6 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 265
>>>>>>> Waited count: 3207
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 26 (IPC Server handler 5 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 296
>>>>>>> Waited count: 3206
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 25 (IPC Server handler 4 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 199
>>>>>>> Waited count: 3202
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 24 (IPC Server handler 3 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 216
>>>>>>> Waited count: 3201
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 23 (IPC Server handler 2 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 119
>>>>>>> Waited count: 3199
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 22 (IPC Server handler 1 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 193
>>>>>>> Waited count: 3207
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 21 (IPC Server handler 0 on 48722):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 285
>>>>>>> Waited count: 3203
>>>>>>> Waiting on
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>>>> Stack:
>>>>>>> sun.misc.Unsafe.park(Native Method)
>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>>>> 
>>>>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>>>> Thread 18 (IPC Server listener on 48722):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 0
>>>>>>> Waited count: 0
>>>>>>> Stack:
>>>>>>> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>>>>> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>>>>> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>>>>>>> org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
>>>>>>> Thread 20 (IPC Server Responder):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 0
>>>>>>> Waited count: 0
>>>>>>> Stack:
>>>>>>> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>>>>> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>>>>> org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
>>>>>>> Thread 17 (Timer-0):
>>>>>>> State: TIMED_WAITING
>>>>>>> Blocked count: 1
>>>>>>> Waited count: 12764
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.util.TimerThread.mainLoop(Timer.java:509)
>>>>>>> java.util.TimerThread.run(Timer.java:462)
>>>>>>> Thread 16 (295726648@qtp0-0 - Acceptor0
>>>>>> SelectChannelConnector@0.0.0.0:50060):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 1310
>>>>>>> Waited count: 1
>>>>>>> Stack:
>>>>>>> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>>>>> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>>>>> 
>>>>>> 
>> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
>>>>>>> 
>> org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
>>>>>>> 
>>>>>> 
>> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
>>>>>>> 
>>>>>> 
>> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
>>>>>>> 
>>>>>> 
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
>>>>>>> Thread 4 (Signal Dispatcher):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 0
>>>>>>> Waited count: 0
>>>>>>> Stack:
>>>>>>> Thread 3 (Finalizer):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 1337
>>>>>>> Waited count: 1336
>>>>>>> Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>>>>>>> java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>>>>>>> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>>>>>>> Thread 2 (Reference Handler):
>>>>>>> State: WAITING
>>>>>>> Blocked count: 1544
>>>>>>> Waited count: 1336
>>>>>>> Waiting on java.lang.ref.Reference$Lock@46efbdf1
>>>>>>> Stack:
>>>>>>> java.lang.Object.wait(Native Method)
>>>>>>> java.lang.Object.wait(Object.java:485)
>>>>>>> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>>>>>>> Thread 1 (main):
>>>>>>> State: RUNNABLE
>>>>>>> Blocked count: 152594
>>>>>>> Waited count: 279661
>>>>>>> Stack:
>>>>>>> sun.management.ThreadImpl.getThreadInfo0(Native Method)
>>>>>>> sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>>>>>>> sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
>>>>>>> 
>>>>>> 
>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
>>>>>>> org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
>>>>>>> org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
>>>>>>> 
>>>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
>> About
>>>>>> to purge task: attempt_201004301437_0050_m_000001_0
>>>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> addFreeSlot : current free slots : 1
>>>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
>>>>>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>>>>>> 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map
>> ID
>>>>>> attempt_201004301437_0050_m_000001_0 not found in cache
>>>>>>> 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM
>> :
>>>>>> jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
>>>>>>> 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
>>>>>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>>>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0
>> task's
>>>>>> state:FAILED_UNCLEAN
>>>>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
>> Trying
>>>>>> to launch : attempt_201004301437_0050_m_000001_0
>>>>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
>>>>>> TaskLauncher, current free slots : 1 and trying to launch
>>>>>> attempt_201004301437_0050_m_000001_0
>>>>>>> 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker:
>> JVM
>>>>>> with ID: jvm_201004301437_0050_m_-1095145752 given task:
>>>>>> attempt_201004301437_0050_m_000001_0
>>>>>>> 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_201004301437_0050_m_000001_0 0.0%
>>>>>>> 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_201004301437_0050_m_000001_0 0.0% cleanup
>>>>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
>> Task
>>>>>> attempt_201004301437_0050_m_000001_0 is done.
>>>>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> reported output size for attempt_201004301437_0050_m_000001_0  was 0
>>>>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
>>>>>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>>>>>> ----------------------------------------------------------------
>>>>>>> 
>>>>>>> The syslog of the attempt just looks like:
>>>>>>> ----------------------------------------------------------------
>>>>>>> 2010-05-04 23:48:03,365 INFO
>> org.apache.hadoop.metrics.jvm.JvmMetrics:
>>>>>> Initializing JVM Metrics with processName=MAP, sessionId=
>>>>>>> 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData:
>>>>>> Failed to get version from the manifest file of jar
>>>>>> 
>> 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
>>>>>> : null
>>>>>>> 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask:
>>>>>> numReduceTasks: 0
>>>>>>> 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader:
>>>>>> Loaded the native-hadoop library
>>>>>>> 2010-05-04 23:48:04,873 INFO
>>>>>> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
>>>>>> initialized native-zlib library
>>>>>>> 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>>>>>> brand-new compressor
>>>>>>> 2010-05-05 00:59:08,753 INFO
>> org.apache.hadoop.metrics.jvm.JvmMetrics:
>>>>>> Initializing JVM Metrics with processName=CLEANUP, sessionId=
>>>>>>> 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner:
>>>>>> Runnning cleanup for the task
>>>>>>> 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
>>>>>> Task:attempt_201004301437_0050_m_000001_0 is done. And is in the
>> process of
>>>>>> commiting
>>>>>>> 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner:
>> Task
>>>>>> 'attempt_201004301437_0050_m_000001_0' done.
>>>>>>> ----------------------------------------------------------------
>>>>>>> 
>>>>>>> Any ideas ?
>>>>>>> 
>>>>>>> best regards
>>>>>>> Johannes
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> /*
>>>>>> Joe Stein
>>>>>> http://www.linkedin.com/in/charmalloc
>>>>>> */
>>>>>> 
>>>> 
>>> 
>>> _________________________________________________________________
>>> The New Busy is not the old busy. Search, chat and e-mail from your
>> inbox.
>>> 
>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
>> 
>> 


Re: hanging task

Posted by Raghava Mutharaju <m....@gmail.com>.
Hello Johannes,

        Is this about the common tip of using StringBuilder/StringBuffer
instead of '+' when concatenating Strings? I would like to know more about
the problem and how you resolved it.


Regards,
Raghava.

On Fri, May 28, 2010 at 6:43 AM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> Thanks for the valuable input!
> It was some string concatenation i did in each map call and it got super
> long. Resolving this made the problem go away!
>
> cheers
> Johannes
>
> On May 7, 2010, at 6:35 PM, Michael Segel wrote:
>
> >
> > There's a couple of things...
> >
> > 1) Inside the map() method, if you're taking longer than the timeout, you
> will fail.
> > 2) You could also fail if you run in to GC problems that cause timeout
> failures.
> >
> > What sometimes happens in long running jobs, you tend to see your GC take
> longer as the job runs.
> >
> > Do you log a start and end time in each iteration of map()? This will
> tell you if the individual time spent in a map() method call is too long.
> >
> > Also look at GC. Trying to go through all the logs can really drive one
> to drink. :-)
> >
> > HTH
> >
> > -Mike
> >
> >
> >> Subject: Re: hanging task
> >> From: jzillmann@googlemail.com
> >> Date: Fri, 7 May 2010 17:59:01 +0200
> >> To: common-user@hadoop.apache.org
> >>
> >> Sorry, forgot to say that on map task runs about more then one hour.
> >> So i do not think its the progress thing. So there are successful map
> tasks and those which are failing failing long after the 10 min timeout.
> >> And the question is why there is not any stack of the map task visibly
> in the thread dump.
> >> Is it swallowed or is hadoop stuck right after the mapper is done ?
> >>
> >> Johannes
> >>
> >>
> >> On May 7, 2010, at 5:16 PM, Raghava Mutharaju wrote:
> >>
> >>> Hello Johannes,
> >>>
> >>> I had a similar problem and I used the first approach suggested by
> Joseph
> >>> i.e. to report the status back. I used progress() method as well as the
> >>> setStatus() method. progress() method javadoc link is given below.
> >>>
> >>>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Progressable.html#progress%28%29
> >>>
> >>> This should solve your problem. If you come to know of the reason why
> it is
> >>> taking more time than you expected, please mention it here.
> >>>
> >>> Regards,
> >>> Raghava.
> >>>
> >>> On Fri, May 7, 2010 at 10:31 AM, Joseph Stein <cr...@gmail.com>
> wrote:
> >>>
> >>>> You need to either report status or increment a counter from within
> >>>> your task. In your Java code there is a little trick to help the job
> >>>> be “aware” within the cluster of tasks that are not dead but just
> >>>> working hard.  During execution of a task there is no built in
> >>>> reporting that the job is running as expected if it is not writing
> >>>> out.  So this means that if your tasks are taking up a lot of time
> >>>> doing work it is possible the cluster will see that task as failed
> >>>> (based on the mapred.task.tracker.expiry.interval setting).
> >>>>
> >>>> Have no fear there is a way to tell cluster that your task is doing
> >>>> just fine.  You have 2 ways todo this you can either report the status
> >>>> or increment a counter.  Both of these will cause the task tracker to
> >>>> properly know the task is ok and this will get seen by the jobtracker
> >>>> in turn.  Both of these options are explained in the JavaDoc
> >>>>
> >>>>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
> >>>>
> >>>> Here are some pointers in general you may find useful
> >>>>
> >>>>
> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
> >>>>
> >>>> On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
> >>>> <jz...@googlemail.com> wrote:
> >>>>> Hi hadoop folks,
> >>>>>
> >>>>> i'm encountering following problem on a 4 node cluster running
> >>>> hadoop-0.20.2.
> >>>>>
> >>>>> Its a map only job reading about 9 GB data from outside of hadoop. 31
> map
> >>>> tasks at all while 12 map tasks running at a time.
> >>>>> The first wave of mappers finishes successfully.
> >>>>> Later on the first tasks are failing and they do that shortly before
> >>>> finishing:
> >>>>> Task attempt_201005061210_0002_m_000001_0 failed to report status for
> 602
> >>>> seconds. Killing!
> >>>>> Task attempt_201005061210_0002_m_000001_1 failed to report status for
> 600
> >>>> seconds. Killing!
> >>>>> Task attempt_201005061210_0002_m_000001_2 failed to report status for
> 602
> >>>> seconds. Killing!
> >>>>>
> >>>>> The unusual is that i do not find any signs of the job code in the
> thread
> >>>> dump the tasktracker takes automatically:
> >>>>> ----------------------------------------------------------------
> >>>>> 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>> attempt_201004301437_0050_m_000001_0: Task
> >>>> attempt_201004301437_0050_m_000001_0 failed to report status for 601
> >>>> seconds. Killing!
> >>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>> Process Thread Dump: lost task
> >>>>> 34 active threads
> >>>>> Thread 29555 (process reaper):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 0
> >>>>> Waited count: 0
> >>>>> Stack:
> >>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>>>> Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495
> spawned.):
> >>>>> State: WAITING
> >>>>> Blocked count: 1
> >>>>> Waited count: 2
> >>>>> Waiting on java.lang.UNIXProcess@34b56f69
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>>>>
> >>>>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>>>> Thread 29550 (Thread-15193):
> >>>>> State: WAITING
> >>>>> Blocked count: 13
> >>>>> Waited count: 14
> >>>>> Waiting on java.lang.Object@72c09161
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>>>> Thread 29501 (process reaper):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 0
> >>>>> Waited count: 0
> >>>>> Stack:
> >>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>>>> Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
> >>>>> State: WAITING
> >>>>> Blocked count: 1
> >>>>> Waited count: 2
> >>>>> Waiting on java.lang.UNIXProcess@56175cd4
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>>>>
> >>>>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>>>> Thread 29496 (Thread-15162):
> >>>>> State: WAITING
> >>>>> Blocked count: 13
> >>>>> Waited count: 14
> >>>>> Waiting on java.lang.Object@3b916de2
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>>>> Thread 29225 (process reaper):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 0
> >>>>> Waited count: 0
> >>>>> Stack:
> >>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>>>> Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314
> spawned.):
> >>>>> State: WAITING
> >>>>> Blocked count: 1
> >>>>> Waited count: 2
> >>>>> Waiting on java.lang.UNIXProcess@39874d3b
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>>>>
> >>>>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>>>> Thread 29217 (process reaper):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 0
> >>>>> Waited count: 0
> >>>>> Stack:
> >>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>>>> Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636
> spawned.):
> >>>>> State: WAITING
> >>>>> Blocked count: 1
> >>>>> Waited count: 2
> >>>>> Waiting on java.lang.UNIXProcess@2c39220f
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>>>>
> >>>>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>>>> Thread 29208 (Thread-15014):
> >>>>> State: WAITING
> >>>>> Blocked count: 18
> >>>>> Waited count: 14
> >>>>> Waiting on java.lang.Object@7ddbbae2
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>>>> Thread 29203 (Thread-15011):
> >>>>> State: WAITING
> >>>>> Blocked count: 13
> >>>>> Waited count: 14
> >>>>> Waiting on java.lang.Object@2dac3f6f
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>>>> Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from
> >>>> hadoop):
> >>>>> State: TIMED_WAITING
> >>>>> Blocked count: 10075
> >>>>> Waited count: 10075
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
> >>>>>  org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
> >>>>> Thread 33 (Directory/File cleanup thread):
> >>>>> State: WAITING
> >>>>> Blocked count: 2
> >>>>> Waited count: 615
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
> >>>>> Thread 9 (taskCleanup):
> >>>>> State: WAITING
> >>>>> Blocked count: 30
> >>>>> Waited count: 92
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
> >>>>>  java.lang.Thread.run(Thread.java:619)
> >>>>> Thread 32 (TaskLauncher for task):
> >>>>> State: WAITING
> >>>>> Blocked count: 127
> >>>>> Waited count: 123
> >>>>> Waiting on java.util.LinkedList@c33377
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> >>>>> Thread 31 (TaskLauncher for task):
> >>>>> State: WAITING
> >>>>> Blocked count: 3237
> >>>>> Waited count: 3147
> >>>>> Waiting on java.util.LinkedList@67001629
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> >>>>> Thread 30 (Map-events fetcher for all reduce tasks on
> >>>> tracker_node1t:localhost/127.0.0.1:48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 712
> >>>>> Waited count: 804
> >>>>> Waiting on java.util.TreeMap@1fec8cf1
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
> >>>>> Thread 28 (IPC Server handler 7 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 10
> >>>>> Waited count: 3199
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 27 (IPC Server handler 6 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 265
> >>>>> Waited count: 3207
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 26 (IPC Server handler 5 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 296
> >>>>> Waited count: 3206
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 25 (IPC Server handler 4 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 199
> >>>>> Waited count: 3202
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 24 (IPC Server handler 3 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 216
> >>>>> Waited count: 3201
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 23 (IPC Server handler 2 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 119
> >>>>> Waited count: 3199
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 22 (IPC Server handler 1 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 193
> >>>>> Waited count: 3207
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 21 (IPC Server handler 0 on 48722):
> >>>>> State: WAITING
> >>>>> Blocked count: 285
> >>>>> Waited count: 3203
> >>>>> Waiting on
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>>>> Stack:
> >>>>>  sun.misc.Unsafe.park(Native Method)
> >>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>>>>
> >>>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>>>>
> >>>>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>>>> Thread 18 (IPC Server listener on 48722):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 0
> >>>>> Waited count: 0
> >>>>> Stack:
> >>>>>  sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>>  sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >>>>>  sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >>>>>  sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
> >>>>>  org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
> >>>>> Thread 20 (IPC Server Responder):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 0
> >>>>> Waited count: 0
> >>>>> Stack:
> >>>>>  sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>>  sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >>>>>  sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >>>>>  sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >>>>>  org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
> >>>>> Thread 17 (Timer-0):
> >>>>> State: TIMED_WAITING
> >>>>> Blocked count: 1
> >>>>> Waited count: 12764
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.util.TimerThread.mainLoop(Timer.java:509)
> >>>>>  java.util.TimerThread.run(Timer.java:462)
> >>>>> Thread 16 (295726648@qtp0-0 - Acceptor0
> >>>> SelectChannelConnector@0.0.0.0:50060):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 1310
> >>>>> Waited count: 1
> >>>>> Stack:
> >>>>>  sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>>  sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >>>>>  sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >>>>>  sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >>>>>
> >>>>
> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
> >>>>>
>  org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
> >>>>>
> >>>>
> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
> >>>>>
> >>>>
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
> >>>>>
> >>>>
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> >>>>> Thread 4 (Signal Dispatcher):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 0
> >>>>> Waited count: 0
> >>>>> Stack:
> >>>>> Thread 3 (Finalizer):
> >>>>> State: WAITING
> >>>>> Blocked count: 1337
> >>>>> Waited count: 1336
> >>>>> Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
> >>>>>  java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
> >>>>>  java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> >>>>> Thread 2 (Reference Handler):
> >>>>> State: WAITING
> >>>>> Blocked count: 1544
> >>>>> Waited count: 1336
> >>>>> Waiting on java.lang.ref.Reference$Lock@46efbdf1
> >>>>> Stack:
> >>>>>  java.lang.Object.wait(Native Method)
> >>>>>  java.lang.Object.wait(Object.java:485)
> >>>>>  java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> >>>>> Thread 1 (main):
> >>>>> State: RUNNABLE
> >>>>> Blocked count: 152594
> >>>>> Waited count: 279661
> >>>>> Stack:
> >>>>>  sun.management.ThreadImpl.getThreadInfo0(Native Method)
> >>>>>  sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
> >>>>>  sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
> >>>>>
> >>>>
> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
> >>>>>
> >>>>
> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
> >>>>>
> >>>>
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
> >>>>>  org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
> >>>>>  org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
> >>>>>
> >>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> About
> >>>> to purge task: attempt_201004301437_0050_m_000001_0
> >>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>> addFreeSlot : current free slots : 1
> >>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
> >>>> attempt_201004301437_0050_m_000001_0 done; removing files.
> >>>>> 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map
> ID
> >>>> attempt_201004301437_0050_m_000001_0 not found in cache
> >>>>> 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM
> :
> >>>> jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
> >>>>> 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
> >>>> attempt_201004301437_0050_m_000001_0 done; removing files.
> >>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>> LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0
> task's
> >>>> state:FAILED_UNCLEAN
> >>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
> Trying
> >>>> to launch : attempt_201004301437_0050_m_000001_0
> >>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
> >>>> TaskLauncher, current free slots : 1 and trying to launch
> >>>> attempt_201004301437_0050_m_000001_0
> >>>>> 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker:
> JVM
> >>>> with ID: jvm_201004301437_0050_m_-1095145752 given task:
> >>>> attempt_201004301437_0050_m_000001_0
> >>>>> 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>> attempt_201004301437_0050_m_000001_0 0.0%
> >>>>> 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>> attempt_201004301437_0050_m_000001_0 0.0% cleanup
> >>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
> Task
> >>>> attempt_201004301437_0050_m_000001_0 is done.
> >>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>> reported output size for attempt_201004301437_0050_m_000001_0  was 0
> >>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
> >>>> attempt_201004301437_0050_m_000001_0 done; removing files.
> >>>>> ----------------------------------------------------------------
> >>>>>
> >>>>> The syslog of the attempt just looks like:
> >>>>> ----------------------------------------------------------------
> >>>>> 2010-05-04 23:48:03,365 INFO
> org.apache.hadoop.metrics.jvm.JvmMetrics:
> >>>> Initializing JVM Metrics with processName=MAP, sessionId=
> >>>>> 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData:
> >>>> Failed to get version from the manifest file of jar
> >>>>
> 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
> >>>> : null
> >>>>> 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask:
> >>>> numReduceTasks: 0
> >>>>> 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader:
> >>>> Loaded the native-hadoop library
> >>>>> 2010-05-04 23:48:04,873 INFO
> >>>> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> >>>> initialized native-zlib library
> >>>>> 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >>>> brand-new compressor
> >>>>> 2010-05-05 00:59:08,753 INFO
> org.apache.hadoop.metrics.jvm.JvmMetrics:
> >>>> Initializing JVM Metrics with processName=CLEANUP, sessionId=
> >>>>> 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner:
> >>>> Runnning cleanup for the task
> >>>>> 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
> >>>> Task:attempt_201004301437_0050_m_000001_0 is done. And is in the
> process of
> >>>> commiting
> >>>>> 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner:
> Task
> >>>> 'attempt_201004301437_0050_m_000001_0' done.
> >>>>> ----------------------------------------------------------------
> >>>>>
> >>>>> Any ideas ?
> >>>>>
> >>>>> best regards
> >>>>> Johannes
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> /*
> >>>> Joe Stein
> >>>> http://www.linkedin.com/in/charmalloc
> >>>> */
> >>>>
> >>
> >
> > _________________________________________________________________
> > The New Busy is not the old busy. Search, chat and e-mail from your
> inbox.
> >
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
>
>

Re: hanging task

Posted by Johannes Zillmann <jz...@googlemail.com>.
Thanks for the valuable input!
It was some string concatenation i did in each map call and it got super long. Resolving this made the problem go away!

cheers
Johannes 

On May 7, 2010, at 6:35 PM, Michael Segel wrote:

> 
> There's a couple of things...
> 
> 1) Inside the map() method, if you're taking longer than the timeout, you will fail.
> 2) You could also fail if you run in to GC problems that cause timeout failures.
> 
> What sometimes happens in long running jobs, you tend to see your GC take longer as the job runs.
> 
> Do you log a start and end time in each iteration of map()? This will tell you if the individual time spent in a map() method call is too long. 
> 
> Also look at GC. Trying to go through all the logs can really drive one to drink. :-)
> 
> HTH
> 
> -Mike
> 
> 
>> Subject: Re: hanging task
>> From: jzillmann@googlemail.com
>> Date: Fri, 7 May 2010 17:59:01 +0200
>> To: common-user@hadoop.apache.org
>> 
>> Sorry, forgot to say that on map task runs about more then one hour.
>> So i do not think its the progress thing. So there are successful map tasks and those which are failing failing long after the 10 min timeout.
>> And the question is why there is not any stack of the map task visibly in the thread dump.
>> Is it swallowed or is hadoop stuck right after the mapper is done ?
>> 
>> Johannes
>> 
>> 
>> On May 7, 2010, at 5:16 PM, Raghava Mutharaju wrote:
>> 
>>> Hello Johannes,
>>> 
>>> I had a similar problem and I used the first approach suggested by Joseph
>>> i.e. to report the status back. I used progress() method as well as the
>>> setStatus() method. progress() method javadoc link is given below.
>>> 
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Progressable.html#progress%28%29
>>> 
>>> This should solve your problem. If you come to know of the reason why it is
>>> taking more time than you expected, please mention it here.
>>> 
>>> Regards,
>>> Raghava.
>>> 
>>> On Fri, May 7, 2010 at 10:31 AM, Joseph Stein <cr...@gmail.com> wrote:
>>> 
>>>> You need to either report status or increment a counter from within
>>>> your task. In your Java code there is a little trick to help the job
>>>> be “aware” within the cluster of tasks that are not dead but just
>>>> working hard.  During execution of a task there is no built in
>>>> reporting that the job is running as expected if it is not writing
>>>> out.  So this means that if your tasks are taking up a lot of time
>>>> doing work it is possible the cluster will see that task as failed
>>>> (based on the mapred.task.tracker.expiry.interval setting).
>>>> 
>>>> Have no fear there is a way to tell cluster that your task is doing
>>>> just fine.  You have 2 ways todo this you can either report the status
>>>> or increment a counter.  Both of these will cause the task tracker to
>>>> properly know the task is ok and this will get seen by the jobtracker
>>>> in turn.  Both of these options are explained in the JavaDoc
>>>> 
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
>>>> 
>>>> Here are some pointers in general you may find useful
>>>> 
>>>> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
>>>> 
>>>> On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
>>>> <jz...@googlemail.com> wrote:
>>>>> Hi hadoop folks,
>>>>> 
>>>>> i'm encountering following problem on a 4 node cluster running
>>>> hadoop-0.20.2.
>>>>> 
>>>>> Its a map only job reading about 9 GB data from outside of hadoop. 31 map
>>>> tasks at all while 12 map tasks running at a time.
>>>>> The first wave of mappers finishes successfully.
>>>>> Later on the first tasks are failing and they do that shortly before
>>>> finishing:
>>>>> Task attempt_201005061210_0002_m_000001_0 failed to report status for 602
>>>> seconds. Killing!
>>>>> Task attempt_201005061210_0002_m_000001_1 failed to report status for 600
>>>> seconds. Killing!
>>>>> Task attempt_201005061210_0002_m_000001_2 failed to report status for 602
>>>> seconds. Killing!
>>>>> 
>>>>> The unusual is that i do not find any signs of the job code in the thread
>>>> dump the tasktracker takes automatically:
>>>>> ----------------------------------------------------------------
>>>>> 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_201004301437_0050_m_000001_0: Task
>>>> attempt_201004301437_0050_m_000001_0 failed to report status for 601
>>>> seconds. Killing!
>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Process Thread Dump: lost task
>>>>> 34 active threads
>>>>> Thread 29555 (process reaper):
>>>>> State: RUNNABLE
>>>>> Blocked count: 0
>>>>> Waited count: 0
>>>>> Stack:
>>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>> Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495 spawned.):
>>>>> State: WAITING
>>>>> Blocked count: 1
>>>>> Waited count: 2
>>>>> Waiting on java.lang.UNIXProcess@34b56f69
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>> 
>>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>> Thread 29550 (Thread-15193):
>>>>> State: WAITING
>>>>> Blocked count: 13
>>>>> Waited count: 14
>>>>> Waiting on java.lang.Object@72c09161
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>> Thread 29501 (process reaper):
>>>>> State: RUNNABLE
>>>>> Blocked count: 0
>>>>> Waited count: 0
>>>>> Stack:
>>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>> Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
>>>>> State: WAITING
>>>>> Blocked count: 1
>>>>> Waited count: 2
>>>>> Waiting on java.lang.UNIXProcess@56175cd4
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>> 
>>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>> Thread 29496 (Thread-15162):
>>>>> State: WAITING
>>>>> Blocked count: 13
>>>>> Waited count: 14
>>>>> Waiting on java.lang.Object@3b916de2
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>> Thread 29225 (process reaper):
>>>>> State: RUNNABLE
>>>>> Blocked count: 0
>>>>> Waited count: 0
>>>>> Stack:
>>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>> Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314 spawned.):
>>>>> State: WAITING
>>>>> Blocked count: 1
>>>>> Waited count: 2
>>>>> Waiting on java.lang.UNIXProcess@39874d3b
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>> 
>>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>> Thread 29217 (process reaper):
>>>>> State: RUNNABLE
>>>>> Blocked count: 0
>>>>> Waited count: 0
>>>>> Stack:
>>>>>  java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>>>  java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>  java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>> Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636 spawned.):
>>>>> State: WAITING
>>>>> Blocked count: 1
>>>>> Waited count: 2
>>>>> Waiting on java.lang.UNIXProcess@2c39220f
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>>>  org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>>>  org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>> 
>>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>>>> 
>>>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>>>> Thread 29208 (Thread-15014):
>>>>> State: WAITING
>>>>> Blocked count: 18
>>>>> Waited count: 14
>>>>> Waiting on java.lang.Object@7ddbbae2
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>> Thread 29203 (Thread-15011):
>>>>> State: WAITING
>>>>> Blocked count: 13
>>>>> Waited count: 14
>>>>> Waiting on java.lang.Object@2dac3f6f
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>>>> Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from
>>>> hadoop):
>>>>> State: TIMED_WAITING
>>>>> Blocked count: 10075
>>>>> Waited count: 10075
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
>>>>>  org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
>>>>> Thread 33 (Directory/File cleanup thread):
>>>>> State: WAITING
>>>>> Blocked count: 2
>>>>> Waited count: 615
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>> 
>>>> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
>>>>> Thread 9 (taskCleanup):
>>>>> State: WAITING
>>>>> Blocked count: 30
>>>>> Waited count: 92
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
>>>>>  java.lang.Thread.run(Thread.java:619)
>>>>> Thread 32 (TaskLauncher for task):
>>>>> State: WAITING
>>>>> Blocked count: 127
>>>>> Waited count: 123
>>>>> Waiting on java.util.LinkedList@c33377
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>> 
>>>> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
>>>>> Thread 31 (TaskLauncher for task):
>>>>> State: WAITING
>>>>> Blocked count: 3237
>>>>> Waited count: 3147
>>>>> Waiting on java.util.LinkedList@67001629
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>> 
>>>> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
>>>>> Thread 30 (Map-events fetcher for all reduce tasks on
>>>> tracker_node1t:localhost/127.0.0.1:48722):
>>>>> State: WAITING
>>>>> Blocked count: 712
>>>>> Waited count: 804
>>>>> Waiting on java.util.TreeMap@1fec8cf1
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>> 
>>>> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
>>>>> Thread 28 (IPC Server handler 7 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 10
>>>>> Waited count: 3199
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 27 (IPC Server handler 6 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 265
>>>>> Waited count: 3207
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 26 (IPC Server handler 5 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 296
>>>>> Waited count: 3206
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 25 (IPC Server handler 4 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 199
>>>>> Waited count: 3202
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 24 (IPC Server handler 3 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 216
>>>>> Waited count: 3201
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 23 (IPC Server handler 2 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 119
>>>>> Waited count: 3199
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 22 (IPC Server handler 1 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 193
>>>>> Waited count: 3207
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 21 (IPC Server handler 0 on 48722):
>>>>> State: WAITING
>>>>> Blocked count: 285
>>>>> Waited count: 3203
>>>>> Waiting on
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>>>> Stack:
>>>>>  sun.misc.Unsafe.park(Native Method)
>>>>>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>>> 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>>> 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>>>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>>>> Thread 18 (IPC Server listener on 48722):
>>>>> State: RUNNABLE
>>>>> Blocked count: 0
>>>>> Waited count: 0
>>>>> Stack:
>>>>>  sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>  sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>>>  sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>>>  sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>>>>>  org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
>>>>> Thread 20 (IPC Server Responder):
>>>>> State: RUNNABLE
>>>>> Blocked count: 0
>>>>> Waited count: 0
>>>>> Stack:
>>>>>  sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>  sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>>>  sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>>>  sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>>>  org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
>>>>> Thread 17 (Timer-0):
>>>>> State: TIMED_WAITING
>>>>> Blocked count: 1
>>>>> Waited count: 12764
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.util.TimerThread.mainLoop(Timer.java:509)
>>>>>  java.util.TimerThread.run(Timer.java:462)
>>>>> Thread 16 (295726648@qtp0-0 - Acceptor0
>>>> SelectChannelConnector@0.0.0.0:50060):
>>>>> State: RUNNABLE
>>>>> Blocked count: 1310
>>>>> Waited count: 1
>>>>> Stack:
>>>>>  sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>  sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>>>  sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>>>  sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>>>  sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>>> 
>>>> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
>>>>>  org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
>>>>> 
>>>> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
>>>>> 
>>>> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
>>>>> 
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
>>>>> Thread 4 (Signal Dispatcher):
>>>>> State: RUNNABLE
>>>>> Blocked count: 0
>>>>> Waited count: 0
>>>>> Stack:
>>>>> Thread 3 (Finalizer):
>>>>> State: WAITING
>>>>> Blocked count: 1337
>>>>> Waited count: 1336
>>>>> Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>>>>>  java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>>>>>  java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>>>>> Thread 2 (Reference Handler):
>>>>> State: WAITING
>>>>> Blocked count: 1544
>>>>> Waited count: 1336
>>>>> Waiting on java.lang.ref.Reference$Lock@46efbdf1
>>>>> Stack:
>>>>>  java.lang.Object.wait(Native Method)
>>>>>  java.lang.Object.wait(Object.java:485)
>>>>>  java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>>>>> Thread 1 (main):
>>>>> State: RUNNABLE
>>>>> Blocked count: 152594
>>>>> Waited count: 279661
>>>>> Stack:
>>>>>  sun.management.ThreadImpl.getThreadInfo0(Native Method)
>>>>>  sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>>>>>  sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>>>>> 
>>>> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
>>>>> 
>>>> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
>>>>> 
>>>> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
>>>>> 
>>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
>>>>>  org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
>>>>>  org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
>>>>> 
>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: About
>>>> to purge task: attempt_201004301437_0050_m_000001_0
>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> addFreeSlot : current free slots : 1
>>>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
>>>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>>>> 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map ID
>>>> attempt_201004301437_0050_m_000001_0 not found in cache
>>>>> 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>>>> jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
>>>>> 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
>>>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0 task's
>>>> state:FAILED_UNCLEAN
>>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: Trying
>>>> to launch : attempt_201004301437_0050_m_000001_0
>>>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
>>>> TaskLauncher, current free slots : 1 and trying to launch
>>>> attempt_201004301437_0050_m_000001_0
>>>>> 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker: JVM
>>>> with ID: jvm_201004301437_0050_m_-1095145752 given task:
>>>> attempt_201004301437_0050_m_000001_0
>>>>> 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_201004301437_0050_m_000001_0 0.0%
>>>>> 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_201004301437_0050_m_000001_0 0.0% cleanup
>>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: Task
>>>> attempt_201004301437_0050_m_000001_0 is done.
>>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> reported output size for attempt_201004301437_0050_m_000001_0  was 0
>>>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
>>>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>>>> ----------------------------------------------------------------
>>>>> 
>>>>> The syslog of the attempt just looks like:
>>>>> ----------------------------------------------------------------
>>>>> 2010-05-04 23:48:03,365 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>>>> Initializing JVM Metrics with processName=MAP, sessionId=
>>>>> 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData:
>>>> Failed to get version from the manifest file of jar
>>>> 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
>>>> : null
>>>>> 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask:
>>>> numReduceTasks: 0
>>>>> 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader:
>>>> Loaded the native-hadoop library
>>>>> 2010-05-04 23:48:04,873 INFO
>>>> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
>>>> initialized native-zlib library
>>>>> 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool: Got
>>>> brand-new compressor
>>>>> 2010-05-05 00:59:08,753 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>>>> Initializing JVM Metrics with processName=CLEANUP, sessionId=
>>>>> 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner:
>>>> Runnning cleanup for the task
>>>>> 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
>>>> Task:attempt_201004301437_0050_m_000001_0 is done. And is in the process of
>>>> commiting
>>>>> 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner: Task
>>>> 'attempt_201004301437_0050_m_000001_0' done.
>>>>> ----------------------------------------------------------------
>>>>> 
>>>>> Any ideas ?
>>>>> 
>>>>> best regards
>>>>> Johannes
>>>> 
>>>> 
>>>> 
>>>> --
>>>> /*
>>>> Joe Stein
>>>> http://www.linkedin.com/in/charmalloc
>>>> */
>>>> 
>> 
> 		 	   		  
> _________________________________________________________________
> The New Busy is not the old busy. Search, chat and e-mail from your inbox.
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3


Re: hanging task

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi Mike,

       I think GC stands for Garbage Collection right? How can one look at
GC?

Regards,
Raghava.

On Fri, May 7, 2010 at 12:35 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> There's a couple of things...
>
> 1) Inside the map() method, if you're taking longer than the timeout, you
> will fail.
> 2) You could also fail if you run in to GC problems that cause timeout
> failures.
>
> What sometimes happens in long running jobs, you tend to see your GC take
> longer as the job runs.
>
> Do you log a start and end time in each iteration of map()? This will tell
> you if the individual time spent in a map() method call is too long.
>
> Also look at GC. Trying to go through all the logs can really drive one to
> drink. :-)
>
> HTH
>
> -Mike
>
>
> > Subject: Re: hanging task
> > From: jzillmann@googlemail.com
> > Date: Fri, 7 May 2010 17:59:01 +0200
> > To: common-user@hadoop.apache.org
> >
> > Sorry, forgot to say that on map task runs about more then one hour.
> > So i do not think its the progress thing. So there are successful map
> tasks and those which are failing failing long after the 10 min timeout.
> > And the question is why there is not any stack of the map task visibly in
> the thread dump.
> > Is it swallowed or is hadoop stuck right after the mapper is done ?
> >
> > Johannes
> >
> >
> > On May 7, 2010, at 5:16 PM, Raghava Mutharaju wrote:
> >
> > > Hello Johannes,
> > >
> > > I had a similar problem and I used the first approach suggested by
> Joseph
> > > i.e. to report the status back. I used progress() method as well as the
> > > setStatus() method. progress() method javadoc link is given below.
> > >
> > >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Progressable.html#progress%28%29
> > >
> > > This should solve your problem. If you come to know of the reason why
> it is
> > > taking more time than you expected, please mention it here.
> > >
> > > Regards,
> > > Raghava.
> > >
> > > On Fri, May 7, 2010 at 10:31 AM, Joseph Stein <cr...@gmail.com>
> wrote:
> > >
> > >> You need to either report status or increment a counter from within
> > >> your task. In your Java code there is a little trick to help the job
> > >> be “aware” within the cluster of tasks that are not dead but just
> > >> working hard.  During execution of a task there is no built in
> > >> reporting that the job is running as expected if it is not writing
> > >> out.  So this means that if your tasks are taking up a lot of time
> > >> doing work it is possible the cluster will see that task as failed
> > >> (based on the mapred.task.tracker.expiry.interval setting).
> > >>
> > >> Have no fear there is a way to tell cluster that your task is doing
> > >> just fine.  You have 2 ways todo this you can either report the status
> > >> or increment a counter.  Both of these will cause the task tracker to
> > >> properly know the task is ok and this will get seen by the jobtracker
> > >> in turn.  Both of these options are explained in the JavaDoc
> > >>
> > >>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
> > >>
> > >> Here are some pointers in general you may find useful
> > >>
> > >>
> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
> > >>
> > >> On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
> > >> <jz...@googlemail.com> wrote:
> > >>> Hi hadoop folks,
> > >>>
> > >>> i'm encountering following problem on a 4 node cluster running
> > >> hadoop-0.20.2.
> > >>>
> > >>> Its a map only job reading about 9 GB data from outside of hadoop. 31
> map
> > >> tasks at all while 12 map tasks running at a time.
> > >>> The first wave of mappers finishes successfully.
> > >>> Later on the first tasks are failing and they do that shortly before
> > >> finishing:
> > >>> Task attempt_201005061210_0002_m_000001_0 failed to report status for
> 602
> > >> seconds. Killing!
> > >>> Task attempt_201005061210_0002_m_000001_1 failed to report status for
> 600
> > >> seconds. Killing!
> > >>> Task attempt_201005061210_0002_m_000001_2 failed to report status for
> 602
> > >> seconds. Killing!
> > >>>
> > >>> The unusual is that i do not find any signs of the job code in the
> thread
> > >> dump the tasktracker takes automatically:
> > >>> ----------------------------------------------------------------
> > >>> 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
> > >> attempt_201004301437_0050_m_000001_0: Task
> > >> attempt_201004301437_0050_m_000001_0 failed to report status for 601
> > >> seconds. Killing!
> > >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> > >> Process Thread Dump: lost task
> > >>> 34 active threads
> > >>> Thread 29555 (process reaper):
> > >>> State: RUNNABLE
> > >>> Blocked count: 0
> > >>> Waited count: 0
> > >>> Stack:
> > >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> > >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >>> Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495
> spawned.):
> > >>> State: WAITING
> > >>> Blocked count: 1
> > >>> Waited count: 2
> > >>> Waiting on java.lang.UNIXProcess@34b56f69
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> > >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> > >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> > >>>
> > >>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > >>> Thread 29550 (Thread-15193):
> > >>> State: WAITING
> > >>> Blocked count: 13
> > >>> Waited count: 14
> > >>> Waiting on java.lang.Object@72c09161
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > >>> Thread 29501 (process reaper):
> > >>> State: RUNNABLE
> > >>> Blocked count: 0
> > >>> Waited count: 0
> > >>> Stack:
> > >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> > >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >>> Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
> > >>> State: WAITING
> > >>> Blocked count: 1
> > >>> Waited count: 2
> > >>> Waiting on java.lang.UNIXProcess@56175cd4
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> > >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> > >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> > >>>
> > >>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > >>> Thread 29496 (Thread-15162):
> > >>> State: WAITING
> > >>> Blocked count: 13
> > >>> Waited count: 14
> > >>> Waiting on java.lang.Object@3b916de2
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > >>> Thread 29225 (process reaper):
> > >>> State: RUNNABLE
> > >>> Blocked count: 0
> > >>> Waited count: 0
> > >>> Stack:
> > >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> > >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >>> Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314
> spawned.):
> > >>> State: WAITING
> > >>> Blocked count: 1
> > >>> Waited count: 2
> > >>> Waiting on java.lang.UNIXProcess@39874d3b
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> > >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> > >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> > >>>
> > >>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > >>> Thread 29217 (process reaper):
> > >>> State: RUNNABLE
> > >>> Blocked count: 0
> > >>> Waited count: 0
> > >>> Stack:
> > >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> > >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >>> Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636
> spawned.):
> > >>> State: WAITING
> > >>> Blocked count: 1
> > >>> Waited count: 2
> > >>> Waiting on java.lang.UNIXProcess@2c39220f
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> > >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> > >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> > >>>
> > >>
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> > >>>
> > >>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > >>> Thread 29208 (Thread-15014):
> > >>> State: WAITING
> > >>> Blocked count: 18
> > >>> Waited count: 14
> > >>> Waiting on java.lang.Object@7ddbbae2
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > >>> Thread 29203 (Thread-15011):
> > >>> State: WAITING
> > >>> Blocked count: 13
> > >>> Waited count: 14
> > >>> Waiting on java.lang.Object@2dac3f6f
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > >>> Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from
> > >> hadoop):
> > >>> State: TIMED_WAITING
> > >>> Blocked count: 10075
> > >>> Waited count: 10075
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>
> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
> > >>>   org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
> > >>> Thread 33 (Directory/File cleanup thread):
> > >>> State: WAITING
> > >>> Blocked count: 2
> > >>> Waited count: 615
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>
> > >>
> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
> > >>> Thread 9 (taskCleanup):
> > >>> State: WAITING
> > >>> Blocked count: 30
> > >>> Waited count: 92
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
> > >>>   java.lang.Thread.run(Thread.java:619)
> > >>> Thread 32 (TaskLauncher for task):
> > >>> State: WAITING
> > >>> Blocked count: 127
> > >>> Waited count: 123
> > >>> Waiting on java.util.LinkedList@c33377
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>
> > >>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> > >>> Thread 31 (TaskLauncher for task):
> > >>> State: WAITING
> > >>> Blocked count: 3237
> > >>> Waited count: 3147
> > >>> Waiting on java.util.LinkedList@67001629
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>
> > >>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> > >>> Thread 30 (Map-events fetcher for all reduce tasks on
> > >> tracker_node1t:localhost/127.0.0.1:48722):
> > >>> State: WAITING
> > >>> Blocked count: 712
> > >>> Waited count: 804
> > >>> Waiting on java.util.TreeMap@1fec8cf1
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>
> > >>
> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
> > >>> Thread 28 (IPC Server handler 7 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 10
> > >>> Waited count: 3199
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 27 (IPC Server handler 6 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 265
> > >>> Waited count: 3207
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 26 (IPC Server handler 5 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 296
> > >>> Waited count: 3206
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 25 (IPC Server handler 4 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 199
> > >>> Waited count: 3202
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 24 (IPC Server handler 3 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 216
> > >>> Waited count: 3201
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 23 (IPC Server handler 2 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 119
> > >>> Waited count: 3199
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 22 (IPC Server handler 1 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 193
> > >>> Waited count: 3207
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 21 (IPC Server handler 0 on 48722):
> > >>> State: WAITING
> > >>> Blocked count: 285
> > >>> Waited count: 3203
> > >>> Waiting on
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> > >>> Stack:
> > >>>   sun.misc.Unsafe.park(Native Method)
> > >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >>>
> > >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> > >>>
> > >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> > >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > >>> Thread 18 (IPC Server listener on 48722):
> > >>> State: RUNNABLE
> > >>> Blocked count: 0
> > >>> Waited count: 0
> > >>> Stack:
> > >>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> > >>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> > >>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> > >>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> > >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> > >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
> > >>>   org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
> > >>> Thread 20 (IPC Server Responder):
> > >>> State: RUNNABLE
> > >>> Blocked count: 0
> > >>> Waited count: 0
> > >>> Stack:
> > >>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> > >>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> > >>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> > >>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> > >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> > >>>   org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
> > >>> Thread 17 (Timer-0):
> > >>> State: TIMED_WAITING
> > >>> Blocked count: 1
> > >>> Waited count: 12764
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.util.TimerThread.mainLoop(Timer.java:509)
> > >>>   java.util.TimerThread.run(Timer.java:462)
> > >>> Thread 16 (295726648@qtp0-0 - Acceptor0
> > >> SelectChannelConnector@0.0.0.0:50060):
> > >>> State: RUNNABLE
> > >>> Blocked count: 1310
> > >>> Waited count: 1
> > >>> Stack:
> > >>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> > >>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> > >>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> > >>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> > >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> > >>>
> > >>
> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
> > >>>
> org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
> > >>>
> > >>
> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
> > >>>
> > >>
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
> > >>>
> > >>
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> > >>> Thread 4 (Signal Dispatcher):
> > >>> State: RUNNABLE
> > >>> Blocked count: 0
> > >>> Waited count: 0
> > >>> Stack:
> > >>> Thread 3 (Finalizer):
> > >>> State: WAITING
> > >>> Blocked count: 1337
> > >>> Waited count: 1336
> > >>> Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
> > >>>   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
> > >>>   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> > >>> Thread 2 (Reference Handler):
> > >>> State: WAITING
> > >>> Blocked count: 1544
> > >>> Waited count: 1336
> > >>> Waiting on java.lang.ref.Reference$Lock@46efbdf1
> > >>> Stack:
> > >>>   java.lang.Object.wait(Native Method)
> > >>>   java.lang.Object.wait(Object.java:485)
> > >>>   java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> > >>> Thread 1 (main):
> > >>> State: RUNNABLE
> > >>> Blocked count: 152594
> > >>> Waited count: 279661
> > >>> Stack:
> > >>>   sun.management.ThreadImpl.getThreadInfo0(Native Method)
> > >>>   sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
> > >>>   sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
> > >>>
> > >>
> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
> > >>>
> > >>
> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
> > >>>
> > >>
> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
> > >>>
> > >>
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
> > >>>   org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
> > >>>   org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
> > >>>
> > >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> About
> > >> to purge task: attempt_201004301437_0050_m_000001_0
> > >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> > >> addFreeSlot : current free slots : 1
> > >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
> > >> attempt_201004301437_0050_m_000001_0 done; removing files.
> > >>> 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map
> ID
> > >> attempt_201004301437_0050_m_000001_0 not found in cache
> > >>> 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM
> :
> > >> jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
> > >>> 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
> > >> attempt_201004301437_0050_m_000001_0 done; removing files.
> > >>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
> > >> LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0
> task's
> > >> state:FAILED_UNCLEAN
> > >>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
> Trying
> > >> to launch : attempt_201004301437_0050_m_000001_0
> > >>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
> > >> TaskLauncher, current free slots : 1 and trying to launch
> > >> attempt_201004301437_0050_m_000001_0
> > >>> 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker:
> JVM
> > >> with ID: jvm_201004301437_0050_m_-1095145752 given task:
> > >> attempt_201004301437_0050_m_000001_0
> > >>> 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
> > >> attempt_201004301437_0050_m_000001_0 0.0%
> > >>> 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
> > >> attempt_201004301437_0050_m_000001_0 0.0% cleanup
> > >>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
> Task
> > >> attempt_201004301437_0050_m_000001_0 is done.
> > >>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
> > >> reported output size for attempt_201004301437_0050_m_000001_0  was 0
> > >>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
> > >> attempt_201004301437_0050_m_000001_0 done; removing files.
> > >>> ----------------------------------------------------------------
> > >>>
> > >>> The syslog of the attempt just looks like:
> > >>> ----------------------------------------------------------------
> > >>> 2010-05-04 23:48:03,365 INFO
> org.apache.hadoop.metrics.jvm.JvmMetrics:
> > >> Initializing JVM Metrics with processName=MAP, sessionId=
> > >>> 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData:
> > >> Failed to get version from the manifest file of jar
> > >>
> 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
> > >> : null
> > >>> 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask:
> > >> numReduceTasks: 0
> > >>> 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader:
> > >> Loaded the native-hadoop library
> > >>> 2010-05-04 23:48:04,873 INFO
> > >> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> > >> initialized native-zlib library
> > >>> 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> > >> brand-new compressor
> > >>> 2010-05-05 00:59:08,753 INFO
> org.apache.hadoop.metrics.jvm.JvmMetrics:
> > >> Initializing JVM Metrics with processName=CLEANUP, sessionId=
> > >>> 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner:
> > >> Runnning cleanup for the task
> > >>> 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
> > >> Task:attempt_201004301437_0050_m_000001_0 is done. And is in the
> process of
> > >> commiting
> > >>> 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner:
> Task
> > >> 'attempt_201004301437_0050_m_000001_0' done.
> > >>> ----------------------------------------------------------------
> > >>>
> > >>> Any ideas ?
> > >>>
> > >>> best regards
> > >>> Johannes
> > >>
> > >>
> > >>
> > >> --
> > >> /*
> > >> Joe Stein
> > >> http://www.linkedin.com/in/charmalloc
> > >> */
> > >>
> >
>
> _________________________________________________________________
> The New Busy is not the old busy. Search, chat and e-mail from your inbox.
>
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

RE: hanging task

Posted by Michael Segel <mi...@hotmail.com>.
There's a couple of things...

1) Inside the map() method, if you're taking longer than the timeout, you will fail.
2) You could also fail if you run in to GC problems that cause timeout failures.

What sometimes happens in long running jobs, you tend to see your GC take longer as the job runs.

Do you log a start and end time in each iteration of map()? This will tell you if the individual time spent in a map() method call is too long. 

Also look at GC. Trying to go through all the logs can really drive one to drink. :-)

HTH

-Mike


> Subject: Re: hanging task
> From: jzillmann@googlemail.com
> Date: Fri, 7 May 2010 17:59:01 +0200
> To: common-user@hadoop.apache.org
> 
> Sorry, forgot to say that on map task runs about more then one hour.
> So i do not think its the progress thing. So there are successful map tasks and those which are failing failing long after the 10 min timeout.
> And the question is why there is not any stack of the map task visibly in the thread dump.
> Is it swallowed or is hadoop stuck right after the mapper is done ?
> 
> Johannes
> 
> 
> On May 7, 2010, at 5:16 PM, Raghava Mutharaju wrote:
> 
> > Hello Johannes,
> > 
> > I had a similar problem and I used the first approach suggested by Joseph
> > i.e. to report the status back. I used progress() method as well as the
> > setStatus() method. progress() method javadoc link is given below.
> > 
> > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Progressable.html#progress%28%29
> > 
> > This should solve your problem. If you come to know of the reason why it is
> > taking more time than you expected, please mention it here.
> > 
> > Regards,
> > Raghava.
> > 
> > On Fri, May 7, 2010 at 10:31 AM, Joseph Stein <cr...@gmail.com> wrote:
> > 
> >> You need to either report status or increment a counter from within
> >> your task. In your Java code there is a little trick to help the job
> >> be “aware” within the cluster of tasks that are not dead but just
> >> working hard.  During execution of a task there is no built in
> >> reporting that the job is running as expected if it is not writing
> >> out.  So this means that if your tasks are taking up a lot of time
> >> doing work it is possible the cluster will see that task as failed
> >> (based on the mapred.task.tracker.expiry.interval setting).
> >> 
> >> Have no fear there is a way to tell cluster that your task is doing
> >> just fine.  You have 2 ways todo this you can either report the status
> >> or increment a counter.  Both of these will cause the task tracker to
> >> properly know the task is ok and this will get seen by the jobtracker
> >> in turn.  Both of these options are explained in the JavaDoc
> >> 
> >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
> >> 
> >> Here are some pointers in general you may find useful
> >> 
> >> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
> >> 
> >> On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
> >> <jz...@googlemail.com> wrote:
> >>> Hi hadoop folks,
> >>> 
> >>> i'm encountering following problem on a 4 node cluster running
> >> hadoop-0.20.2.
> >>> 
> >>> Its a map only job reading about 9 GB data from outside of hadoop. 31 map
> >> tasks at all while 12 map tasks running at a time.
> >>> The first wave of mappers finishes successfully.
> >>> Later on the first tasks are failing and they do that shortly before
> >> finishing:
> >>> Task attempt_201005061210_0002_m_000001_0 failed to report status for 602
> >> seconds. Killing!
> >>> Task attempt_201005061210_0002_m_000001_1 failed to report status for 600
> >> seconds. Killing!
> >>> Task attempt_201005061210_0002_m_000001_2 failed to report status for 602
> >> seconds. Killing!
> >>> 
> >>> The unusual is that i do not find any signs of the job code in the thread
> >> dump the tasktracker takes automatically:
> >>> ----------------------------------------------------------------
> >>> 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
> >> attempt_201004301437_0050_m_000001_0: Task
> >> attempt_201004301437_0050_m_000001_0 failed to report status for 601
> >> seconds. Killing!
> >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> >> Process Thread Dump: lost task
> >>> 34 active threads
> >>> Thread 29555 (process reaper):
> >>> State: RUNNABLE
> >>> Blocked count: 0
> >>> Waited count: 0
> >>> Stack:
> >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>> Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495 spawned.):
> >>> State: WAITING
> >>> Blocked count: 1
> >>> Waited count: 2
> >>> Waiting on java.lang.UNIXProcess@34b56f69
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>> 
> >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>> Thread 29550 (Thread-15193):
> >>> State: WAITING
> >>> Blocked count: 13
> >>> Waited count: 14
> >>> Waiting on java.lang.Object@72c09161
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>> Thread 29501 (process reaper):
> >>> State: RUNNABLE
> >>> Blocked count: 0
> >>> Waited count: 0
> >>> Stack:
> >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>> Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
> >>> State: WAITING
> >>> Blocked count: 1
> >>> Waited count: 2
> >>> Waiting on java.lang.UNIXProcess@56175cd4
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>> 
> >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>> Thread 29496 (Thread-15162):
> >>> State: WAITING
> >>> Blocked count: 13
> >>> Waited count: 14
> >>> Waiting on java.lang.Object@3b916de2
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>> Thread 29225 (process reaper):
> >>> State: RUNNABLE
> >>> Blocked count: 0
> >>> Waited count: 0
> >>> Stack:
> >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>> Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314 spawned.):
> >>> State: WAITING
> >>> Blocked count: 1
> >>> Waited count: 2
> >>> Waiting on java.lang.UNIXProcess@39874d3b
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>> 
> >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>> Thread 29217 (process reaper):
> >>> State: RUNNABLE
> >>> Blocked count: 0
> >>> Waited count: 0
> >>> Stack:
> >>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>> Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636 spawned.):
> >>> State: WAITING
> >>> Blocked count: 1
> >>> Waited count: 2
> >>> Waiting on java.lang.UNIXProcess@2c39220f
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
> >>> 
> >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >>> 
> >> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> >>> Thread 29208 (Thread-15014):
> >>> State: WAITING
> >>> Blocked count: 18
> >>> Waited count: 14
> >>> Waiting on java.lang.Object@7ddbbae2
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>> Thread 29203 (Thread-15011):
> >>> State: WAITING
> >>> Blocked count: 13
> >>> Waited count: 14
> >>> Waiting on java.lang.Object@2dac3f6f
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> >>> Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from
> >> hadoop):
> >>> State: TIMED_WAITING
> >>> Blocked count: 10075
> >>> Waited count: 10075
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
> >>>   org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
> >>> Thread 33 (Directory/File cleanup thread):
> >>> State: WAITING
> >>> Blocked count: 2
> >>> Waited count: 615
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>> 
> >> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
> >>> Thread 9 (taskCleanup):
> >>> State: WAITING
> >>> Blocked count: 30
> >>> Waited count: 92
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
> >>>   java.lang.Thread.run(Thread.java:619)
> >>> Thread 32 (TaskLauncher for task):
> >>> State: WAITING
> >>> Blocked count: 127
> >>> Waited count: 123
> >>> Waiting on java.util.LinkedList@c33377
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>> 
> >> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> >>> Thread 31 (TaskLauncher for task):
> >>> State: WAITING
> >>> Blocked count: 3237
> >>> Waited count: 3147
> >>> Waiting on java.util.LinkedList@67001629
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>> 
> >> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> >>> Thread 30 (Map-events fetcher for all reduce tasks on
> >> tracker_node1t:localhost/127.0.0.1:48722):
> >>> State: WAITING
> >>> Blocked count: 712
> >>> Waited count: 804
> >>> Waiting on java.util.TreeMap@1fec8cf1
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>> 
> >> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
> >>> Thread 28 (IPC Server handler 7 on 48722):
> >>> State: WAITING
> >>> Blocked count: 10
> >>> Waited count: 3199
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 27 (IPC Server handler 6 on 48722):
> >>> State: WAITING
> >>> Blocked count: 265
> >>> Waited count: 3207
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 26 (IPC Server handler 5 on 48722):
> >>> State: WAITING
> >>> Blocked count: 296
> >>> Waited count: 3206
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 25 (IPC Server handler 4 on 48722):
> >>> State: WAITING
> >>> Blocked count: 199
> >>> Waited count: 3202
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 24 (IPC Server handler 3 on 48722):
> >>> State: WAITING
> >>> Blocked count: 216
> >>> Waited count: 3201
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 23 (IPC Server handler 2 on 48722):
> >>> State: WAITING
> >>> Blocked count: 119
> >>> Waited count: 3199
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 22 (IPC Server handler 1 on 48722):
> >>> State: WAITING
> >>> Blocked count: 193
> >>> Waited count: 3207
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 21 (IPC Server handler 0 on 48722):
> >>> State: WAITING
> >>> Blocked count: 285
> >>> Waited count: 3203
> >>> Waiting on
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >>> Stack:
> >>>   sun.misc.Unsafe.park(Native Method)
> >>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >>> 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >>> 
> >> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> >>> Thread 18 (IPC Server listener on 48722):
> >>> State: RUNNABLE
> >>> Blocked count: 0
> >>> Waited count: 0
> >>> Stack:
> >>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
> >>>   org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
> >>> Thread 20 (IPC Server Responder):
> >>> State: RUNNABLE
> >>> Blocked count: 0
> >>> Waited count: 0
> >>> Stack:
> >>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >>>   org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
> >>> Thread 17 (Timer-0):
> >>> State: TIMED_WAITING
> >>> Blocked count: 1
> >>> Waited count: 12764
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.util.TimerThread.mainLoop(Timer.java:509)
> >>>   java.util.TimerThread.run(Timer.java:462)
> >>> Thread 16 (295726648@qtp0-0 - Acceptor0
> >> SelectChannelConnector@0.0.0.0:50060):
> >>> State: RUNNABLE
> >>> Blocked count: 1310
> >>> Waited count: 1
> >>> Stack:
> >>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >>> 
> >> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
> >>>   org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
> >>> 
> >> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
> >>> 
> >> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
> >>> 
> >> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> >>> Thread 4 (Signal Dispatcher):
> >>> State: RUNNABLE
> >>> Blocked count: 0
> >>> Waited count: 0
> >>> Stack:
> >>> Thread 3 (Finalizer):
> >>> State: WAITING
> >>> Blocked count: 1337
> >>> Waited count: 1336
> >>> Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
> >>>   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
> >>>   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> >>> Thread 2 (Reference Handler):
> >>> State: WAITING
> >>> Blocked count: 1544
> >>> Waited count: 1336
> >>> Waiting on java.lang.ref.Reference$Lock@46efbdf1
> >>> Stack:
> >>>   java.lang.Object.wait(Native Method)
> >>>   java.lang.Object.wait(Object.java:485)
> >>>   java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> >>> Thread 1 (main):
> >>> State: RUNNABLE
> >>> Blocked count: 152594
> >>> Waited count: 279661
> >>> Stack:
> >>>   sun.management.ThreadImpl.getThreadInfo0(Native Method)
> >>>   sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
> >>>   sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
> >>> 
> >> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
> >>> 
> >> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
> >>> 
> >> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
> >>> 
> >> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
> >>>   org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
> >>>   org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
> >>> 
> >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: About
> >> to purge task: attempt_201004301437_0050_m_000001_0
> >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> >> addFreeSlot : current free slots : 1
> >>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
> >> attempt_201004301437_0050_m_000001_0 done; removing files.
> >>> 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map ID
> >> attempt_201004301437_0050_m_000001_0 not found in cache
> >>> 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> >> jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
> >>> 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
> >> attempt_201004301437_0050_m_000001_0 done; removing files.
> >>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
> >> LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0 task's
> >> state:FAILED_UNCLEAN
> >>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> >> to launch : attempt_201004301437_0050_m_000001_0
> >>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
> >> TaskLauncher, current free slots : 1 and trying to launch
> >> attempt_201004301437_0050_m_000001_0
> >>> 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker: JVM
> >> with ID: jvm_201004301437_0050_m_-1095145752 given task:
> >> attempt_201004301437_0050_m_000001_0
> >>> 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
> >> attempt_201004301437_0050_m_000001_0 0.0%
> >>> 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
> >> attempt_201004301437_0050_m_000001_0 0.0% cleanup
> >>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: Task
> >> attempt_201004301437_0050_m_000001_0 is done.
> >>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
> >> reported output size for attempt_201004301437_0050_m_000001_0  was 0
> >>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
> >> attempt_201004301437_0050_m_000001_0 done; removing files.
> >>> ----------------------------------------------------------------
> >>> 
> >>> The syslog of the attempt just looks like:
> >>> ----------------------------------------------------------------
> >>> 2010-05-04 23:48:03,365 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> >> Initializing JVM Metrics with processName=MAP, sessionId=
> >>> 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData:
> >> Failed to get version from the manifest file of jar
> >> 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
> >> : null
> >>> 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask:
> >> numReduceTasks: 0
> >>> 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader:
> >> Loaded the native-hadoop library
> >>> 2010-05-04 23:48:04,873 INFO
> >> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> >> initialized native-zlib library
> >>> 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool: Got
> >> brand-new compressor
> >>> 2010-05-05 00:59:08,753 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> >> Initializing JVM Metrics with processName=CLEANUP, sessionId=
> >>> 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner:
> >> Runnning cleanup for the task
> >>> 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
> >> Task:attempt_201004301437_0050_m_000001_0 is done. And is in the process of
> >> commiting
> >>> 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner: Task
> >> 'attempt_201004301437_0050_m_000001_0' done.
> >>> ----------------------------------------------------------------
> >>> 
> >>> Any ideas ?
> >>> 
> >>> best regards
> >>> Johannes
> >> 
> >> 
> >> 
> >> --
> >> /*
> >> Joe Stein
> >> http://www.linkedin.com/in/charmalloc
> >> */
> >> 
> 
 		 	   		  
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

Re: hanging task

Posted by Johannes Zillmann <jz...@googlemail.com>.
Sorry, forgot to say that on map task runs about more then one hour.
So i do not think its the progress thing. So there are successful map tasks and those which are failing failing long after the 10 min timeout.
And the question is why there is not any stack of the map task visibly in the thread dump.
Is it swallowed or is hadoop stuck right after the mapper is done ?

Johannes


On May 7, 2010, at 5:16 PM, Raghava Mutharaju wrote:

> Hello Johannes,
> 
> I had a similar problem and I used the first approach suggested by Joseph
> i.e. to report the status back. I used progress() method as well as the
> setStatus() method. progress() method javadoc link is given below.
> 
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Progressable.html#progress%28%29
> 
> This should solve your problem. If you come to know of the reason why it is
> taking more time than you expected, please mention it here.
> 
> Regards,
> Raghava.
> 
> On Fri, May 7, 2010 at 10:31 AM, Joseph Stein <cr...@gmail.com> wrote:
> 
>> You need to either report status or increment a counter from within
>> your task. In your Java code there is a little trick to help the job
>> be “aware” within the cluster of tasks that are not dead but just
>> working hard.  During execution of a task there is no built in
>> reporting that the job is running as expected if it is not writing
>> out.  So this means that if your tasks are taking up a lot of time
>> doing work it is possible the cluster will see that task as failed
>> (based on the mapred.task.tracker.expiry.interval setting).
>> 
>> Have no fear there is a way to tell cluster that your task is doing
>> just fine.  You have 2 ways todo this you can either report the status
>> or increment a counter.  Both of these will cause the task tracker to
>> properly know the task is ok and this will get seen by the jobtracker
>> in turn.  Both of these options are explained in the JavaDoc
>> 
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
>> 
>> Here are some pointers in general you may find useful
>> 
>> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
>> 
>> On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
>> <jz...@googlemail.com> wrote:
>>> Hi hadoop folks,
>>> 
>>> i'm encountering following problem on a 4 node cluster running
>> hadoop-0.20.2.
>>> 
>>> Its a map only job reading about 9 GB data from outside of hadoop. 31 map
>> tasks at all while 12 map tasks running at a time.
>>> The first wave of mappers finishes successfully.
>>> Later on the first tasks are failing and they do that shortly before
>> finishing:
>>> Task attempt_201005061210_0002_m_000001_0 failed to report status for 602
>> seconds. Killing!
>>> Task attempt_201005061210_0002_m_000001_1 failed to report status for 600
>> seconds. Killing!
>>> Task attempt_201005061210_0002_m_000001_2 failed to report status for 602
>> seconds. Killing!
>>> 
>>> The unusual is that i do not find any signs of the job code in the thread
>> dump the tasktracker takes automatically:
>>> ----------------------------------------------------------------
>>> 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201004301437_0050_m_000001_0: Task
>> attempt_201004301437_0050_m_000001_0 failed to report status for 601
>> seconds. Killing!
>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
>> Process Thread Dump: lost task
>>> 34 active threads
>>> Thread 29555 (process reaper):
>>> State: RUNNABLE
>>> Blocked count: 0
>>> Waited count: 0
>>> Stack:
>>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>> Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495 spawned.):
>>> State: WAITING
>>> Blocked count: 1
>>> Waited count: 2
>>> Waiting on java.lang.UNIXProcess@34b56f69
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>> Thread 29550 (Thread-15193):
>>> State: WAITING
>>> Blocked count: 13
>>> Waited count: 14
>>> Waiting on java.lang.Object@72c09161
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>> Thread 29501 (process reaper):
>>> State: RUNNABLE
>>> Blocked count: 0
>>> Waited count: 0
>>> Stack:
>>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>> Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
>>> State: WAITING
>>> Blocked count: 1
>>> Waited count: 2
>>> Waiting on java.lang.UNIXProcess@56175cd4
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>> Thread 29496 (Thread-15162):
>>> State: WAITING
>>> Blocked count: 13
>>> Waited count: 14
>>> Waiting on java.lang.Object@3b916de2
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>> Thread 29225 (process reaper):
>>> State: RUNNABLE
>>> Blocked count: 0
>>> Waited count: 0
>>> Stack:
>>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>> Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314 spawned.):
>>> State: WAITING
>>> Blocked count: 1
>>> Waited count: 2
>>> Waiting on java.lang.UNIXProcess@39874d3b
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>> Thread 29217 (process reaper):
>>> State: RUNNABLE
>>> Blocked count: 0
>>> Waited count: 0
>>> Stack:
>>>   java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>>   java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>   java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>> Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636 spawned.):
>>> State: WAITING
>>> Blocked count: 1
>>> Waited count: 2
>>> Waiting on java.lang.UNIXProcess@2c39220f
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>>>   org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
>>>   org.apache.hadoop.util.Shell.run(Shell.java:134)
>>> 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
>>> 
>> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
>>> Thread 29208 (Thread-15014):
>>> State: WAITING
>>> Blocked count: 18
>>> Waited count: 14
>>> Waiting on java.lang.Object@7ddbbae2
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>> Thread 29203 (Thread-15011):
>>> State: WAITING
>>> Blocked count: 13
>>> Waited count: 14
>>> Waiting on java.lang.Object@2dac3f6f
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
>>> Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from
>> hadoop):
>>> State: TIMED_WAITING
>>> Blocked count: 10075
>>> Waited count: 10075
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
>>>   org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
>>> Thread 33 (Directory/File cleanup thread):
>>> State: WAITING
>>> Blocked count: 2
>>> Waited count: 615
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>> 
>> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
>>> Thread 9 (taskCleanup):
>>> State: WAITING
>>> Blocked count: 30
>>> Waited count: 92
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
>>>   java.lang.Thread.run(Thread.java:619)
>>> Thread 32 (TaskLauncher for task):
>>> State: WAITING
>>> Blocked count: 127
>>> Waited count: 123
>>> Waiting on java.util.LinkedList@c33377
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>> 
>> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
>>> Thread 31 (TaskLauncher for task):
>>> State: WAITING
>>> Blocked count: 3237
>>> Waited count: 3147
>>> Waiting on java.util.LinkedList@67001629
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>> 
>> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
>>> Thread 30 (Map-events fetcher for all reduce tasks on
>> tracker_node1t:localhost/127.0.0.1:48722):
>>> State: WAITING
>>> Blocked count: 712
>>> Waited count: 804
>>> Waiting on java.util.TreeMap@1fec8cf1
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>> 
>> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
>>> Thread 28 (IPC Server handler 7 on 48722):
>>> State: WAITING
>>> Blocked count: 10
>>> Waited count: 3199
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 27 (IPC Server handler 6 on 48722):
>>> State: WAITING
>>> Blocked count: 265
>>> Waited count: 3207
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 26 (IPC Server handler 5 on 48722):
>>> State: WAITING
>>> Blocked count: 296
>>> Waited count: 3206
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 25 (IPC Server handler 4 on 48722):
>>> State: WAITING
>>> Blocked count: 199
>>> Waited count: 3202
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 24 (IPC Server handler 3 on 48722):
>>> State: WAITING
>>> Blocked count: 216
>>> Waited count: 3201
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 23 (IPC Server handler 2 on 48722):
>>> State: WAITING
>>> Blocked count: 119
>>> Waited count: 3199
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 22 (IPC Server handler 1 on 48722):
>>> State: WAITING
>>> Blocked count: 193
>>> Waited count: 3207
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 21 (IPC Server handler 0 on 48722):
>>> State: WAITING
>>> Blocked count: 285
>>> Waited count: 3203
>>> Waiting on
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
>>> Stack:
>>>   sun.misc.Unsafe.park(Native Method)
>>>   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>> 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>> 
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>>>   org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
>>> Thread 18 (IPC Server listener on 48722):
>>> State: RUNNABLE
>>> Blocked count: 0
>>> Waited count: 0
>>> Stack:
>>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>>>   org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
>>> Thread 20 (IPC Server Responder):
>>> State: RUNNABLE
>>> Blocked count: 0
>>> Waited count: 0
>>> Stack:
>>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>>   org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
>>> Thread 17 (Timer-0):
>>> State: TIMED_WAITING
>>> Blocked count: 1
>>> Waited count: 12764
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.util.TimerThread.mainLoop(Timer.java:509)
>>>   java.util.TimerThread.run(Timer.java:462)
>>> Thread 16 (295726648@qtp0-0 - Acceptor0
>> SelectChannelConnector@0.0.0.0:50060):
>>> State: RUNNABLE
>>> Blocked count: 1310
>>> Waited count: 1
>>> Stack:
>>>   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>>   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>>   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>>   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>> 
>> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
>>>   org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
>>> 
>> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
>>> 
>> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
>>> 
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
>>> Thread 4 (Signal Dispatcher):
>>> State: RUNNABLE
>>> Blocked count: 0
>>> Waited count: 0
>>> Stack:
>>> Thread 3 (Finalizer):
>>> State: WAITING
>>> Blocked count: 1337
>>> Waited count: 1336
>>> Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>>>   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>>>   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>>> Thread 2 (Reference Handler):
>>> State: WAITING
>>> Blocked count: 1544
>>> Waited count: 1336
>>> Waiting on java.lang.ref.Reference$Lock@46efbdf1
>>> Stack:
>>>   java.lang.Object.wait(Native Method)
>>>   java.lang.Object.wait(Object.java:485)
>>>   java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>>> Thread 1 (main):
>>> State: RUNNABLE
>>> Blocked count: 152594
>>> Waited count: 279661
>>> Stack:
>>>   sun.management.ThreadImpl.getThreadInfo0(Native Method)
>>>   sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>>>   sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>>> 
>> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
>>> 
>> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
>>> 
>> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
>>> 
>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
>>>   org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
>>>   org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
>>> 
>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: About
>> to purge task: attempt_201004301437_0050_m_000001_0
>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 1
>>> 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>> 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map ID
>> attempt_201004301437_0050_m_000001_0 not found in cache
>>> 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
>>> 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
>> LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0 task's
>> state:FAILED_UNCLEAN
>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: Trying
>> to launch : attempt_201004301437_0050_m_000001_0
>>> 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
>> TaskLauncher, current free slots : 1 and trying to launch
>> attempt_201004301437_0050_m_000001_0
>>> 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker: JVM
>> with ID: jvm_201004301437_0050_m_-1095145752 given task:
>> attempt_201004301437_0050_m_000001_0
>>> 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201004301437_0050_m_000001_0 0.0%
>>> 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201004301437_0050_m_000001_0 0.0% cleanup
>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: Task
>> attempt_201004301437_0050_m_000001_0 is done.
>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
>> reported output size for attempt_201004301437_0050_m_000001_0  was 0
>>> 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
>> attempt_201004301437_0050_m_000001_0 done; removing files.
>>> ----------------------------------------------------------------
>>> 
>>> The syslog of the attempt just looks like:
>>> ----------------------------------------------------------------
>>> 2010-05-04 23:48:03,365 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>> Initializing JVM Metrics with processName=MAP, sessionId=
>>> 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData:
>> Failed to get version from the manifest file of jar
>> 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
>> : null
>>> 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask:
>> numReduceTasks: 0
>>> 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader:
>> Loaded the native-hadoop library
>>> 2010-05-04 23:48:04,873 INFO
>> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
>> initialized native-zlib library
>>> 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new compressor
>>> 2010-05-05 00:59:08,753 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>> Initializing JVM Metrics with processName=CLEANUP, sessionId=
>>> 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner:
>> Runnning cleanup for the task
>>> 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
>> Task:attempt_201004301437_0050_m_000001_0 is done. And is in the process of
>> commiting
>>> 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner: Task
>> 'attempt_201004301437_0050_m_000001_0' done.
>>> ----------------------------------------------------------------
>>> 
>>> Any ideas ?
>>> 
>>> best regards
>>> Johannes
>> 
>> 
>> 
>> --
>> /*
>> Joe Stein
>> http://www.linkedin.com/in/charmalloc
>> */
>> 


Re: hanging task

Posted by Raghava Mutharaju <m....@gmail.com>.
Hello Johannes,

I had a similar problem and I used the first approach suggested by Joseph
i.e. to report the status back. I used progress() method as well as the
setStatus() method. progress() method javadoc link is given below.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Progressable.html#progress%28%29

This should solve your problem. If you come to know of the reason why it is
taking more time than you expected, please mention it here.

Regards,
Raghava.

On Fri, May 7, 2010 at 10:31 AM, Joseph Stein <cr...@gmail.com> wrote:

> You need to either report status or increment a counter from within
> your task. In your Java code there is a little trick to help the job
> be “aware” within the cluster of tasks that are not dead but just
> working hard.  During execution of a task there is no built in
> reporting that the job is running as expected if it is not writing
> out.  So this means that if your tasks are taking up a lot of time
> doing work it is possible the cluster will see that task as failed
> (based on the mapred.task.tracker.expiry.interval setting).
>
> Have no fear there is a way to tell cluster that your task is doing
> just fine.  You have 2 ways todo this you can either report the status
> or increment a counter.  Both of these will cause the task tracker to
> properly know the task is ok and this will get seen by the jobtracker
> in turn.  Both of these options are explained in the JavaDoc
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
>
> Here are some pointers in general you may find useful
>
> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
>
> On Fri, May 7, 2010 at 10:26 AM, Johannes Zillmann
> <jz...@googlemail.com> wrote:
> > Hi hadoop folks,
> >
> > i'm encountering following problem on a 4 node cluster running
> hadoop-0.20.2.
> >
> > Its a map only job reading about 9 GB data from outside of hadoop. 31 map
> tasks at all while 12 map tasks running at a time.
> > The first wave of mappers finishes successfully.
> > Later on the first tasks are failing and they do that shortly before
> finishing:
> > Task attempt_201005061210_0002_m_000001_0 failed to report status for 602
> seconds. Killing!
> > Task attempt_201005061210_0002_m_000001_1 failed to report status for 600
> seconds. Killing!
> > Task attempt_201005061210_0002_m_000001_2 failed to report status for 602
> seconds. Killing!
> >
> > The unusual is that i do not find any signs of the job code in the thread
> dump the tasktracker takes automatically:
> > ----------------------------------------------------------------
> > 2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201004301437_0050_m_000001_0: Task
> attempt_201004301437_0050_m_000001_0 failed to report status for 601
> seconds. Killing!
> > 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> Process Thread Dump: lost task
> > 34 active threads
> > Thread 29555 (process reaper):
> >  State: RUNNABLE
> >  Blocked count: 0
> >  Waited count: 0
> >  Stack:
> >    java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495 spawned.):
> >  State: WAITING
> >  Blocked count: 1
> >  Waited count: 2
> >  Waiting on java.lang.UNIXProcess@34b56f69
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >    org.apache.hadoop.util.Shell.run(Shell.java:134)
> >
>  org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > Thread 29550 (Thread-15193):
> >  State: WAITING
> >  Blocked count: 13
> >  Waited count: 14
> >  Waiting on java.lang.Object@72c09161
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > Thread 29501 (process reaper):
> >  State: RUNNABLE
> >  Blocked count: 0
> >  Waited count: 0
> >  Stack:
> >    java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
> >  State: WAITING
> >  Blocked count: 1
> >  Waited count: 2
> >  Waiting on java.lang.UNIXProcess@56175cd4
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >    org.apache.hadoop.util.Shell.run(Shell.java:134)
> >
>  org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > Thread 29496 (Thread-15162):
> >  State: WAITING
> >  Blocked count: 13
> >  Waited count: 14
> >  Waiting on java.lang.Object@3b916de2
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > Thread 29225 (process reaper):
> >  State: RUNNABLE
> >  Blocked count: 0
> >  Waited count: 0
> >  Stack:
> >    java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314 spawned.):
> >  State: WAITING
> >  Blocked count: 1
> >  Waited count: 2
> >  Waiting on java.lang.UNIXProcess@39874d3b
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >    org.apache.hadoop.util.Shell.run(Shell.java:134)
> >
>  org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > Thread 29217 (process reaper):
> >  State: RUNNABLE
> >  Blocked count: 0
> >  Waited count: 0
> >  Stack:
> >    java.lang.UNIXProcess.waitForProcessExit(Native Method)
> >    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636 spawned.):
> >  State: WAITING
> >  Blocked count: 1
> >  Waited count: 2
> >  Waiting on java.lang.UNIXProcess@2c39220f
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
> >    org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
> >    org.apache.hadoop.util.Shell.run(Shell.java:134)
> >
>  org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
> >
>  org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
> > Thread 29208 (Thread-15014):
> >  State: WAITING
> >  Blocked count: 18
> >  Waited count: 14
> >  Waiting on java.lang.Object@7ddbbae2
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > Thread 29203 (Thread-15011):
> >  State: WAITING
> >  Blocked count: 13
> >  Waited count: 14
> >  Waiting on java.lang.Object@2dac3f6f
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
> > Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from
> hadoop):
> >  State: TIMED_WAITING
> >  Blocked count: 10075
> >  Waited count: 10075
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
> >    org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
> > Thread 33 (Directory/File cleanup thread):
> >  State: WAITING
> >  Blocked count: 2
> >  Waited count: 615
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d6c68b3
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >
>  org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
> > Thread 9 (taskCleanup):
> >  State: WAITING
> >  Blocked count: 30
> >  Waited count: 92
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3298407f
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
> >    java.lang.Thread.run(Thread.java:619)
> > Thread 32 (TaskLauncher for task):
> >  State: WAITING
> >  Blocked count: 127
> >  Waited count: 123
> >  Waiting on java.util.LinkedList@c33377
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >
>  org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> > Thread 31 (TaskLauncher for task):
> >  State: WAITING
> >  Blocked count: 3237
> >  Waited count: 3147
> >  Waiting on java.util.LinkedList@67001629
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >
>  org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
> > Thread 30 (Map-events fetcher for all reduce tasks on
> tracker_node1t:localhost/127.0.0.1:48722):
> >  State: WAITING
> >  Blocked count: 712
> >  Waited count: 804
> >  Waiting on java.util.TreeMap@1fec8cf1
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >
>  org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
> > Thread 28 (IPC Server handler 7 on 48722):
> >  State: WAITING
> >  Blocked count: 10
> >  Waited count: 3199
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 27 (IPC Server handler 6 on 48722):
> >  State: WAITING
> >  Blocked count: 265
> >  Waited count: 3207
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 26 (IPC Server handler 5 on 48722):
> >  State: WAITING
> >  Blocked count: 296
> >  Waited count: 3206
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 25 (IPC Server handler 4 on 48722):
> >  State: WAITING
> >  Blocked count: 199
> >  Waited count: 3202
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 24 (IPC Server handler 3 on 48722):
> >  State: WAITING
> >  Blocked count: 216
> >  Waited count: 3201
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 23 (IPC Server handler 2 on 48722):
> >  State: WAITING
> >  Blocked count: 119
> >  Waited count: 3199
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 22 (IPC Server handler 1 on 48722):
> >  State: WAITING
> >  Blocked count: 193
> >  Waited count: 3207
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 21 (IPC Server handler 0 on 48722):
> >  State: WAITING
> >  Blocked count: 285
> >  Waited count: 3203
> >  Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@b4848ae
> >  Stack:
> >    sun.misc.Unsafe.park(Native Method)
> >    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> >
>  java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> >    org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
> > Thread 18 (IPC Server listener on 48722):
> >  State: RUNNABLE
> >  Blocked count: 0
> >  Waited count: 0
> >  Stack:
> >    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
> >    org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
> > Thread 20 (IPC Server Responder):
> >  State: RUNNABLE
> >  Blocked count: 0
> >  Waited count: 0
> >  Stack:
> >    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >    org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
> > Thread 17 (Timer-0):
> >  State: TIMED_WAITING
> >  Blocked count: 1
> >  Waited count: 12764
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.util.TimerThread.mainLoop(Timer.java:509)
> >    java.util.TimerThread.run(Timer.java:462)
> > Thread 16 (295726648@qtp0-0 - Acceptor0
> SelectChannelConnector@0.0.0.0:50060):
> >  State: RUNNABLE
> >  Blocked count: 1310
> >  Waited count: 1
> >  Stack:
> >    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> >    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> >    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> >    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
> >
>  org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
> >    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
> >
>  org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
> >
>  org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
> >
>  org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> > Thread 4 (Signal Dispatcher):
> >  State: RUNNABLE
> >  Blocked count: 0
> >  Waited count: 0
> >  Stack:
> > Thread 3 (Finalizer):
> >  State: WAITING
> >  Blocked count: 1337
> >  Waited count: 1336
> >  Waiting on java.lang.ref.ReferenceQueue$Lock@5e77533b
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
> >    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
> >    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> > Thread 2 (Reference Handler):
> >  State: WAITING
> >  Blocked count: 1544
> >  Waited count: 1336
> >  Waiting on java.lang.ref.Reference$Lock@46efbdf1
> >  Stack:
> >    java.lang.Object.wait(Native Method)
> >    java.lang.Object.wait(Object.java:485)
> >    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> > Thread 1 (main):
> >  State: RUNNABLE
> >  Blocked count: 152594
> >  Waited count: 279661
> >  Stack:
> >    sun.management.ThreadImpl.getThreadInfo0(Native Method)
> >    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
> >    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
> >
>  org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
> >
>  org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
> >
>  org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
> >
>  org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
> >    org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
> >    org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
> >
> > 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: About
> to purge task: attempt_201004301437_0050_m_000001_0
> > 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker:
> addFreeSlot : current free slots : 1
> > 2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_201004301437_0050_m_000001_0 done; removing files.
> > 2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map ID
> attempt_201004301437_0050_m_000001_0 not found in cache
> > 2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
> > 2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_201004301437_0050_m_000001_0 done; removing files.
> > 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0 task's
> state:FAILED_UNCLEAN
> > 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> to launch : attempt_201004301437_0050_m_000001_0
> > 2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 1 and trying to launch
> attempt_201004301437_0050_m_000001_0
> > 2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker: JVM
> with ID: jvm_201004301437_0050_m_-1095145752 given task:
> attempt_201004301437_0050_m_000001_0
> > 2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201004301437_0050_m_000001_0 0.0%
> > 2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201004301437_0050_m_000001_0 0.0% cleanup
> > 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: Task
> attempt_201004301437_0050_m_000001_0 is done.
> > 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker:
> reported output size for attempt_201004301437_0050_m_000001_0  was 0
> > 2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_201004301437_0050_m_000001_0 done; removing files.
> > ----------------------------------------------------------------
> >
> > The syslog of the attempt just looks like:
> > ----------------------------------------------------------------
> > 2010-05-04 23:48:03,365 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> > 2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData:
> Failed to get version from the manifest file of jar
> 'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
> : null
> > 2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 0
> > 2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
> > 2010-05-04 23:48:04,873 INFO
> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
> > 2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new compressor
> > 2010-05-05 00:59:08,753 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=CLEANUP, sessionId=
> > 2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner:
> Runnning cleanup for the task
> > 2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201004301437_0050_m_000001_0 is done. And is in the process of
> commiting
> > 2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_201004301437_0050_m_000001_0' done.
> > ----------------------------------------------------------------
> >
> > Any ideas ?
> >
> > best regards
> > Johannes
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>